LDI Docs – 3 Procedures, Functions, Operators and Examples
March 11, 2011 Leave a comment
3. Procedures, Functions, Operators and Examples
Before you start to work on through the examples below, do grant the LUCENEUSER role to any dedicated Oracle user/schema, who has become selected to run the Lucene Domain Index (LDI). Remember, that you must not run a index within user/schema LUCENE. For example:
-- connected as sysdba grant LUCENEUSER to scott;
3.1 Create a Lucene Domain Index
3.1.1 Single column index
The first example creates a domain index on table t1, column f2 using Lucene’s SimpleAnalyzer along with the Analyzer parameter. After execution, a new index, T1.IT1, and two new LDI-tables, the index storage table, IT1$T, and the index queue table (AQ), IT1$QT, will have been added to the user’s schema. Because of the generated objects, can no Lucene Domain Index name be longer than 21 characters! This is due the secondary generation of the Oracle AQ table name.
create table t1 (
f1 number,
f2 varchar2(200),
f3 varchar2(200),
f4 number unique);
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.SimpleAnalyzer');
Another simple example may employ a stemmer instead of an analyzer. A stemmer is kind of a language specific analyzer and behaves comparable to Lucene’s StandardAnalyzer. The Lucene stemming approach used here, the SnowballAnalyzer, is based on the Snowball code stack (snowball.tartarus.org). The employment of a stemmer requires setting the Stemmer parameter. Since a stemmer is somehow an extended analyzer, any Stemmer parameter in a parameter list will override any Analyzer parameter given as well. See Appendix A of the LDI docs for more information about analyzing and stemming.
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Stemmer:English');
3.1.2 Multiple columns
Extending the previous example, you can also index additional columns of the same base table, in sum called compound columns. This requires the usage of the ExtraCols parameter, handing in the column identifier, that can be modified with an alias name, iff required. Note that the parameter list provided for a create or alter index needs to be a string without any line breaks and without any dispensible whitespace characters. Here is the code:
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Stemmer:English;ExtraCols:F1 "f1"');
Note that employing an alias name also modifies the syntax of field searches with lcontains. That is, executing the following search will not result in any rows returned:
select * from t1 where lcontains(f2, 'F1:xyz', 1) > 0;
You have stick to the exact alias field name, with an appropriate character case instead:
select * from t1 where lcontains(f2, 'f1:xyz', 1) > 0;
However, the column actually indexed can be searched using an all uppercase field identifier or just no field identifier at all:
select * from t1 where lcontains(f2, 'xyz', 1) > 0; select * from t1 where lcontains(f2, 'F2:xyz', 1) > 0;
gennff: the following statement and example may be misleading because oracle is sometimes smart enough to detect blind updates and does not fire any update trigger under that circumstance. this technique needs to be checked and proved in detail.
Because Oracle ODCI API will not detect changes on other columns than the indexed master column, you need to create a trigger that will fire on any update on columns in the ExtraCols list to rewrite the master column. Such a way, any changes on f1 will also force to change f2. ODCI will notify LDI that an specific rowid was updated and LDI, being based on the specific parameter definition of this index, will update the index tables to reflect the changes. Here is an example:
CREATE OR REPLACE TRIGGER L$IT1 BEFORE UPDATE OF f1 ON t1 FOR EACH ROW
BEGIN
:new.f2 := :new.f2;
END;
/
3.1.3 Multiple tables
LDI also supports indexing multiple columns over multiple tables, that can be joined in a natural form. All to be done ist to define a list of tables with the ExtraTabs parameter as well as to specify a where condition with the WhereCondition parameter. The master table alias L$MT is automatically introduced by LDI and can be applied without any preparations. This alias is furthermore important, for example, to create complex joins with xml tables, that imply the use of the existsNode or extracValue operators. This functionality has been added starting with the 2.9.0.1.0 release of LDI. Here is an example:
create table t2 (
f4 number primary key,
f5 VARCHAR2(200));
create table t1 (
f1 number,
f2 VARCHAR2(4000),
f3 number,
CONSTRAINT t1_t2_fk FOREIGN KEY (f3)
REFERENCES t2(f4) ON DELETE cascade);
create index it1 on t1(f3) indextype is lucene.LuceneIndex
parameters('ExtraCols:L$MT.f2 "f2",t2.f5 "f5";ExtraTabs:t2;WhereCondition:L$MT.f3=t2.f4');
Note that the tables t1 and t2 are joined directly by a foreign key, such that t1 could be considered a child table of t2. Using this set of parameters, when the ODCI API detects a change on the index master column t1.f3, a select like this will be executed:
select L$MT.f3, L$MT.f2 "f2", t2.f5 "f5"
from t1 L$MT, t2
where L$MT.rowid=? and L$MT.f3=t2.f4;
gennff: the following example may be wrong in terms of the t2 trigger. that is, the it1 index will never see any inserts and/or deletes on table t2 and will therefore hardly stay in sync with the actual table data. in general, i don’t think it is an easy to follow approach to use an index on a child table in this example (t1 is a child of t2 since it references it by fk).
Keeping the LDI index in sync with the changes upon any columns (and tables) defined with ExtraCols (and ExtraTabs) parameters is no more complex than the compound columns on a single table example above. It just requires a combination of two or more triggers, according to the number of additional tables in the index definition. The additional trigger determines all rowids at the master table who have a reference to the row in change and then employs the LuceneDomainIndex.enqueueChange procedure to just notify LDI about the changes, while sys.ODCIRidList is a special ODCI structure to hold a group of rowids. Here you go:
CREATE OR REPLACE TRIGGER L$IT1 BEFORE UPDATE OF f2 ON t1 FOR EACH ROW
BEGIN
:new.f3 := :new.f3;
END;
/
CREATE OR REPLACE TRIGGER LT$IT1 AFTER UPDATE OF f5 ON t2 FOR EACH ROW
DECLARE
ridlist sys.ODCIRidList;
BEGIN
SELECT ROWID BULK COLLECT INTO ridlist FROM T1 WHERE F3=:NEW.f4;
LuceneDomainIndex.enqueueChange(USER || '.IT1', ridlist, 'update');
END;
/
3.1.4 Padding and formatting
Lucene Domain Index can be customized with a parameter named UserDataStore, that defines which class is responsible for creating Lucene documents. A Lucene document is a list of fields for each column indexed, plus an extra field named rowid, being stored compressed and untokenized. By default UserDataStore is defined to be implemented by org.apache.lucene.indexer.DefaultUserDataStore.
Default UserDataStore supports left padding for NUMBER or FLOAT columns as well as left character padding for VARCHAR2 or CHAR columns. To define padding, use the FormatCols parameter within the create or alter index DDL command. Here is an example that automatically pads all F2 column values to 15 characters and formats all F3 column values to comply to 00.00:
create table t1 (
f1 number primary key,
f2 varchar2(200),
f3 number(4,2))
ORGANIZATION INDEX;
insert into t1 values (1, 'ravi', 3.46);
insert into t1 values (3, 'murthy', 15.87);
commit;
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Stemmer:English;ExtraCols:F3;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00)');
then these rows will be indexed as Lucene documents like:
FINE: Document<stored,indexed<rowid:*BAGAGfsCwQL+> indexed,tokenized<F2:zzzzzzzzzzzravi> indexed<F3:03,46>> FINE: Document<stored,indexed<rowid:*BAGAGfsCwQT+> indexed,tokenized<F2:zzzzzzzzzmurthy> indexed<F3:15,87>>
For columns based on Oracle XMLType, FormatCols parameter can be used to define an XPath expression which controls a subset of XML nodes to be indexed.
create table t1 (
f1 VARCHAR2(10),
f2 XMLType);
insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));
insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));
commit;
create index it1 on t1(f1) indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.WhitespaceAnalyzer;ExtraCols:F2;FormatCols:F1(000),F2(/emp/name)');
Inspecting the trace output of the create index statement, above rows will be indexed as:
FINE: Document<stored,indexed<rowid:AAAWqqAAGAAABodAAA> indexed,tokenized<F1:001> indexed,tokenized<F2:ravi >> FINE: Document<stored,indexed<rowid:AAAWqqAAGAAABodAAB> indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >>
Columns of type VARCHAR2, CHAR and CLOB allow for some special formatting options, namely NOT_ANALYZED, NOT_ANALYZED_STORED, ANALYZED_WITH_OFFSETS, ANALYZED_WITH_POSITIONS and ANALYZED_WITH_POSITIONS_OFFSETS. These options control the way UserDataStore chooses to analyze/index, store and vectorize the field values. For example, a field value that only holds one token per document may be analyzed/indexed without tokenization and also not be stored with the index to be employed as a sort field:
gennff: the example should also include sorting for numbers, probably using the NumericField class as of Lucene >= 2.9 .
create table emails (
emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB);
INSERT INTO EMAILS (EMAILFROM, EMAILTO, SUBJECT, EMAILDATE, BODYTEXT)
VALUES ('EMAILFROM', 'EMAILTO', 'SUBJECT', sysdate, 'BODYTEXT');
commit;
create index emailbodyText on emails(bodyText) indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;ExtraCols:emailDate "emailDate",subject "subject",emailFrom "emailFrom",emailTo "emailTo";FormatCols:subject(NOT_ANALYZED),emailFrom(NOT_ANALYZED),emailTo(NOT_ANALYZED);LogLevel:ALL');
Then the trace output for the row processed looks like this, where the fields subject, emailFrom and emailTo are only indexed but not tokenized by the analyzer (obviously, emailDate is also only indexed because the analyzer is smart enough to deduce this approach from the datatype of the value):
FINE: Document<stored,indexed<rowid:AAAWvbAAGAAABrrAAA> indexed,tokenized<BODYTEXT:BODYTEXT> indexed<emailDate:20110617> indexed<subject:SUBJECT> indexed<emailFrom:EMAILFROM> indexed<emailTo:EMAILTO>>
Another example, using the ANALYZED_WITH_VECTORS option of the FormatCols parameter prints:
FINE: Document<stored,indexed<rowid:AAAW5IAAGAAABqOAAG> stored,indexed,tokenized,termVector<STR1:20.12.2010> stored,indexed,tokenized,termVector<STR2:20.12.2010>>
Iff you go and read the Lucene documentation, you won’t find the options introduced above. They have been set up by LDI to define a mapping to Lucene’s concept of field, that is:
- analysis/indexing: (how will the value be searchable via the inverted index
- storage: is there any need to preserve the original value with the index as well, for display purposes probably and
- term vectorization: used for highlightning, categrization etc.
Here is the mapping table where the first row depicts the default value, i.e. when no FormatCols parameter is given:
| LDI format option | Lucene field analysis/indexing (Field.Index.*) | Lucene field storage (Field.Store.*) | Lucene field term vectorization (TermVector.*) |
|---|---|---|---|
| default | ANALYZED | NO | NO |
| ANALYZED_\ STORED |
ANALYZED | YES | NO |
| ANALYZED_\ WITH_VECTORS |
ANALYZED | YES | YES |
| ANALYZED_\ WITH_OFFSETS |
ANALYZED | YES | WITH_OFFSETS |
| ANALYZED_\ WITH_POSITIONS |
ANALYZED | YES | WITH_POSITIONS |
| ANALYZED_\ WITH_POSITIONS_\ OFFSETS |
ANALYZED | YES | WITH_POSITIONS_\ OFFSETS |
| NOT_ANALYZED | NOT_ANALYZED | NO | NO |
| NOT_ANALYZED_\ STORED |
NOT_ANALYZED | YES | NO |
The following table depicts mappings that regard the *_NORMS options to describe index time boosting of terms. It is just an extension of the preceding table, this time incorporation the *_NORMS option.
| LDI format option | Lucene field analysis/indexing (Field.Index.*) | Lucene field storage (Field.Store.*) | Lucene field term vectorization (TermVector.*) |
|---|---|---|---|
| ANALYZED_\ NO_NORMS |
ANALYZED_\ NO_NORMS |
NO | NO |
| ANALYZED_\ NO_NORMS_\ STORED |
ANALYZED_\ NO_NORMS |
YES | NO |
| ANALYZED_\ NO_NORMS_\ WITH_VECTORS |
ANALYZED_\ NO_NORMS |
YES | YES |
| ANALYZED_\ NO_NORMS_\ WITH_OFFSETS |
ANALYZED_\ NO_NORMS |
YES | WITH_OFFSETS |
| ANALYZED_\ NO_NORMS_\ WITH_POSITIONS |
ANALYZED_\ NO_NORMS |
YES | WITH_POSITIONS |
| ANALYZED_\ NO_NORMS_\ WITH_POSITIONS_\ OFFSETS |
ANALYZED_\ NO_NORMS |
YES | WITH_POSITIONS_\ OFFSETS |
| NOT_ANALYZED_\ NO_NORMS |
NOT_ANALYZED_\ NO_NORMS |
NO | NO |
| NOT_ANALYZED_\ NO_NORMS_\ STORED |
NOT_ANALYZED_\ NO_NORMS |
YES | NO |
Concluding, the following table introduces the possibility to directly index numeric values as numbers, dates and times as of Lucene’s release >= 2.9 NumericField class. Since LDI currently only supports sorting by int and float (see § 3.4.4), the basic types long and double are not yet accepted. Specifying NUMERIC_DATETIME should be accompanied with an according FormatCols format to set the desired datetime resolution from year down to second, e.g. FormatCols:revisionDate(day).
| LDI format option | Lucene field analysis/indexing (Field.Index.*) | Lucene field storage (Field.Store.*) | Lucene field term vectorization (TermVector.*) |
|---|---|---|---|
| NUMERIC_INT | n/a (NOT_ANALYZED) | NO | n/a (NO) |
| NUMERIC_FLOAT | n/a (NOT_ANALYZED) | NO | n/a (NO) |
| NUMERIC_DATETIME | n/a (NOT_ANALYZED) | NO | n/a (NO) |
3.1.5 Functional columns
Since some ExtraCols parameter list will just be added to the selected columns of the base table for indexing, one can also define additional or specialized fields for a Lucene index. Actually, any valid SQL expression in a select section is allowed here. For example, using the above table definition, one can introduce another Lucene field named id, that can even be a subject to formatting with the same statement. Again, be careful to use the same character case when writing down database column and Lucene field identifiers:
create index it1 on t1(f1) indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.standard.StandardAnalyzer;ExtraCols:F2,extractValue(F2,''/emp/@id'') "id";FormatCols:F1(000),F2(/emp/name),id(00)');
The LDI indexer will use a statement roughly comparable to the following:
select f1, F2, extractValue(F2,''/emp/@id'') "id" from t1;
Appropriate selects may look like this:
select * from t1 where lcontains(f1, '001') > 0; select * from t1 where lcontains(f1, 'id:01') > 0; select * from t1 where lcontains(f1, 'F2:ravi') > 0;
3.1.6 Online synchronization
If you put SyncMode:OnLine, in contrast to SyncMode:Defered, in the parameter list of your create index statement, LDI will set up a PLSQL AQ Callback to the indexes queue it1$q in it1$qt, see above, that immediately synchronizes all changes of insert or update statements with the Lucene index structure in the background. LDI will enqueue and process affected rowids of the master table in batches of BatchCount rows, where the default batch size is 115. For example:
create index pages_lidx_all on pages p (value(p)) indextype is Lucene.LuceneIndex
parameters('SyncMode:OnLine;LogLevel:WARNING;Stemmer:Spanish;ExtraCols:extractValue(object_value,''/page/title'') "title",extractValue(object_value,''/page/revision/comment'') "comment",extract(object_value,''/page/revision/text/text()'') "text",extractValue(object_value,''/page/revision/timestamp'') "revisionDate";IncludeMasterColumn:false;LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');
3.1.7 Populate Index
Using PopulateIndex:false with an index create statement will build an empty Lucene index structure that is ready for use but will of course not return any results on a select statement. You may execute an alter index rebuild statement afterwards to actually populate the index. Here is an example:
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('PopulateIndex:false;LogLevel:ALL;IncludeMasterColumn:false;ExtraCols:F1,extractValue(F2,''/emp/name/text()'') "name",extractValue(F2,''/emp/@id'') "id";FormatCols:F1(000),id(00)');
-- At this point the index is set up but not populated, so no rows will be returned
select lscore(1),f2 from t1 where lcontains(f2, 'name:ravi',1) > 0;
-- Populate the index
alter index it1 rebuild parameters('Analyzer:org.apache.lucene.analysis.WhitespaceAnalyzer');
-- Now, upon a query, rows will be returned (iff matched)
select lscore(1), f2 from t1 where lcontains(f2, 'name:ravi',1) > 0;
3.1.8 Parallel Index Operations
Starting with Lucene Domain Index 2.9.1.1.0, you can enable parallel operations with the ParallelDegree parameter that may be explicitely set to 0, the default, or to a range of 2 to 9. Parallel operations is implemented by using multiple UserDataStore segements and is most useful on multi core chip boxes or in a RAC environment. Up to now, only insert statemets upon the base table is parallelized and the index must be configured in SyncMode:OnLine mode. This is an example of an index that enables parallel inserts:
create index source_big_lidx on test_source_big(text)
indextype is lucene.luceneindex
parameters('BatchCount:1000;ParallelDegree:4;SyncMode:OnLine;LogLevel:INFO;AutoTuneMemory:true;PerFieldAnalyzer:line(org.apache.lucene.analysis.KeywordAnalyzer),TEXT(org.apache.lucene.analysis.SimpleAnalyzer);FormatCols:line(0000);ExtraCols:line "line"');
After execution of the statement, ten new tables will be visible on the user’s schema: SOURCE_BIG_LIDX$T (master index storage) and SOURCE_BIG_LIDX$[0..3]$T (slave index storages) as well as SOURCE_BIG_LIDX$QT (master index queue) and SOURCE_BIG_LIDX$[0..3]$QT (slave index queues). A sequence SOURCE_BIG_LIDX$S is also created and serves generating the numbers 0 to 3.
The parallel implementation will enqueue batches of BatchCount:1000 rows in the master queue SOURCE_BIG_LIDX$Q of the index. Then the PLSQL AQ Callback, which is enabled for this queue by SyncMode:OnLine will dequeue each batch and re-enqueue in the slaves queues SOURCE_BIG_LIDX$[0..3]$Q. As a result, Oracle AQ will execute multiple AQ server processes that you can see as multiple ora_j00x_sid processes on a *U*X box.
With Oracle 11g we saw that that AQ may not always start another slave process if one callback is getting a lot of (too much) CPU usage. Experience shows, that a BatchCount parameter setting around 250 always leaves enough machine ressources for other slaves processes to be started successfully and setting parallelizing to real work.
gennff: the following section is quite hard to understand and should be rewritten, pointing out the several keywords in a more declarative way.
Lucene Domain Index 2.9.2.1.1+ and 3.0.1.1.0+ also includes a new parameter IndexOnRam, default true, that executes indexing in RAM, using the Lucene RAMDirectory implementation. RAMDirectory is around 40% faster than indexing a similar batch by means of the slave storages and parallel index operations, as introduced before. Obviously each slave process will consume more RAM compared to the same batch of rows using a disk storage based on OJVMDirectory. The rule of thumb is to enable LogLevel:INFO and check for a certain BatchCount value, what time it takes to index a new batch on RAM and what time it takes to merge the slave directory with the main directory storage. Here is an example:
INFO: .addDocToIdx - start indexing on SCOTT.SOURCE_BIG_LIDX numRows= 500 INFO: .addDocToIdx - indexing done SCOTT.SOURCE_BIG_LIDX elapsedTime: 200 ms. INFO: .addDocToIdx - addIndexesNoOptimize merge done SCOTT.SOURCE_BIG_LIDX elapsedTime: 317 ms.
Choosing a greater BatchCount value here will promote an indexing time similar or greater than the merge time and will, in sum, decrease the overall processing time, for example:
INFO: .addDocToIdx - start indexing on SCOTT.SOURCE_BIG_LIDX numRows= 700 INFO: .addDocToIdx - indexing done SCOTT.SOURCE_BIG_LIDX elapsedTime: 312 ms. INFO: .addDocToIdx - addIndexesNoOptimize merge done SCOTT.SOURCE_BIG_LIDX elapsedTime: 74 ms.
Parallel indexing is are also used for insert .. into .. select .. from DML operations, however, the BatchCount parameter is not used in this case because Oracle automatically chooses the number of rows that are inserted in a batch, usually around 115 rows.
A Lucene Domain Index working in SyncMode:OnLine and using a ParallelDegree greater than 1 will be populated in parallel using RAMDirectory or OJVMDirectory slave storage depending on IndexOnRam parameter. Note that indexing is always done in parallel with other indexing operation because it doesn’t require a write lock on the master indexing.
3.2 Alter index
The Lucene Domain Index alter index command can be used to change any parameter after index creation time. LDI parameters are a simple list of name:value pairs being stored into LDI’s OJVMDirectory storage. If you want to remove any parameter from the index definition, do prepend the parameter name with a "~". Here are some examples of alter index.
Change the Lucene index writer parameter MaxBufferedDocs to 500 and disable auto tuning of indexing memory:
alter index it1
parameters('MaxBufferedDocs:500;AutoTuneMemory:false');
Similar to the previous one example but enabling online synchronization:
alter index it1
parameters('MaxBufferedDocs:500;AutoTuneMemory:false;SyncMode:OnLine');
This disables online synchronization from the above example. You can get a similar effect by setting SyncMode:Deferred, which is the default value for SyncMode, to overwrite any previous SyncMode setting.
alter index it1 parameters('~SyncMode:OnLine');
3.2 Alter index rebuild
The alter index rebuild statement rebuilds an index from scratch. This is useful when a LDI instance is damaged, corrupted or you need to change some parameter setting that affects field value preprocessing and indexing. An example is the Lucene Analyzer parameter.
3.2.1 Manual
The following example shows how to change Lucene index analyzer. If you change your index analyzer it is necessary to rebuild the complete index because you should not query an index with an analyzer that is different from the index time.
alter index it1 rebuild
parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;MaxBufferedDocs:500;AutoTuneMemory:false);
3.2.2 On Line
gennff: it is not comprehensible to the reader what an oracle index online rebuild has in common with a ldi online sync.
Alter index rebuild will not return up to the complete operation is finished. Rebuild On Line is a functionality for Oracle Index available in enterprise edition databases, but with a little trick you can rebuild Lucene Domain Index On Line too. If you are working with SyncMode:Deferred you need to change to SyncMode:OnLine, then you can rebuild the index by using:
alter index it1 rebuild
parameters('SyncMode:OnLine;MergeFactor:100;BatchCount:1000');
commit; -- notify change to AQ Callback
Rebuild command enqueues batchs of 1000 rowids of the master table (it1) for addition to Lucene Index structure then Lucene Domain Index AQ Callback will process these messages using background database process and automatically commit changes when it finish.
3.3 Drop
Dropping a LDI instance is no way different from dropping any other index in an Oracle database. Under the covers, this operation also drops the index’ storage table, IT1$T for the above example, and the index’ AQ IT1$Q with its storage IT1$QT. If the index is configured with SyncMode:OnLine, the PL/SQL AQ Callback is disabled first.
drop index it1;
If something goes wrong during index drop command you can add a force at the end of the command. That will clear any system views from any stale references to the index.
drop index it1 force;
3.4 Querying
Lucene Domain Index introduces a new SQL operator named lcontains() with its ancillary operators lscore() and lhighlight() (see below). The functionality of lcontains() and lscore() is comparable to the Oracle Text operators contains() and score().
3.4.1 Simple columns
Based on the xml indexing by FormatCols example from § 3.1.4, some simple statement employs contains() in the where clause and lscore() with the select list. In that order, the operators serve a parameterized text query and a normalized hit scoring. The first parameter to lcontains() denotes the column on which the Lucene Domain Index resides, the second one is the actual string (of tokens) to be searched. Both parameters are mandatory. The third parameter of lcontains() as well as the first parameter of lscore() is a correlation id that establishes a connection between a dedicated lcontains() and a dedicated lscore() in some SQL statement. That is, every call to Lucene by means of lcontains() delivers its own scoring! Note that lcontains() must always return a value > 0 to some consumer (a boolean expression here) to identify a successful match.
Since the Lucene index resides on column f1, this column is called the master column of the index and makes up the default search field of a query expression (see the Lucene Query Parser syntax). That is, leaving out any field qualifier in front of a query expression term, executes a search on column f1.
select lscore(1) as sc, f1 from t1 where lcontains(f1, '001', 1) > 0;
SC|F1
------|---
1.000|1
The approach is different when searching an ExtraCols column, here the search string has to be prefixed by “F2:”.
select lscore(1) as sc, f1 from t1 where lcontains(f1, 'F2:ravi', 1) > 0;
SC|F2
------|----------------------------------------
1.000|<emp id="1"><name>ravi</name></emp>
3.4.2 Multiple columns
Lucene query parser syntax provides a rich query language that comprises logical operators, term modifiers, grouping and stuff. You can apply any of them to each column indexed as long as the resulting expression conforms to the query language. Here is a some more complex example, using the xml indexing by FormatCols example from § 3.1.5 . Note that first row matches against the extra column F2:ravi and the functional column id:01, the second row matches with F1 equal to 003 (where some qualifier its not necessary for F1 because it is the master column of the index).
select f1, lscore(1) sc, extractValue(f2, '/emp/@id') id from t1
where lcontains(f1, '003 OR (F2:(ravi OR ravie) AND id:01)', 1) > 0;
F1 | SC|ID
---|------|---
1 | .577|1
3 | .206|3
3.4.3 Pagination
The lcontains() operator has an extension to the Lucene query parser syntax that includes in-line pagination information for the Lucene Domain Index result set. You can select a specific window (pagination) of your sorted query results, note the DOMAIN_INDEX_SORT optimizer hint and see next section, by injecting a query qarser like range term inside the query expression. For example (from § 3.1.5):
select /*+ DOMAIN_INDEX_SORT */
f1, lscore(1) sc, extractValue(f2, '/emp/@id') id from t1
where lcontains(f1, 'rownum:[2 TO 2] AND (003 OR (F2:(ravi OR ravie) AND id:01))', 1) > 0
order by lscore(1) desc;
F1 | SC|ID
---|------|---
3 | .206|3
Lucene Domain Index implementation automatically extracts pagination information rownum:[n TO m] AND from the beginning of the query expression and only returns the required subset of n rowids to the Oracle optimizer. This extension provides a lot of performance gain by eliminating the outer statement of Oracle’s Top-N syntax that, in a worst case, collects all resulting rowids to filter the result set window. Because inline pagination is an home brew extension to the standard Lucene query parser syntax, there are some home brew rules also:
- The
rownum:[n TO m] ANDterm must start the query expression and, as is, we simply use positioned string matching of therownum .. ANDkeywords to extract the start and the stop index of the window. Such a way it is your responsibility to provide a well formatted term. - Pagination information is concatenated to the actual query expression using the
ANDboolean operator. However, this operator does not have any sense concerning the grouping logic of terms in a search expression, it is a placeholder, likerownum. For example,rownum:[n TO m] AND xx OR bbwill be evaluated as((rownum:[n TO m] AND xx) OR bb)but of course only seached as(xx OR bb).
Although all the pagination functionality looks really easy to use and fast executing yould should always keep the following in mind. Query result sets may change over time, they may change in value, in lenght and in order. For example, a select like the one above may return a completely different result set iff some new row has been added to the base table. The row will, iff matched, also move the page window around. A common pifall is that you inspect some search results on page 123, then a new row comes in and matches right at the top, and all of a sudden when you switch to page 123+1, the last result from page 123 reappears as the first result from page 123+1.
3.4.4 Sort
Pure Lucene provides sorting over the result of a particular query, Lucene Domain Index goes further and provides sorting by using an extra argument to the lcontains() operator (see § 3.4.1). The sort parameter syntax is a coma separated string of field[:ORDER[:TYPE] values, where the fields being included in the sorting spec should be NOT_ANALYZED or NOT_ANALYZED_STORED (see the FormatCols parameter in § 3.1.4). The ORDER can be set to ASC or DESC, default value is ASC. The TYPE key can be string, float or int, starting with Lucene 2.9.0 the default value is string.
Note that if you are using lcontains() to sort anything within the index, you have to add the DOMAIN_INDEX_SORT optimizer hint. This hint tells the Oracle optimizer that the order of the rows will be dictated by that Lucene Domain Index. Also note that the usage of lscore() in conjunction with a non-scored sort does not make any sense at all and will, due to the score computation on the index engine, only produce an overhead to the execution time of some query.
Here are some examples of sorted queries against the emails table created in § 3.1.4 and on that data:
INSERT INTO EMAILS (EMAILFROM, EMAILTO, SUBJECT, EMAILDATE, BODYTEXT)
VALUES ('arthur@schop.de', 'friedrich@nietz.de', 'Pessimismus', sysdate-12, 'Denn alles Streben entspringt aus Mangel');
INSERT INTO EMAILS (EMAILFROM, EMAILTO, SUBJECT, EMAILDATE, BODYTEXT)
VALUES ('friedrich@nietz.de', 'arthur@schop.de', 'Deine Philosophie', sysdate-7, 'wie beim Eintritt in den Hochwald');
INSERT INTO EMAILS (EMAILFROM, EMAILTO, SUBJECT, EMAILDATE, BODYTEXT)
VALUES ('irgendwer@schop.de', 'arthur@schop.de', 'Metaphysik ist pessimistisch', sysdate-3, 'Die Welt als Wille und Vorstellung');
INSERT INTO EMAILS (EMAILFROM, EMAILTO, SUBJECT, EMAILDATE, BODYTEXT)
VALUES ('friedrich@nietz.de', 'arthur@schop.de', 'Der neue Pessimismus', sysdate, 'Vitalismus ist doch schicker!');
commit;
SELECT /*+ DOMAIN_INDEX_SORT */ subject, emailfrom as src, emaildate ts FROM emails where lcontains(bodytext, 'streben || wille || ruhe', 'subject', 1) > 0; SUBJECT |SRC |TS ------------------------------|------------------------------|-------- Metaphysik ist pessimistisch |irgendwer@schop.de |18.06.11 Metaphysik ist pessimistisch |arthur@schop.de |18.06.11 Metaphysik und Pessimismus |arthur@schop.de |09.06.11 SELECT /*+ DOMAIN_INDEX_SORT */ subject, emailfrom as src, emaildate ts FROM emails where lcontains(bodytext, 'streben || wille || ruhe', 'subject:DESC', 1) > 0; SUBJECT |SRC |TS ------------------------------|------------------------------|-------- Metaphysik und Pessimismus |arthur@schop.de |09.06.11 Metaphysik ist pessimistisch |irgendwer@schop.de |18.06.11 Metaphysik ist pessimistisch |arthur@schop.de |18.06.11 SELECT /*+ DOMAIN_INDEX_SORT */ subject, emailfrom as src, emaildate ts FROM emails where lcontains(bodytext, 'streben || wille || ruhe', 'subject:DESC,emailFrom', 1) > 0; SUBJECT |SRC |TS ------------------------------|------------------------------|-------- Metaphysik und Pessimismus |arthur@schop.de |09.06.11 Metaphysik ist pessimistisch |arthur@schop.de |18.06.11 Metaphysik ist pessimistisch |irgendwer@schop.de |18.06.11 SELECT /*+ DOMAIN_INDEX_SORT */ subject, emailfrom as src, emaildate ts FROM emails where lcontains(bodytext, 'streben || wille || ruhe', 'subject:DESC,emailFrom:DESC', 1) > 0; SUBJECT |SRC |TS ------------------------------|------------------------------|-------- Metaphysik und Pessimismus |arthur@schop.de |09.06.11 Metaphysik ist pessimistisch |irgendwer@schop.de |18.06.11 Metaphysik ist pessimistisch |arthur@schop.de |18.06.11
Again, be very careful to not change the case of any identifier or add redundant whitespace with the sort specification, Lucene Domain Index is very sloppy as a query parser. For example, writing emailFrom as emailfrom (no capital F) changes the semantics of the result set without notice:
SELECT /*+ DOMAIN_INDEX_SORT */ subject, emailfrom as src, emaildate ts FROM emails where lcontains(bodytext, 'streben || wille || ruhe', 'emailFrom:DESC', 1) > 0; SUBJECT |SRC |TS ------------------------------|------------------------------|-------- Metaphysik ist pessimistisch |irgendwer@schop.de |18.06.11 Metaphysik und Pessimismus |arthur@schop.de |09.06.11 Metaphysik ist pessimistisch |arthur@schop.de |18.06.11 SELECT /*+ DOMAIN_INDEX_SORT */ subject, emailfrom as src, emaildate ts FROM emails where lcontains(bodytext, 'streben || wille || ruhe', 'emailfrom:DESC', 1) > 0; SUBJECT |SRC |TS ------------------------------|------------------------------|-------- Metaphysik und Pessimismus |arthur@schop.de |09.06.11 Metaphysik ist pessimistisch |irgendwer@schop.de |18.06.11 Metaphysik ist pessimistisch |arthur@schop.de |18.06.11
The following query doesn’t include any sort specification but the DOMAIN_INDEX_SORT hint such that the result set will by default be sorted by score descending. This is an abbreviated syntax for the probably most widely used application of sorts with text searches. You can, of course, exchange it to the classic syntax of order by lscore(1) desc. Yes, ok, the result set are not identical, however, this is because two rows have, to a mantissa of 15 numbers, ichecked it, the same score, so Oracle / LDI has no other choice than returning the rows by accident.
SELECT /*+ DOMAIN_INDEX_SORT */ subject, emailfrom as src, lscore(1) sc FROM emails where lcontains(bodytext, 'streben || wille || ruhe', 1) > 0; SUBJECT |SRC | SC ------------------------------|------------------------------|------ Metaphysik ist pessimistisch |arthur@schop.de | .184 Metaphysik und Pessimismus |arthur@schop.de | .138 Metaphysik ist pessimistisch |irgendwer@schop.de | .138 SELECT subject, emailfrom as src, lscore(1) sc FROM emails where lcontains(bodytext, 'streben || wille || ruhe', 1) > 0 order by lscore(1) desc; SUBJECT |SRC | SC ------------------------------|------------------------------|------ Metaphysik ist pessimistisch |arthur@schop.de | .184 Metaphysik ist pessimistisch |irgendwer@schop.de | .138 Metaphysik und Pessimismus |arthur@schop.de | .138
gennff: the reader gained no understanding how to index and sort by date or timestamp values. one may of course convert dates to well formatted strings or numbers using formatcols. However, per chance ldi may already offer some automated process that, for whatever reason, did not make it to the docs yet.
3.4.5 Count Hits
The countHits() function is a Lucene Domain Index optimization to replace the regular SQL count(*) functionality. By comparison, countHits() is extremely fast because there is no need to pass rowid information from the Lucene Data Cartridge to the Oracle Engine to count matching rows. Here is an example:
select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') hits from dual;
HITS
----
5
The first argument of countHits() is the Lucene Domain Index name, the second argument is the search expression in question. You can optionally use a three argument version of countHits() to check some index in another schema. countHits() does not only serve the functionality explained above, a call to countHits() will also cache the rowid information along with the query expression. This is useful for lcontains() queries executed immediately afterwards, because the LDI can shortcut to this cached information. Under the covers, LDI keyes to the cache by sort_string(QueryParser.toString()) such that at least the query expressions of countHits() and lcontains() must match exactly. Following is an example of countHits() in correlation with lcontains() where emailFrom:(security) ist the match key:
select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') from dual;
LUCENEDOMAININDEX.COUNTHITS('EMAILBODYTEXT','SECURITY')
-------------------------------------------------------
5
Elapsed: 00:00:00.02
select emailFrom FROM emails
where lcontains(bodytext,'security','emailFrom:ASC',1)>0;
EMAILFROM
---------
codeshepherd@gmail.com
codeshepherd@gmail.com
erik@ehatchersolutions.com
lucenelist2005@danielnaber.de
lucenelist2005@danielnaber.de
Elapsed: 00:00:00.04
3.4.6 First Rows Hint
Starting with the 2.4.0.1.0 release of LDI we have replaced deprecated Lucene hit classes by the TopDocs class. If you use the FIRST_ROWS optimizer hint in conjuction with the lcontains() inline pagination, Lucene Domain Index will execute a call to TopDocs to get the first N hits only. For example:
select /*+ FIRST_ROWS DOMAIN_INDEX_SORT */ lhighlight(1), extractValue(object_value,'/page/title')
from pages
where lcontains(object_value, 'rownum:[1 TO 10] AND (musica tango rock)', 1)>0;
FIRST_ROWS and rownum:[1 TO 10] being used together performs a Lucene Query for the first 10 hits only. However, the next query along rownum:[10 TO 20] will find most of the Lucene structures, like the Searcher class instance and the ROWIDLucene to DocID association, already cached in memory. The Lucene index will be nevertheless be re-queried to get first 20 Hits (1..20) again. On the other hand, if you omit FIRST_ROWS, Oracle will by default switch to ALL_ROWS mode which means, if you are using a pagination of (rownum:[n TO > 2000]), Lucene Domain Index will fetch m first hits, but if m is lower than 2000, Lucene Domain Index will try to fetch by default 2000 hits. The magic number of 2000 is due Oracle ODCI API calls to the ODCIFetch routine in batches of 2000 rowids. If FIRST_ROWS and in-line pagination are not included in query, Lucene Domain Index alwasy works in batches of 2000 hits causing several cache misses in a full scan mode. For example, given a query:
select count(*) from pages where lcontains(object_value, 'musica tango rock')>0;
Lucene Domain Index fetches the first 2000 hits, finally with the information that the hit amount is 2736 and it re-fetches (cache miss) the 2736 hits. Obviously you can use LuceneDomainIndex.countHits() to estimate the hit count in advance being faster than the previous query.
3.4.7 Highlighting
The lhighlight() ancillary operator works just as lscore(), remember the corellation id to lcontains(), but instead returns a VARCHAR2 text with the hit words (tokens) highlightened. The HTML tag used to remark hit words denotes <B></B> and is, as well as the fragment separator (…) and the maximum number of fragments (4), not yet customizable per call. However, starting with the 2.4.1.1.0 release of LDI, these parameters have become statically customizable through alter index ... parameters();, see below. here is some highlighting example:
SELECT /*+ DOMAIN_INDEX_SORT */ subject, lscore(1) sc, lhighlight(1) txt
FROM emails
where lcontains(bodytext, 'security OR mysql', 'subject:ASC', 1)>0;
SUBJECT
SC
TXT
---
Re: lucene injection
.27477634
On Dec 21, 2006, at 4:56 AM, Deepan wrote: I am bothered about <strong>security</strong> problems with lucene. Is it vulnerable to any kind of injection like <strong>mysql</strong> injection? many times the query from user is passed to lucene for search without validating. Rest easy. There are no known <strong>security</strong> issues with Lucene, and it has even undergone a recent static code analysis by Fortify (see the lucene-dev e-mail list).
Highlighting only works with columns of types VARCHAR2, CLOB and XMLType. You can perform highlighting operation even if your master columns is not indexed/stored. So far, the index creation DDL below features a IncludeMasterColumn:false, which means the actual XMLType representation of the Spanish Wikipedia page dump is not rawly indexed. Only the virtual columns title, comment, text and revisionDate become processed by Lucene. However, the LDI TextHighlight Java method must receive the full value(p) XMLType from the RDBMS engine to feed the Lucene Highlighter class with all the information that is not necessarily contained in the Lucene index.
create index pages_lidx_all on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('PopulateIndex:false;
DefaultColumn:text;
SyncMode:Deferred;
LogLevel:INFO;
Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;
ExtraCols:extractValue(object_value,''/page/title'') "title",
extractValue(object_value,''/page/revision/comment'') "comment",
extract(object_value,''/page/revision/text/text()'') "text",
extractValue(object_value,''/page/revision/timestamp'') "revisionDate";
FormatCols:revisionDate(day);
IncludeMasterColumn:false;
LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');
select /*+ DOMAIN_INDEX_SORT */ lhighlight(1), extractValue(object_value,'/page/title')
from pages
where lcontains(object_value, 'rownum:[1 TO 10] AND (musica tango rock)', 1)>0;
<strong>Música</strong> de Argentina... [[Latinoamérica|latinoamericanos]] con más desarrollo en su [[<strong>música</strong>]]. Se encuentra una gran... argentinos, un instrumento tradicional andino]] Aún se mantiene la <strong>música</strong> de los [[Indígenas_en_Argentina... de grandes corrientes de [[inmigración|inmigrantes]] europeos, la <strong>música</strong> argentina se enriqueció
Música de Argentina
musical emparentado con la [[habanera]] y el [[<strong>tango</strong> (<strong>música</strong>)|<strong>tango</strong>]].
==Diferencias con el <strong>tango</strong>==
Aunque tanto la milonga como el <strong>tango</strong> están en [[compás]] de 2/4, las 8 [[semicorchea]]s de la milonga están distribuidas en 3 + 3 + 2 en cambio el <strong>tango</strong> posee un ritmo más «cuadrado». Las letras...]] criticó en algún momento el <strong>tango</strong> y prefirió la milonga, que no trasmite la melancolía
Milonga (género musical)
Parameters supported by highlighting functions are:
Formatter, a valid class name that implements Lucene’sFormatterinterface and a constructor with no arguments. The default value isorg.apache.lucene.search.highlight.SimpleHTMLFormatter.MaxNumFragmentsRequired, number of text fragments returned by thelhighlight()function, the default value is 4.FragmentSize, the size of each fragment returned, the default value is100.FragmentSeparator, the string used as fragment separator, the default is “…”. Note that you can not use “;” or “:” as fragment separator because these tokens are used as parameter and value delimiters with thecreateoralter index ... parameters();statements.
So far there is no customization allowed by passing any constructor arguments to the Formatter class, but you can easily create your own formatter to call SimpleHTMLFormatter with arguments like this:
create or replace and compile java source named "org.apache.lucene.search.highlight.MyHTMLFormatter"
as
package org.apache.lucene.search.highlight;
public class MyHTMLFormatter extends SimpleHTMLFormatter {
public MyHTMLFormatter() {
super("<span class=\"myhighlightclass\">","</span>");
}
}
/
alter index emailbodyText
parameters('Formatter:org.apache.lucene.search.highlight.MyHTMLFormatter;
MaxNumFragmentsRequired:3;
FragmentSeparator:...;
FragmentSize:50');
3.4.8 Highlighting using pipeline table functions
gennff: the author should at least give a short application scenario where the pipeline functions outperform the classis approach od § 3.4.7 .
phighlight() and rhighlight() provide a more general usage pattern for Lucene’s highlighting functionality. phighlight() receives an SQL query as string and performs highlighting according to a set of user defined columns on the query result. rhighlight() receives a SYS_REFCURSOR argument and adain performs highlighting on a set of user defined query columns but, unlike phighlight(), rhighlight() requires that the user defines a return type of the query, usually a TABLE OF collection, because with a SYS_REFCURSOR argument there is no option to know the return type of the query at compilation time. Both functions support the highlighting parameters introduced in § 3.4.7 . Here are two examples of highlighting by pipeline table functions:
SELECT * FROM
TABLE(phighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
from eMails e where lcontains(bodytext,''security OR mysql'',''subject:ASC'',1)>0'
));
SELECT * FROM
TABLE(rhighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'EMAILRSET',
CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
from eMails e where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0)
));
The first three arguments of both pipeline functions read the same: the LDI index used, the Lucene query expression (that should match the lcontains() argument in the query) and finally the column list to be highlightened. The last argument for phighlight() is a VARCHAR2 type that transfers the SQL query to be executed by DBMS_SQL package. Note the additional single quotes used as an escape character. For rhighlight() two further arguments are required. The type returned by the cursor, EMAILRSET, that is a collection of the EMAILR record which holds all columns of the table EMAILS plus the score returned by the lscore() function (see the example below). And finally, the last argument is of CURSOR type which means any SQL query.
CREATE TYPE EMAILR AS OBJECT (
sc NUMBER,
emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB
);
CREATE OR REPLACE TYPE EMAILRSET AS TABLE OF EMAILR;
3.4.9 More like this functionality
The more like this functionality of Lucene is provided in the LDI package MoreLike, funtion this (again overloaded to allow for an additional owner name parameter) as follows.
FUNCTION this(index_name IN VARCHAR2,
x IN ROWID,
f IN NUMBER DEFAULT 1,
t IN NUMBER DEFAULT 10,
minTermFreq IN NUMBER DEFAULT 2,
minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist;
A typical use case may look like this, where the anonymous PL/SQL block gets the first ROWID returned from the first query as pivot element and then expands the result set with other rows that also include terms like “procedure (C, Java or PL/SQL), optionally qualified”. Note that the “C” token will not be taken into account because it is regarded a stop word. Refer to the Appendix D.6 for a full explanation of each parameter.
select rowid,lscore(1),text from test_source_big
where lcontains(text,'"procedure java"~10',1)>0 order by lscore(1) desc;
AAAOaPAAEAAAAnnABV 1.00000003 procedure (C, Java or PL/SQL), optionally qualified
AAAOaPAAEAAAA0aAAV .84852819 STATIC PROCEDURE refreshParameterCache as LANGUAGE JAVA NAME
...
declare
ridlist sys.odciridlist;
begin
ridlist := MoreLike.this(index_name=>'SOURCE_BIG_LIDX',x=>'AAAOaPAAEAAAAnnABV',minTermFreq=>1);
FOR i IN (
select rowid,text from test_source_big
where rowid in (select * from table(ridlist_table(ridlist)))
) LOOP
dbms_output.put_line('rowid: '||i.rowid||' text: '||i.text);
END LOOP;
end;
/
rowid: AAAOaPAAEAAAAhLAAc text: after issuing insert, update, delete or anonymous PL/SQL calls
rowid: AAAOaPAAEAAAAjrAAo text: QUALIFIED_SQL_NAME
rowid: AAAOaPAAEAAAAk5AAe text: ORA-06502: PL/SQL: numeric or value error: character string buffer
...
rowid: AAAOaPAAEAAAAtXAAb text: The name of the Java class, PL/SQL package or object type implementing
3.4.10 Facets
Starting with Lucene Domain Index release 2.4.1.1.0, Lucene’s facets functionality is available through a SQL aggregate function lfacets(). The input parameter is an encoded string containing the LDI (schema.)index name and a list of categories. The aggregated function only accepts a simple scalar value as an input argument so we need to encode the index name and categories list in a comma separated value. Using the index created in § 2.5, some categories in Lucene query Syntax are prefixed by TEXT:, according to the actually indexed column, and carry “procedure” as the main and “java” as the sub category, respectively.
CREATE OR REPLACE function lfacets(input varchar2)
return agg_tbl
parallel_enable aggregate using facets_agg_type;
/
select lfacets('SOURCE_BIG_LIDX,TEXT:procedure,TEXT:java') from dual;
Creating a table with categories and linking the rows in a parent-child-relationship is an option to automatically generate facets, for example:
create table source_categories (
cat_code number(4),
cat_name varchar2(256),
cat_parent number(4),
CONSTRAINT PK_SOURCE_CATEGORIES PRIMARY KEY (cat_code),
CONSTRAINT FK_CAT_PARENT FOREIGN KEY (cat_parent)
REFERENCES source_categories (cat_code)
);
insert into source_categories values (1,'TEXT:procedure',null);
insert into source_categories values (2,'TEXT:function',null);
...
insert into source_categories values (6,'TEXT:java',1);
insert into source_categories values (7,'TEXT:(pl sql)',1);
insert into source_categories values (8,'TEXT:wrapped',1);
...
insert into source_categories values (21,'line:[1 TO 1000]',1);
insert into source_categories values (22,'line:[1001 TO 2000]',1);
insert into source_categories values (23,'line:[2001 TO 3000]',1);
Now we can query the above table by executing a call to lfacets(), passing a category and sub a category. Note that we are using the ljoin() function which will convert the agg_tbl type to a comma separated string current plus hit cardinality. The first row returned does not have a sub category because the parent column value is null. The trailing 5116, in parantheses, is the number of rows that match the token “procedure”. The TEXT:procedure,line:[1001 TO 2000] result implies an logical intersection between the set of rows that include the token “procedure” against the set of rows that match “line[1001 TO 2000]“. The group by cat_code causes the Oracle ODCI API to first calculate the bit set for “procedure” and then iterate over all the sub categories “java”, “pl sql”, “wrapped”, doing bit calculations. This is fast and once the facets is computed it is stored as a filter in the Lucene Domain Index memory structures.
select ljoin(lfacets('SOURCE_BIG_LIDX,' ||
case level when (1) then cat_name
else prior cat_name || ',' || cat_name end)) facet,
cat_code, level
from source_categories
start with cat_parent is null
connect by prior cat_code = cat_parent
group by cat_code,level;
FACET CAT_CODE LEVEL
--------------------------------------------------------
TEXT:procedure(5116) 1 1
TEXT:function(5574) 2 1
TEXT:trigger(96) 3 1
TEXT:package(860) 4 1
TEXT:(object type)(5140) 5 1
TEXT:procedure,TEXT:java(9) 6 2
.....
TEXT:procedure,line:[1 TO1000](3) 21 2
TEXT:procedure,line:[1001 TO2000](615) 22 2
...
When a number of rows or the amount of categories is quite large, one can use a materialized view to work as cache for the facets computation. Such a materialized view can be queried as any other table and the access will be too fast (and can be indexed as well).
CREATE MATERIALIZED VIEW source_facets
AS
select ljoin(lfacets('SOURCE_BIG_LIDX,' ||
case level when (1) then cat_name
else prior cat_name || ',' || cat_name end)) facet,
cat_code, level
from source_categories
start with cat_parent is null
connect by prior cat_code = cat_parent
group by cat_code,level;
gennff: __rework marker__
3.4.11 Terms pipeline table functions
Starting with Lucene Domain Index 2.9.1.1.0, two pipeline table functions has been included to iterate over terms of Lucene Index structure, high_freq_terms():
FUNCTION high_freq_terms(index_name VARCHAR2,
term_name VARCHAR2,
num_terms NUMBER) RETURN term_info_set
is available for getting the Top-N (num_terms) most used terms on the whole index or in a particular field. term_info_set is defined as:
TYPE term_info AS OBJECT (
term VARCHAR2(4000),
docFreq NUMBER(10)
);
TYPE term_info_set AS TABLE OF term_info;
You can query your index by using:
select * from table(high_freq_terms('SOURCE_BIG_LIDX','TEXT',10));
select * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10));
select * from table(high_freq_terms('SOURCE_BIG_LIDX','line',100));
and, index_terms():
FUNCTION index_terms(index_name VARCHAR2,
term_name VARCHAR2) RETURN term_info_set
select * from table(index_terms('SOURCE_BIG_LIDX','TEXT')) order by docFreq desc;
select * from table(index_terms('SOURCE_BIG_LIDX','TEXT'));
select * from table(index_terms('SOURCE_BIG_LIDX',null)) where rownum select * from (select * from table(index_terms('SOURCE_BIG_LIDX','line')) order by docFreq desc) where rownum
on both functions if argument term is NULL, these functions will iterate over all index terms. The natural order for high_freq_terms() is descendent by docFreq, but index_terms() is ordered by term_name:term_value ascending. Note that if you pass a non NULL value to term to starts with the first value for the specific term index_terms() do not stop when all the values of this term are completed, this functionality is similar to Lucene Java method reader.terms(new Term(term)). Here example if you want only iterate on an specific term name:
BEGIN
FOR term_rec IN (SELECT * FROM table(index_terms('SOURCE_BIG_LIDX','line'))) LOOP
/* Fetch from cursor variable. */
EXIT WHEN substr(term_rec.term,1,length('line'))<>'line'; -- exit when last row is fetched
-- process data record
dbms_output.put_line('Name = ' || term_rec.term || ' ' || term_rec.docFreq);
END LOOP;
END;
/
You can use index_terms() to get the Top-N terms order by docFreq, for example:
select * from (select * from table(index_terms('SOURCE_BIG_LIDX',null))
order by docFreq desc) where rownumselect * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10)); TEXT:in 24952 TEXT:varchar 16996 ... TEXT:return 6241 Elapsed: 00:00:00.02
Two queries are equivalent semantically but high_freq_terms() is more efficient because it uses TermInfoQueue structure for sorting, caches his computation one is executed and do not creates a lot of term_info objects which then are sorted by the RDBMS engine. lfreqterms ancillary operator is a complimentary function which requires a column indexed with ANALYZED_WITH_VECTORS option in FormatCols index parameter. Example usage:
select TERMVECTOR TXT, LFREQTERMS(1) FQTERMS from t1 where lcontains(termvector,'two',1)>0;
TXT FQTERMS
--- -------
one two two three three three LUCENE.TERM_INFO_SET('LUCENE.TERM_INFO(one,1)','LUCENE.TERM_INFO(three,3)','LUCENE.TERM_INFO(two,2)')
one two two three three three LUCENE.TERM_INFO_SET('LUCENE.TERM_INFO(one,1)','LUCENE.TERM_INFO(three,3)','LUCENE.TERM_INFO(two,2)')
one two two three three three LUCENE.TERM_INFO_SET('LUCENE.TERM_INFO(one,1)','LUCENE.TERM_INFO(three,3)','LUCENE.TERM_INFO(two,2)')
two two three three three four four four four LUCENE.TERM_INFO_SET('LUCENE.TERM_INFO(four,4)','LUCENE.TERM_INFO(three,3)','LUCENE.TERM_INFO(two,2)')
3.4.12 Did You Mean functionality
Starting with Lucene Domain Index 2.9.2.1.0, Did You Mean Lucene functionality was added as an extended LDI property using the Lucene SpellChecker library to create the dictionary index from the main index. Finaly, the dictionary index will be merged to the main index.
PROCEDURE indexDictionary(
index_name IN VARCHAR2,
spellColumns IN VARCHAR2 DEFAULT null,
distancealg IN VARCHAR2 DEFAULT 'Levenstein')
is available to create the dictionary index to be merged with main index.
You can create the dictionary by using:
call didyoumean.indexdictionary('SOURCE_BIG_LIDX');
Call completed.
Elapsed: 00:01:11.61
exec didyoumean.indexdictionary('EMAILBODYTEXT','BODYTEXT,subject,emailFrom,emailTo','NGram');
PL/SQL procedure successfully completed.
Elapsed: 00:00:01.62
Only index_name is mandatory. If spellColumns parameter is NULL, the master column of the main index will be used. By default Levenstein Distance Algorithm (a.k.a. edit distance) is applied (other options are Jaro – Jaro Winkler metric – and Ngram distance).
Note: The dictionary structure create the “word”, “gramN”, “startN” and “endN” Lucene fields, so be carefull if you have this fieds in the main index. The structure of this index is (for a 3-4 gram) this:
Index Structure
Example
word
kings
gram3
kin, ing, ngs
gram4
king, ings
start3
kin
start4
king
end3
ngs
end4
ings
and,
FUNCTION suggest (
index_name IN VARCHAR2,
cmpval IN VARCHAR2,
highlight IN VARCHAR2 DEFAULT null,
distancealg IN VARCHAR2 DEFAULT 'Levenstein'
) RETURN VARCHAR2;
is available to query the dictionary index. You can query the dictionary by using:
select didyoumean.suggest('SOURCE_BIG_LIDX','sorce') suggestion from dual;
SUGGESTION
----------
source
Elapsed: 00:00:00.31
select didyoumean.suggest('SOURCE_BIG_LIDX','sorce','b') suggestion from dual;
SUGGESTION
----------
<strong>source</strong>
Elapsed: 00:00:00.09
select didyoumean.suggest('SOURCE_BIG_LIDX','sorce','b','Jaro') suggestion from dual;
SUGGESTION
----------
<strong>source</strong>
Elapsed: 00:00:00.07
select didyoumean.suggest('EMAILBODYTEXT','lucene searhc','i') suggestion from dual;
SUGGESTION
----------
lucene <em>search</em>
Elapsed: 00:00:00.06
select didyoumean.suggest('EMAILBODYTEXT','lucine injetion','b','Levenstein') suggestion from dual;
SUGGESTION
----------
<strong>lucene</strong> <strong>injection</strong>
Elapsed: 00:00:00.06
The index_name parameter and the word to respell (cmpval) parameter are mandatory. You can define, optionaly, the highlight to be used (e.g. b for bold, i for italic, etc.) and define the distance algorithm to apply.
Pipeline table lautocomplete (
index_name IN VARCHAR2,
term_name VARCHAR2 DEFAULT NULL,
term_value VARCHAR2 DEFAULT '__ALL__',
num_terms NUMBER DEFAULT 10
) RETURN term_info_set;
is available to query the dictionary index. You can query the dictionary by using:
select * from table(lautocomplete('DICC_LIDX','TERM','th',15)) t;
TERM DOCFREQ
----------------------------- ------
there 3
theory 2
thaw 2
then 2
therefore 2
thence 2
....
thank 1
Pipeline table ldidyoumean (
index_name IN VARCHAR2,
cmpval VARCHAR2,
numSug NUMBER DEFAULT 10,
highlight VARCHAR2 DEFAULT null,
distancealg VARCHAR2 DEFAULT 'Levenstein') RETURN term_info_set;
is available to query the dictionary index. You can query the dictionary by using:
select * from table(ldidyoumean('DICC_LIDX', 'atention', 5, 'b', 'Levenstein')) t;
TERM DOCFREQ
-------------------- -------
<strong>attention</strong> 2
<strong>intention</strong> 1
<strong>detention</strong> 1
<strong>mention</strong> 3
<strong>attenuation</strong> 1
lsimilarity ancillary operator computes the similarity between the indexed column and the query passed to lcontains using Levenstein Distance algorithm. Example:
select lsimilarity(1) sim1, term t from dicc where lcontains(term, 'there~0.7', 1) > 0; SIM1 T ---- _ 1 there 1 there 1 there 0.800000011920928955078125 where 0.800000011920928955078125 theme 0.800000011920928955078125 here 0.800000011920928955078125 here select lsimilarity(1) sim1, term t from dicc where lcontains(term, 'there*', 1) > 0 SIM1 T ---- - 1 there 0.5555555820465087890625 therefore 1 there 1 there 0.5555555820465087890625 therefore
3.5 Synchronize
Working with SyncMode:Deferred you has to manually synchronize your index, it means update Lucene Domain Index structure applying pending changes such as insert and update. Deletes operations are always applied due ODCI Api do not accept rowid of deleted rows.
Here an example:
begin
LuceneDomainIndex.sync('IT1');
commit; -- release locks
end;
/
LuceneDomainIndex.sync procedure requires an argument of type VARCHAR2 with the index object name, index object name are usually capitalized and have the syntax SCHEMA_OWNER.IDX_NAME.
Synchronize operation could raise an exception if some rows being indexed are locked for update, in that case you have release first locked rows and re-sync the index.
An exclusive lock at Lucene Index storage is obtained during index synchronization, so you has to commit or rollback the connection immediately after this operation to release exclusive lock.
Since Lucene Domain Index 2.4.0.1.0 you can use LuceneDomainIndex.sync(‘IT1′) or LuceneDomainIndex.sync(USER,’IT1′), both procedure are equivalent.
Note: Due a limitation on SYS.ODCIRidList() array you can enqueue more than 32767 additions or deletions, an update is counted as one deletion plus one addition by Lucene implementation code. This limitation will be removed in future releases of Lucene Domain Index.
3.6 Optimize
Optionally you can optimize Lucene Index storage, for doing that execute:
begin
LuceneDomainIndex.optimize('IT1');
commit; -- release locks
end;
/
Like sync operation this procedure get an exclusive lock at Lucene Index storage table and perform an optimization of Lucene Index merging multiples segment in new one for example. You can still performing select operation (read-only) using Lucene Domain Index during optimization time, Oracle concurrency system (redo logs) provides you this functionality, once you perform a commit operation any other concurrent session will automatically see index changes.
3.7 XMLDB Export
Lucene Domain Index provides for a raw dump of the Lucene index directory files. You can perform this task by means of an Oracle XMLDB export operation that will create xdb resources, which in turn can be accessed by a variety of methods such as FTP, HTTP or WebDAV.
Before writing the raw dump to XMLDB, go and check the accessibilty of the access method in question. For FTP in Oracle 11g, for example, access to the protocol is switched off by default, having the FTP port initially at value of 0. You can check the current setting of the FTP port using this statement:
-- SYSDBA
select extractValue(DBMS_XDB.cfg_get(),
'/xdbconfig/sysconfig/protocolconfig/ftpconfig/ftp-port/text()')
from dual;
Resetting the port to an appropriate value goes like this. Test your FTP immediately after resetting the port! You should at least be able to read/see the ftp://thathost:1531/public directory.
-- SYSDBA
DECLARE
newconfig XMLType;
BEGIN
SELECT
updateXML(DBMS_XDB.cfg_get(),
'/xdbconfig/sysconfig/protocolconfig/ftpconfig/ftp-port/text()', 1531)
INTO newconfig
FROM DUAL;
DBMS_XDB.cfg_update(newconfig);
COMMIT;
END;
/
-- or for short in 11g
exec dbms_xdb.setFtpPort(1531);
gennff: it appears nondistinctive to the reader why the method does not comprise the commit with its body. is there a special reason that necessitated this design decision?
The next step is to execute the utility method LuceneDomainIndex.xdbExport() that may be called with parameters (schemaname, indexname) or just (indexname) (the latter will derive the schema name from the index object). It is essential to execute the commit; afterwards because the method itself will not commit and will keep holding all incoming locks until commiting (or rolling back):
begin
LuceneDomainIndex.xdbExport('IT1');
commit; -- makes change visible to Ftp or WebDAV
end;
/
For an index IT1 of user SCOTT, the file resources will be available at this xdb directory: ftp://thathost:1531/public/lucene/SCOTT.IT1. Once you copied the files to some public file system, you can open the Lucene index with any Lucene compatible application like, for example, Luke. Here are some screen shots of Luke, analyzing several properties of a Lucene index.
3.8 Exporting/Importing functional index with exp/imp Oracle tools
You can perform an Oracle exp operation for your Lucene Domain Index. Oracle exp tool performs by default functional index for every table being exported during the backup process. As I mention early Lucene Domain Index creates a table named IDX_NAME$T which have Lucene file storage replaced by BLOB, also a DBMS AQ is created during the index creation time, this queue is associated to a table IDX_NAME$QT, both tables have a flag marked as SECONDARY, which means that you can not export these tables alone, but they are automatically includes when Lucene Domain Index is included into the export.
During import operation Oracle re-create the index using a create index … parameters(‘your lucene parameters’) DML statement, all Lucene Domain Index parameters are included except for the parameter PopulateIndex which always is stored as false into Oracle System’s views. This parameter is altered intentional by Lucene Domain Index because if its set to true, during import operation Lucene Domain Index will try to re-create the Lucene Index structure instead of using the information restored into IDX_NAME$T table.
Alternative to XMLDB Export or Oracle exp tool you can also exports your Lucene Domain Index storage using a create table as … DML statement. For example:
create table SOURCE_BIG_LIDX$T$BK as (select * from SOURCE_BIG_LIDX$T);
You can export now using exp tool SOURCE_BIG_LIDX$T$BK because is regular table:
-bash-3.2$ exp Export: Release 10.2.0.3.0 - Production on Fri Mar 27 02:46:18 2009 Copyright (c) 1982, 2005, Oracle. All rights reserved. Username: scott/tiger Connected to: Oracle Database 10g Release 10.2.0.3.0 - Production Enter array fetch buffer size: 4096 > Export file: expdat.dmp > SOURCE_BIG_LIDX_BK.dmp (2)U(sers), or (3)T(ables): (2)U > 3 Export table data (yes/no): yes > yes Compress extents (yes/no): yes > Export done in US7ASCII character set and AL16UTF16 NCHAR character set server uses AL32UTF8 character set (possible charset conversion) About to export specified tables via Conventional Path ... Table(T) or Partition(T:P) to be exported: (RETURN to quit) > SOURCE_BIG_LIDX$T$BK . . exporting table SOURCE_BIG_LIDX$T$BK 19 rows exported Table(T) or Partition(T:P) to be exported: (RETURN to quit) > Export terminated successfully without warnings.
Now you can drop your index and re-create again without populating it:
select count(*) from test_source_big where lcontains(text,'function')>0;
COUNT(*)
--------
6167
drop index SOURCE_BIG_LIDX;
Index dropped.
create index source_big_lidx on test_source_big(text)
indextype is lucene.LuceneIndex
parameters('PopulateIndex:false;
AutoTuneMemory:true;
Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;
MergeFactor:500;
FormatCols:line(0000);
ExtraCols:line "line"');
Index created.
drop table SOURCE_BIG_LIDX$T$BK;
Table dropped.
Restore your .dmp now and check again if your index returns a correct result:
-bash-3.2$ imp scott/tiger Import: Release 10.2.0.3.0 - Production on Fri Mar 27 02:49:40 2009 Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10g Release 10.2.0.3.0 - Production Import file: expdat.dmp > SOURCE_BIG_LIDX_BK.dmp Enter insert buffer size (minimum is 8192) 30720> Export file created by EXPORT:V10.02.01 via conventional path import done in US7ASCII character set and AL16UTF16 NCHAR character set import server uses AL32UTF8 character set (possible charset conversion) List contents of import file only (yes/no): no > Ignore create error due to object existence (yes/no): no > Import grants (yes/no): yes > Import table data (yes/no): yes > Import entire export file (yes/no): no > yes . importing SCOTT's objects into SCOTT . importing SCOTT's objects into SCOTT . . importing table "SOURCE_BIG_LIDX$T$BK" 19 rows imported Import terminated successfully without warnings.
Check first that your index do not have information and populate them with Lucene Index information:
conn scott/tiger
Connected.
select count(*) from test_source_big where lcontains(text,'function')>0;
COUNT(*)
--------
0
truncate table SOURCE_BIG_LIDX$T;
Table truncated.
insert into SOURCE_BIG_LIDX$T (select * from SOURCE_BIG_LIDX$T$BK);
19 rows created.
exit
..... and connect again to refresh Lucene Domain Index in memory structures ....
conn scott/tiger
Connected.
select count(*) from test_source_big where lcontains(text,'function')>0;
COUNT(*)
----------
6167
As you can see the Lucene Domain Index structure can be export alone without exporting the master table, this is useful when you are upgrading Lucene Domain Index that requires that all index need to be dropped first and you don’t want to re-create a very big index.
Doc Links
Previous / LDI Docs – 2 Installing and Testing
Next / LDI Docs – 4 Locking and Performance




Recent Comments