LDI Docs – Appendix E (Project Change Log)

E Project Change Log

3.5.0.1.0 Second Release based on Lucene 3 (3.5.0, 23/Jan/12) core base

  • Fully customizable Searcher and Updater process to enable JMX monitoring and time outs
  • Updated to latest spellchecker interfaces
  • Fixed missing changes to work with backported version on 10g

3.4.0.1.0 Second Release based on Lucene 3 (3.4.0, 25/Nov/11) core base

  • Use latest merge policy implementation TieredMergePolicy
  • Use total RAM reported by getJavaPoolSize() when setting MaxBufferedDocs
  • Better error reporting when an Analyzer is not found
  • Replaced execute immediate with open-fech-close functionality to avoid core dump on 10g when double check for deleted rowid
  • Included a backported version of JUnit4 to jdk1.4 version for 10g releases
  • Added a parallel updater process, when working in OnLine mode this process do write operations on LDI structure on behalf of the AQ process
  • Delete do not longer required a write exclusive lock on index storage, now deletes are also enqueued as inserts or updates
  • Updated source to Lucene 3.4.0 code, removed some deprecated API

3.0.2.1.0 Initial Release based on Lucene 3 (3.0.2, 14/Sep/10) core base

  • Added a long awaited functionality, a parallel/shared/slave search process used during a start-fetch-close and CountHits function
  • Added lfreqterms ancillary operator returning the freq terms array of rows visited
  • Added lsimilarity ancillary operator returning a computed Levenshtein distance of the row visited
  • Added a ldidyoumean pipeline table function using DidYouMean.indexDictionary storage
  • Added test using SQLUnit

3.0.1.1.0 Initial Release based on Lucene 3 (3.0.1) core base

  • Added IndexOnRam functionality by using RAMDirectory for building intermediate index after merging a new set of rows
  • RewriteScore and SimilarityMethod can be used to get better result using wildcard operator
  • Added auto complete functionality using lautocomplete pipe-line table function
  • Removed all deprecated method usage and implementations
  • New CVS repository only compatible with 11g, 10g version is implemented by using a retro-translator

2.9.2.1.0 Production Release based on Lucene 2.9 (2.9.2) core base

  • Added elapsed time information when log level is INFO
  • Removed deprecated usage of LUCENE_CURRENT constant
  • Fixed facets inconsitence due ignore internal parameter ColName
  • Initial implementation of DidYouMean functionality contributed by Pedro Pinheiro
  • Temporary fix until Lucene defines clear semantics for Directory.fileLength (see Lucene issue 2316)

2.9.1.1.0 Production Release based on Lucene 2.9 (2.9.1) core base

  • New Lucene Core base libraries
  • Full Lucene Test Suites certified
  • Fixed bug enqueue more rowids than required when using OnLine mode and ExtraTabs, WhereCondition parameters
  • Fixed operator priority when WhereCondition have OR operator
  • DefaultUserDataStore now uses an array of cached fields to improve performance
  • Spanish Analyzer use latest ASCIIFoldingFilter
  • high_freq_terms(idx_name,term,max_num_term) pipeline table function was added to return high frequent terms and the associated docFreq value
  • index_terms(idx_name,term) pipeline table function was added to return a list of terms and their associated frequency
  • DefaultUserDataStore now have support for ANALYZED, ANALYZED_WITH_VECTORS, ANALYZED_WITH_OFFSETS, ANALYZED_WITH_POSITIONS and ANALYZED_WITH_POSITIONS_OFFSETS Lucene Field option values
  • OJVMLock was replaced by SingleInstanceLockFactory for per instance locking, cross sessions lockings are implemented by select for update functionality
  • an automatic upgrade from 2.9.0 is possible without Index deletions or rebuild, you have to execute:
    ant upgrade-domain-index
    ant ncomp-lucene-ojvm (10g only)
    ant jit-lucene-classes (11g only)
    

2.9.0.1.0 Production release based on Lucene 2.9.0 core base, 29/Sep/09

  • Tested with Oracle 11gR2, 11gR1 and 10.2 databases
  • DefaultUserDataStore do a SAX parsing to get text nodes and attributes from an XMLType value.
  • A SimpleLRUCache is used to load rowids and his associated Lucene doc id, this reduce memory consumption when querying very big tables. A new parameters has been added, CachedRowIdSize by default 10000 to control the size of the LRU cache.
  • Lucene Domain Index core was updated to use TopFieldCollector and to avoid computation time when lscore() is not used.
  • Two new parameter has been added NormalizeScore which control when to track the Max Score and PreserveDocIdOrder when querying, both parameters are consequence of new Lucene Collector API and boost performance when querying.
  • A table alias L$MT is defined for the master table associated to the index to be used in complex queries to associate columns from master tables and columns from dependent tables

2.4.1.1.0 (maintenance release based on Lucene 2.4.1, 27/Mar/09)

  • Do not store internal parameters into system’s views and force to PopulateIndex:false
  • After every sync, now files marked as deleted are purged to free BLOB storage
  • Added lfacets aggregated function for doing facets
  • CountHits function no longer requires sort argument
  • Filter are stored/retrived only using QueryParser.toString() key
  • UN_TOKENIZED format string at DefaultUserDataStore class was replaced by NOT_ANALYZED or NOT_ANALYZED_STORED according to new Lucene definitions.
  • Fix bug when sync try to process more than 32767 rowids enqueued.
  • Added parameters for highlighting functions Formatter, MaxNumFragmentsRequired, FragmentSeparator and FragmentSize.
  • Added PerFieldAnalyzer parameter to use independent Analyzer for each columns.
  • Added sample of a custom Formatter org.apache.lucene.search.highlight.MyHTMLFormatter

2.4.1.0.0 (first release based on Lucene 2.4.1, 9/Mar/09)

  • Fix compatibility problem between 10g/11g SQL Date representation on pipeline table function.

2.4.0.1.0 (maintenance release based on Lucene 2.4.0, 10/Jan/09)

  • Added Rhighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, rType IN VARCHAR2, rws IN SYS_REFCURSOR) RETURN ANYDATASET pipeline table function
  • Added Phighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, stmt IN VARCHAR2) RETURN ANYDATASET pipeline table function
  • Added lhighlight(NUMBER):VARCHAR2 ancilliary operator
  • Removed usage of Lucene deprecated API (Hits and IndexWriter for example)
  • Usage of FIRST_ROWS optimizer hits to decide how many rows load at first time
  • sync, optimize and rebuild interfaces now use index_name or [owner,index_name] arguments
  • A better build system to build Lucene Domain Index from sources
  • More tests
  • Tested against 11.1.0.7 and 10.2.0.3
  • See online docs to see usage of FIRST_ROWS and lhighlight() operator

2.4.0.0.0 (production release based on Lucene 2.4.0, 10/10/08)

  • Added parameter for CLOB enconding
  • More Like this function
  • NGram analyzer
  • EnglishWikipediaAnalyzer
  • DataStore interface include API for setting current connection
  • Now analyzers, queries, snowball and WikiPedia contrib packages are required

2.3.2.0.0 (binary release based on Lucene 2.3.2, 1/Jun/08)

  • Compiled against Lucene 2.3.2 production release
  • Used latest API for merging based on RAM usage
  • Use Writer for deleting during Sync
  • Confirm 4x improvement during indexing reported by Lucene dev group
  • Fix workaround which changes order of the rowids in ODCRIDList
  • Added an Spanish WikiPedia Analyzer for testing
  • Reports IOException instead of RunTimeException to signal EOF or File Not Found
  • Decouple Flush functionality from TableIndexer

2.2.0.2.2 (fixpack for 2.2.0.2.0 release, 5/Apr/08)

  • Added Rowid to lucene doc id caching.
  • Usage of LoadFirstFieldSelector during Document loading to only load rowid field.
  • Added a test suite which index a wikipedia dump inside the OJVM.

2.2.0.2.1 (fixpack for 2.2.0.2.0 release, 12/Dec/07)

  • DefaultUserDataStore requires usage of XPath text() expresion for getting only textual value
  • Added logging info SQL being executed at table indexer
  • Change document logging to FINER level
  • More pre-defined mapping at DefaultUserDataStore for NUMBER, BINARY_FLOAT, BINARY_DOUBLE, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Oracle types.
  • New parameter PopulateIndex:[true|false] for populating or not Lucene Index at creation time.
  • New parameter IncludeMasterColumn:[true|false], to choose whether or not index master column, useful with Virtual Columns and XMLType.
  • New parameter BatchCount:integer, to choose how many rows count are enqueued for indexing using create … index … parameters(‘SyncMode:OnLine’);
  • Creating an index with SyncMode:OnLine causes that LuceneDomain index will enqueue batchs of “BatchCount” rows for index by AQ PLSQL callback in background. Lucene Domain Index is intermediately ready for querying after create.
  • Batch rowid indexing is doing using a pipeline function.

2.2.0.2.0 (third major release synchronized with Lucene 2.2.0, 12/Dec/07)

http://sourceforge.net/project/showfiles.php?group_id=56183

# CVS access:
cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login
cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm

  • sort by column passed at lcontains(col,query_parser_str,sort_str,corr_id) syntax
  • Logging support using Java Util Logging package
  • JUnit test suites emulating middle tier environment
  • Support for rebuild and optimize online for SyncMode:OnLine index
  • XMLDB Export
  • AutoTuneMemory parameter for replacing MaxBufferedDocs parameter
  • Functional column support

2.2.0.1.1 (second release, 27/Sep/07 05:39 AM)

https://issues.apache.org/jira/secure/attachment/12366661/ojvm-09-27-07.tar.gz

# CVS access:
cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login
cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm

  • LuceneDomainIndex.countHits() function to replace select count from .. where lcontains(..)>0 syntax.
  • support inline pagination at lcontains(col,’rownum:[n TO m] AND …”) function
  • rounding and padding support for columns date, timestamp, mumber, float, varchar2 and char
  • ODCI API array DML support
  • BLOB parameter support

2.2.0.1.0 (first release synchronized with lucene 2.2.0, 14/Sep/07 06:44 AM)

# CVS access:
cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login
cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm

  • Synchronized with latest Lucene 2.2.0 production
  • Replaced in memory storage using Vector based implementation by direct BLOB IO, reducing memory usage for large index.
  • Support for user data stores, it means you can not only index one column at time (limited by Data Cartridge API on 10g), now you can index multiples columns at base table and columns on related tabled joined together.
  • User Data Stores can be customized by the user, it means writing a simple Java Class users can control which column are indexed, padding used or any other functionality previous to document adding step.
  • There is a DefaultUserDataStore which gets all columns of the query and built a Lucene Document with Fields representing each database columns these fields are automatically padded if they have NUMBER or rounded if they have DATE data, for example.
  • lcontains() SQL operator support full Lucene’s QueryParser syntax to provide access to all columns indexed, see examples below.
  • Support for DOMAIN_INDEX_SORT and FIRST_ROWS hint, it means that if you want to get rows order by lscore() operator (ascending,descending) the optimizer hint will assume that Lucene Domain Index will returns rowids in proper order avoided an inline-view to sort it.
  • Automatic index synchronization by using AQ’s Call Back.
  • Lucene Domain Index creates extra tables named IndexName$T and an Oracle AQ named IndexName$Q with his storage table IndexName$QT at user’s schema, so you can alter storage’s preference if you want.
  • ojvm project is at SourceForge.net CVS, so anybody can get it and collaborate
  • Tested against 10gR2 and 11gR1 database.

2.0.0.1.3 (third release, 09/Jan/07 11:40 AM)

https://issues.apache.org/jira/secure/attachment/12348574/ojvm-01-09-07.tar.gz

  • The Data Cartridge API is used without column data to reduce the data stored on the queue of changes and speedup the operation of the synchronize method.
  • Query Hits are cached associated to the index search and the string returned by the QueryParser.toString() method.
  • If no ancillary operator is used in the select, do not store the score list.
  • The “Stemmer” argument is recognized as parameter given the argument for the SnowBall analyzer, for example:
    create index it1 on t1(f2) indextype is lucene.LuceneIndex parameters('Stemmer:English');
    
  • Before installing the ojvm extension is necessary to execute “ant jar-core” on the snowball directory.
  • The IndexWriter.setUseCompoundFile(false) is called to use multi file storage (faster than the compound file) because there is no file descriptor limitation inside the OJVM, BLOBs are used instead of File.
  • Files are marked for deletion and they are purged when calling to Sync or Optimize methods.
  • Blob are created and populated in one call using Oracle SQL RETURNING information.
  • A testing script for using OE sample schema, with query comparisons against Oracle Text ctxsys.context index.

2.0.0.1.2 (second release, 20/Dec/06 02:03 PM)

https://issues.apache.org/jira/secure/attachment/12347614/ojvm-12-20-06.tar.gz

  • This new release of the OJVMDirectory Lucene Store includes a fully functional Oracle Domain Index with a queue for update/insert massive operations and a lot of performance improvement.

2.0.0.1.1 (first release, 28/Nov/06 01:04 PM)

https://issues.apache.org/jira/secure/attachment/12345967/ojvm-11-28-06.tar.gz

  • The complet API for the Oracle Domain index was completed, but the solution for the operator contains outside the where clause is not good.
  • I will implement a singleton solution for the OJVMDirectory object when is used in read only mode, typically when user performs select operations against tables which have columns indexed with Lucene. This implementation will increase a lot the final performance because the index reader will be ready for each select operation. Obviously I will check if another user or thread makes a write operation on the index to reload the read-only singleton.
  • The queue for storing the changes on the index is not implemented yet, I’ll add it in a short time.

2.0.0.1.0 (initial implementation, 22/Nov/06 03:45 PM)

https://issues.apache.org/jira/secure/attachment/12345516/ojvm.tar.gz

Doc Links

Previous / LDI Docs – Appendix D (Functions, operators and utilities)

LDI Docs – Appendix D (Functions, operators and utilities)

Doc Links

Previous / LDI Docs – Appendix C (JUnit test suites explained)
Next / LDI Docs – Appendix E (Project Change Log)

LDI Docs – Appendix C (JUnit test suites explained)

C JUnit test suites explained

C.1 DBTestCase base class

This is base class for most of the test suites includes. It provides a connection pool using OracleDataSource with a minimum of two ready to use connection and growing to 5, after this it will wait up to 20 seconds for free connection. This connection pool is created at the class constructor. Utility methods provided by this class, each method use is own SQLConnection, so they are autonomous transactions:

  • createTable(), create a test table as follow, (T1 is a constant value defined as TABLE):
    create table T1 (
      f1 number primary key,
      f2 varchar2(200),
      f3 varchar2(200),
      f4 number);
    
  • dropTable(), drop table created above.
  • createIndex(), add a Lucene Domain Index to previous one created table as follow, (LogLevel,Analyzer,MergeFactor,ExtraCols and FormatCols are customizable at class level, after index creation MergeFactor is reduced to 2):
    create index IT1 on T1(f2)
      indextype is lucene.LuceneIndex
        parameters('LogLevel:WARNING;
          Analyzer:org.apache.lucene.analysis.StopAnalyzer;
          MergeFactor:500;
          ExtraCols:F1;
          FormatCols:F1(0000)')
    
  • dropIndex(), drop previous one index.
  • int insertRows(int startIndex, int endIndex), insert a set of rows at above table with F1 column varying from startIndex to endIndex. F2 column is an english text representation of F1, F4 is F1*10 and F3 is an english text representation of F1*10. Return a number of rows inserted. If there are problems such as primary key violation it rollback the transaction.
  • int deleteRows(int startIndex, int endIndex), delete a set of rows where F1 between startIndex and endIndex. Return a number of rows deleted. If there are problems rollback the transaction. Note that deleting rows automatically update Lucene Index.
  • int updateRows(int startIndex, int endIndex), update F2 column with his own value to fire ODCI update method on each row between startIndex and endIndex.
    Return a number of rows updated.
  • findRows(int n), find rows which F2 match again a text representation of n using lcontains operator. It only test for a result having 0 or more rows.
  • long syncIndex(), perform a sync operation at Lucene Domain Index applying pending changes (inserts, updates). If there are errors, usually caused by another transaction having an exclusive lock in a row being indexed, it rollback the operation. Next successful sync will apply pending changes of failed operations. Return a long value with the amount of milliseconds spent during sync.
  • long optimizeIndex(), perform an optimize operation at Lucene Domain Index merging segments in a new one. If there are errors, usually caused by another transaction having an exclusive lock on the index, it rollback the operation. Return a long value with the amount of milliseconds spent during optimize.

C.2 TestDBIndex

Simple test which create a table his index and performs insertions, sync, optimize and deletions, finally drop index and table. His output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndex
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 3.836 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 40 total char inserted: 415 avg text length: 10
[junit] Index synced: IT1 elapsed time: 265 ms.
[junit] Avg Sync time: 6
[junit] Index optimized: IT1 elapsed time: 40 ms.
[junit] Avg Optimize time: 1
[junit] Row deleted 40, from: 10 to: 49 elapsed time: 1303 ms. Avg time: 32 ms.
[junit] Index droped: IT1
[junit] Table droped: T1

C.3 TestDBIndexAddDoc

Performs several insertions and sync, starting with 10 rows, then 90 and so on, ending with 3.000 insertions using insertRow method of DBTestCase base class. After each batch of insertions calls to syncIndex method calculating average time of sync method for each row inserted. His output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexAddDoc
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 64.696 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Index synced: IT1 elapsed time: 126 ms.
[junit] Inserted rows: 10 total char inserted: 49 avg text length: 4
[junit] Index synced: IT1 elapsed time: 142 ms.
[junit] Avg Sync time: 14
[junit] Inserted rows: 90 total char inserted: 988 avg text length: 10
[junit] Index synced: IT1 elapsed time: 374 ms.
[junit] Avg Sync time: 4
[junit] Inserted rows: 400 total char inserted: 9201 avg text length: 23
[junit] Index synced: IT1 elapsed time: 1276 ms.
[junit] Avg Sync time: 3
[junit] Inserted rows: 500 total char inserted: 11726 avg text length: 23
[junit] Index synced: IT1 elapsed time: 1601 ms.
[junit] Avg Sync time: 3
[junit] Inserted rows: 1000 total char inserted: 35950 avg text length: 35
[junit] Index synced: IT1 elapsed time: 4675 ms.
[junit] Avg Sync time: 4
[junit] Inserted rows: 3000 total char inserted: 110851 avg text length: 36
[junit] Index synced: IT1 elapsed time: 25480 ms.
[junit] Avg Sync time: 8
[junit] Index droped: IT1
[junit] Table droped: T1

C.4 TestDBIndexDelDoc

At setup method this test case a create a table and fill it with 500 rows. Then performs deletions batch of 10, 90 and 400 rows each calculating average time for each row deleted. His output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexDelDoc
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 20.543 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 500 total char inserted: 10238 avg text length: 20
[junit] Index synced: IT1 elapsed time: 1643 ms.
[junit] Row deleted 10, from: 1 to: 10 elapsed time: 356 ms. Avg time: 35 ms.
[junit] Row deleted 90, from: 11 to: 100 elapsed time: 2535 ms. Avg time: 28 ms.
[junit] Row deleted 400, from: 101 to: 500 elapsed time: 11526 ms. Avg time: 28 ms.
[junit] Index droped: IT1
[junit] Table droped: T1

C.5 TestDBIndexParallel

This is more complex test case to check concurrent access to Lucene Domain Index. To do this creates several threads, some for simulating batch insertions of 10 rows, others for simulating batch deletions of 10 rows, another for simulating batch updates of 10 rows and finally many threads searching for rows each 0.5 seconds. By default creates 3 threads for each kind of operations and each thread perform:

  • 20 inserts
  • 5 deletes
  • 5 update
  • 100 search

Each thread takes his own connection from the connection pool and do his job, if fastSync constant is true after each successful insert and update it calls to syncIndex method to update Lucene Index, if fastSync is false another thread is started performing sync index each 1 second. It end when all threads (inserts, deletes, updates) finish. Here some part of his output:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexParallel
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 97.7 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] FastSync: true
[junit] Deleter 1 deleting at block 70
[junit] Updater 1 updating at block 70
[junit] Inserter 2 inserting at block 90
[junit] No Row deleted at: 70 to: 79 elapsed time: 131 ms.
[junit] No Row updated at: 70 to: 79 elapsed time: 12 ms.
[junit] Searcher 2 searching row 30
[junit] Searcher 1 searching row 77
[junit] Not Found rows with: thirty  elapsed time: 211 ms.
[junit] Not Found rows with: seventy-seven  elapsed time: 170 ms.
[junit] Inserted rows: 10 total char inserted: 115 avg text length: 11
[junit] Searcher 2 searching row 62
[junit] Searcher 0 searching row 63
[junit] Searcher 1 searching row 49
[junit] Not Found rows with: sixty-two  elapsed time: 64 ms.
[junit] Index synced: IT1 elapsed time: 283 ms.
[junit] Not Found rows with: sixty-three  elapsed time: 215 ms.
[junit] Searcher 2 searching row 74
[junit] Not Found rows with: seventy-four  elapsed time: 39 ms.
[junit] Not Found rows with: forty-nine  elapsed time: 137 ms.
[junit] Searcher 1 searching row 95
[junit] Searcher 2 searching row 46
[junit] Found rows with: ninety-five  elapsed time: 103 ms.
....
[junit] Updater 2 updating at block 20
[junit] No Row updated at: 20 to: 29 elapsed time: 3 ms.
[junit] Inserted rows: 10 total char inserted: 80 avg text length: 8
[junit] Searcher 0 searching row 97
[junit] Found rows with: ninety-seven  elapsed time: 60 ms.
[junit] Index synced: IT1 elapsed time: 147 ms.
.....
[junit] Searcher 2 searching row 39
[junit] Searcher 1 searching row 84
[junit] Not Found rows with: thirty-nine  elapsed time: 33 ms.
[junit] Not Found rows with: eighty-four  elapsed time: 38 ms.
[junit] Updater 0 updating at block 90
[junit] Row updated 10, from: 90 to: 99 elapsed time: 16 ms. Avg time: 1 ms.
[junit] Index synced: IT1 elapsed time: 162 ms.
......
[junit] Inserted rows: 10 total char inserted: 125 avg text length: 12
[junit] Searcher 0 searching row 57
[junit] Searcher 1 searching row 28
[junit] Deleter 1 deleting at block 80
[junit] Searcher 2 searching row 64
[junit] No Row deleted at: 80 to: 89 elapsed time: 58 ms.
[junit] Not Found rows with: twenty-eight  elapsed time: 112 ms.
[junit] Not Found rows with: fifty-seven  elapsed time: 155 ms.
[junit] Index synced: IT1 elapsed time: 242 ms.
[junit] Searcher 0 searching row 98
[junit] Found rows with: ninety-eight  elapsed time: 72 ms.
[junit] Not Found rows with: sixty-four  elapsed time: 175 ms.
[junit] Searcher 0 searching row 27
[junit] Not Found rows with: twenty-seven  elapsed time: 75 ms.
[junit] Searcher 1 searching row 5
[junit] Deleter 2 deleting at block 50
[junit] Searcher 2 searching row 84
[junit] Not Found rows with: eighty-four  elapsed time: 20 ms.
[junit] Updater 2 updating at block 10
[junit] No Row deleted at: 50 to: 59 elapsed time: 28 ms.
[junit] Row updated 10, from: 10 to: 19 elapsed time: 36 ms. Avg time: 3 ms.
[junit] Found rows with: five  elapsed time: 216 ms.
.................
[junit] Inserter 1 inserting at block 50
[junit] Found rows at: 50 position, ignoring insertions
[junit] Index droped: IT1
[junit] Table droped: T1

C.6 TestDBIndexSearchDoc

This test check some special features of lcontains operator such as in-line pagination, sort by and filter by expressions. First create a table with 200 rows and then query them, his output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexSearchDoc
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 14.001 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 746 ms.
[junit] testFilterAll()
[junit] Excecution time: 129 ms.
[junit] 120 Score: 0.9606395 str: one hundred twenty
[junit] 119 Score: 0.25453204 str: one hundred nineteen
[junit] 118 Score: 0.25453204 str: one hundred eighteen
[junit] 117 Score: 0.25453204 str: one hundred seventeen
[junit] 116 Score: 0.25453204 str: one hundred sixteen
[junit] 115 Score: 0.25453204 str: one hundred fifteen
[junit] 114 Score: 0.25453204 str: one hundred fourteen
[junit] 113 Score: 0.25453204 str: one hundred thirteen
[junit] 112 Score: 0.25453204 str: one hundred twelve
[junit] 111 Score: 0.25453204 str: one hundred eleven
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 721 ms.
[junit] testFilterBy()
[junit] Excecution time: 162 ms.
[junit] 103 Score: 1.0 str: one hundred three
[junit] 120 Score: 0.9606395 str: one hundred twenty
[junit] 101 Score: 0.28600293 str: one hundred one
[junit] 100 Score: 0.27352643 str: one hundred
....
[junit] 115 Score: 0.25453204 str: one hundred fifteen
[junit] 116 Score: 0.25453204 str: one hundred sixteen
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 751 ms.
[junit] testFilterByOrderBy()
[junit] Excecution time: 138 ms.
[junit] 120 Score: 0.9606395 str: one hundred twenty
[junit] 119 Score: 0.25453204 str: one hundred nineteen
....
[junit] 103 Score: 1.0 str: one hundred three
[junit] 102 Score: 0.25453204 str: one hundred two
[junit] 101 Score: 0.28600293 str: one hundred one
[junit] 100 Score: 0.27352643 str: one hundred
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 761 ms.
[junit] testPagination()
[junit] Excecution time: 193 ms.
[junit] 117 Score: 0.03489425 str: one hundred seventeen
[junit] 118 Score: 0.03489425 str: one hundred eighteen
....
[junit] 132 Score: 0.03489425 str: one hundred thirty-two
[junit] 134 Score: 0.03489425 str: one hundred thirty-four
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 743 ms.
[junit] testCountHits()
[junit] Excecution time: 53 ms.
[junit] Hits: 126
[junit] Index droped: IT1
[junit] Table droped: T1

C.7 TestQueryHits

This test is not autonomous because requires an additional step to run. Before run it create a table and his Lucene Index with:

create table test_source_big as (select * from all_source);
create index source_big_lidx on test_source_big(text)
  indextype is lucene.LuceneIndex
    parameters('AutoTuneMemory:true;
      MergeFactor:500;
      FormatCols:line(0000);
      ExtraCols:line "line"');

For 11g databases you can create a best optimize Lucene Index using some new Secure LOB features:

create index source_big_lidx on test_source_big(text)
  indextype is lucene.LuceneIndex
    parameters('FormatCols:line(0000);
      ExtraCols:line "line";
      Analyzer:org.apache.lucene.analysis.StopAnalyzer;
      MergeFactor:500;
      LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

On 10g running it as SCOTT, TEST_SOURCE_BIG table will have 220731 rows using a typical installation based on database templates. Using above table two test checks performance with a query which returns 18387 hits, once call to LuceneDomainIndex.countHits function and another iterate over the result in pages of ten rows, typical scenario of web applications. His output look like:

[junit] Testsuite: org.apache.lucene.indexer.TestQueryHits
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.656 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] iteration from: 13775 to: 13785
[junit] Step time: 791 ms.
[junit] iteration from: 13785 to: 13795
[junit] Step time: 49 ms.
[junit] iteration from: 13795 to: 13805
[junit] Step time: 40 ms.
[junit] iteration from: 13805 to: 13815
[junit] Step time: 44 ms.
[junit] iteration from: 13815 to: 13825
[junit] Step time: 40 ms.
[junit] iteration from: 13825 to: 13835
[junit] Step time: 42 ms.
[junit] iteration from: 13835 to: 13845
[junit] Step time: 41 ms.
[junit] iteration from: 13845 to: 13855
[junit] Step time: 50 ms.
[junit] iteration from: 13855 to: 13865
[junit] Step time: 41 ms.
[junit] iteration from: 13865 to: 13875
[junit] Step time: 41 ms.
[junit] Elapsed time: 1877
[junit] Hits: 18387
[junit] Elapsed time: 564

Note that first iteration took more time because it includes parsing time and caching, also to simulate a real word web application an SQLConnection is take and returned to the pool on each iteration.

Doc Links

Previous / LDI Docs – Appendix B (Lucene Domain Index Storage)
Next / LDI Docs – Appendix D (Functions, operators and utilities)

LDI Docs – Appendix B (Lucene Domain Index Storage)

B Lucene Domain Index Storage

OJVMDirectory class creates a set of Oracle objects to represent Lucene Inverted Index and Domain Index functionality. First it creates a table named IDX_NAME$T (IDX_NAME is your Lucene Domain Index used at create index DDL statement) with this structure:

Name Null? Type
NAME NOT NULL VARCHAR2(30)
LAST_MODIFIED TIMESTAMP(6)
FILE_SIZE NUMBER(38)
DATA BLOB
DELETED CHAR(1)

Also have and index based on IDX_NAME$T.DELETED column to speedy up purge operations. To enqueue operation at the index it defines a DBMS_AQ Queue IDX_NAME$Q with his storage table IDX_NAME$QT. IDX_NAME$Q queue have payload defined as LUCENE_MSG_TYP object. This object type is defined as:

Name Null? Type
RIDLIST SYS.ODCIRIDLIST
OPERATION VARCHAR2(32)

SYS.ODCIRIDLIST is an special structure defined by ODCI API to hold a list of rowid changed by an DML operation. OPERATION is one of insert, delete, update, rebuild, optimize, insert-ram or insert-disk reserved keyword. rebuild and optimize operations are used with SyncMode:OnLine to perform these tasks automatically using a background process. insert-ram and insert-disk are messages enqueued internally by LDI when ParallelDegree is enabled.
When using ParallelDegree grater than 1 the structure showed above is replicated for N parallel storages used when indexing, it means that, for example, with ParallelDegree:2 for an index name SOURCE_BIG_LIDX there will be two extra structures named SOURCE_BIG_LIDX$0{$Q|$QT|$T} and SOURCE_BIG_LIDX$1{$Q|$QT|$T}.

Doc Links

Previous / LDI Docs – Appendix A (Parameter reference and syntax)
Next / LDI Docs – Appendix C (JUnit test suites explained)

LDI Docs – Appendix A (Parameter reference and syntax)

A Parameter reference and syntax

Lucene Domain Index accept several parameters which can be passed using create index or alter index DDL commands. This parameters are divided into four categories, Index Writer, Analyzer, User Data Store and General parameters.

A.1 Lucene Index Writer parameters

This section covers Lucene Index Writer parameters for more information about this parameter see Lucene docs and Wiki.

A.1.1 MergeFactor

Determines how often segment indices are merged by addDocument(). If you are creating a new index over a table with thousands of rows a value of 100 to 500 is good value.

A.1.2 MaxBufferedDocs

Determines the minimal number of documents required before the buffered in-memory documents are merged and a new Segment is created. This value can cause an out of memory exception you provide a value larger than user space available. A typical SGA configuration can accept values of 4000 or 5000 depending how big are your rows being indexed. If you are not sure of how megabytes can consume your rows you can use AutoTuneMemory:true parameter which is a default value, so you choose true MaxBufferedDocs will be ignored and Lucene Domain Index will try to uso 90% of Oracle Java Pool Size value.

A.1.3 MaxMergeDocs

Determines the largest number of documents ever merged by addDocument().

A.1.4 MaxBufferedDeleteTerms

Determines the minimal number of delete terms required before the buffered in-memory delete terms are applied and flushed.

A.1.5 UseCompoundFile

Setting to turn on usage of a compound file. When on, multiple files for each segment are merged into a single file once the segment creation is finished. This is done regardless of what directory is in use. By default Lucene Domain Index do not use compound file format because its not affected by max open file descriptors.

A.1.6 MaxFieldLength

Determines the maximum number of char indexed for any column of this index, default value is 10000.

A.1.7 AutoTuneMemory

AutoTuneMemory:true (default) overrides MaxBufferedDocs parameter, it defines dynamically MaxBufferedDocs based on how much memory is reported by OracleRuntime.getJavaPoolSize() method.
After each document is added to the index it calls to writer.ramSizeInBytes() and test that is not over a 50% of the ram free.
This parameter works in most of the common cases, but you can get a Java out of memory error in multiuser environments because Java Pool Size is common parameter for all the sessions. If you get an exception during index creation time set AutoTuneMemory:false and adjust MaxBufferedDocs to a value which not raise an out of memory exception.

A.2 Analyzer parameters

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer.
Analyzer, PerFieldAnalyzer or Stemmer parameter affects indexing and query expressions, so if you want to change this parameter on a exists index you to must rebuild it, the priority of these three parameters is first check for the Stemmer if its not present check for PerFieldAnalyzer if its not present checks for Analyzer parameter, finally if none of them are defined will use SimpleAnalyzer.

A.2.1 Analyzer

This parameter is fully qualified Java class name which extends org.apache.lucene.analysis.Analyzer. For example:

  • BrazilianAnalyzer
  • ChineseAnalyzer
  • CJKAnalyzer
  • CzechAnalyzer
  • DutchAnalyzer
  • FrenchAnalyzer
  • GermanAnalyzer
  • GreekAnalyzer
  • KeywordAnalyzer
  • PatternAnalyzer
  • RussianAnalyzer
  • SimpleAnalyzer
  • StandardAnalyzer
  • StopAnalyzer
  • ThaiAnalyzer
  • WhitespaceAnalyzer

See Lucene Java Docs for more details. A default analyzer is SimpleAnalyzer.

A.2.2 Stemmer

Stemmer is another kind of analyzer which divides words, stop words and another term related object based on an specific language. Stemmer parameter use Snowball Analyzer, possible values for Stemmer parameter using Lucene 2.2.0 distribution are:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • German2
  • Italian
  • Kp
  • Lovins
  • Norwegian
  • Porter
  • Portuguese
  • Russian
  • Spanish
  • Swedish

Stemmer parameter override Analyzer parameter.

A.2.3 PerFieldAnalyzer

PerFieldAnalyzer is a wrapper of other analyzers which provides an independent analyzer for each column being indexed, see PerFieldAnalyzerWrapper class in Lucene documentation. Each column could have his own analyzer which extends org.apache.lucene.analysis.Analyzer. If a column is not in the list StandardAnalyzer will be used as default. For example:

create table t1 (f1 VARCHAR2(10), f2 XMLType);
insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));
insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));

create index it1 on t1(f2) indextype is lucene.LuceneIndex
 parameters('IncludeMasterColumn:false;
 ExtraCols:F1,extractValue(F2,''/emp/name/text()'') "name",extractValue(F2,''/emp/@id'') "id";
 FormatCols:F1(000),id(00)');

alter index it1 rebuild
 parameters('PerFieldAnalyzer:F1(org.apache.lucene.analysis.KeywordAnalyzer),id(org.apache.lucene.analysis.KeywordAnalyzer)');

In the above example four columns are being indexed by Lucene Domain Index rowid (added by default) using KeywordAnalyzer, F1 and id (added by ExtraCols parameter) using KeywordAnalyzer too, and finally name which is not included into PerFieldParameter and then using StandardAnalyzer.

A.3 User Data Store parameters

Lucene Domain Index implements a User Data Store functionality, this functionality provides many parameters to control which column will be included into a Lucene Document which is inserted into the index.
and First three parameters are used to choose which columns will added to the index in addition to the master column. Oracle Domain Index are bound to a single column, this is a limitation with Oracle 10g version. To avoid this problem passing ExtraCols, ExtraTabsWhereCondition you can easily build a set of new column from the master table and others. Basically a select DML statement is built using these parameters. To clarify this Lucene Domain Index will performs a query like:

– Full table scan (create index statement):
SELECT rowid, MasterTable.MasterColumn, ExtraCols
FROM MasterTable,ExtraTabs
where WhereCondition;

– Find a particular rowid (insert,update operations):
SELECT MasterTable.MasterColumn, ExtraCols
FROM MasterTable,ExtraTabs
where MasterTable.rowid=:rowid AND WhereCondition;

Text in italic are injected by Lucene Domain Index and text in bold are user defined.

A.3.1 ExtraCols

A coma separated list of columns of the Master table of table being indexed or the tables defined into ExtraTabs parameter. Note that if you don’t define columns alias column name are capitalized by default on Oracle databases. For example ‘ExtraCols:F2 “f2″,T2.F3 “f3″‘ note that you can omit master table name if there is no collisions

A.3.2 ExtraTabs

A coma separated list of table name and alias for this tables. For example ‘ExtraTabs:T2 aliasT2,T3 aliasT3′. Note that ODCI API only will detect changes at index master column, to notify changes based on ExtraCols list you need to attach triggers, see section examples above for more detail.

A.3.3 WhereCondition

An SQL where condition used to join index’s master table with ExtraTabs tables. For example: ‘WhereCondition:T1.f1=T2.f2(+) AND T1.F1=aliasT3.f3′. Be careful to produce a correct join condition to guaranty single row result; multiple or zero row result based on the master table values are not allowed.

Note: Up to Lucene Domain Index 2.9.0, if you use a WhereCondition which have an OR operator put this where condition enclosed with () because the precedence of the OR over the AND operator makes that some queries returns more rows that the correct behavior, for example instead of:
WhereCondition:T1.F1=’AA’ OR T1.F1=’BB’
put:
WhereCondition:(T1.F1=’AA’ OR T1.F1=’BB’)
this workaround fix some problems when working in OnLine mode. Starting with 2.9.1 version this extra () are not required.

A.3.4 UserDataStore

This is a fully Java Class name which implements org.apache.lucene.indexer.UserDataStore interface, you can create your own Data Store class implementing this interface. By default Lucene Domain Index provides an implementation which covers most of the typical scenarios, this class is org.apache.lucene.indexer.DefaultUserDataStore and use FormatCols parameter to create Lucene Fields.

A.3.5 FormatCols

A coma separated list of column(format) strings interpreted by User Data Store class to control how an specific database column will be transformed in a Lucene Field. For example you can choose padding, un-tokenized values and so on.
Supported formats by Default Data Store class are:

  • Number padding for numeric columns using java.text.DecimalFormat class syntax, default is 0000000000.
  • Date rounding for timestamp and date columns using org.apache.lucene.document.DateTools, default is day.
  • Character left padding for VARCHAR2 or CHAR columns using org.apache.lucene.util.StringUtils class (leftPad method), default is no left char padding. Any char can be used for left padding.
  • XPath expression for XMLType columns, this XPath string will be passed to XMLType.extract(“format”,”") method, the result of the XPath extraction will be a new XMLType object over getStringVal() will executed. If you want to perform more user defined XMLType to Field extraction extend DefaultUserDataStore class or use virtual column indexing.
  • For columns of type VARCHAR2 or CHAR you can use an special string NOT_ANALYZED or NOT_ANALYZED_STORED as format which tell to Default User Data Store class that this column will be indexed but un-tokenized, this is useful with columns which will be used for sorting.

A.3.6 LockMasterTable

When table indexer is getting the row which will be indexed it can use either FOR UPDATE NOWAIT SQL construction or not, setting this parameter to true cause that the row is acquired with a lock.

A.4 General parameters

This set of parameters are Lucene Domain Index specific parameters.

A.4.1 SyncMode

SyncMode tells to Lucene Domain Index which strategy is used to update the index. SyncMode:Deferred (default) left to the application when the index is synced either by calling LuceneDomainIndex.sync procedure after a set of changes pending or by DBMS_SCHEDULER process at an specific time. With SyncMode:Deferred update and insert operations are queued using DBMS_AQ package. Delete operations are never enqueued because require an update on Lucene Index to not return rowid of deleted rows.
SyncMode:OnLine is implemented by using DBMS_AQ PLSQL callback, so immediately after a commit operation which involves insert or update rows a parallel process dbms_j* is automatically started by DBMS_AQ package to applied pending changes. SyncMode:OnLine should be reserved for index which update, insert or delete operations are much lower than select, AQ callbacks can not handle very well exceptions during sync time, for example when a row being index is locked by another session, so some changes can be lost with this scenario.

A.4.2 Updater, Searcher

Lucene Domain Index can be configured to start several parallel shared process which do reader and writer operations on LDI storage on behalf of the user connected session, you can configure multiple searcher process selected randomly using the syntax host1@port1,host2@port2 and one updater process using similar syntax. By default these parameters are defined with the value local which means not using parallel shared servers. Two parallel server are configured and started during database startup process, a searcher process listen at SYS_CONTEXT(‘USERENV’,'SERVER_HOST’)@1099 which usually is localhost@1099 and the updater process at localhost@1098, you can register multiples searcher/updater processes editing the properties db.searcher.job/db.searcher.port,db.updater.job/db.updater.port at build.xml file and calling to the targets create-searcher-job and create-updater-job respectively.
Updater and Searcher processes can be stopped, started using Ant’s targets disable-jobs and enable-jobs.

A.4.3 LobStorageParameters

Lucene Domain Index uses a BLOB column named “data” for storing Lucene Inverted index files. You can control any LOB storage parameter with this parameter during index creation time, his default value is ‘LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CACHE READS NOLOGGING’ for 11g databases you can use a better optimize storage by using newest Secure LOB parameter, for example: ‘LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING’

A.4.4 LogLevel

Lucene Domain Index uses JDK Java Util Logging package, LogLevel parameter is any of the string defined by Level.parse() method, for example: LogLevel:ALL. By default logging level is defined to WARNING.
Lucene Domain Index uses:

  • SEVERE for non recoverable error conditions
  • FINER for debugging purpose such as ODCI API arguments
  • INFO for checking index operations such as value being indexed
  • WARNING for error messages which are reported as ERROR through ODCI API
  • CONFIG to see user parameters changed by users

Logging information is sent by default to Oracle .trc files, but you can redirect this output using dbms_java.set_output procedure for example.
If you are not sure which field and how these fields are added to the index change LogLevel to INFO and check for lines starting with: “INFO: Document<”
exiting and throwing methods does not print messages also with log level defined to ALL. This is because logging level used by these methods are controlled by ConsoleHandler level.
To get these methods work copy logging.properties file from your JAVA_HOME/jre/lib to ORACLE_HOME/javavm/lib directory and edit the line which includes level property:

# Limit the message that are printed on the console to INFO and above.
java.util.logging.ConsoleHandler.level = ALL
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter

Then shutdown and startup your Oracle database.

A.4.5 CachedRowIdSize

CachedRowIdSize is used by an LRU cached used to maintain the association between Lucene Doc ID and a particular Oracle ROWID. For very big table using an array to store this association can consume a lot of SGA RAM, starting with Lucene Domain Index 2.9.0.1.0 only 10.000 ROWID are stored in this cache, tables with high frequency of updates can use this LRU small due every caused that LRU is completed flushed, but tables with low frequency of updates/deletes can get a lot of performance improvement by using larger LRU cached size.

A.4.6 BatchCount, IndexOnRam and ParallelDegree

These three parameters control parallel index operations (inserts) when OnLine mode is enabled, ParalellelDegree defines how many slave index storage will be created to hold temporary parallel index operations when news rows are inserted or the index is created or rebuild. During index creation or rebuild time BatchCount defines how many rows will processed in batch and parallel with another set of rows. IndexOnRam defines when the new set of rows is indexed in a temporary index in RAM or disk, prior to Lucene Domain Index 2.9.2.1.0 a batch of new rows where processes in temporary index stored in disk, using IndexOnRam:true tells to Lucene Domain Index that the new rows will be indexed in RAM and finally merged into the main index stored in disk.

A.5 Query parameters

This set of parameters which affects QueryParser and search functionality.

A.5.1 DefaultColumn

DefaultColumn defines which columns is used as default column in QueryParser syntax, if this parameter is not set master column of the index is used, this name is a Lucene Field name. Here an example:

create index pages_lidx_all on pages p (value(p))
  indextype is Lucene.LuceneIndex
  parameters('PopulateIndex:false;
    DefaultColumn:text;
    SyncMode:Deferred;
    LogLevel:WARNING;
    Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;
    ExtraCols:extractValue(object_value,''/page/title'') "title", extractValue(object_value,''/page/revision/comment'') "comment", extract(object_value,''/page/revision/text/text()'') "text", extractValue(object_value,''/page/revision/timestamp'') "revisionDate";
    FormatCols:revisionDate(day);
    IncludeMasterColumn:false;
    LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

Note the correlation between DefaultColumn and ExtraCols. ExtraCols defines a Lucene Field named “text” with a value calculated by the SQL expression extract(object_value,”/page/revision/text/text()”), then you can use a Lucene Field text as default Field in QueryParser syntax.

A.5.2 DefaultOperator

DefaultOperator defines which Boolean operator is used in QueryParser syntax, if this parameter is not set OR operator is his default value.

A.5.3 NormalizeScore

NormalizeScore is used during Lucene Index scan to know if they need to track the maximum score, the maximum score then used to normalize the result of lscore() operator to return only values between 0 to 1. If you don’t need a normalized range of the score you can avoid this computation and your query will be fast. Note that a not normalized score not implied that the document are not in order of relevance.

A.5.4 PreserveDocIdOrder

PreserveDocIdOrder is an internal parameter which is used by Lucene in some kind of operator, if you don’t need that result preserve Lucene Doc ID in order rather than the relevance, you can put this value to false (default) and some operator will be fast.

A.5.5 RewriteScore and SimilarityMethod

RewriteScore (true or false) and SimilarityMethod (fully class name) are used when query using wildcard operator (*) these parameters produces better recall values, for example:

create table t1 (f1 number primary key, f2 varchar2(2000), f3 number(5,3));
insert into t1 values (1, 'Cefaleias', 1);
insert into t1 values (2, 'Cefaleia', 1);
insert into t1 values (3, 'Cefaleia em salva', 0.625);
insert into t1 values (4, 'Cefaleias de tensão', 0.625);
insert into t1 values (5, 'Cefaleias / enxaquecas', 0.625);
insert into t1 values (6, 'Desproporção céfalo-pélvica', 0.5);
insert into t1 values (7, 'Deformidade por redução cefálica congénita', 15.87);
insert into t1 values (8, 'Intoxicação por antibióticos do grupo das cefalosporinas', 0.5);
commit;

create index it1 on t1(f2)
  indextype is lucene.luceneindex 
  parameters('LogLevel:ALL;
    Analyzer:org.apache.lucene.analysis.PortugueseAnalyzer;
    FormatCols:F3(00.000);
    ExtraCols:F3;
    RewriteScore:true;
    SimilarityMethod:org.apache.lucene.search.WildcardSimilarity');

select /*+ DOMAIN_INDEX_SORT */ lscore(1) f1, f2 from t1
  where lcontains(f2, 'cefa cefa*',1) > 0

 F1 F2 
 1 Cefaleias 
 1 Cefaleia 
 0.625 Cefaleia em salva 
 0.625 Cefaleias de tensão 
 0.625 Cefaleias / enxaquecas 
 0.5 Desproporção céfalo-pélvica 
 0.5 Deformidade por redução cefálica congénita 
 0.5 Intoxicação por antibióticos do grupo das cefalosporinas 
 8 rows selected

alter index it1 
  parameters('LogLevel:ALL;
    SimilarityMethod:org.apache.lucene.search.DefaultSimilarity');

select /*+ DOMAIN_INDEX_SORT */ lscore(1) f1,f2 from t1
  where lcontains(f2, 'cefa cefa*',1) > 0

 F1 F2 
 0.3539437353610992431640625 Intoxicação por antibióticos do grupo das cefalosporinas 
 0.12431289255619049072265625 Cefaleias 
 0.12431289255619049072265625 Cefaleia 
 0.077695555984973907470703125 Cefaleia em salva 
 0.077695555984973907470703125 Cefaleias de tensão 
 0.077695555984973907470703125 Cefaleias / enxaquecas 
 0.062156446278095245361328125 Desproporção céfalo-pélvica 
 0.062156446278095245361328125 Deformidade por redução cefálica congénita 
 8 rows selected

alter index it1
  parameters('LogLevel:ALL;
    RewriteScore:false');

select /*+ DOMAIN_INDEX_SORT */ lscore(1) f1, f2 from t1
  where lcontains(f2, 'cefa cefa*',1) > 0

 F1 F2 0.15442870557308197021484375 Cefaleias
 0.15442870557308197021484375 Cefaleia
 0.15442870557308197021484375 Cefaleia em salva 
 0.15442870557308197021484375 Cefaleias de tensão 
 0.15442870557308197021484375 Cefaleias / enxaquecas 
 0.15442870557308197021484375 Desproporção céfalo-pélvica 
 0.15442870557308197021484375 Deformidade por redução cefálica congénita 
 0.15442870557308197021484375 Intoxicação por antibióticos do grupo das cefalosporinas
 8 rows selected

A.6 Highlight parameters

This set of parameters which affects lhighlight, phighlight and rhighlight functionality.

A.6.1 Formatter

Formatter defines a valid class name which implements Lucene Interface Formatter and with a constructor with no arguments, default value org.apache.lucene.search.highlight.SimpleHTMLFormatter.

A.6.2 MaxNumFragmentsRequired

MaxNumFragmentsRequired defines a number of text fragments returned by Highlight function, default value is 4.

A.6.3 FragmentSize

FragmentSize defines the size of each fragment returned in characters of each fragment, default value is 100.

A.6.4 FragmentSeparator

FragmentSeparator defines a String used as fragment separator, default value is “…”. Note that you can not use “;” or “:” as fragment separator because are used as parameter and value delimiters into alter index … parameters(..) DDL statement.

Doc Links

Previous / LDI Docs – 4 Locking and Performance
Next / LDI Docs – Appendix B (Lucene Domain Index Storage)

LDI Docs – 4 Locking and Performance

4. Locking and Performance

4.1 Locking used by Lucene Domain Index

Operation Base Table (row/table) Index Table (SCHEMA.IDX$T) Queue Table (SCHEMA.IDX$QT)
Insert X/RX (1) NONE NONE
Update X/RX NONE NONE
Delete X/RX NONE NONE
Manually Sync X/RS (2) X/T|X/RX (3) DBMS_AQ.BLOCKED (4)
Automatically Sync X/RS (2) X/T|X/RX (3) DBMS_AQ.BLOCKED (4)
Optimize NONE X/T|X/RX (3) NONE
  1. X = Row exclusive lock at the row being inserted, RX = Table row exclusive lock – if index parameter LockMasterTable=true.
  2. X = Row exclusive lock at the row being indexed, RS = Table row share lock. A select … for update no wait is performed at all rows being added to Lucene Index.
  3. X/T this is a writer lock semaphore of Lucene Index and provide serialize write operations, the write lock is performed using lock table $SCHEMA.IDX$T. X/RX is performed at many rows of this table because Lucene creates and deletes many files.
  4. To perform massive dequeue operations at DBMS AQ queue Sync scan this queue with DBMS_AQ.BLOCKED option.
    1. 4.2 Performance tips

      4.2.1 Index Writer parameters

      Lucene Index Writer class uses several parameters to control his index structure. Lucene Domain Index pass to Index Writer several parameters such as MergeFactor, MaxBufferedDocs among others. As best practice if you want to index thousands of rows you can override default Lucene parameters for other which speed up indexing time. With create index or alter index rebuild you can set MergeFactor to 100 and MaxBufferedDocs to 4000. This parameters increase index performance but then DML operations at the base table will batch small set of rows, so after DDL commands change MergeFactor to 2 and MaxBufferedDocs to 100. A good place to start knowing these parameters behavior is the Wiki page Improving Indexing Speed.

      4.2.2 Auto Tune Memory functionality

      Lucene Domain Index have a parameter called AutoTuneMemory a true value means that for Index Writer operations it will try to use up to 50% of the Java Pool Size configured at the Oracle SGA to adjust how many documents are buffered (MaxBufferedDocs) before call IndexWritter.flush().
      With AutoTuneMemory:true MergeFactor,MaxBufferedDocs,MaxMergeDocs are not required, its calculated using free RAM at the SGA, but you has to set MergeFactor. Due Java Pool Size is global parameter the rule is not valid if you want to create many index with parallel connexions, two connections will try to use 50% of the SGA, so one of them will ran out of memory.

      4.2.3 Keep Index on RAM

      OJVMDirectory replaces Lucene file system storage by a table storage with BLOBs. For every Lucene Domain Index created there is a new table which stores every Lucene file as a row with a BLOB column, see section 6 for more detail, using similar strategy as Oracle Text you can keep this table in RAM. Unlike Oracle Text which uses multiples tables for storing the inverted index, Lucene Domain Index use one table, execute this DDL command to keep Lucene Index on RAM:

      create index source_small_lidx on test_source_small(text)
      indextype is lucene.LuceneIndex parameters(
        'FormatCols:line(0000); ExtraCols:line "line"; Analyzer:org.apache.lucene.analysis.StopAnalyzer; MergeFactor:500');
      alter index source_small_lidx parameters('MergeFactor:100');
      alter table source_small_lidx$t storage (buffer_pool keep) modify lob (data) (storage (buffer_pool keep));
      

      During Index creation use AutoTuneMemory:true (default value). Finally change OJVMDirectory storage table and LOB to keep them in RAM. Be sure that your SGA has a enough RAM to keep it. To know how big your index you can query the table:

      SQL> select sum(file_size) from source_small_lidx$t where deleted='N';
      SUM(FILE_SIZE)
      --------------
              147444
      

      Finally as Tom Kyte say, tkprof, tkprof, …. ;) . You can see Lucene Domain Index IO operations with an “alter session set events ’10046 trace name context forever, level 12′; then you can find operations at Lucene Domain Index table SCHEMA.IDX_NAME$T. Using TKPROF information you can alter table and lob storage parameters manually.

      4.2.4 Compare your execution plan

      To be sure that your Lucene Domain Index is properly used compare your executions plans and try to avoid non necessary filter by or sort order by predicates by using in-line sort or multiples field Query Parser conditions. Here examples of sorting using emails table created in section 3.1.4:

      SQL> explain plan for
        2  SELECT subject FROM emails where lcontains(bodytext,'security',1)>0
        3  order by subject ASC;
      

      PLAN_TABLE_OUTPUT
      -----------------------------------------------------------------------------------
      Plan hash value: 1542204867
      Id   Operation                    Name           Rows  Bytes  Cost (%CPU)  Time
      0    SELECT STATEMENT                            1     4016   3 (34)       00:00:01
      1    SORT ORDER BY                               1     4016   3 (34)       00:00:01
      2    TABLE ACCESS BY INDEX ROWID  EMAILS         1     4016   2  (0)       00:00:01
      * 3  DOMAIN INDEX                 EMAILBODYTEXT
      
      Predicate Information (identified by operation id):
      -----------------------------------------------------------------------------------
      3 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security',1)>0)
      

      Above execution plan tells that you are using Lucene Domain Index but you can get a better optimizer plan by using lcontains sort:

      SQL> explain plan for
        2  SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails
        3  where lcontains(bodytext,'security','subject:ASC',1)>0;
      

      PLAN_TABLE_OUTPUT
      -----------------------------------------------------------------------------------
      Plan hash value: 1450245214
      Id   Operation                    Name           Rows  Bytes  Cost (%CPU)  Time
      0    SELECT STATEMENT                            1     4016   2 (0)        00:00:01
      1    TABLE ACCESS BY INDEX ROWID  EMAILS         1     4016   2 (0)        00:00:01
      * 2  DOMAIN INDEX                 EMAILBODYTEXT
      
      Predicate Information (identified by operation id):
      -----------------------------------------------------------------------------------
      2 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security','subject:ASC',1)>0)
      

      Here we have a better optimizer plan and lower cost.

      4.2.5 Filtering and sorting at index level

      This functionality only available on Oracle 11g is valid for Lucene Domain Index in 10g/11g databases and also for standard edition version. The performance improvement is done when most of the rows can be filtered and sorted at index level, to do that you have to push the value of the column(s) involved in filter by or order by at index level during index creation. The syntax differs from Oracle Text but the performance improve is similar, let see an example:

      -- Oracle Text 11g syntax
      create index source_big_idx on test_source_big(text) indextype is ctxsys.context
        filter by line
        order by line;
      -- Lucene Domain Index syntax
      create index source_big_lidx on test_source_big(text) indextype is lucene.luceneindex parameters(
        'PerFieldAnalyzer:line(org.apache.lucene.analysis.KeywordAnalyzer),TEXT(org.apache.lucene.analysis.SimpleAnalyzer);
        FormatCols:line(0000);
        ExtraCols:line "line"');
      

      Note that in both cases we choose line as the filter/order by column. Now let see the execution plan and auto trace for an equivalent query using 11g syntax and Lucene Domain Index syntax.

      To see the real impact on the performance of using a filter by at index level let see the time involved of an equivalent query:

      select count(line) from test_source_big
        where lcontains(text,'varchar2 AND line:[2600 TO 9000]')>0;
        2
      COUNT(LINE)
      -----------
      587
      Elapsed: 00:00:00.03
      
      select count(line) from test_source_big
        where lcontains(text,'varchar2')>0 and line>=2600;
        2
      COUNT(LINE)
      -----------
      587
      Elapsed: 00:00:00.89
      

      The point here is that into the first example Lucene Domain Index performs the two operations:

      1. find all the rows which contains the word varchar2
      2. filter the rows that only have line in a range 2600 to 9000

      returning only the rows (587) that match both sentence, for the second example the RDBMS:

      1. look for the rowid that contains the word varchar2 (19963),
      2. visit above rows looking for the value of the column line and filter all that are >=2600

      the difference between the rows visited by the RDBMS is the difference on the performance.

      4.2.6 OnLine mode, ParallelDegree and IndexOnRam

      Starting with 2.9.2.1.1 and 3.0.1.1.0 version inserts are performed in parallel if ParallelDegree is greater than 1 and SyncMode:OnLine, in that case an AQ slave process will create temporary Lucene index adding the rows being indexed, this index is created in RAM if IndexOnRam:true or in disk otherwise, once the index contains the batch of rows added the temporary slave index is merged with the master storage. This speed up massive index additions such as index rebuild, index creation or insert .. into .. select .. from DML operations. Parallel index operations are important in servers which have multiples cores or RAC installations because Oracle AQ starts parallel process doing the job, an SQL trace of WikiPedia dump indexing shows that mostly of time is involved in the scan of table loading the data which is indexed, so using parallel indexing increase the throughput of Lucene Domain Index in multi-core chips.
      Oracle 11g AQ implementation checks how many milliseconds are consumed by an AQ Callback, so choosing a BatchCount too bigger causes that no other slave process is started by Oracle AQ engine, the experience with BatchCount values in a range starting with 100 to 500 is good value to guaranty a correct parallel operation. Following screenshot shows multiple AQ process indexing WikiPedia dump:

      4.2.7 Parallel Shared Slave Index Scan (available since 3.0.2.1.0)

      Starting with 3.0.2.1.0 version a parallel shared slave index scan process is started automatically when your DB start, this process is accepting RMI connections from the other Oracle internal process, that is, once a connection from a client is accepted by the RDBMS a dedicated or shared server is started to performs the SQL operations, this process have the internal OJVM associated to execute the LDI operations. Previous to 3.0.2.1.0 each OJVM have his internal Lucene structures to query the inverted index, due each OJVM is isolated from another if two concurrent connections executes the same SQL operations on LDI each process will load the inverted index structure on RAM and performs the Hit collector operation. The new process now do the same operations but is shared by all the OJVM process which connect to him using RMI, the consequence is that only the first query will load the inverted index structure on RAM, next queries coming from the same or different OJVM process will re-use these structures on RAM. This architectural change increase a lot the Cache Hit rate decreasing the time to performs lcontains operations and reducing the RAM usage by LDI. The screenshot below shows this concept in action:

      There are several process named oracletest (LOCAL=NO) these are dedicated process associated to each client connections, there is one process named ora_j009_test, this process is the slave shared server which is performing the search operations on Lucene Index on behalf of the others. The parallel slave search process is started and stopped automatically by two instance triggers registered at SYS schema, these triggers are after startup and before shutdown events. If you disable these triggers Lucene Domain Index back to previous functionality which means every OJVM process has his own memory structures and do the index scan without dispatching RMI calls.

      4.3 Know caveats

      1. Lucene Domain Index uses Java Util Logging API and RMI to connect to the search process, it means that a grant is required to create and operate on LDI, for example:
        grant LUCENEUSER to scott;
        
      2. SyncMode:OnLine should be reserved only for index which a number of update/insert/delete operation are too small compared to select operations, because each message process requires almost open an IndexWriter/IndexReader on the associated Lucene Index by a background process, except for bulk collect operation or “insert into … select … from” which are processed in batch off 150 rows. Tables with many insert/update operations by seconds should use LuceneDomainIndex.sync(idx) procedure called by DBMS_JOB periodically or by the application.
      3. Syntax for Inline pagination is only supported at the beginning of the Query, it means that if you want to perform pagination using lcontains() query syntax it must start with “rownum:[n TO m] AND” note that this syntax is case sensitive. Also this extraction is performed by splitting the query by position and does not take into account grouping operator, so this query “rownum:[1 TO 10] AND word1 OR word2″ will be passed to Lucene’s Query Parser as “word1 OR word2″ which is not semantically the original one if you look to the precedence operator. We can try to modify Query Parser class in a future to solve this semantic issues.
      4. Columns name are case sensitive in ExtraCols and FormatCols parameters using traditional SQL behavior, it means that for this DDL index creation:
        create index it1 on t1(f2)
          indextype is lucene.LuceneIndex
            parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00);ExtraCols:F3');
        

        You can use ExtraCols with f3 or F3 but FormatCols should be F3 because f3 is returned by the SQL select operation as F3 during the table full scan, also Lucene Index will have a document with a Field F3 instead of f3. If you want to use f3 as is you can re-write DDL index creation with:
        create index it1 on t1(f2)
          indextype is lucene.LuceneIndex
            parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),f3(00.00);ExtraCols:F3 "f3"');
        

        With this sentence Lucene will create documents with two field F2 and f3, F2 is uppercase because is the master column of the index and his passed as “F2″ by ODCI API but, due is the default Field of the query, you can omit his name at lcontains syntax, F3 now is lowercase and will be indexed as a Field “f3″.
      5. Index parameters are pre-cached in memory for faster response. Due isolation behaviour of Oracle JVM sessions, if you call to alter index or re-create a new one in another session you need to close all SQL session that have a pre-load index parameter storage.
        Calling to LuceneDomainIndex.getParameter(‘owner.index_name’,'parameter_name’) you can see the values of any parameter passed to the ODCI API either by calling create index or alter index. Otherwise you can call to LuceneDomainIndex.refreshParameterCache stored procedure.
      6. If you re-install Lucene Domain Index without deleting existing indexes you can manually drop resources associated to and old index. For example:
        drop index source_big_lidx force;
        Index dropped.
        select table_name from tabs;
        
        TABLE_NAME
        ------------------------------
        DEPT
        EMP
        BONUS
        SALGRADE
        SOURCE_BIG_LIDX$QT
        DR$SOURCE_BIG_IDX$I
        DR$SOURCE_BIG_IDX$R
        SOURCE_BIG_LIDX$T
        TEST_SOURCE_BIG
        DR$SOURCE_BIG_IDX$N
        DR$SOURCE_BIG_IDX$K
        
        11 rows selected.
        
        drop table SOURCE_BIG_LIDX$T;
        Table dropped.
        
        conn / as sysdba
        connected.
        
        exec DBMS_AQADM.STOP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q');
        PL/SQL procedure successfully completed.
        
        exec DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q');
        PL/SQL procedure successfully completed.
        
        exec DBMS_AQADM.DROP_QUEUE_TABLE(queue_table  => 'SCOTT.SOURCE_BIG_LIDX$QT', force=>true);
        PL/SQL procedure successfully completed.
        
        exit
        

        Note that “drop index … force” will de-register Lucene Domain Index from Oracle’s system views, then Lucene Domain Index storage’s table is manually dropped, finally connected as SYS Lucene Domain Index AQ’s table is dropped.
      7. Oracle 11g have a know bug “6445561 – ORA-00600 [26599] [62] DUE TO INCORRECT PERSISTENCE OF BY INVOKER PIN” please apply patch number p6445561_111060_LINUX.zip available at Metalink, this bug affects select count(*) with a large results.
      8. Up to Lucene Domain Index 2.9.0 there is known problem with the WhereCondition parameter using OR SQL operator, see section A.3.3 to see the workaround.

      Doc Links

      Previous / LDI Docs – 3 Procedures, Functions, Operators and Examples
      Next / LDI Docs – Appendix A (Parameter reference and syntax)

LDI Docs – 3 Procedures, Functions, Operators and Examples

3. Procedures, Functions, Operators and Examples

Before you start to work on through the examples below, do grant the LUCENEUSER role to any dedicated Oracle user/schema, who has become selected to run the Lucene Domain Index (LDI). Remember, that you must not run a index within user/schema LUCENE. For example:

 -- connected as sysdba
  grant LUCENEUSER to scott;

3.1 Create a Lucene Domain Index

3.1.1 Single column index

The first example creates a domain index on table t1, column f2 using Lucene’s SimpleAnalyzer along with the Analyzer parameter. After execution, a new index, T1.IT1, and two new LDI-tables, the index storage table, IT1$T, and the index queue table (AQ), IT1$QT, will have been added to the user’s schema. Because of the generated objects, can no Lucene Domain Index name be longer than 21 characters! This is due the secondary generation of the Oracle AQ table name.

  create table t1 (
    f1 number,
    f2 varchar2(200),
    f3 varchar2(200),
    f4 number unique);

  create index it1 on t1(f2) indextype is lucene.LuceneIndex
    parameters('Analyzer:org.apache.lucene.analysis.SimpleAnalyzer');

Read more of this post

LDI Docs – 2 Installing and Testing

2 Installing and Testing

2.1 Requirements

  • JDeveloper 11g (optional) only if you want to edit the Java code
  • Ant 1.7.0
  • Sun JDK 1.5.0_05/1.4.2 ($ORACLE_HOME/jdk directory works fine as Java Home for compiling on 10g and 11g)
  • Linux/Windows Database Oracle 10g 10.2/11g production

2.2 Install binary distributions

Binary distributions are available at SourceForge.net and provides a very straightforward installation.

Read more of this post

LDI Docs – 1 Introduction

1 Introduction

General introduction, features, benefits and comparison with Lucene standalone implementation and Oracle Text.

1.1 What is Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Apache Lucene is an open source project available for free download.
If Lucene is a pure Java framework why not use it inside Oracle Database JVM environment?

1.2 What is Lucene Domain Index

Lucene Domain Index is full integration of Lucene project running inside the Oracle database using Oracle JVM. Oracle provides a full featured JVM inside your Oracle Database compliant with JDK 1.4 in 10g release and 1.5 in 11g.
OJVMDirectory is a replacement for Lucene’s file system storage by a BLOB based storage, the name is related to the class which overrides (Directory.java)

Read more of this post

Follow

Get every new post delivered to your Inbox.