LDI Docs – Appendix A (Parameter reference and syntax)
March 11, 2011 1 Comment
A Parameter reference and syntax
Lucene Domain Index accept several parameters which can be passed using create index or alter index DDL commands. This parameters are divided into four categories, Index Writer, Analyzer, User Data Store and General parameters.
A.1 Lucene Index Writer parameters
This section covers Lucene Index Writer parameters for more information about this parameter see Lucene docs and Wiki.
A.1.1 MergeFactor
Determines how often segment indices are merged by addDocument(). If you are creating a new index over a table with thousands of rows a value of 100 to 500 is good value.
A.1.2 MaxBufferedDocs
Determines the minimal number of documents required before the buffered in-memory documents are merged and a new Segment is created. This value can cause an out of memory exception you provide a value larger than user space available. A typical SGA configuration can accept values of 4000 or 5000 depending how big are your rows being indexed. If you are not sure of how megabytes can consume your rows you can use AutoTuneMemory:true parameter which is a default value, so you choose true MaxBufferedDocs will be ignored and Lucene Domain Index will try to uso 90% of Oracle Java Pool Size value.
A.1.3 MaxMergeDocs
Determines the largest number of documents ever merged by addDocument().
A.1.4 MaxBufferedDeleteTerms
Determines the minimal number of delete terms required before the buffered in-memory delete terms are applied and flushed.
A.1.5 UseCompoundFile
Setting to turn on usage of a compound file. When on, multiple files for each segment are merged into a single file once the segment creation is finished. This is done regardless of what directory is in use. By default Lucene Domain Index do not use compound file format because its not affected by max open file descriptors.
A.1.6 MaxFieldLength
Determines the maximum number of char indexed for any column of this index, default value is 10000.
A.1.7 AutoTuneMemory
AutoTuneMemory:true (default) overrides MaxBufferedDocs parameter, it defines dynamically MaxBufferedDocs based on how much memory is reported by OracleRuntime.getJavaPoolSize() method.
After each document is added to the index it calls to writer.ramSizeInBytes() and test that is not over a 50% of the ram free.
This parameter works in most of the common cases, but you can get a Java out of memory error in multiuser environments because Java Pool Size is common parameter for all the sessions. If you get an exception during index creation time set AutoTuneMemory:false and adjust MaxBufferedDocs to a value which not raise an out of memory exception.
A.2 Analyzer parameters
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.
Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer.
Analyzer, PerFieldAnalyzer or Stemmer parameter affects indexing and query expressions, so if you want to change this parameter on a exists index you to must rebuild it, the priority of these three parameters is first check for the Stemmer if its not present check for PerFieldAnalyzer if its not present checks for Analyzer parameter, finally if none of them are defined will use SimpleAnalyzer.
A.2.1 Analyzer
This parameter is fully qualified Java class name which extends org.apache.lucene.analysis.Analyzer. For example:
- BrazilianAnalyzer
- ChineseAnalyzer
- CJKAnalyzer
- CzechAnalyzer
- DutchAnalyzer
- FrenchAnalyzer
- GermanAnalyzer
- GreekAnalyzer
- KeywordAnalyzer
- PatternAnalyzer
- RussianAnalyzer
- SimpleAnalyzer
- StandardAnalyzer
- StopAnalyzer
- ThaiAnalyzer
- WhitespaceAnalyzer
See Lucene Java Docs for more details. A default analyzer is SimpleAnalyzer.
A.2.2 Stemmer
Stemmer is another kind of analyzer which divides words, stop words and another term related object based on an specific language. Stemmer parameter use Snowball Analyzer, possible values for Stemmer parameter using Lucene 2.2.0 distribution are:
- Danish
- Dutch
- English
- Finnish
- French
- German
- German2
- Italian
- Kp
- Lovins
- Norwegian
- Porter
- Portuguese
- Russian
- Spanish
- Swedish
Stemmer parameter override Analyzer parameter.
A.2.3 PerFieldAnalyzer
PerFieldAnalyzer is a wrapper of other analyzers which provides an independent analyzer for each column being indexed, see PerFieldAnalyzerWrapper class in Lucene documentation. Each column could have his own analyzer which extends org.apache.lucene.analysis.Analyzer. If a column is not in the list StandardAnalyzer will be used as default. For example:
create table t1 (f1 VARCHAR2(10), f2 XMLType);
insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));
insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('IncludeMasterColumn:false;
ExtraCols:F1,extractValue(F2,''/emp/name/text()'') "name",extractValue(F2,''/emp/@id'') "id";
FormatCols:F1(000),id(00)');
alter index it1 rebuild
parameters('PerFieldAnalyzer:F1(org.apache.lucene.analysis.KeywordAnalyzer),id(org.apache.lucene.analysis.KeywordAnalyzer)');
In the above example four columns are being indexed by Lucene Domain Index rowid (added by default) using KeywordAnalyzer, F1 and id (added by ExtraCols parameter) using KeywordAnalyzer too, and finally name which is not included into PerFieldParameter and then using StandardAnalyzer.
A.3 User Data Store parameters
Lucene Domain Index implements a User Data Store functionality, this functionality provides many parameters to control which column will be included into a Lucene Document which is inserted into the index.
and First three parameters are used to choose which columns will added to the index in addition to the master column. Oracle Domain Index are bound to a single column, this is a limitation with Oracle 10g version. To avoid this problem passing ExtraCols, ExtraTabsWhereCondition you can easily build a set of new column from the master table and others. Basically a select DML statement is built using these parameters. To clarify this Lucene Domain Index will performs a query like:
– Full table scan (create index statement):
SELECT rowid, MasterTable.MasterColumn, ExtraCols
FROM MasterTable,ExtraTabs
where WhereCondition;
– Find a particular rowid (insert,update operations):
SELECT MasterTable.MasterColumn, ExtraCols
FROM MasterTable,ExtraTabs
where MasterTable.rowid=:rowid AND WhereCondition;
Text in italic are injected by Lucene Domain Index and text in bold are user defined.
A.3.1 ExtraCols
A coma separated list of columns of the Master table of table being indexed or the tables defined into ExtraTabs parameter. Note that if you don’t define columns alias column name are capitalized by default on Oracle databases. For example ‘ExtraCols:F2 “f2″,T2.F3 “f3″‘ note that you can omit master table name if there is no collisions
A.3.2 ExtraTabs
A coma separated list of table name and alias for this tables. For example ‘ExtraTabs:T2 aliasT2,T3 aliasT3′. Note that ODCI API only will detect changes at index master column, to notify changes based on ExtraCols list you need to attach triggers, see section examples above for more detail.
A.3.3 WhereCondition
An SQL where condition used to join index’s master table with ExtraTabs tables. For example: ‘WhereCondition:T1.f1=T2.f2(+) AND T1.F1=aliasT3.f3′. Be careful to produce a correct join condition to guaranty single row result; multiple or zero row result based on the master table values are not allowed.
Note: Up to Lucene Domain Index 2.9.0, if you use a WhereCondition which have an OR operator put this where condition enclosed with () because the precedence of the OR over the AND operator makes that some queries returns more rows that the correct behavior, for example instead of:
WhereCondition:T1.F1=’AA’ OR T1.F1=’BB’
put:
WhereCondition:(T1.F1=’AA’ OR T1.F1=’BB’)
this workaround fix some problems when working in OnLine mode. Starting with 2.9.1 version this extra () are not required.
A.3.4 UserDataStore
This is a fully Java Class name which implements org.apache.lucene.indexer.UserDataStore interface, you can create your own Data Store class implementing this interface. By default Lucene Domain Index provides an implementation which covers most of the typical scenarios, this class is org.apache.lucene.indexer.DefaultUserDataStore and use FormatCols parameter to create Lucene Fields.
A.3.5 FormatCols
A coma separated list of column(format) strings interpreted by User Data Store class to control how an specific database column will be transformed in a Lucene Field. For example you can choose padding, un-tokenized values and so on.
Supported formats by Default Data Store class are:
- Number padding for numeric columns using java.text.DecimalFormat class syntax, default is 0000000000.
- Date rounding for timestamp and date columns using org.apache.lucene.document.DateTools, default is day.
- Character left padding for VARCHAR2 or CHAR columns using org.apache.lucene.util.StringUtils class (leftPad method), default is no left char padding. Any char can be used for left padding.
- XPath expression for XMLType columns, this XPath string will be passed to XMLType.extract(“format”,”") method, the result of the XPath extraction will be a new XMLType object over getStringVal() will executed. If you want to perform more user defined XMLType to Field extraction extend DefaultUserDataStore class or use virtual column indexing.
- For columns of type VARCHAR2 or CHAR you can use an special string NOT_ANALYZED or NOT_ANALYZED_STORED as format which tell to Default User Data Store class that this column will be indexed but un-tokenized, this is useful with columns which will be used for sorting.
A.3.6 LockMasterTable
When table indexer is getting the row which will be indexed it can use either FOR UPDATE NOWAIT SQL construction or not, setting this parameter to true cause that the row is acquired with a lock.
A.4 General parameters
This set of parameters are Lucene Domain Index specific parameters.
A.4.1 SyncMode
SyncMode tells to Lucene Domain Index which strategy is used to update the index. SyncMode:Deferred (default) left to the application when the index is synced either by calling LuceneDomainIndex.sync procedure after a set of changes pending or by DBMS_SCHEDULER process at an specific time. With SyncMode:Deferred update and insert operations are queued using DBMS_AQ package. Delete operations are never enqueued because require an update on Lucene Index to not return rowid of deleted rows.
SyncMode:OnLine is implemented by using DBMS_AQ PLSQL callback, so immediately after a commit operation which involves insert or update rows a parallel process dbms_j* is automatically started by DBMS_AQ package to applied pending changes. SyncMode:OnLine should be reserved for index which update, insert or delete operations are much lower than select, AQ callbacks can not handle very well exceptions during sync time, for example when a row being index is locked by another session, so some changes can be lost with this scenario.
A.4.2 Updater, Searcher
Lucene Domain Index can be configured to start several parallel shared process which do reader and writer operations on LDI storage on behalf of the user connected session, you can configure multiple searcher process selected randomly using the syntax host1@port1,host2@port2 and one updater process using similar syntax. By default these parameters are defined with the value local which means not using parallel shared servers. Two parallel server are configured and started during database startup process, a searcher process listen at SYS_CONTEXT(‘USERENV’,'SERVER_HOST’)@1099 which usually is localhost@1099 and the updater process at localhost@1098, you can register multiples searcher/updater processes editing the properties db.searcher.job/db.searcher.port,db.updater.job/db.updater.port at build.xml file and calling to the targets create-searcher-job and create-updater-job respectively.
Updater and Searcher processes can be stopped, started using Ant’s targets disable-jobs and enable-jobs.
A.4.3 LobStorageParameters
Lucene Domain Index uses a BLOB column named “data” for storing Lucene Inverted index files. You can control any LOB storage parameter with this parameter during index creation time, his default value is ‘LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CACHE READS NOLOGGING’ for 11g databases you can use a better optimize storage by using newest Secure LOB parameter, for example: ‘LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING’
A.4.4 LogLevel
Lucene Domain Index uses JDK Java Util Logging package, LogLevel parameter is any of the string defined by Level.parse() method, for example: LogLevel:ALL. By default logging level is defined to WARNING.
Lucene Domain Index uses:
- SEVERE for non recoverable error conditions
- FINER for debugging purpose such as ODCI API arguments
- INFO for checking index operations such as value being indexed
- WARNING for error messages which are reported as ERROR through ODCI API
- CONFIG to see user parameters changed by users
Logging information is sent by default to Oracle .trc files, but you can redirect this output using dbms_java.set_output procedure for example.
If you are not sure which field and how these fields are added to the index change LogLevel to INFO and check for lines starting with: “INFO: Document<”
exiting and throwing methods does not print messages also with log level defined to ALL. This is because logging level used by these methods are controlled by ConsoleHandler level.
To get these methods work copy logging.properties file from your JAVA_HOME/jre/lib to ORACLE_HOME/javavm/lib directory and edit the line which includes level property:
# Limit the message that are printed on the console to INFO and above. java.util.logging.ConsoleHandler.level = ALL java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
Then shutdown and startup your Oracle database.
A.4.5 CachedRowIdSize
CachedRowIdSize is used by an LRU cached used to maintain the association between Lucene Doc ID and a particular Oracle ROWID. For very big table using an array to store this association can consume a lot of SGA RAM, starting with Lucene Domain Index 2.9.0.1.0 only 10.000 ROWID are stored in this cache, tables with high frequency of updates can use this LRU small due every caused that LRU is completed flushed, but tables with low frequency of updates/deletes can get a lot of performance improvement by using larger LRU cached size.
A.4.6 BatchCount, IndexOnRam and ParallelDegree
These three parameters control parallel index operations (inserts) when OnLine mode is enabled, ParalellelDegree defines how many slave index storage will be created to hold temporary parallel index operations when news rows are inserted or the index is created or rebuild. During index creation or rebuild time BatchCount defines how many rows will processed in batch and parallel with another set of rows. IndexOnRam defines when the new set of rows is indexed in a temporary index in RAM or disk, prior to Lucene Domain Index 2.9.2.1.0 a batch of new rows where processes in temporary index stored in disk, using IndexOnRam:true tells to Lucene Domain Index that the new rows will be indexed in RAM and finally merged into the main index stored in disk.
A.5 Query parameters
This set of parameters which affects QueryParser and search functionality.
A.5.1 DefaultColumn
DefaultColumn defines which columns is used as default column in QueryParser syntax, if this parameter is not set master column of the index is used, this name is a Lucene Field name. Here an example:
create index pages_lidx_all on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('PopulateIndex:false;
DefaultColumn:text;
SyncMode:Deferred;
LogLevel:WARNING;
Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;
ExtraCols:extractValue(object_value,''/page/title'') "title", extractValue(object_value,''/page/revision/comment'') "comment", extract(object_value,''/page/revision/text/text()'') "text", extractValue(object_value,''/page/revision/timestamp'') "revisionDate";
FormatCols:revisionDate(day);
IncludeMasterColumn:false;
LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');
Note the correlation between DefaultColumn and ExtraCols. ExtraCols defines a Lucene Field named “text” with a value calculated by the SQL expression extract(object_value,”/page/revision/text/text()”), then you can use a Lucene Field text as default Field in QueryParser syntax.
A.5.2 DefaultOperator
DefaultOperator defines which Boolean operator is used in QueryParser syntax, if this parameter is not set OR operator is his default value.
A.5.3 NormalizeScore
NormalizeScore is used during Lucene Index scan to know if they need to track the maximum score, the maximum score then used to normalize the result of lscore() operator to return only values between 0 to 1. If you don’t need a normalized range of the score you can avoid this computation and your query will be fast. Note that a not normalized score not implied that the document are not in order of relevance.
A.5.4 PreserveDocIdOrder
PreserveDocIdOrder is an internal parameter which is used by Lucene in some kind of operator, if you don’t need that result preserve Lucene Doc ID in order rather than the relevance, you can put this value to false (default) and some operator will be fast.
A.5.5 RewriteScore and SimilarityMethod
RewriteScore (true or false) and SimilarityMethod (fully class name) are used when query using wildcard operator (*) these parameters produces better recall values, for example:
create table t1 (f1 number primary key, f2 varchar2(2000), f3 number(5,3));
insert into t1 values (1, 'Cefaleias', 1);
insert into t1 values (2, 'Cefaleia', 1);
insert into t1 values (3, 'Cefaleia em salva', 0.625);
insert into t1 values (4, 'Cefaleias de tensão', 0.625);
insert into t1 values (5, 'Cefaleias / enxaquecas', 0.625);
insert into t1 values (6, 'Desproporção céfalo-pélvica', 0.5);
insert into t1 values (7, 'Deformidade por redução cefálica congénita', 15.87);
insert into t1 values (8, 'Intoxicação por antibióticos do grupo das cefalosporinas', 0.5);
commit;
create index it1 on t1(f2)
indextype is lucene.luceneindex
parameters('LogLevel:ALL;
Analyzer:org.apache.lucene.analysis.PortugueseAnalyzer;
FormatCols:F3(00.000);
ExtraCols:F3;
RewriteScore:true;
SimilarityMethod:org.apache.lucene.search.WildcardSimilarity');
select /*+ DOMAIN_INDEX_SORT */ lscore(1) f1, f2 from t1
where lcontains(f2, 'cefa cefa*',1) > 0
F1 F2
1 Cefaleias
1 Cefaleia
0.625 Cefaleia em salva
0.625 Cefaleias de tensão
0.625 Cefaleias / enxaquecas
0.5 Desproporção céfalo-pélvica
0.5 Deformidade por redução cefálica congénita
0.5 Intoxicação por antibióticos do grupo das cefalosporinas
8 rows selected
alter index it1
parameters('LogLevel:ALL;
SimilarityMethod:org.apache.lucene.search.DefaultSimilarity');
select /*+ DOMAIN_INDEX_SORT */ lscore(1) f1,f2 from t1
where lcontains(f2, 'cefa cefa*',1) > 0
F1 F2
0.3539437353610992431640625 Intoxicação por antibióticos do grupo das cefalosporinas
0.12431289255619049072265625 Cefaleias
0.12431289255619049072265625 Cefaleia
0.077695555984973907470703125 Cefaleia em salva
0.077695555984973907470703125 Cefaleias de tensão
0.077695555984973907470703125 Cefaleias / enxaquecas
0.062156446278095245361328125 Desproporção céfalo-pélvica
0.062156446278095245361328125 Deformidade por redução cefálica congénita
8 rows selected
alter index it1
parameters('LogLevel:ALL;
RewriteScore:false');
select /*+ DOMAIN_INDEX_SORT */ lscore(1) f1, f2 from t1
where lcontains(f2, 'cefa cefa*',1) > 0
F1 F2 0.15442870557308197021484375 Cefaleias
0.15442870557308197021484375 Cefaleia
0.15442870557308197021484375 Cefaleia em salva
0.15442870557308197021484375 Cefaleias de tensão
0.15442870557308197021484375 Cefaleias / enxaquecas
0.15442870557308197021484375 Desproporção céfalo-pélvica
0.15442870557308197021484375 Deformidade por redução cefálica congénita
0.15442870557308197021484375 Intoxicação por antibióticos do grupo das cefalosporinas
8 rows selected
A.6 Highlight parameters
This set of parameters which affects lhighlight, phighlight and rhighlight functionality.
A.6.1 Formatter
Formatter defines a valid class name which implements Lucene Interface Formatter and with a constructor with no arguments, default value org.apache.lucene.search.highlight.SimpleHTMLFormatter.
A.6.2 MaxNumFragmentsRequired
MaxNumFragmentsRequired defines a number of text fragments returned by Highlight function, default value is 4.
A.6.3 FragmentSize
FragmentSize defines the size of each fragment returned in characters of each fragment, default value is 100.
A.6.4 FragmentSeparator
FragmentSeparator defines a String used as fragment separator, default value is “…”. Note that you can not use “;” or “:” as fragment separator because are used as parameter and value delimiters into alter index … parameters(..) DDL statement.
Doc Links
Previous / LDI Docs – 4 Locking and Performance
Next / LDI Docs – Appendix B (Lucene Domain Index Storage)
hi. I find a little mistake.
in paragraph “A.1.2 MaxBufferedDocs” you wrote: so you choose true MaxBufferedDocs will be ignored and Lucene Domain Index will try to uso 90% of Oracle Java Pool Size value.
but in real its 50%.