LDI Docs – 1 Introduction

1 Introduction

General introduction, features, benefits and comparison with Lucene standalone implementation and Oracle Text.

1.1 What is Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Apache Lucene is an open source project available for free download.
If Lucene is a pure Java framework why not use it inside Oracle Database JVM environment?

1.2 What is Lucene Domain Index

Lucene Domain Index is full integration of Lucene project running inside the Oracle database using Oracle JVM. Oracle provides a full featured JVM inside your Oracle Database compliant with JDK 1.4 in 10g release and 1.5 in 11g.
OJVMDirectory is a replacement for Lucene’s file system storage by a BLOB based storage, the name is related to the class which overrides (Directory.java)

Here a simple list of points take into account to choose this storage:

  • Using traditional File System for storing the inverted index is not a good option for some users, you don’t have commit or rollback behavior, backup, etc.
  • Using BLOB for storing the inverted index running Lucene outside the Oracle database has a bad performance because there are a lot of network round trips and data marshaling.
  • Indexing relational data such as tables with VARCHAR2, CLOB or XMLType with Lucene running outside the database has the same problem as the previous point.
  • By using Secure BLOB on Oracle 11g you can choose to encrypt and compress Lucene Index storage transparently reducing disk usage and not exposing your relational data outside the DB increasing risk or violating SOX company regulations.
  • The JVM included inside the Oracle database can scale up to 10.000+ concurrent sessions without memory leaks or deadlock and all the operations on tables are in the same memory space!!

More on this, Oracle provides a Data Cartridge API (ODCI), also called Extensible Indexing mechanism because you can write your own Domain Index and integrate it with the Oracle Engine and optimizer.

There are some important points integrating Lucene by using ODCI:

  • Changes on rows are automatically notified to Lucene, now these changes are en-queued using Oracle AQ. User can control if these changes are applied OnLine (immediately after commit) or Deferred (application Sync).
  • Oracle optimizer can choose a proper execution plan if there is a Domain Index created.
  • You can mix lcontains(), lhighlight(), lscore() and many other operators, procedures or functions in your queries.

1.3 Why do I use Lucene Domain Index?

Oracle includes a full featured and enterprise dedicated text search engine named Oracle Text, being coded in C and fully integrated into the Oracle kernel, but:

  • on Oracle Text you can not:
    • control which functionality will be included into next release
    • easily customize it for your needs
    • index Index Organized Tables (IOT)
    • index joined tables
    • index unlimited extra columns
    • easily highlight text
    • index NCLOB and NVARCHAR data types
  • on Oracle 10g you can not:
    • index multiple columns in a same index
    • sort and filter by using indexed columns at index level
  • on Oracle 11g you can not:
    • filter by / sort by on columns of timestamp with TZ, commonly
      used in XMLDB because is the official data type for xsd:date type
  • using Lucene Domain Index you can:
    • usually indexes are smaller because Lucene Domain Index do not store any column, except the rowid, inside Lucene’s inverted index structure. By using a rowid Oracle can lookup any column value faster than retrieve it from Lucene inverted index
    • Support padding for Text columns
    • Support formatting (rounding/padding) for Number and Date/Time columns
    • You can create index on-line even in a standard edition databases (feature available en EE for Text)
    • Extending DefaultUserDataStore class an application can implement any data type mapping, specially BLOB which in common cases have non standard encoding
    • An experimental native REST WS can be used to query the index
    • Lucene inverted index is transactional, if a SQL operation is rolled back, the index will be consistent too, avoiding phantom reads or negative hits (rows which should be included as hit but was not included in Lucene index)
    • is a ready to use uptodate solution for any programming language, for example Ruby, .Net, Phyton or PHP
    • an elegant solution for highlighting text use pipeline table functions
    • a high level abstraction layer for Lucene IR library, developers only deal with SQL
    • transparent compression and encryption of Lucene storage if you enable Oracle Transparent Data Encryption and Secure File compression

Doc Links

Next / LDI Docs – 2 Installing and Testing

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.