Getting Started With Zend_Lucene

20th February 2009

Zend_Lucene is an implementation of the Lucene search engine in PHP5 and is included as part of the Zend Framework from version 1.6. Lucene implements all of the standard search engine query syntaxes (eg. boolean and wildcard searches) and stores its index as files so it doesn't need a database server to run. Lucene can be used if you want to add search functionality to a site but don't want to go down the route of building a querying syntax from scratch.

To get started with Lucene you need to create an index. The following code has the effect of creating a directory on your server that Lucene will use to store and retrieve documents.

$index = Zend_Search_Lucene::create('/data/my-index');

To open the index use the following code.

$index = Zend_Search_Lucene::open('/data/my-index');

Of course your index will not contain anything so the next step is to add some documents to it.

To create a new document you need to create a new document object. This is done using the Zend_Search_Lucene_Document() class.

$doc = new Zend_Search_Lucene_Document();

You can then assign fields to this document using the static functions of the Zend_Search_Lucene_Field class.

  1. $doc->addField(Zend_Search_Lucene_Field::Text('title', 'The title of the document'));
  2. $doc->addField(Zend_Search_Lucene_Field::Text('contents', 'The contents of the document.'));

You can also use binary data, which although isn't searchable it is held in the index. So it is therefore possible to add thumbnail data to your documents.

$doc->addField(Zend_Search_Lucene_Field::Binary('originalfile', $filedata));

Any binary data you assign like this isn't tokenized or indexed but it is stored in the index so you would need to assign other fields so that the data can be searched for.

Once you have added your fields you can add the document using the addDocument() function of the index opened index object.

$index->addDocument($doc);

If you are building a search index for a site then you might want to use the built in HTML parsing functionality. This makes it easy for you to add either a HTML string or a HTML filename that Lucene will then index. You then add this file to the index using the addDocument() function of the opened index object. Note that when adding documents in this way you should also add the URL of the document as a field so that you can retrieve it later.

  1. $doc = Zend_Search_Lucene_Document_Html::loadHTMLFile('http://www.hashbangcode.com/');
  2. $doc->addField(Zend_Search_Lucene_Field::Text('url','http://www. hashbangcode.com/'));
  3. $index->addDocument($doc);

You can also index and search Word, Excel and Powerpoint documents in much the same way as this.

Once you have the index you can search it. This is done using an opened index object, you can find out how big your index is and how many documents you have in your index by using the count() and numDocs() functions receptively.

  1. $indexSize = $index->count();
  2. $documents = $index->numDocs();

To construct a query and implement the boolean and wildcard searching you need to use the Zend_Search_Lucene_Search_QueryParser class, this is then passed onto the Zend_Search_Lucene_Search_Query_Boolean object using the addSubquery() function.

  1. $queryStr = 'hash';
  2. $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
  3.  
  4. $query = new Zend_Search_Lucene_Search_Query_Boolean();
  5. $query->addSubquery($userQuery, true);
  6.  
  7. // do the search
  8. $hits = $index->find($query);

The variable $hits now contains an array of the Zend_Search_Lucene_Search_QueryHit object. This object has a property called score, which is the score of the hit result. The score is an indication (between 0 and 1) of how closely the query matched the index. The first item in the $hits array will have the highest score value. Every field that you defined for the document whilst indexing is now presented as a property of this object. So if you set a URL field for your document you can see a list of your documents using the following code:

  1. $hits = $index->find($query);
  2. foreach ($hits as $hit) {
  3. echo $hit->score.'<br />';
  4. echo $hit->url.'<br />';
  5. }

Lucene can do a lot more than what I have briefly detailed here so I might write some posts in the future on how to refine updating, indexing and searching.

Add new comment

The content of this field is kept private and will not be shown publicly.