Zend Lucene And PDF Documents Part 4: Searching

Note: This post is over two years old and so the information contained here might be out of date. If you do spot something please leave a comment and we will endeavour to correct.

16th November 2009 - 10 minutes read time

Last time we had indexed our PDF documents and were ready to add a search form to our application. Adding search requires two things, the form to enter the search terms into and an action to control what happens when the form is submitted.

The easiest task is to set up the search view, create the view file search.phtml (in views/scripts/search/) and add the following code. This is just a basic form with a single text input. Note the addition of the queryString parameter, this will allow us to print out the last searched for value in the search box, which is good usability practice.

<form action="" method="get" name="searchForm">
    <input type="text" name="q" id="q" value="<?php echo $this->queryString; ?>" />
    <input type="submit" name="search" value="Search" />
</form>

There are many ways to set up our search action to perform, and it depends on what you are planning to do with it. The simplest method is to pass the query string to Lucene index find() method. The method I have chosen here uses the Zend_Search_Lucene_Search_Query_Boolean class to create a multiple term query where every term is either required to be present, absent, or neither. Although at first we only want to pass a single parameter this method allows us to easily add search terms in the future without too much fuss. Here is the search action in full.

public function searchAction()
{
    $filters = array('q' => array('StringTrim', 'StripTags'));
    $validators = array('q' => array('presence' => 'required'));
    $input = new Zend_Filter_Input($filters, $validators, $_GET);
 
    if (is_string($this->_request->getParam('q'))) {
        $queryString = $input->getEscaped('q');
        $this->view->queryString = $queryString;
 
        if ($input->isValid()) {
            $config = Zend_Registry::get('config');
            $index = App_Search_Lucene::open($config->luceneIndex);
 
            $query = new Zend_Search_Lucene_Search_Query_Boolean();
 
            $pathTerm = new Zend_Search_Lucene_Index_Term($queryString);
            $pathQuery = new Zend_Search_Lucene_Search_Query_Term($pathTerm);
            $query->addSubquery($pathQuery, true);
 
            try {
                $hits = $index->find($query);
            } catch (Zend_Search_Lucene_Exception $ex) {
                $hits = array();
            }
 
            $this->view->hits = $hits;
        } else {
            $this->view->messages = $input->getMessages();
        }
    }
}

Here is what is going on in the action.

A filter and validator are created that allow us to make sure that the incoming search conforms to certain rules. This also allows us to filter the incoming text to prevent any hacks being used.
Because we don't want to run the validator when we first visit the page we make sure that the query is a string.
Validate the query string and send any errors found to the view.
Get the config object and open the lucene index.
Initialise the Zend_Search_Lucene_Search_Query_Boolean object.
Create our search term and add it to our query as a sub query using the addSubquery() method.
Attempt to run the query and return the results as an array of Zend_Search_Lucene_Search_QueryHit objects. If an error occurred in the search then set the $hits array to be blank.
Send the $hits array to the view so we can print it out.

The Zend_Search_Lucene_Search_QueryHit objects in the $hits array can be used to obtain all of the information that we added during the indexing phase as well as some other information like the score of the document in the search. The next step is to add the error messages and search results to the search.phtml file. Adding the messages is quite simple, but adding the search results can be complex, especially as not all of the PDF's in our index will have exactly the same fields. This might be because we haven't filled them in yet, or that we are indexing PDF's from a third party and so might not contain some meta-tags that we have added to our documents. Because of this we need to make sure that the field exists before we try to print it out or an exception will occur.

Here is the search.phtml file in full. Note that this contains very little formatting for the reason that it is intended to be adapted to whatever needs you would like.

<form action="" method="get" name="searchForm">
    <input type="text" name="q" id="q" value="<?php echo $this->queryString; ?>" />
    <input type="submit" name="search" value="Search" />
</form>
<?php
 
if (!is_null($this->hits)) {
    echo count($this->hits) . ' results found.<hr />';
 
    foreach ($this->hits as $hit) {
        $fields = $hit->getDocument()->getFieldNames();
        echo 'id: '.$hit->id . '<br />';
        echo 'score: '.$hit->score . '<br />';
 
        echo 'Filename: '.$hit->getDocument()->Filename . '<br />';
        echo 'Key: '.$hit->getDocument()->Key . '<br />';
 
        if (in_array('Title', $fields)) {
            echo 'Title: '.$hit->getDocument()->Title . '<br />';
        }
 
        if (in_array('CreationDate', $fields)) {
            echo 'CreationDate: '.$hit->getDocument()->CreationDate . '<br />';
        }
 
        if (in_array('Author', $fields)) {
            echo 'Author: '.$hit->getDocument()->Author . '<br />';
        }
        echo '<hr />';
    }
}

What I thought might be a good addition to this program is to add another search term to the $query object and restrict the search results to a particular day. Rather than use the timestamp for this value I changed the App_Search_Lucene_Index_Pdfs class so that when the CreationDate is stored it is stored as a value of yyyymmdd. Change the appropriate line in the App_Search_Lucene_Index_Pdfs class (around line 66) to the following.

$indexValues['CreationDate'] = date('Ymd', $dateCreated);

Next, I added another search term to the $query object in the search action. I have hard coded the time into this section, but it shouldn't be too difficult to add a control to the search form and new validation and filtering rules.

$pathTerm  = new Zend_Search_Lucene_Index_Term('20091023', 'CreationDate');
$pathQuery = new Zend_Search_Lucene_Search_Query_Term($pathTerm);
$query->addSubquery($pathQuery, true);

However, when this has all been added it doesn't seem to work. The addition of the number doesn't effect the search results at all. This is because Zend_Search_Lucene uses an analyser to parse and process the text for each indexed field. The default text analyser is Zend_Search_Lucene_Analysis_Analyzer_Common_Text_CaseInsensitive which will only index letters in lowercase, meaning that our time field is completely ignored when the documents are being indexed. This has the knock on effect of making searching for numeric values completely impossible.

To solve this we need to use a different analyser to process our fields so that numbers as well as letters are used in our index. To set a new analyser you need to use the static setDefault() method of the Zend_Search_Lucene_Analysis_Analyzer class and pass it a text analyser object. This can be done with the following code.

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());

The analysis is done when the document is added so the index via the addDocument() method so the call to the setDefault() method can be added anywhere before that. The code above can be added to this application either in the App_Search_Lucene_Index_Pdfs class or the App_Search_Lucene_Document class. I have tried the code in each and it works in either location but I think the best place for this is probably the App_Search_Lucene_Document class. This is because it makes sense to have this as a standard for creating documents from different formats rather than having to add it to the App_Search_Lucene_Index classes every time a new format is added.

After adding this to your code you will need to re-index the documents before numeric search will work.

In the next (and final) post about Zend Lucene and PDF documents I will add an observer to the code so that we don't have to keep re-indexing the entire file directory every time we make a change to any documents. I will also be making the full source code available for download.

PHP

Zend Framework

Zend Lucene And PDF Documents Part 4: Searching

Add new comment

Related Content

Converting Images To The Colour Pallet From The Matrix In PHP

A Look At Flood Fill Algorithms In PHP

A Look At Benford's Law

Protecting A Page From Being Directly Accessed With PHP

Generating Colour Palettes From Images In PHP

Validating XML Files With XML Schema Definitions In PHP