Last time we looked at viewing and saving meta data to PDF documents using Zend Framework. The next step before we try to index them with Zend Lucene is to extract the data out of the documents themselves. I should note here that we can't extract the data perfectly from every PDF document, we certainly can't extract any images or tables from the PDF into any recognisable text. There is a little issue with extracting the text because we are essentially looking at compressed data. The text isn't saved into the document, it is rendered into the document using a font. So what we need to do is extract this data into some format the Lucene can tokenize. Because we are just getting the text out of the document for our search index we can take a few short-cuts in order to get as much textual data out of the document as possible. All of this data might not be fully readable and we will definitely loose any formatting and images, but for the purposes we are using it for we don't really need it. The idea is that we can retrieve as much relevant and indexable content for Zend Lucene to tokenize. Also, it is not possible to extract the data out of encrypted PDF documents.
What we need to do first is set up some items so that we can simply use a PDF extraction service to do the hard work for us. This does mean a greater understanding of Zend Framework than the last post required. What we are going to do is register a namespace with Zend_Loader_Autoloader. This will allow us to create classes that we can keep in a tidy folder structure and are also automatically included when we need them. If you don't have one already, create a function called _initAutoload() or similar in your Bootstrap.php file. Then enter the following code (the whole class is included here for clarity). You might have already done this in your Zend Framework project so you can skip this step if that is the case.
class Bootstrap extends Zend_Application_Bootstrap_Bootstrap
{
protected function _initAutoload()
{
$autoloader = Zend_Loader_Autoloader::getInstance();
$autoloader->registerNamespace(array('App_'));
}
}
What this does is to register a folder called App, which is located in our library folder, to be part of the Zend Framework autoloading functions. Create a class called App_Search_Helper_PdfParser and put it in the folder \library\App\Search\Helper\ like this:
--application
--library
----App
------Search
--------Helper
----------PdfParser.php
----Zend
Now we can instansiate the object without having to worry about if it is included or not, the Zend Framework autoloader will simply look in the right place for the file by looking at the class name and include it for us. We will use this folder structure for the rest of the application and build upon it as we add classes.
What we need to do now is to create the code that will run over our PDF document and pick out the text. I have to admit that I didn't write this fully myself, it is the result of a couple of hours of picking bits and pieces of code from examples and applications so that I could do what I needed to do. I have tested this code with lots of different examples of PDF documents (about 50 from different resources) so it should be able to extract data from most PDF types. What this code essentially does is split the document into various different sections, and then try to uncompress each section that has a FlateDecode filter type. If the decompression works (ie, we have some data) then add this to a string and continue, returning it once at the end of the document. I have also added some string manipulation to this code that will strip out any odd characters or white space that we don't need. Here is the class in full, again there is rather a lot of code here so I have commented it to make it clearer.
Also, because of the use of gzuncompress you will need a zip library present on your server for this to work properly.
class App_Search_Helper_PdfParser
{
/**
* Convert a PDF into text.
*
* @param string $filename The filename to extract the data from.
* @return string The extracted text from the PDF
*/
public function pdf2txt($data)
{
/**
* Split apart the PDF document into sections. We will address each
* section separately.
*/
$a_obj = $this->getDataArray($data, "obj", "endobj");
$j = 0;
/**
* Attempt to extract each part of the PDF document into a "filter"
* element and a "data" element. This can then be used to decode the
* data.
*/
foreach ($a_obj as $obj) {
$a_filter = $this->getDataArray($obj, "<<", ">>");
if (is_array($a_filter) && isset($a_filter[0])) {
$a_chunks[$j]["filter"] = $a_filter[0];
$a_data = $this->getDataArray($obj, "stream", "endstream");
if (is_array($a_data) && isset($a_data[0])) {
$a_chunks[$j]["data"] = trim(substr($a_data[0], strlen("stream"), strlen($a_data[0]) - strlen("stream") - strlen("endstream")));
}
$j++;
}
}
$result_data = NULL;
// decode the chunks
foreach ($a_chunks as $chunk) {
// Look at each chunk decide if we can decode it by looking at the contents of the filter
if (isset($chunk["data"])) {
// look at the filter to find out which encoding has been used
if (strpos($chunk["filter"], "FlateDecode") !== false) {
// Use gzuncompress but supress error messages.
$data =@ gzuncompress($chunk["data"]);
if (trim($data) != "") {
// If we got data then attempt to extract it.
$result_data .= ' ' . $this->ps2txt($data);
}
}
}
}
/**
* Make sure we don't have large blocks of white space before and after
* our string. Also extract alphanumerical information to reduce
* redundant data.
*/
$result_data = trim(preg_replace('/([^a-z0-9 ])/i', ' ', $result_data));
// Return the data extracted from the document.
if ($result_data == "") {
return NULL;
} else {
return $result_data;
}
}
/**
* Strip out the text from a small chunk of data.
*
* @param string $ps_data The chunk of data to convert.
* @return string The string extracted from the data.
*/
public function ps2txt($ps_data)
{
// Stop this function returning bogus information from non-data string.
if (ord($ps_data[0]) < 10) {
return $ps_data;
}
if (substr($ps_data, 0, 8 ) == '/CIDInit') {
return '';
}
$result = "";
$a_data = $this->getDataArray($ps_data, "[", "]");
// Extract the data.
if (is_array($a_data)) {
foreach ($a_data as $ps_text) {
$a_text = $this->getDataArray($ps_text, "(", ")");
if (is_array($a_text)) {
foreach ($a_text as $text) {
$result .= substr($text, 1, strlen($text) - 2);
}
}
}
}
// Didn't catch anything, try a different way of extracting the data
if (trim($result) == "") {
// the data may just be in raw format (outside of [] tags)
$a_text = $this->getDataArray($ps_data, "(", ")");
if (is_array($a_text)) {
foreach ($a_text as $text) {
$result .= substr($text, 1, strlen($text) - 2);
}
}
}
// Remove any stray characters left over.
$result = preg_replace('/\b([^a|i])\b/i', ' ', $result);
return trim($result);
}
/**
* Convert a section of data into an array, separated by the start and end words.
*
* @param string $data The data.
* @param string $start_word The start of each section of data.
* @param string $end_word The end of each section of data.
* @return array The array of data.
*/
public function getDataArray($data, $start_word, $end_word)
{
$start = 0;
$end = 0;
$a_result = array();
while ($start !== false && $end !== false) {
$start = strpos($data, $start_word, $end);
$end = strpos($data, $end_word, $start);
if ($end !== false && $start !== false) {
// data is between start and end
$a_result[] = substr($data, $start, $end - $start + strlen($end_word));
}
}
return $a_result;
}
}
To use this within your application just instantiate the object and call the pdf2txt() method, passing in the rendered PDF string as the parameter. Rather than get this object to open the file a second time (after first being opened to inspect the PDF data) I decided to use the Zend_Pdf object to transfer the data into the class. The following code shows how to load a PDF using Zend_Pdf and pass the rendered string to the pdf2txt() method.
$pdf = Zend_Pdf::load($pdfPath);
$pdfParse = new App_Search_Helper_PdfParser();
$contents = $pdfParse->pdf2txt($pdf->render());
What we should be left with after this process is a block of text that we can use in our search index.
In the next post I will tie together the meta data and the contents retrival and use them to index our PDF documents using Zend Lucene. Again I will make all of the source code available for this project in the final instalment, so stay tuned if you would like it.
Comments
Submitted by Shankar on Thu, 02/04/2010 - 19:57
PermalinkSubmitted by giHlZp8M8D on Thu, 02/04/2010 - 20:36
Permalinkcauses an error. Any ideas? Is this a valid substr-command? expects parameter 2 to be long, string given in
Submitted by Anonymous on Mon, 04/11/2011 - 13:37
PermalinkNo, that was incorrect, it should have been a call to strpos() instead of substr(). I have corrected that now.
Submitted by giHlZp8M8D on Mon, 04/11/2011 - 13:45
PermalinkSubmitted by Edward on Sat, 08/02/2014 - 01:18
PermalinkSubmitted by Jamil on Wed, 02/24/2016 - 20:02
Permalink$pdf = Zend_Pdf::load($pdfPath);
Uncaught Error: Class "Zend_Pdf" not found in E:\webSoft\xampp_sarber\htdocs\2021_\index.php:15
Submitted by Najmul Hasan Ferdous on Sat, 05/01/2021 - 06:49
PermalinkAdd new comment