Using XPath With HTML Files In PHP

Note: This post is over two years old and so the information contained here might be out of date. If you do spot something please leave a comment and we will endeavour to correct.

14th March 2013 - 11 minutes read time

I recently have started looking into making myself a PHP Zend Certified Engineer and after doing a bit of research I found that the standard PHP string and array functions appear to be a large part of the exam material. So as a starting point (and for future revision) I decided it might be a good idea to create a revision sheet for those functions.

Rather than do this job by hand I wanted an automated way of getting hold of the function definitions. It is possible to download the PHP documentation as a set of many HTML files from the PHP website. This gave me the files I needed to start extracting the necessary information. The function declarations I needed were spread across two types of HTML files. There was an index file that contained single line descriptions of the functions, each of which links to a page describing that function in detail. The inner pages contained the function declarations so what I needed was to extract all of the links from the function index file and all of the declarations at the top of each linked page.

In order to do this I used the DOM classes from PHP, which were introduced into PHP in version 5. These classes are a great way of parsing XML and HTML without messing around with regular expressions where things can go wrong very quickly. After learning about how to use them the DOM classes are now my preferred approach to extracting data from HTML files. However, because HTML is perfectly fine as a mess of tags we need to first suppress the errors that will be generated as a result of any HTML oddities. This can be done using the libxml_use_internal_errors() function, which we pass a value of true.

libxml_use_internal_errors(TRUE);

We could opt to get hold of any errors in the document using the libxml_get_errors() function, but we will just be throwing the errors away so there is no need to do that. This function returns an array of errors so if you wanted to you could loop through them and try and do something about them.

I then grabbed the contents of the strings functions index page and passed this into a DOMDocument object using the loadHTMLFile() method, creating a HTML DOM. This creates a usable DOM object based on a HTML file, which is then passed into a new DOMXPath object so that I can query the HTML DOM using XPath.

$dom = new DOMDocument();
$dom->loadHTMLFile('ref.strings.html');
$x = new DOMXPath($dom);

XPath is a XML query language that is used to find elements within a XML document and is pretty easy to use. What we essentially need to do is find all of the anchor elements (a) that are children of list elements (li), which are children of a unordered list element (ul) with a class value of 'chunklist chunklist_reference'. The DOMXPath object is used to run the XPath query using the appropriately named query() method. The query() method returns an traversable DOMNodeList object that contains the list of DOMNode objects. What this means is that we can use it in a foreach() loop as look at each DOMNode object individually. We can get hold of the href of the anchor tag by using the getAttribute() method of the DOMNode object and asking it for the 'href' attribute. The following code will print out all of the href attributes found in the links.

foreach($x->query("//ul[@class='chunklist chunklist_reference']/li/a") as $node) {
  $href = $node->getAttribute("href");
  echo $href . PHP_EOL;
}

Now that I have all of the file references I need I can load in the inner HTML using a new DOMDocument object and run a different XPath query. This time, however, we should only ever be returning a single result (ie. the function definition) and so we just need to get hold of that single item. This can be done by using the item() method of the DOMNodeList object. Here is the code that loads in the inner function page (based on the href we picked up in the loop above) and finds the function definition contained within a div element with the class attribute of 'methodsynopsis dc-description'.

$function_dom = new DOMDocument();
$function_dom->loadHTMLFile('php-chunked-xhtml/' . $href);
$function_x = new DOMXPath($function_dom); 

// Pick out the function definition
$function_node_list = $function_x->query("//div[@class='methodsynopsis dc-description']");
$function_node = $function_node_list->item(0);

The HTML DOM we now have contains the following markup in the original page, which is way more information than we actually need. The function definition itself contains lots of inner tags that separate each component. What we have now is a DOMNode object, which contains several child DOMNode objects and we now need to translate this into a text format. What we could do is traverse this tree of DOMNode objects, pulling the text content out of each, one at a time. Thankfully, the DOMNode object has a property called textContent, which already contains the text for this object and all child objects. We can therefore pull out the contents of the DOMNode tree like this:

$function_definition = $function_node->textContent;

The actual text content that this produces is a little messy due to the white space that's left behind after the HTML tags are removed. So a the definition just needs to be run through a couple of cleaning steps to tidy up the output.

$function_definition = trim(preg_replace("/\s{2,}/", ' ', $function_node->textContent));
$function_definition = str_replace(array(' (', '( ', ' )'), array('(', '(', ')'), $function_definition);

One problem I encountered was that some of the pages in the list of functions are actually aliases, and therefore have a slightly different structure to the normal function pages. This means that the XPath query we ran before will not find anything. If this happens then the query() method of the DOMXPath object will return a NULL value, which we can easily detect. All we need to do then is run a slightly different query to pick out the alias definition.

if (is_null($function_node)) {
  // This is an alias, slightly different structure to a function page
  $alias_node_list = $function_x->query("//p[@class='refpurpose']");
  $function_node = $alias_node_list->item(0);
}

One last thing I wanted to do was to extract the function name from the definition. This can be done easily by using a XPath subquery. If you pass a DOMNode object as the second parameter to the query() method the query you run is then relative to that DOM object. This means that I can search for a span with a certain class attribute within the function declaration node without having to worry if the same thing appears elsewhere within the global DOM. The XPath query needed for the function and alias names are slightly different here so I have posed them below.

// Function name
$function_name = $function_x->query("./span[@class='methodname']/strong", $function_node)->item(0)->textContent;
// Alias name
$function_name = $function_x->query("./span[@class='refname']", $function_node)->item(0)->textContent;

All of the code above is ready to be brought together into a single function. The following code will take a listings page location as an input and extract the function definitions within it into a single array, which is the returned.

function get_function_list($href) {
  // Turn off invalid HTML errors
  libxml_use_internal_errors(TRUE);

  $functions = array();

  // Parse the main HTML document
  $dom = new DOMDocument();
  $dom->loadHTMLFile($href);
  $x = new DOMXPath($dom); 

  // Get all of the function page links
  foreach($x->query("//ul[@class='chunklist chunklist_reference']/li/a") as $node) {
    $href = $node->getAttribute("href");
   
    // Get the function file contents and parse it
    $function_dom = new DOMDocument();
    $function_dom->loadHTMLFile($href);
    $function_x = new DOMXPath($function_dom); 

    // Pick out the function definition
    $function_node_list = $function_x->query("//div[@class='methodsynopsis dc-description']");
    $function_node = $function_node_list->item(0);

    if (is_null($function_node)) {
      // This is an alias, slightly different structure to a function page
      $alias_node_list = $function_x->query("//p[@class='refpurpose']");
      $function_node = $alias_node_list->item(0);
      
      // Query the alias xpath query results
      $function_name = $function_x->query("./span[@class='refname']", $function_node)->item(0)->textContent;
    }
    else {
      // Get function name
      $function_name = $function_x->query("./span[@class='methodname']/strong", $function_node)->item(0)->textContent;
    }

    // Extract the contents into string, stripping some of the whitespace
    $function_definition = trim(preg_replace("/\s{2,}/", ' ', $function_node->textContent));
    $function_definition = str_replace(array(' (', '( ', ' )'), array('(', '(', ')'), $function_definition);
    
    // Add the function to our definitions list
    $functions[$function_name] = $function_definition;
  }

  return $functions;
}

Here is the code I used to run the above function and save the output into a file. As I said before I only wanted to extract the string and array functions so I am only looking at those index files.

$file_contents = '';
$file_contents .= '--STRING FUNCTIONS--' . PHP_EOL;
$functions = get_function_list('ref.strings.html');
foreach ($functions as $function) {
  $file_contents .= $function . PHP_EOL;
}

file_put_contents('string_functions.txt', $file_contents);

$file_contents = '';
$file_contents .= '--ARRAY FUNCTIONS--' . PHP_EOL;
$functions = get_function_list('ref.array.html');
foreach ($functions as $function) {
  $file_contents .= $function . PHP_EOL;
}

file_put_contents('array_functions.txt', $file_contents);

I now have two files that contain the array and string functions available in PHP, which I printed out and stuck to the wall of my office as a revision guide.

DOM