Validating XML Files With XML Schema Definitions In PHP

XML is a useful format for configuration, data storage, and transmitting data from one system to another. As a human readable format that can be easily read by machines it quickly gained favor in lots of different systems as a mechanism for data storage.

Many systems transparently make use of XML without the user ever seeing it. The API system SOAP is built around XML data, and it is normally possible to ask API endpoints to respond in either JSON or XML. It is common to see XML files used in configuration of systems as they can be easily edited and parsed when needed.

XML Schema Definitions (XSD) were first published by the W3C in 2001 and are one of a number of different XML schema formats that exist. Using this format we can look at a XML file and see straight away if it is valid or not and also provide some insight into why the XML is invalid. This prevents having to implement hard coded logic into your systems that attempt to validate the XML document.

In this article we will look at using PHP to validate an XML document using a XSD document, and also how to render the error output correctly.

The XML and XSD Files

Let's take a very simple XSD document that might be used on a system that stores company address information.

<?xml version="1.0" encoding="utf-8"?>
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="company">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element type="xsd:string" name="company_email" minOccurs="1" />
        <xsd:element type="xsd:string" name="company_name" minOccurs="1" />
        <xsd:element type="xsd:string" name="company_address1" minOccurs="1" />
        <xsd:element type="xsd:string" name="company_postcode" minOccurs="1" />
        <xsd:element type="xsd:string" name="company_tel" minOccurs="1" />
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>
</xsd:schema>

We might receive a XML file that looks like this.

<?xml version="1.0" encoding="utf-8"?>
<company>
  <company_email>[email protected]</company_email>
  <company_name>example</company_name>
  <company_address1>1 Testing Street</company_address1>
  <company_postcode>T35 7ER</company_postcode>
  <company_tel>0123456789</company_tel>
</company>

Validation

To validate this XML document we need to load it into memory, which we can do using the DOMDocument object that is built into PHP. The load() method allows us to load an XML file from a given filename.

$xml= new DOMDocument();
$xmlDoc = 'company.xml';
$xml->load($xmlDoc, LIBXML_NOBLANKS);

We can also use the loadXML() method to load the XML from a string, which might be more useful if we received the XML via an API call or similar.

$xml= new DOMDocument();
$xmlDoc = 'company.xml';
$xmlString = file_get_contents($xmlDoc);
$xml->loadXML($xmlString, LIBXML_NOBLANKS);

With that step complete we can now validate the XML using the schemaValidate() method of the DOMDocument object. This method takes the schema filename as an argument and will return true if the supplied schema matches the XML document that we have loaded.

$xmlSchema = 'company.xsd';
if (!$xml->schemaValidate($xmlSchema)) {
  print 'XML file is invalid';
}

Alternatively, we can use the schemaValidateSource() method to load a schema definition from a string.

$xmlSchema = 'company.xsd';
$xmlString = file_get_contents($xmlDoc);
if (!$xml->schemaValidateSource($xmlSchema)) {
  print 'XML file is invalid';
}

These methods tell us if the schema is valid, but we can find out more information about why the schema is invalid.

Printing Schema Validation Messages

If the schema is invalid then we can find out why using the libxml_get_errors() function. This function returns an array of LibXMLError objects that we can use to print out information about the validation error using a custom function. Once we have printed out the error we can clean out the existing error cache using the libxml_clear_errors() function.

$errors = libxml_get_errors();
foreach ($errors as $error) {
  print libxml_render_error($error, $xml) . PHP_EOL;
}
libxml_clear_errors();

The libxml_render_error() is a custom function that takes an LibXMLError object and the DOMDocument object and prints out as much information as possible about the issue. Here is the function in full. 

function libxml_render_error(LibXMLError $error, DOMDocument $domDocument) {
  // Extract a formatted representation of the XML file as an array of lines.
  $domDocument->formatOutput = true;
  $lines = explode("\n", $domDocument->saveXML());

  $return = '';

  // Print out the line that has the problem, along with two lines of context.
  // The line at -1 is
  if ($error->line >= 1 && isset($lines[$error->line])) {
	  $return .= (isset($lines[$error->line - 2]) ? $error->line -2  . ':' . $lines[$error->line - 2] . PHP_EOL : '');
	  $return .= (isset($lines[$error->line - 1]) ? $error->line -1  . ':' . $lines[$error->line - 1] . PHP_EOL : '');
	  $return .= $error->line . ':' . $lines[$error->line] . PHP_EOL;
    // Try to put a pointer to where the error occurred.
    if ($error->column === 0) {
      $return .= str_pad('', strlen(trim($lines[$error->line - 1])) - 1, '-') . '^' . PHP_EOL;
    } else {
      $return .= str_pad('', $error->column, '-') . '^' . PHP_EOL;
    }
  }

  // Print the error level.
  switch ($error->level) {
    case LIBXML_ERR_WARNING:
      $return .= 'Warning ' . $error->code . ': ';
      break;
    case LIBXML_ERR_ERROR:
      $return .= 'Error ' . $error->code . ': ';
      break;
    case LIBXML_ERR_FATAL:
      $return .= 'Fatal Error ' . $error->code . ': ';
      break;
  }

  // Trim the error message.
  $return .= trim($error->message);

  if ($error->file) {
    // If the error has a file reference then print this out also.
    $return .= ' in ' . basename($error->file);
  }

  // Render the line and column of the error.
  $return .= ' Line: ' . $error->line . ' Column: ' . $error->column;

  return $return . PHP_EOL;
}

As an example of this in action, let's change company_address1 in our original XML file to be company_address, which might happen when using the XML format.

  <company_address>1 Testing Street</company_address>

Now, when we attempt to validate the XML document we will get the following output.

3:  <company_name>example</company_name>
4:  <company_address>1 Testing Street</company_address>
5:  <company_postcode>T35 7ER</company_postcode>
--------------------------------------------------^
Error 1871: Element 'company_address': This element is not expected. Expected is ( company_address1 ). in xml_validation Line: 5 Column: 0

If the XML schema validation has multiple errors then they will just be printed in sequence.

The LIBXML_SCHEMA_CREATE Flag

The second argument of the schemaValidate() and schemaValidateSource() methods is a single flag called LIBXML_SCHEMA_CREATE. This is currently the only flag that is accepted by this method and it can be used to inject default values into the DOM object during the validation step.

To get this working we need to alter the original XSD document to add a default attribute. Here we are setting the default attribute to be "0123" for the company_tel element.

<xsd:element type="xsd:string" name="company_tel" minOccurs="1" default="0123" />

Now we change the XML document slightly to that the company_tel element is still present, but the value is missing.

  <company_tel />

We then validate the document, which will pass, and then print out the formatted XML document if the XML is valid.

$xml = new DOMDocument();

$xmlDoc = 'company.xml';
$xmlSchema = 'company.xsd';

$xml->load($xmlDoc, LIBXML_NOBLANKS);
if ($xml->schemaValidate($xmlSchema, LIBXML_SCHEMA_CREATE)) {
  $xml->formatOutput = true;
  print $xml->saveXML() . PHP_EOL;
}

After running the validation we find that the default value of "0123" for the company_tel element has been added to the document.

<?xml version="1.0" encoding="utf-8"?>
<company>
  <company_email>[email protected]</company_email>
  <company_name>example</company_name>
  <company_address1>1 Testing Street</company_address1>
  <company_postcode>T35 7ER</company_postcode>
  <company_tel>0123</company_tel>
</company>

The value is also available in the DOMDocument object.

print $xml->getElementsByTagName('company_tel')->item(0)->nodeValue; // Prints "0123".

This technique is useful for enforcing default values in your XML documents without having to add custom logic to your codebase.

Conclusion

If you are accepting XML files then it is a good idea to pass then through a validation step in order to ensure that the files are valid before making use of them. You can also use the LIBXML_SCHEMA_CREATE flag to inject default values into your XML without having to write custom logic.

Whilst XML has lost favor in recent years to formats like JSON and YAML, there are still many systems that make use of this format or even allow you to return data in this format. Having a validation handler to ensure that your XML is fully valid helps to protect your system against errors.

Currently, the PHP library only supports XSD version 1.0. Version 1.1 of the XSD specification introduced structures and more complex data types and whilst they are useful, they aren't always strictly needed to validate an XML document.

Add new comment

The content of this field is kept private and will not be shown publicly.