I have already talked about converting a sitemap.xml file into a urllist.txt file, but what if you want to create a HTML sitemap? If you have a sitemap.xml file then you can use this to spider your site, scrape the contents of each page and populate the HTML file with this information.
The following code does this. For every page it looks for the title tag, the description meta tag and the first h2 tag on the page. These items are then used to construct a segment of HTML for that page.
<?php
$header = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>HTML Sitemap</title>
</head>
<body>';
set_time_limit(400);
$currentElement = '';
$currentLoc = '';
$map = "<h1>HTML Sitemap</h1>"."\n";
function parsePage($data)
{
global $map;
if ( stripos($data,".pdf") > 0 ) {
$map .= '<p><a href="'.$data.'">PDF document.</a></p>'."\n";
$map .= '<p>A pdf document.</p>'."\n";
} elseif ( stripos($data, ".txt")>0 ) {
$map .= '<p><a href="'.$data.'">Text document.</a></p>'."\n";
$map .= '<p>A text document.</p>'."\n";
} else {
if ( $urlh = @fopen($data, 'rb') ) {
$contents = '';
if ( phpversion()>5 ) {
$contents = stream_get_contents($urlh);
} else {
while ( !feof($urlh) ) {
$contents .= fread($urlh, 8192);
};
};
preg_match('/(?<=\<[Tt][Ii][Tt][Ll][Ee]\>)\s*?(.*?)\s*?(?=\<\/[Tt][Ii][Tt][Ll][Ee]\>)/U', $contents, $title);
$title = $title[0];
$header = array();
preg_match('/(?<=\<[Hh]2\>)(.*?)(?=\<\/[Hh]2\>)/U', $contents, $header);
$header = strip_tags($header[0]);
if ( strlen($title) > 0 && strlen($header) > 0 ) {
$map .= '<p class="link"><a href="'.str_replace('&','&',$data).'" title="'.(strlen($header)>0?trim($header):trim($title)).'">'.trim($title).(strlen($header)>0?" - ".trim($header):'').'</a></p>'."\n";
} elseif ( strlen($title) > 0 ) {
$map .= '<p class="link"><a href="'.str_replace('&','&',$data).'" title="'.trim($title).'">'.trim($title).'</a></p>'."\n";
} elseif ( strlen($header) > 0 ) {
$map .= '<p class="link"><a href="'.str_replace('&','&',$data).'" title="'.trim($header).'">'.trim($header).'</a></p>'."\n";
};
preg_match('/(?<=\<[Mm][Ee][Tt][Aa]\s[Nn][Aa][Mm][Ee]\=\"[Dd]escription\" content\=\")(.*?)(?="\s*?\/?\>)/U', $contents, $description);
$description = $description[0];
if ( strlen($description)>0 ) {
$map .= '<p class="desc">'.trim($description).'</p>'."\n";
};
fclose($urlh);
};
};
};
function startElement($xmlParser, $name, $attribs)
{
global $currentElement;
$currentElement = $name;
};
function endElement($parser, $name)
{
global $currentElement,$currentLoc;
if ( $currentElement == 'loc') {
parsePage($currentLoc);
$currentLoc = '';
};
$currentElement = '';
};
function characterData($parser, $data)
{
global $currentElement,$currentLoc;
if ( $currentElement == 'loc' ) {
$currentLoc .= $data;
};
};
$xml_parser = xml_parser_create();
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($xml_parser,"startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if ( !($fp = fopen('sitemap.xml', "r")) ) {
die("could not open XML input");
};
while ( $data = fread($fp,4096) ) {
if ( !xml_parse($xml_parser, $data,feof($fp)) ) {
die(sprintf("XML error: %s at line %d",xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser)));
};
};
fclose($fp);
$footer = '</body>
</html>';
$fp = fopen('sitemap.html', "w+");
fwrite($fp,$header.$map.$footer);
fclose($fp);
echo $header.$map.$footer;
This script prints out the sitemap and also saves the sitemap to a file for later use. This is essential as the script can take a long time to run due to all of the page accessing that it has to do.
This script is fairly complicated and has gone through several versions since I first created it so if you find any improvements or bugs then let me know and I will incorporate them.
Comments
Submitted by john on Wed, 07/28/2010 - 13:06
PermalinkSubmitted by giHlZp8M8D on Wed, 07/28/2010 - 20:07
PermalinkSubmitted by mikula on Tue, 08/21/2012 - 06:45
PermalinkAdd new comment