Convert HTML To ASCII With PHP

The reverse of turning ASCII text into HTML is to convert HTML into ASCII. And to this end here is a little function that does this.

function html2ascii($s) {
 // convert links
 $s = preg_replace('/<a\s+.*? href="?([^\">]*)"?[^>]*>(.*?)<\/a>/i','$2 ($1)',$s);
 
 // convert p, br and hr tags
 $s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
 $s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
 $s = preg_replace('@<div[^>]*>(.*)@i',"\n".'$1'."\n",$s);  
  
 // convert bold and italic tags
 $s = preg_replace('@<b[^>]*>(.*?)@i','*$1*',$s);
 $s = preg_replace('@<strong[^>]*>(.*?)@i','*$1*',$s);
 $s = preg_replace('@<i[^>]*>(.*?)@i','_$1_',$s);
 $s = preg_replace('@<em[^>]*>(.*?)@i','_$1_',$s);
   
 // decode any entities
 $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));
 
 // decode numbered entities
 $s = preg_replace('/&#(\d+);/e','chr(str_replace(";", "", str_replace("&#","","$0")))', $s);
 
 // strip any remaining HTML tags
 $s = strip_tags($s);
 
 // return the string
 return $s;
}

To use this function just pass it a string. Here is an example of it at work.

$htmlString = '<p>This is some <strong>XHTML</strong> markup that <em>will</em> be<br />
turned <a href="http://www.hashbangcode.com/" title="#! code">into</a> an ascii string.</p>';

echo html2ascii($htmlString);

Produces the following output.

This is some *XHTML* markup that _will_ be
turned into (http://www.hashbangcode.com/) an ascii string

Update:

It turns out the the use of the 'e' flag in preg_replace() isn't valid any more, so you need to use preg_replace_callback() instead. Also, the function needs to be passed a callback instead of just having a string.

I've updated the function a little here:

function html2ascii($s) {
  // convert links
  $s = preg_replace('/<a\s+.*? href="?([^\">]*)"?[^>]*>(.*?)<\/a>/i', '$2 ($1)', $s);

  // convert p, br and hr tags
  $s = preg_replace('@<(b|h)r[^>]*>(?=\<)@i', "\n", $s);
  $s = preg_replace('@<p[^>]*>(?=\<)@i', "\n\n", $s);
  $s = preg_replace('@<div[^>]*>(.*)(?=\<)@i', "\n" . '$1' . "\n", $s);

  // convert bold and italic tags
  $s = preg_replace('@<b[^>]*>(.*?)(?=\<)@i', '*$1*', $s);
  $s = preg_replace('@<strong[^>]*>(.*?)(?=\<)@i', '*$1*', $s);
  $s = preg_replace('@<i[^>]*>(.*?)(?=\<)@i', '_$1_', $s);
  $s = preg_replace('@<em[^>]*>(.*?)(?=\<)@i', '_$1_', $s);

  // decode any entities
  $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));

  // decode numbered entities
  $s = preg_replace_callback('/&#(\d+);/', function($matches) {
    return chr(str_replace(";", "", str_replace("&#", "",$matches[0])));
  }, $s);

  // strip any remaining HTML tags
  $s = strip_tags($s);

  // return the string
  return $s;
}

However, I really wouldn't use regular expressions to parse HTML.

Comments

I got error at in line 19 --> $s = preg_replace('//e','chr(\\1)',$s); Warning: Wrong parameter count for chr() in C:\PHP-test\xxxxx.php (??) : regexp code on line 1
Permalink
You are quite right, that would never work! I have updated the script with the fix.
Name
Philip Norton
Permalink

preg_replace_callback(): Requires argument 2, 'chr(str_replace(";", "", str_replace("&#","","$0")))', to be a valid callback

Permalink

Thanks for the info Tradesouthwest. I have updated the post with some changes.

Name
Philip Norton
Permalink

Really helpful piece of code -- thanks!  I had to add some replacements for my specific case, but the rest worked well (after all these years).

// LER: begin

// undo my prior replace space and tag chars

$s = str_replace('&#x25c3;', '<', $s);

$s = str_replace('&#x25b9;', '>', $s);

$s = str_replace('&nbsp;', ' ', $s);

// convert my line numbers, after above, * or **\d{1,3} to \n

$s = preg_replace('@\*{1,3}\d{1,4}:\s@',"\n", $s);

// strtr above is changing some spaces or tabs to hex a0 and c2 chars, undo it

$s = preg_replace('@\xa0@', ' ', $s);

$s = preg_replace('@\xc2@', '', $s);

// my tabs, but above replacements resulted in 3 not 4 spaces per tab

$s = str_replace(' ', "\t", $s);

$s = str_replace(" \n", "\n", $s); // ok; preg_replace('@\s$@', "\n", $s) FAILED!

// LER: end

Permalink

Thanks for the info Lawrence, glad you found it useful!

Name
Philip Norton
Permalink

Add new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
4 + 0 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.