Extract Links From A HTML File With PHP

Use the following function to extract all of the links from a HTML string.

function linkExtractor($html)
{
 $linkArray = array();
 if(preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i', $html, $matches, PREG_SET_ORDER)){
  foreach ($matches as $match) {
   array_push($linkArray, array($match[1], $match[2]));
  }
 }
 return $linkArray;
}

To use it just read a web page or file into a string, and pass that string to the function. The following example reads a web page using the PHP CURL functions and then passes the result into the function to retrieve the links.

$url = 'http://www.hashbangcode.com';	
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12');
curl_setopt($ch,CURLOPT_HEADER,0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
curl_setopt($ch,CURLOPT_TIMEOUT,120);
$html = curl_exec($ch);
curl_close($ch);
echo '<pre>' . print_r(linkExtractor($html), true) . '<pre>';

The function will return an array, with each element being an array containing the link location and the text that the link contains.

Comments

Cool Scripts.
Permalink
Hi Philip, I have similar script:\n"); PRINT("
    \n"); WHILE(!FEOF($page)) { $line = FGETS($page, 255); WHILE(EREGI("HREF=\"[^\"]*\"", $line, $match)) { PRINT("
  • "); PRINT($match[0]); PRINT("
    \n"); $replace = EREG_REPLACE("\?", "\?", $match[0]); $line = EREG_REPLACE($replace, "", $line); } } PRINT("
\n"); FCLOSE($page); ?>
How do I get links only with .zip extension and without href=" and in the end of each line " as well?
Permalink

Add new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
15 + 4 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.