Extract Keywords From A Text String With PHP

Note: This post is over two years old and so the information contained here might be out of date. If you do spot something please leave a comment and we will endeavour to correct.

29th July 2009 - 5 minutes read time

A common issue I have come across in the past is that I have a CMS system, or an old copy of Wordpress, and I need to create a set of keywords to be used in the meta keywords field. To solve this I put together a simple function that runs through a string and picks out the most commonly used words in that list as an array. This is currently set to be 10, but you can change that quite easily.

The first thing the function defines is a list of "stop" words. This is a list of words that occur quite a bit in English text and would therefore interfere with the outcome of the function. The function also uses a variant of the slug function to remove any odd characters that might be in the text.

function extractCommonWords($string){
      $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
   
      $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase
   
      preg_match_all('/\b.*?\b/i', $string, $matchWords);
      $matchWords = $matchWords[0];
      
      foreach ( $matchWords as $key=>$item ) {
          if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
              unset($matchWords[$key]);
          }
      }   
      $wordCountArr = array();
      if ( is_array($matchWords) ) {
          foreach ( $matchWords as $key => $val ) {
              $val = strtolower($val);
              if ( isset($wordCountArr[$val]) ) {
                  $wordCountArr[$val]++;
              } else {
                  $wordCountArr[$val] = 1;
              }
          }
      }
      arsort($wordCountArr);
      $wordCountArr = array_slice($wordCountArr, 0, 10);
      return $wordCountArr;
}

The function returns the 10 most commonly occurring words as an array, with the key as the word and the amount of times it occurs as the value. To extract the words just use the implode() function in conjunction with the array_keys() function. To change the number of words returned just alter the value in the third parameter of the array_slice() function near the return statement, currently set to 10. Here is an example of the function in action.

$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));

This produces the following output.

some,text,machines,vending

Update

After lots of versions of this code submitted by users I think the most reliable version is this one from Cenk.

function extractKeyWords($string) {
  mb_internal_encoding('UTF-8');
  $stopwords = array();
  $string = preg_replace('/[\pP]/u', '', trim(preg_replace('/\s\s+/iu', '', mb_strtolower($string))));
  $matchWords = array_filter(explode(' ',$string) , function ($item) use ($stopwords) { return !($item == '' || in_array($item, $stopwords) || mb_strlen($item) <= 2 || is_numeric($item));});
  $wordCountArr = array_count_values($matchWords);
  arsort($wordCountArr);
  return array_keys(array_slice($wordCountArr, 0, 10));
}

This will produce the following sort of output.

print implode(',', extractKeyWords("This is some text. This is some text. Vending Machines are great."));
// prints "this,text,some,great,are,vending,machines"

print implode(',', extractKeyWords('হো সয়না সয়না সয়না ওগো সয়না এত জ্বালা সয়না ঘরেতে আমার এ মন রয়না কেন রয়না রয়না'));
// prints "সয়না,রয়না,কেন,আমার,ঘরেতে,ওগো,জ্বালা"

If you use this then you should be aware that it requires PHP 5.5+ due to the use of a closure, but then that shouldn't be a problem :)

string

PHP

array_slice

Comments

hello can any one help me with a problem? I have a variable containing a html commando like this:

$str="<a href="http://d-grund.dk/" id="home" target="_blank" title="D-Grund.DK"><img alt="D-Grund.DK" border="1" height="90" id="home" src="http://d_grund.dk/images/homelogo.gif" width="166" /></a>";

and i will split it up so i can change the values and sample it again to work with echo $str; sorry for my bad english. i have searched the entire net and nothing is what i need so i hope anyone can help me. my solution is not good because it bet big and time consuming. and i have find some on the internet that look a like that i need but i cant sample it again without destroying anything. sincerely LAT D-Grund.dk Administrator my email is [email protected]

Submitted by Anonymous on Sun, 06/20/2010 - 12:29

Extract Keywords From A Text String With PHP

Comments

Add new comment

Related Content

Generating Colour Palettes From Images In PHP

Validating XML Files With XML Schema Definitions In PHP

Creating A Character Bitmap In PHP

Approximating Pi Using A Circle And A Square

Drawing A Parabolic Curve With Straight Lines In PHP

Recreating Spotify Wrapped In PHP