Extract Keywords From A Text String With PHP

29th July 2009

A common issue I have come across in the past is that I have a CMS system, or an old copy of Wordpress, and I need to create a set of keywords to be used in the meta keywords field. To solve this I put together a simple function that runs through a string and picks out the most commonly used words in that list as an array. This is currently set to be 10, but you can change that quite easily.

The first thing the function defines is a list of "stop" words. This is a list of words that occur quite a bit in English text and would therefore interfere with the outcome of the function. The function also uses a variant of the slug function to remove any odd characters that might be in the text.

  1. function extractCommonWords($string){
  2. $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
  3.  
  4. $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
  5. $string = trim($string); // trim the string
  6. $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
  7. $string = strtolower($string); // make it lowercase
  8.  
  9. preg_match_all('/\b.*?\b/i', $string, $matchWords);
  10. $matchWords = $matchWords[0];
  11.  
  12. foreach ( $matchWords as $key=>$item ) {
  13. if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
  14. unset($matchWords[$key]);
  15. }
  16. }
  17. $wordCountArr = array();
  18. if ( is_array($matchWords) ) {
  19. foreach ( $matchWords as $key => $val ) {
  20. $val = strtolower($val);
  21. if ( isset($wordCountArr[$val]) ) {
  22. $wordCountArr[$val]++;
  23. } else {
  24. $wordCountArr[$val] = 1;
  25. }
  26. }
  27. }
  28. arsort($wordCountArr);
  29. $wordCountArr = array_slice($wordCountArr, 0, 10);
  30. return $wordCountArr;
  31. }

The function returns the 10 most commonly occurring words as an array, with the key as the word and the amount of times it occurs as the value. To extract the words just use the implode() function in conjunction with the array_keys() function. To change the number of words returned just alter the value in the third parameter of the array_slice() function near the return statement, currently set to 10. Here is an example of the function in action.

  1. $text = "This is some text. This is some text. Vending Machines are great.";
  2. $words = extractCommonWords($text);
  3. echo implode(',', array_keys($words));

This produces the following output.

some,text,machines,vending

Update

After lots of versions of this code submitted by users I think the most reliable version is this one from Cenk.

  1. function extractKeyWords($string) {
  2. $stopwords = array();
  3. $string = preg_replace('/[\pP]/u', '', trim(preg_replace('/\s\s+/iu', '', mb_strtolower($string))));
  4. $matchWords = array_filter(explode(' ',$string) , function ($item) use ($stopwords) { return !($item == '' || in_array($item, $stopwords) || mb_strlen($item) <= 2 || is_numeric($item));});
  5. $wordCountArr = array_count_values($matchWords);
  6. arsort($wordCountArr);
  7. return array_keys(array_slice($wordCountArr, 0, 10));
  8. }

This will produce the following sort of output.

  1. print implode(',', extractKeyWords("This is some text. This is some text. Vending Machines are great."));
  2. // prints "this,text,some,great,are,vending,machines"
  3.  
  4. print implode(',', extractKeyWords('হো সয়না সয়না সয়না ওগো সয়না এত জ্বালা সয়না ঘরেতে আমার এ মন রয়না কেন রয়না রয়না'));
  5. // prints "সয়না,রয়না,কেন,আমার,ঘরেতে,ওগো,জ্বালা"

If you use this then you should be aware that it requires PHP 5.5+ due to the use of a closure, but then that shouldn't be a problem :)

Comments

Permalink
hello can any one help me with a problem? I have a variable containing a html commando like this: $str=""; and i will split it up so i can change the values and sample it again to work with echo $str; sorry for my bad english. i have searched the entire net and nothing is what i need so i hope anyone can help me. my solution is not good because it bet big and time consuming. and i have find some on the internet that look a like that i need but i cant sample it again without destroying anything. sincerely LAT D-Grund.dk Administrator my email is [email protected]

Submitted by Anonymous on Sun, 06/20/2010 - 12:29

Permalink
I would love to help but I'm not sure what you are trying to do. Passing this string through the extractCommonWords() function probably won't produce any meaningful output as the string is just HTML containing a link and an image.

Submitted by philipnorton42 on Sun, 06/20/2010 - 17:28

Permalink
  1. function extractCommonWords($string){
  2. $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
  3.  
  4. $string = preg_replace('/ss+/i', '', $string);
  5. $string = trim($string); // trim the string
  6. $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
  7. $string = strtolower($string); // make it lowercase
  8.  
  9. preg_match_all('/\b.*?\b/i', $string, $matchWords);
  10. $matchWords = $matchWords[0];
  11.  
  12. foreach ( $matchWords as $key=>$item ) {
  13. if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) $val ) {
  14. $val = strtolower($val);
  15. if ( isset($wordCountArr[$val]) ) {
  16. $wordCountArr[$val]++;
  17. } else {
  18. $wordCountArr[$val] = 1;
  19. }
  20. }
  21. }
  22. arsort($wordCountArr);
  23. $wordCountArr = array_slice($wordCountArr, 0, 10);
  24. return $wordCountArr;
  25. }
  26. }
  27.  
  28. $text = "This is some text. This is some text. Vending Machines are great.";
  29. $words = extractCommonWords($text);
  30. echo implode(',', array_keys($words));

Submitted by Jaco on Fri, 07/02/2010 - 12:21

Permalink
Thanks for the input, after running the script it was clear that it wouldn't work as there was one too many curly braces in the function. I have removed this in the example so it will run now.

Submitted by philipnorton42 on Fri, 07/02/2010 - 14:15

Permalink
Do you maybe know, what would be the problem, .. when I get my keywords with your script.. I always get letter "p" added right before first keyword and right after last keyword.. os the keyword list looks like this: pkeyword1, keyword2...... keyword10p .. and I dont know what causes this.. thanks

Submitted by Dusan on Tue, 07/20/2010 - 00:49

Permalink
Are you passing it a HTML string with p tags at the start and end? Try using strip_tags() first.

Submitted by philipnorton42 on Tue, 07/20/2010 - 08:51

Permalink
Your approach of stripping punctuation was a heck of a lot better than my method. I went the opposite approach and used a stock snippet of code with several preg_replace patterns, except for WordPress curly smart quotes. This is much more elegant and simple. Thanks for that, and here is how I would count word frequency: $wordCountArray = array_count_values( $matchWords ); No need of a foreach loop. The strtolower() call inside both foreach loops is really not needed either since you have already converted $string to lowercase beforehand.

Submitted by Ronnie on Fri, 09/03/2010 - 18:00

Permalink
Hey I’ve found your code very useful. Thanks a lot! However I’m finding an issue and that is that I want to divide the number of times a word repeats by the total number of words. The problem I have is that I cannot access each item of the array since the keys change every time I change the text. I was wondering if there is a say around this or if a new dimension can be added to the array so that things like this will work: echo $wordCountArr[2]/$totalwords; Any help will be much appreciated! Thanks in advance! :)

Submitted by Guillermo on Wed, 01/19/2011 - 04:17

Permalink
By the way, I've also tryed this with now luck... $numrepeats = print_r (array_values($words), true); for ($h=0; $h"; }

Submitted by Guillermo on Wed, 01/19/2011 - 04:23

Permalink
@Guillermo I think to get the effect you are looking for you'll need to modify the function in the following way:
  1. function extractCommonWords($string){
  2. $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
  3.  
  4. $string = preg_replace('/ss+/i', '', $string);
  5. $string = trim($string); // trim the string
  6. $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
  7. $string = strtolower($string); // make it lowercase
  8.  
  9. preg_match_all('/\b.*?\b/i', $string, $matchWords);
  10. $matchWords = $matchWords[0];
  11.  
  12. $totalWords = count($matchWords[0]);
  13.  
  14. foreach ( $matchWords as $key=>$item ) {
  15. if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) $val ) {
  16. $val = strtolower($val);
  17. if ( !isset($wordCountArr[$val]) {
  18. $wordCountArr[$val] = array();
  19. }
  20. if ( isset($wordCountArr[$val]['count']) ) {
  21. $wordCountArr[$val]['count']++;
  22. } else {
  23. $wordCountArr[$val]['count'] = 1;
  24. }
  25. }
  26. arsort($wordCountArr);
  27. $wordCountArr = array_slice($wordCountArr, 0, 10);
  28. foreach ( $wordCountArr as $key => $val) {
  29. $wordCountArr[$val]['bytotal'] = $wordCountArr[$val]['count'] / $totalWords;
  30. }
  31. }
  32. return $wordCountArr;
  33. }
This is untested code, but it should get you what you need. :)

Submitted by philipnorton42 on Wed, 01/19/2011 - 10:18

Permalink
Hey thanks for your super-fast response! I get it quite to work (not saying the code is not right, I just didn’t figure out how to implement it). But I’ve found a way around the issue. Thank you very much!

Submitted by Guillermo on Wed, 01/19/2011 - 22:06

Permalink
Excellent bit of code. I was looking for something to auto-suggest keywords for a content management system. I made the following changes : 1) removed the strtolower call in the foreach loop because the string has already been forced to lower case so all the items will be lower case anyway.
  1. if ( $item == '' || in_array($item, $stopWords) || strlen($item)
  2.  
  3. 2) Added a $count parameter to the function so that when it is called you can specify how many words to return.
  4.  
  5. <code>
  6. function extractCommonWords($string,$count){
  7.  
  8. ...
  9.  
  10. $wordCountArr = array_slice($wordCountArr, 0, $count);
  11. return $wordCountArr;
  12. }
The hardest thing to get right is the stop word list.

Submitted by Matt Hawkins on Thu, 01/27/2011 - 10:39

Permalink

Brilliant. Thanks for this piece of code. I was after doing something very similar so have used this as a base and made a few tweaks here and there so it fits my needs. 

Submitted by Brit on Wed, 03/23/2011 - 10:52

Permalink

Hi

Great code and is working a treat on my site but I have a problem with words containing 'ss'.  The double ss is being removed so a word such as 'password' is being seen as 'paword' in the keywords.

Regards

Submitted by Mkj on Sat, 05/21/2011 - 09:37

Permalink

 

Hi, would you please to let me know how I could include spanish characters too?.

For example I need to print: 'tamaño' instead I get: 'tamao'

Thanks in advance
Tolenca

 

Submitted by Tolenca on Wed, 06/22/2011 - 19:46

Permalink

You need to add those special characters to the regular expression on line 6 of the examples above. You just need to make sure it takes alphanumeric characters as well as the spanish letters. I think that should work.

Submitted by philipnorton42 on Wed, 06/22/2011 - 22:20

Permalink

instead of

$wordCountArr = array();
  1. if ( is_array($matchWords) ) {
  2. foreach ( $matchWords as $key => $val ) {
  3. $val = strtolower($val);
  4. if ( !isset($wordCountArr[$val]) {
  5. $wordCountArr[$val] = array();
  6. }
  7. if ( isset($wordCountArr[$val]['count']) ) {
  8. $wordCountArr[$val]['count']++;
  9. } else {
  10. $wordCountArr[$val]['count'] = 1;
  11. }
  12. }

I did:

  1. $ignoreOccur = array(1,2);
  2. $wordCountArr = array_diff(array_count_values(explode(" ", matchWords)), $ignoreOccur);

To get all assoc array $wordCoundArr of [word] => [occurences], ignoring words that have occurred 1 or 2 (or specified number of) times.

Submitted by Joe Bowman on Fri, 09/02/2011 - 14:09

Permalink

Interesting take on it. I like it! :)

Submitted by philipnorton42 on Fri, 09/02/2011 - 14:11

Permalink

I changed line 13 to not include 'keywords' that are just numbers using the  is_numeric() function

if ( $item == '' || in_array($item, $stopWords) || strlen($item) >= 3 || is_numeric($item) ) {

Submitted by Gerhard Racter on Thu, 10/27/2011 - 15:49

Permalink

Thanks, this is quite useful!

Submitted by HSMoore on Tue, 11/08/2011 - 09:43

Permalink

Make sure this line:

  1. $string = preg_replace('/ss+/i', '', $string);

reads like this:

  1. $string = preg_replace('/\s\s+/i', '', $string);

Submitted by Anonymous on Thu, 05/03/2012 - 15:23

Permalink

This script just returns a string of comma separated keywords and supports multibyte characters.
Adjust your stopwords for your locale.

  1. function extractCommonWords($string)
  2. {
  3. $stopwords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
  4. $string = mb_strtolower($string); // make it lowercase
  5. $string = trim(preg_replace('/\s\s+/i', '', $string;)); //remove multiple whitespace
  6. $string = preg_replace('/[\pP]/', '', $string); // remove punctuation
  7. $matchWords = array_filter(explode(" ",$string) , function ($item) use ($stopwords) { return !($item == '' || in_array($item, $stopwords) || mb_strlen($item) < 2 || is_numeric($item));});
  8. $wordCountArr = arsort(array_count_values($matchWords));
  9. return implode(',', array_keys(array_slice($wordCountArr, 0, 10)));
  10. }

 

Submitted by wookiester on Fri, 05/04/2012 - 10:53

Permalink

Whoops.. Ignore my previous code block.. it was bugged.. try this.

  1. function extractKeyWords($string)
  2. {
  3. $stopwords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
  4. $string = preg_replace('/[\pP]/', '', trim(preg_replace('/\s\s+/i', '', mb_strtolower(utf8_encode($string)))));
  5. $matchWords = array_filter(explode(' ',$string) , function ($item) use ($stopwords) { return !($item == '' || in_array($item, $stopwords) || mb_strlen($item) < 2 || is_numeric($item));});
  6. $wordCountArr = array_count_values($matchWords);
  7. arsort($wordCountArr);
  8. return implode(',', array_keys(array_slice($wordCountArr, 0, 10)));
  9. }

Submitted by wookiester on Fri, 05/04/2012 - 11:11

Permalink
plz help me how to use for arabic text such as "اب ت ث "? thank you

Submitted by Mohamed saad on Thu, 05/17/2012 - 15:51

Permalink
I'm afraid I can't answer that as I don't know enough about the Arabic language/alphabet. Sorry :(

Submitted by philipnorton42 on Thu, 05/17/2012 - 15:55

Permalink
Heya. Great script thanks :) Just a note for the guys who want to handle non-anglo characters, if they replace: $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); with this: $string = preg_replace('/[^\w\d -]/', '', $string); It should support extended characters, not sure off the top of my head how much unicode is covered by \w (if any), but it works for a bunch of quick tests I've put through it.

Submitted by Bob Davies on Tue, 08/14/2012 - 07:14

Permalink
to reduce whitespace (followed by another whitespace) within the string, its better to use preg_replace('/[ ]{2,}/sm', ' ', $text)

Submitted by summsel on Wed, 11/07/2012 - 16:57

Permalink
@sammsel - Seems like a good way of doing things, but why is it better? Is there a performance overhead using my method? Also you are only detecting the space character, not all white space character, which might cause problems/

Submitted by philipnorton42 on Wed, 11/07/2012 - 22:32

Permalink
i see if $wordCountArr contain number (ex. year), the output value for it's 0. For return the corect value you must modify $wordCountArr = array_slice($wordCountArr, 0, 10); in $wordCountArr = array_slice($wordCountArr, 0, 10,$preserve_keys=true);

Submitted by weter on Tue, 01/01/2013 - 06:36

Permalink
Thanks a lot for sharing this tutorial on how to extract keywords from a text string using php. This post really helped me big time. I am so glad that I came to this site. I just hope that these awesome posts will keep on coming.

Submitted by Keybo Wordsy on Sat, 10/19/2013 - 13:38

Permalink
Found this useful, I need a simple help. If there is a word "bmw-x666" It results "bmw,-,x666" but what i want is it shoudnt extract the word containing hyphen. Could anyone help me?

Submitted by Roshan on Sun, 12/08/2013 - 08:26

Permalink
How could i set the key phrase of matched words. eg. This is some, Vending Machines are

Submitted by jasa pembuatan web on Wed, 08/13/2014 - 23:13

Permalink
great code Philip - thanks ! is there a way to group the resulting keywords to two or three words ? your eg: some,text,machines,vending to eg: some text, machines vending, some machines, text vending, some vending, text machines... as multiple keywords can give good results in meta working example: samsung, s3, battery, i9300, original, EB900F result: samsung battery, samsung s3 battery, original i9300 battery, EB900F original, battery samsung s3... google accepts up to 10 grouped keywords of one two or three keywords

Submitted by Gary on Wed, 12/17/2014 - 22:37

Permalink
Hi, I try to use above code with a text is 3-In-1. However, why it returns as 3In1? It seems a preg_replace was remove - from a text. Can I skip - removing within a function? Thanks

Submitted by M on Thu, 12/25/2014 - 14:13

Permalink
To stop the function removing dashes you need to change the following line $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); To this: $string = preg_replace('/[^a-zA-Z0-9 ]/', '', $string); Line 7 in the code example above.

Submitted by philipnorton42 on Fri, 12/26/2014 - 10:06

Permalink
Output of code is some,text,machines,vending ie only keywords are displayed but if I need frequency count along with keywords for eg. some 3,text 3,machines 2,vending 1. What modifications should I make ?

Submitted by Kinley Chriatian on Thu, 01/29/2015 - 18:51

Permalink
thanks for your source code, it's works well.

Submitted by oscar on Thu, 11/26/2015 - 07:52

Permalink
Hello Philip, If I want to extract common words from different language (ex. Bengali - হো সয়না সয়না সয়না ওগো সয়না এত জ্বালা সয়না ঘরেতে আমার এ মন রয়না কেন রয়না রয়না) then what changes should I make? Currently with your code it's not working, it gives no output! Please help.

Submitted by NNO on Sun, 04/24/2016 - 08:15

Permalink
Great code Philip - thanks! I just have a little problem, how can i make each keywords a link. Please help

Submitted by raphael on Thu, 04/28/2016 - 16:58

Permalink
It depends on where you want the keywords link to.

Submitted by philipnorton42 on Thu, 04/28/2016 - 17:16

Permalink
Great Code, and works! you can make me for simple life

Submitted by delatosca on Fri, 10/07/2016 - 09:20

Permalink
Thank you, the article is very useful for me and I gained knowledge by reading this article

Submitted by Dzulfikar on Fri, 02/17/2017 - 02:37

Permalink
Hi, I try to use above code with a text is 3-In-1. However, why it returns as 3In1? information is interesting and a great tutorial.. tnks

Submitted by Mugo on Thu, 03/09/2017 - 07:08

Permalink
Thank you for sharing this incredible post with us.

Submitted by AJ on Tue, 03/28/2017 - 18:44

Permalink
great code

Submitted by felipe on Wed, 10/04/2017 - 00:11

Permalink
Very useful guide

Submitted by Mind Roaster on Mon, 11/20/2017 - 10:10

Permalink
I want model no of laptop as keyword. plz help. Code is working great! "Dell Inspiron Core i3 6th Gen - (4 GB/1 TB HDD/Linux) 3467 Laptop (14 inch, Black, 1.956 kg)" doesnt extract 3467 from string. Thank you :)

Submitted by Sarthak Gophane on Wed, 01/03/2018 - 06:56

Permalink

This code helped develop a WordPress plugin that generates tag and keywords for each post.

Submitted by David Maina on Sat, 10/06/2018 - 12:16

Add new comment

The content of this field is kept private and will not be shown publicly.