A common issue I have come across in the past is that I have a CMS system, or an old copy of Wordpress, and I need to create a set of keywords to be used in the meta keywords field. To solve this I put together a simple function that runs through a string and picks out the most commonly used words in that list as an array. This is currently set to be 10, but you can change that quite easily.
The first thing the function defines is a list of "stop" words. This is a list of words that occur quite a bit in English text and would therefore interfere with the outcome of the function. The function also uses a variant of the slug function to remove any odd characters that might be in the text.
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
The function returns the 10 most commonly occurring words as an array, with the key as the word and the amount of times it occurs as the value. To extract the words just use the implode() function in conjunction with the array_keys() function. To change the number of words returned just alter the value in the third parameter of the array_slice() function near the return statement, currently set to 10. Here is an example of the function in action.
$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));
This produces the following output.
some,text,machines,vending
Update
After lots of versions of this code submitted by users I think the most reliable version is this one from Cenk.
function extractKeyWords($string) {
mb_internal_encoding('UTF-8');
$stopwords = array();
$string = preg_replace('/[\pP]/u', '', trim(preg_replace('/\s\s+/iu', '', mb_strtolower($string))));
$matchWords = array_filter(explode(' ',$string) , function ($item) use ($stopwords) { return !($item == '' || in_array($item, $stopwords) || mb_strlen($item) <= 2 || is_numeric($item));});
$wordCountArr = array_count_values($matchWords);
arsort($wordCountArr);
return array_keys(array_slice($wordCountArr, 0, 10));
}
This will produce the following sort of output.
print implode(',', extractKeyWords("This is some text. This is some text. Vending Machines are great."));
// prints "this,text,some,great,are,vending,machines"
print implode(',', extractKeyWords('হো সয়না সয়না সয়না ওগো সয়না এত জ্বালা সয়না ঘরেতে আমার এ মন রয়না কেন রয়না রয়না'));
// prints "সয়না,রয়না,কেন,আমার,ঘরেতে,ওগো,জ্বালা"
If you use this then you should be aware that it requires PHP 5.5+ due to the use of a closure, but then that shouldn't be a problem :)
Comments
hello can any one help me with a problem? I have a variable containing a html commando like this:
and i will split it up so i can change the values and sample it again to work with echo $str; sorry for my bad english. i have searched the entire net and nothing is what i need so i hope anyone can help me. my solution is not good because it bet big and time consuming. and i have find some on the internet that look a like that i need but i cant sample it again without destroying anything. sincerely LAT D-Grund.dk Administrator my email is lat@ofir.dk
Submitted by Anonymous on Sun, 06/20/2010 - 12:29
PermalinkSubmitted by giHlZp8M8D on Sun, 06/20/2010 - 17:28
PermalinkSubmitted by Jaco on Fri, 07/02/2010 - 12:21
PermalinkSubmitted by giHlZp8M8D on Fri, 07/02/2010 - 14:15
PermalinkSubmitted by Dusan on Tue, 07/20/2010 - 00:49
PermalinkSubmitted by giHlZp8M8D on Tue, 07/20/2010 - 08:51
PermalinkYour approach of stripping punctuation was a heck of a lot better than my method. I went the opposite approach and used a stock snippet of code with several preg_replace patterns, except for WordPress curly smart quotes. This is much more elegant and simple. Thanks for that, and here is how I would count word frequency:
No need of a foreach loop. The
strtolower()
call inside both foreach loops is really not needed either since you have already converted $string to lowercase beforehand.Submitted by Ronnie on Fri, 09/03/2010 - 18:00
PermalinkSubmitted by Guillermo on Wed, 01/19/2011 - 04:17
Permalink"; }
Submitted by Guillermo on Wed, 01/19/2011 - 04:23
Permalink@Guillermo I think to get the effect you are looking for you'll need to modify the function in the following way:
This is untested code, but it should get you what you need. :)
Submitted by giHlZp8M8D on Wed, 01/19/2011 - 10:18
PermalinkSubmitted by Guillermo on Wed, 01/19/2011 - 22:06
PermalinkExcellent bit of code. I was looking for something to auto-suggest keywords for a content management system. I made the following changes :
1) removed the strtolower call in the foreach loop because the string has already been forced to lower case so all the items will be lower case anyway.
2) Added a $count parameter to the function so that when it is called you can specify how many words to return.
The hardest thing to get right is the stop word list.
Submitted by Matt Hawkins on Thu, 01/27/2011 - 10:39
PermalinkBrilliant. Thanks for this piece of code. I was after doing something very similar so have used this as a base and made a few tweaks here and there so it fits my needs.
Submitted by Brit on Wed, 03/23/2011 - 10:52
PermalinkHi
Great code and is working a treat on my site but I have a problem with words containing 'ss'. The double ss is being removed so a word such as 'password' is being seen as 'paword' in the keywords.
Regards
Submitted by Mkj on Sat, 05/21/2011 - 09:37
PermalinkHi, would you please to let me know how I could include spanish characters too?.
For example I need to print: 'tamaño' instead I get: 'tamao'
Thanks in advance
Tolenca
Submitted by Tolenca on Wed, 06/22/2011 - 19:46
PermalinkYou need to add those special characters to the regular expression on line 6 of the examples above. You just need to make sure it takes alphanumeric characters as well as the spanish letters. I think that should work.
Submitted by giHlZp8M8D on Wed, 06/22/2011 - 22:20
Permalinkinstead of
I did:
To get all assoc array $wordCoundArr of [word] => [occurences], ignoring words that have occurred 1 or 2 (or specified number of) times.
Submitted by Joe Bowman on Fri, 09/02/2011 - 14:09
PermalinkInteresting take on it. I like it! :)
Submitted by giHlZp8M8D on Fri, 09/02/2011 - 14:11
PermalinkI changed line 13 to not include 'keywords' that are just numbers using the is_numeric() function
Submitted by Gerhard Racter on Thu, 10/27/2011 - 15:49
PermalinkThanks, this is quite useful!
Submitted by HSMoore on Tue, 11/08/2011 - 09:43
PermalinkMake sure this line:
reads like this:
Submitted by Anonymous on Thu, 05/03/2012 - 15:23
PermalinkThis script just returns a string of comma separated keywords and supports multibyte characters.
Adjust your stopwords for your locale.
Submitted by wookiester on Fri, 05/04/2012 - 10:53
PermalinkWhoops.. Ignore my previous code block.. it was bugged.. try this.
Submitted by wookiester on Fri, 05/04/2012 - 11:11
PermalinkSubmitted by Mohamed saad on Thu, 05/17/2012 - 15:51
PermalinkSubmitted by giHlZp8M8D on Thu, 05/17/2012 - 15:55
PermalinkHeya. Great script thanks :) Just a note for the guys who want to handle non-anglo characters, if they replace:
with this:
It should support extended characters, not sure off the top of my head how much unicode is covered by \w (if any), but it works for a bunch of quick tests I've put through it.
Submitted by Bob Davies on Tue, 08/14/2012 - 07:14
Permalinkto reduce whitespace (followed by another whitespace) within the string, its better to use
Submitted by summsel on Wed, 11/07/2012 - 16:57
PermalinkSubmitted by giHlZp8M8D on Wed, 11/07/2012 - 22:32
PermalinkSubmitted by weter on Tue, 01/01/2013 - 06:36
PermalinkSubmitted by Keybo Wordsy on Sat, 10/19/2013 - 13:38
PermalinkSubmitted by Roshan on Sun, 12/08/2013 - 08:26
PermalinkSubmitted by jasa pembuatan web on Wed, 08/13/2014 - 23:13
PermalinkSubmitted by Gary on Wed, 12/17/2014 - 22:37
PermalinkSubmitted by M on Thu, 12/25/2014 - 14:13
PermalinkTo stop the function removing dashes you need to change the following line
To this:
Line 7 in the code example above.
Submitted by giHlZp8M8D on Fri, 12/26/2014 - 10:06
PermalinkSubmitted by Kinley Chriatian on Thu, 01/29/2015 - 18:51
PermalinkSubmitted by oscar on Thu, 11/26/2015 - 07:52
PermalinkNow Perfectly For UTF-8
Submitted by Cenk on Sat, 01/09/2016 - 23:00
PermalinkSubmitted by NNO on Sun, 04/24/2016 - 08:15
PermalinkSubmitted by raphael on Thu, 04/28/2016 - 16:58
PermalinkSubmitted by giHlZp8M8D on Thu, 04/28/2016 - 17:16
PermalinkSubmitted by delatosca on Fri, 10/07/2016 - 09:20
PermalinkSubmitted by Dzulfikar on Fri, 02/17/2017 - 02:37
PermalinkSubmitted by Mugo on Thu, 03/09/2017 - 07:08
PermalinkSubmitted by AJ on Tue, 03/28/2017 - 18:44
PermalinkSubmitted by felipe on Wed, 10/04/2017 - 00:11
PermalinkSubmitted by Mind Roaster on Mon, 11/20/2017 - 10:10
PermalinkSubmitted by Sarthak Gophane on Wed, 01/03/2018 - 06:56
PermalinkThis code helped develop a WordPress plugin that generates tag and keywords for each post.
Submitted by David Maina on Sat, 10/06/2018 - 12:16
PermalinkGrate post.Thank You for sharing useful information.it is very benefited to the PHP learners.
keep posting more information.
Submitted by PHP Course on Tue, 08/04/2020 - 15:35
PermalinkAdd new comment