A common issue I have come across in the past is that I have a CMS system, or an old copy of Wordpress, and I need to create a set of keywords to be used in the meta keywords field. To solve this I put together a simple function that runs through a string and picks out the most commonly used words in that list as an array. This is currently set to be 10, but you can change that quite easily.
The first thing the function defines is a list of "stop" words. This is a list of words that occur quite a bit in English text and would therefore interfere with the outcome of the function. The function also uses a variant of the slug function to remove any odd characters that might be in the text.
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
The function returns the 10 most commonly occurring words as an array, with the key as the word and the amount of times it occurs as the value. To extract the words just use the implode() function in conjunction with the array_keys() function. To change the number of words returned just alter the value in the third parameter of the array_slice() function near the return statement, currently set to 10. Here is an example of the function in action.
$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));
This produces the following output.
some,text,machines,vending
Update
After lots of versions of this code submitted by users I think the most reliable version is this one from Cenk.
function extractKeyWords($string) {
mb_internal_encoding('UTF-8');
$stopwords = array();
$string = preg_replace('/[\pP]/u', '', trim(preg_replace('/\s\s+/iu', '', mb_strtolower($string))));
$matchWords = array_filter(explode(' ',$string) , function ($item) use ($stopwords) { return !($item == '' || in_array($item, $stopwords) || mb_strlen($item) <= 2 || is_numeric($item));});
$wordCountArr = array_count_values($matchWords);
arsort($wordCountArr);
return array_keys(array_slice($wordCountArr, 0, 10));
}
This will produce the following sort of output.
print implode(',', extractKeyWords("This is some text. This is some text. Vending Machines are great."));
// prints "this,text,some,great,are,vending,machines"
print implode(',', extractKeyWords('হো সয়না সয়না সয়না ওগো সয়না এত জ্বালা সয়না ঘরেতে আমার এ মন রয়না কেন রয়না রয়না'));
// prints "সয়না,রয়না,কেন,আমার,ঘরেতে,ওগো,জ্বালা"
If you use this then you should be aware that it requires PHP 5.5+ due to the use of a closure, but then that shouldn't be a problem :)
Comments
the extractcommonwords function is just superb, Brillinat work and thakn you
Submitted by haron taiko on Sun, 08/23/2020 - 18:47
Permalinki am so satisfacted.my english is poor, sorry :). thx for approving my user greetings wally
Submitted by wallykew on Wed, 01/20/2021 - 17:47
PermalinkAdd new comment