Jack Kray

  • 0

Extract Keywords From A Text String With PHP

A common issue I have come across in the past is that I have a CMS system, or an old copy of Wordpress, and I need to create a set of keywords to be used in the meta keywords field. To solve this I put together a simple function that runs through a string and picks out the most commonly used words in that list as an array. This is currently set to be 10, but you can change that quite easily.

The first thing the function defines is a list of "stop" words. This is a list of words that occur quite a bit in English text and would therefore interfere with the outcome of the function. The function also uses a variant of the slug function to remove any odd characters that might be in the text.

function extractCommonWords($string){
      $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
   
      $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase
   
      preg_match_all('/\b.*?\b/i', $string, $matchWords);
      $matchWords = $matchWords[0];
      
      foreach ( $matchWords as $key=>$item ) {
          if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
              unset($matchWords[$key]);
          }
      }   
      $wordCountArr = array();
      if ( is_array($matchWords) ) {
          foreach ( $matchWords as $key => $val ) {
              $val = strtolower($val);
              if ( isset($wordCountArr[$val]) ) {
                  $wordCountArr[$val]++;
              } else {
                  $wordCountArr[$val] = 1;
              }
          }
      }
      arsort($wordCountArr);
      $wordCountArr = array_slice($wordCountArr, 0, 10);
      return $wordCountArr;
}

 

The function returns the 10 most commonly occurring words as an array, with the key as the word and the amount of times it occurs as the value. To extract the words just use the implode() function in conjunction with the array_keys() function. To change the number of words returned just alter the value in the third parameter of the array_slice() function near the return statement, currently set to 10. Here is an example of the function in action.

$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));

 

some,text,machines,vending

 

Another code

function extractCommonWords($string){
      $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
   
      $string = preg_replace('/ss+/i', '', $string);
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase
   
      preg_match_all('/\b.*?\b/i', $string, $matchWords);
      $matchWords = $matchWords[0];
      
      foreach ( $matchWords as $key=>$item ) {
          if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
              unset($matchWords[$key]);
          }
      }   
      $wordCountArr = array();
      if ( is_array($matchWords) ) {
          foreach ( $matchWords as $key => $val ) {
              $val = strtolower($val);
              if ( isset($wordCountArr[$val]) ) {
                  $wordCountArr[$val]++;
              } else {
                  $wordCountArr[$val] = 1;
              }
          }
      }
      arsort($wordCountArr);
      $wordCountArr = array_slice($wordCountArr, 0, 10);
      return $wordCountArr;
  }
}
 
$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));

 

Source : hashbangcode.com

loading...

Related posts

Related posts