i have been given an assignment, to find frequencies of all words in a large text file. I have tried a program which finds the same in a sample string. Done by taking that string in an array. But in case of a text file spanning many pages with thousands of words, won't that array eat up a lot space? I have been asked to consider performance as a prime criteria.
Any suggestion will be awesome

Recommended Answers

All 10 Replies

I think this would work reasonably fast:

<?php
$filename = "/path/to/file.txt";
$handle = fopen($filename,"r");
if ($handle === false) {
  exit;
  }
$word = "";
while (false !== ($letter = fgetc($handle))) {
  if ($letter == ' ') {
    $results[$word]++;
    $word = "";
    }
  else {
    $word .= $letter;
    }
}
fclose($handle);
print_r($results);
?>

Note: This assumes the file is in the format <word><space><word><space><word>.... etc.

You can always put your results in a mysql table, if your worried that memory will become a problem.

Member Avatar for diafol

You came here for a suggestion and got a complete solution. Lucky guy.

You came here for a suggestion and got a complete solution. Lucky guy.

I just re-read the original post, and now I realise it was for an assignment. I suppose I shouldn't have provided the complete solution. Oh well :)

Member Avatar for diafol

Haha, he'll have to noobie-fy it it he wants to pass it off as his own though! ;)

Member Avatar for diafol

You could also do this:

$str = file_get_contents("text.txt"); //get string from file - no error handling here though
preg_match_all("/\b(\w+[-]\w+)|(\w+)\b/",$str,$r); //place words into array $r - this includes hyphenated words
$c = array_count_values(array_map("strtolower",$r[0])); //create new array - with case-insensitive count 
foreach($c as $key => $val){
	echo $key . " [" . $val . "]<br />";  //output data	
}

This will give you output for ASCII character words. However multibyte characters (â etc) won't work. That needs something more sophisticated.

I bet somebody could write a better regex than I've used though.

You came here for a suggestion and got a complete solution. Lucky guy.

what can i say, edwinhermann is really kind !!

edwinhermann and ardav...thanks for your suggestions and answers !! :)

i have a text file written in utf-8 format like arabic script. for that suggest me what to do. how to output text in same arabic script

Khan - please start a new thread. This thread is solved and it's not appropriate to post additional messages here.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.