i have been given an assignment, to find frequencies of all words in a large text file. I have tried a program which finds the same in a sample string. Done by taking that string in an array. But in case of a text file spanning many pages with thousands of words, won't that array eat up a lot space? I have been asked to consider performance as a prime criteria.
Any suggestion will be awesome

I think this would work reasonably fast:

$filename = "/path/to/file.txt";
$handle = fopen($filename,"r");
if ($handle === false) {
$word = "";
while (false !== ($letter = fgetc($handle))) {
  if ($letter == ' ') {
    $word = "";
  else {
    $word .= $letter;

Note: This assumes the file is in the format <word><space><word><space><word>.... etc.

You came here for a suggestion and got a complete solution. Lucky guy.

I just re-read the original post, and now I realise it was for an assignment. I suppose I shouldn't have provided the complete solution. Oh well :)


You could also do this:

$str = file_get_contents("text.txt"); //get string from file - no error handling here though
preg_match_all("/\b(\w+[-]\w+)|(\w+)\b/",$str,$r); //place words into array $r - this includes hyphenated words
$c = array_count_values(array_map("strtolower",$r[0])); //create new array - with case-insensitive count 
foreach($c as $key => $val){
	echo $key . " [" . $val . "]<br />";  //output data	

This will give you output for ASCII character words. However multibyte characters (รข etc) won't work. That needs something more sophisticated.

I bet somebody could write a better regex than I've used though.

what can i say, edwinhermann is really kind !!


i have a text file written in utf-8 format like arabic script. for that suggest me what to do. how to output text in same arabic script


