Determining the number of unique words in a .txt file

Please support our C++ advertiser: Intel Parallel Studio Home
Reply

Join Date: Jul 2005
Posts: 1,671
Reputation: Lerner is a name known to all Lerner is a name known to all Lerner is a name known to all Lerner is a name known to all Lerner is a name known to all Lerner is a name known to all 
Solved Threads: 261
Lerner Lerner is offline Offline
Posting Virtuoso

Re: Determining the number of unique words in a .txt file

 
0
  #21
Dec 5th, 2008
change this:
  1. while (inFile.getline(line,Num))
  2. {
  3. strncpy(array[i].word, line, wordLength);
  4. count++;
  5. i++;
  6. }
To this:
  1. while (inFile.getline(line,Num))
  2. {
  3. if(wordSearch(array, count, line))
  4. cout << "duplicate found" << endl;
  5. else
  6. {
  7. strcpy(array[count], line);
  8. count++;
  9. }
  10. }
Change wordSearch to this:
  1. bool wordSearch( char line[], int count, words array[])
  2. {
  3. bool found = false;
  4. int i = count;
  5. while (i)
  6. {
  7. if (strcmp(line, array[i-1].word) == 0)
  8. {
  9. array[i-1].count++;
  10. found = true;
  11. break;
  12. }
  13. i-- ;
  14. }
  15. return found;
  16. }
Eliminate wordSearchSetup() completely.

count should be the number of unique words found. If it reaches 100 before completely reading the file you will have to output the full error message. If you want to keep track of the number of total words found in the file in addition to the number of unique words in the file, you can do that too. Once you have completed the file reading you can display array with each unique word and the number of times it was found.
Klatu Barada Nikto
Reply With Quote Quick reply to this message  
Join Date: Nov 2008
Posts: 13
Reputation: matt_570 is an unknown quantity at this point 
Solved Threads: 0
matt_570 matt_570 is offline Offline
Newbie Poster

Re: Determining the number of unique words in a .txt file

 
0
  #22
Dec 7th, 2008
Ok thanks, now I'm trying to determine the average occurence of each words, I'm starting out by finding out how many of each word there is, heres my function-


  1. void averageOccurrence(words array[], int array_length)
  2. {
  3. int n;
  4. char cmp_array[wordLength];
  5.  
  6. for( int i= 0; i< array_length; i++)
  7. {
  8. strcpy(cmp_array, array[i].word);
  9.  
  10. for (int j=1; j<array_length; j++)
  11. {
  12. n = (strcmp(array[j].word, cmp_array));
  13. if(n == 0)
  14. array[i].count++;
  15. }
  16. }
  17.  
  18.  
  19. }
It just gives me a large count like 150077, or 150079.
Reply With Quote Quick reply to this message  
Join Date: Jul 2008
Posts: 2,001
Reputation: ArkM has much to be proud of ArkM has much to be proud of ArkM has much to be proud of ArkM has much to be proud of ArkM has much to be proud of ArkM has much to be proud of ArkM has much to be proud of ArkM has much to be proud of ArkM has much to be proud of 
Solved Threads: 343
ArkM's Avatar
ArkM ArkM is offline Offline
Postaholic

Re: Determining the number of unique words in a .txt file

 
0
  #23
Dec 7th, 2008
What's initial value of count member (must be zero)?
Reply With Quote Quick reply to this message  
Join Date: Jul 2005
Posts: 1,671
Reputation: Lerner is a name known to all Lerner is a name known to all Lerner is a name known to all Lerner is a name known to all Lerner is a name known to all Lerner is a name known to all 
Solved Threads: 261
Lerner Lerner is offline Offline
Posting Virtuoso

Re: Determining the number of unique words in a .txt file

 
0
  #24
Dec 7th, 2008
>>trying to determine the average occurence of each words

You don't care what each word is to do this, you only need only need to know how many unique words there are----that would be count in my last post, and how many words there were in the file. The number of words in the file could be calculated as a running total as you read through the file, as indicated in my last post, or it can be calculated by looping through the array of unique words and adding up the number of each in a running total. For example. if there are three unique words with frequency of 3, 6 and 9 each respectively, then the average number of occurences of unique words would be 6. You can decide which approach you wish to take. However, the code you have posted in post #22 above doesn't have a chance of coming up with the correct answer.
Klatu Barada Nikto
Reply With Quote Quick reply to this message  
Join Date: Nov 2008
Posts: 13
Reputation: matt_570 is an unknown quantity at this point 
Solved Threads: 0
matt_570 matt_570 is offline Offline
Newbie Poster

Re: Determining the number of unique words in a .txt file

 
0
  #25
Dec 7th, 2008
Thanks a lot, that helps a lot. I'm on my last stat, I have to find the most commonly occuring word(s). Heres my function-

  1. void commonWord(double count[], int array_length, words array[])
  2. {
  3. int commonCount[Num];
  4. int max;
  5. max = count[0];
  6. int j = 0;
  7. int i = 0;
  8. int k = 0;
  9.  
  10. for(i = 1; i<array_length; i++)
  11. {
  12. if(count[i] > max)
  13. max = count[i];
  14. else if( count[i] == max)
  15. {
  16. commonCount[j] = i;
  17. j++;
  18. }
  19. }
  20.  
  21. cout<< "The most commonly occuring words are: "<< endl;
  22.  
  23. for( k = 0; k<array_length; k++)
  24. cout<< array[commonCount[k]].word<< endl;
  25. }
I dont get a compile-time error. But when I run the program, I get a message telling me the .exe file stopped working.
Reply With Quote Quick reply to this message  
Join Date: Nov 2008
Posts: 13
Reputation: matt_570 is an unknown quantity at this point 
Solved Threads: 0
matt_570 matt_570 is offline Offline
Newbie Poster

Re: Determining the number of unique words in a .txt file

 
0
  #26
Dec 9th, 2008
I'm now trying to write my stats to a text file, but its only writing two of the stats, and those stats come from the same function.

Heres the code -

  1. #include <iostream>
  2. #include <fstream>
  3. #include <cstdlib>
  4. #include <string>
  5. #include <iomanip>
  6. using namespace std;
  7.  
  8. int const wordLength = 21;
  9. int const Num = 100;
  10. int const fileSize = 255;
  11.  
  12. struct words
  13. {
  14. char word[wordLength];
  15. int count;
  16. };
  17.  
  18.  
  19.  
  20. void storeFile(char[], char [], words []);
  21. void displayFile(char[], char[], words[]);
  22. int wordSearch(char word[], int array_size, words []);
  23. void sortSetup(char[], words [], int);
  24. int sort (words [], int, int);
  25. void averageLength(char[], words[], int);
  26. void Occurrence(char[], words[], int);
  27. void commonWord(char[], double[], int, words []);
  28. void averageOccurrence(char[], words[], int);
  29.  
  30. int main ()
  31. {
  32.  
  33. words array[Num];
  34. char fileName[fileSize];
  35. char out_file_name[fileSize];
  36.  
  37. cout << "Please enter the name of the file you wish to open: "<< endl;
  38. cin.getline(fileName,fileSize);
  39.  
  40. if (!cin.good() ) {
  41. cout << "Error reading cin..." << endl ;
  42. return -1 ;
  43. }
  44.  
  45. cout<<"Please enter the name of the file you wish to send the data too" << endl;
  46. cin.getline(out_file_name,fileSize);
  47.  
  48. displayFile(out_file_name, fileName, array);
  49. storeFile(out_file_name, fileName, array);
  50.  
  51. cin.ignore();
  52. }
  53.  
  54. void storeFile (char out_file_name[], char fileName[], words array[] )
  55. {
  56. ofstream outFile;
  57. outFile.open(out_file_name);
  58.  
  59. int i = 0;
  60. ifstream inFile;
  61. char line [Num];
  62. int array_size = 0 ;
  63. inFile.open(fileName);
  64.  
  65. while (inFile.getline(line,Num))
  66. {
  67. array_size = wordSearch(line, array_size, array);
  68.  
  69. i++;
  70. }
  71.  
  72. inFile.close();
  73.  
  74. outFile<< "The number of unique words are: "<< array_size << endl;
  75. outFile<< "total number of words are: " << i << endl;
  76. outFile<< endl;
  77.  
  78. }
  79.  
  80. int wordSearch( char line[], int array_size, words array[])
  81. {
  82. int i = array_size ;
  83.  
  84.  
  85. while (i && array_size > 0)
  86. {
  87. if (strcmp(line, array[i-1].word) == 0)
  88. {
  89. array[i-1].count++;
  90. return array_size ;
  91. }
  92. i-- ;
  93. }
  94.  
  95. strcpy(array[array_size].word, line) ;
  96. array[array_size].count = 1 ;
  97. return array_size+1;
  98. }
  99.  
  100. void displayFile (char out_file_name[], char fileName[], words array[] )
  101. {
  102. int i = 0;
  103. char line [Num];
  104.  
  105. ifstream inFile;
  106. inFile.open(fileName);
  107.  
  108.  
  109. while (inFile.getline(line,Num))
  110. {
  111.  
  112. strncpy(array[i].word, line, wordLength);
  113. i++;
  114. }
  115.  
  116. sortSetup(out_file_name, array, i);
  117. averageLength(out_file_name, array, i);
  118. Occurrence(out_file_name, array, i);
  119. averageOccurrence(out_file_name, array, i);
  120.  
  121. inFile.close();
  122. }
  123.  
  124. void sortSetup (char out_file_name[], words array [], int array_length)
  125. {
  126. ofstream outFile;
  127. outFile.open(out_file_name);
  128.  
  129. char temp[wordLength];
  130. int position;
  131. int i = array_length;
  132.  
  133. for (int loop = 0; loop < array_length - 1; loop++)
  134. {
  135. position = sort (array, loop, array_length - 1);
  136. if (position != loop)
  137. {
  138. strcpy(temp, array[position].word);
  139. strcpy(array[position].word, array[loop].word);
  140. strcpy(array[loop].word, temp);
  141. }
  142. }
  143.  
  144. outFile << "The words in alphabetical order are:"<< endl;
  145.  
  146. for (int j = 0; j< array_length; j++)
  147. {
  148. if(strcmp(array[j].word,array[j-1].word)!=0)
  149. outFile << array[j].word << endl;
  150. }
  151.  
  152. outFile << endl;
  153. }
  154.  
  155.  
  156. int sort (words array[], int start, int stop)
  157. {
  158. int n;
  159. int loc = start;
  160. for (int pos = start + 1; pos <= stop; pos++)
  161. {
  162. n = (strcmp(array[pos].word, array[loc].word));
  163.  
  164. if (n < 0)
  165. loc = pos;
  166. }
  167. return loc;
  168. }
  169.  
  170. void averageLength(char out_file_name[], words array[], int i)
  171. {
  172. ofstream outFile;
  173. outFile.open(out_file_name);
  174.  
  175. double average = 0;
  176. for(int j = 0; j<i; j++)
  177. average = average + strlen(array[j].word);
  178.  
  179. average = average/i;
  180.  
  181. outFile << "The average length of the words are: " << average <<endl;
  182. outFile << endl;
  183. }
  184.  
  185. void Occurrence(char out_file_name[], words array[], int array_length)
  186. {
  187. ofstream outFile;
  188. outFile.open(out_file_name);
  189.  
  190. int n;
  191. char cmp_array[wordLength];
  192. double count[Num];
  193.  
  194. for( int i= 0; i< array_length; i++)
  195. {
  196. strcpy(cmp_array, array[i].word);
  197. count[i] = 0;
  198. for (int j=0; j<array_length; j++)
  199. {
  200. n = (strcmp(array[j].word, cmp_array));
  201. if(n == 0)
  202. count[i]++;
  203. }
  204. }
  205.  
  206. outFile<<"The unique words and the number of times they appear in the text file appears asthe following:"<< endl;
  207. outFile<<"word/times it appears:" << endl;
  208. outFile<< endl;
  209. for (int k = 0; k< array_length; k++)
  210. {
  211. if(strcmp(array[k].word,array[k-1].word)!=0)
  212. outFile <<array[k].word << " / " << count[k] << endl;
  213. }
  214.  
  215. outFile<<endl;
  216. commonWord(out_file_name, count, array_length, array);
  217. }
  218.  
  219.  
  220. void commonWord(char out_file_name[], double count[], int array_length, words array[])
  221. {
  222. ofstream outFile;
  223. outFile.open(out_file_name);
  224.  
  225. int count_max;
  226. count_max = count[0];
  227. int j = 0;
  228. int i = 0;
  229.  
  230.  
  231. for(i = 1; i<array_length; i++)
  232. {
  233. if(count[i] > count_max)
  234. count_max = count[i];
  235. }
  236.  
  237. outFile<< "The word(s) that occur the most are: "<< endl;
  238.  
  239. for( j = 0; j<array_length; j++)
  240. {
  241. if(strcmp(array[j].word,array[j-1].word)!=0)
  242. {
  243. if(count[j] == count_max)
  244.  
  245. outFile << array[j].word<< endl;
  246. }
  247. }
  248.  
  249. outFile<< endl;
  250. }
  251.  
  252. void averageOccurrence(char out_file_name[], words array[], int array_length)
  253. {
  254. ofstream outFile;
  255. outFile.open(out_file_name);
  256.  
  257. int n;
  258. char cmp_array[wordLength];
  259. double count[Num];
  260.  
  261. for( int i= 0; i< array_length; i++)
  262. {
  263. strcpy(cmp_array, array[i].word);
  264. count[i] = 0;
  265. for (int j=0; j<array_length; j++)
  266. {
  267. n = (strcmp(array[j].word, cmp_array));
  268. if(n == 0)
  269. count[i]++;
  270. }
  271. }
  272.  
  273. outFile<<"The average occurence of a word appears as the following:" << endl; outFile <<"word/average appearence:" << endl;
  274. outFile<< endl;
  275. for (int k = 0; k< array_length; k++)
  276. {
  277. if(strcmp(array[k].word,array[k-1].word)!=0)
  278. outFile <<array[k].word << " / " << count[k]/array_length << endl;
  279. }
  280.  
  281. outFile<<endl;
  282. }

The "storeFile" function is the only one that prints to the file.
BTW I know this code is unorganized and not the best way to do it, but this project is due tomorrow (12-10)
Last edited by matt_570; Dec 9th, 2008 at 8:15 pm.
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC