How can I count the number of unique words in a file in this program

Question

andrew.mendonca.967

11 Years Ago

CSCI-15 Assignment #2, String processing. (60 points) Due 9/23/13

You MAY NOT use C++ string objects for anything in this program.

Write a C++ program that reads lines of text from a file using the ifstream getline() method, tokenizes the lines into words ("tokens") using strtok(), and keeps statistics on the data in the file. Your input and output file names will be supplied to your program on the command line, which you will access using argc and argv[].

You need to count the total number of words, the number of unique words, the count of each individual word, and the number of lines. Also, remember and print the longest and shortest words in the file. If there is a tie for longest or shortest word, you may resolve the tie in any consistent manner (e.g., use either the first one or the last one found, but use the same method for both longest and shortest). You may assume the lines comprise words (contiguous lower-case letters [a-z]) separated by spaces, terminated with a period. You may ignore the possibility of other punctuation marks, including possessives or contractions, like in "Jim's house". Lines before the last one in the file will have a newline ('\n') after the period. In your data files, omit the '\n' on the last line. You may assume that the lines will be no longer than 100 characters, the individual words will be no longer than 15 letters and there will be no more than 100 unique words in the file.

Read the lines from the input file, and echo-print them to the output file. After reaching end-of-file on the input file (or reading a line of length zero, which you should treat as the end of the input data), print the words with their occurrence counts, one word/count pair per line, and the collected statistics to the output file. You will also need to create other test files of your own. Also, your program must work correctly with an EMPTY input file – which has NO statistics.

Test file looks like this (exactly 4 lines, with NO NEWLINE on the last line):

the quick brown fox jumps over the lazy dog.
now is the time for all good men to come to the aid of their party.
all i want for christmas is my two front teeth.
the quick brown fox jumps over a lazy dog.

the quick brown fox jumps over the lazy dog.
now is the time for all good men to come to the aid of their party.
all i want for christmas is my two front teeth.
the quick brown fox jumps over a lazy dog.

Copy and paste this into a small file for one of your tests.

Hints:

Use a 2-dimensional array of char, 100 rows by 16 columns (why not 15?), to hold the unique words, and a 1-dimensional array of ints with 100 elements to hold the associated counts. For each word, scan through the occupied lines in the array for a match (use strcmp()), and if you find a match, increment the associated count, otherwise (you got past the last word), add the word to the table and set its count to 1.

The separate longest word and the shortest word need to be saved off in their own C-strings. (Why can't you just keep a pointer to them in the tokenized data?)

Remember – put NO NEWLINE at the end of the last line, or your test for end-of-file might not work correctly. (This may cause the program to read a zero-length line before seeing end-of-file.)

This is not a long program – no more than about 2 pages of code.

Here is what I got so far:

#include<iostream>
#include<iomanip>
#include<fstream>
#include<string>
#include<cstring>
using namespace std;

void totalwordCount(ifstream &inputFile)
{
    char words[100][16]; // Holds the unique words.
    char *token;
    int totalCount = 0; // Counts the total number of words.
    // Read every word in the file.
    while(!inputFile.eof())
    {
        totalCount++; // Increment the total number of words.
        // Tokenize each word and remove spaces, periods, and newlines.
        token = strtok(words[99], " .\n"); 
        while(token != NULL)
        {
            token = strtok(NULL, " .\n");
            inputFile >> words[99];
        }
    }
    cout << "Total number of words in file: " << totalCount << endl;
}

void uniquewordCount(ifstream &inputFile)
{
    char words[100][16]; // Holds the unique words
    int counter[100];
    char *tok = "0";
    int uniqueCount = 0; // Counts the total number of unique words
    while(!inputFile.eof())
    {
        tok = strtok(words[99], " .\n");
        while(tok != NULL)
        {
            uniqueCount++;
            tok = strtok(NULL, " .\n");
            inputFile >> words[99];
            for(int i = 0; i < 100; i++)
            {
                if(strcmp(tok, words[i]) == 0)
                {
                    counter[i]++;
                }
                else
                {
                    strcpy(words[i], tok);
                    counter[i]++;
                }
            }
            tok = strtok(NULL, " .\n"); 
        }
    }
    for(int i = 0; i < 10; i++)
    {
        for(int j = 0; j < 10; j++)
        {
            cout << words[i][j] << " " << counter[i];
            cout << "\n";
        }
    }
}

int main(int argc, char *argv[])
{
    ifstream inputFile;
    char inFile[12] = "string1.txt";
    char outFile[16] = "word result.txt";

    // Get the name of the file from the user.
    cout << "Enter the name of the file: ";
    cin >> inFile;

    // Open the input file.
    inputFile.open(inFile);

    // If successfully opened, process the data.
    if(inputFile)
    {
        while(!inputFile.eof())
        {        
            totalwordCount(inputFile);
            uniquewordCount(inputFile);
        }
    }
    return 0;
}

I already figured out how to count the total number of words in the file in the totalwordCount() function, but in the uniquewordCount() function, I am having trouble counting the total number of unique words and counting the number of occurrences of each word. Is there something that I need to change in the uniquewordCount() function?

c++

5 Contributors
6 Replies
4K Views
10 Hours Discussion Span
Latest Post 11 Years Ago Latest Post by phorce

nchy13 0 Light Poster

11 Years Ago

You can always use hashtable with (key,value) pair for each word and count its occurence.

rubberman 1,355 Nearly a Posting Virtuoso

11 Years Ago

Myself, I would use a map to map words to occurances:

std::map<std::string,size_t> wordcount;

to store the counts of words. Simple enough. Just increment the count on each occurance.

void addWord(const std::string& word)
{
    if (wordcount.find(word) == wordcount.end())
    {
        wordcount[word] = 1;
    }
    else
    {
        wordcount[word] += 1;
    }
}

Edited 11 Years Ago by rubberman

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

rubberman 1,355 Nearly a Posting Virtuoso Featured Poster · Answer 1 · 2013-09-15T13:19:15+00:00

FWIW, my post above is NOT performance optimized, but will handle most all situations adequately. I use such code to process 10's of thousands of log records (up to 2K in size) per second in real-time systems, and still only take 1 or 2% of a single core CPU load. This is a case where the KISS principle (Keep It Short and Simple) is a good protocol.

iamthwee · Answer 2 · 2013-09-15T17:01:15+00:00

You MAY NOT use C++ string objects for anything in this program.

Kinda implies not being able to use hash maps or the likes?

iamthwee · Answer 3 · 2013-09-15T17:05:21+00:00

I already figured out how to count the total number of words in the file in the totalwordCount() function, but in the uniquewordCount() function, I am having trouble counting the total number of unique words and counting the number of occurrences of each word. Is there something that I need to change in the uniquewordCount() function?

Well you could just loop through your array and the one which has only 1 as the number of occurences is obviously a uniquewordcount.

Do this at the end.

@the others I doubt the OP can use maps. See line two in his original thread.

phorce 131 Posting Whiz in Training Featured Poster · Answer 4 · 2013-09-15T17:21:33+00:00

Can I just ask.. Why are you reading the words in twice? This cannot be right.. Could you not just read the words into an array, and then perform the word count / unique word count in separate functions?