Hi I'm just a newbie here and I hope someone might be so kind to help me with my stuff. I have a project regarding C. I'm already half-way done and all that are left to do are:

1. Parse a string into words and arrange these words in descending order according to their frequency/number of occurrence [all words are converted to lowercase].
2. If two different words have the same count of frequency, they must be ordered alphabetically.
3. Words are all case INsensitive and are all made of English alphabet.
4.Print the results in the form of ("word", "frequency count").
5. Also, write the results in a file and save it.

I already know how to do some of them but my problem is that I can't find a way in incorporating nodes and linked lists in the process. I am really trying to understand all the sample codes here in the forum.

The base string is suppose to be coming from an uploaded .txt file from the user. And I am already done with its code.

I am really desperate in solving this problem. :S

So if someone there might be so helpful to give some hints, tips, links or might as well the code itself, I would really appreciate it. Sorry I really don't have the time to browse all the posts here, because my deadline for this project is really coming. I am out of sleep for days.

Thanks in advance sirs/ma'ams!
-A Computer Eng'g Freshman

Recommended Answers

All 48 Replies

I'm affraid i cant help you with alot of that as i'm not an expert at C but i did recently do some work involving reading from files and i found this website helpfull.

http://www.mrx.net/c/readfunctions.html

This link might help you with the string functions you are after.

http://www.cs.cf.ac.uk/Dave/C/node19.html

I'm already done with the file reading/writing. But gee thanks for the quick response. Thanks! =)

Just to give some basic ideas, you might do with the following simple setup, without linked lists involved at all.
Use a structure like e.g.

typedef struct 
{
    const char * _pWord; // pointer to the word
    unsigned int _nFreq;  // its frequency
}Entry;

To make it easy, have a fixed size table of Entry structs e.g.

#define MAX_WORDS 123
Entry table[MAX_WORDS] = {0};

and track the amount of used entries by e.g.

size_t nWordsUsed; // Increment when a word is inserted to the table

Write a function that parses the input i.e. splits the string to words, e.g.

void  Parse(char * pWords)

Write a function that handles inserting a word into your table (called by Parse()).

void Insert(const char * pWord)

Finally, use qsort() for sorting the table, you need to write a sort function of your own to supply to the qsort(), e.g.

int mySort(const void * pl, const void * pr)
{
     // Get the pointers to the Entry structs and compare
     Entry * pLeft = (Entry *) pl;
     Entry * pRight = (Entry *) pr;
     // compare ...
     // int result = ...
     return result;
}

In your sort function, first compare the frequency, if that's equal, let strcmpi() do the comparison. For more information on qsort() http://www.cplusplus.com/reference/clibrary/cstdlib/qsort.html
So your main function could look like ...

// global variables .. (here only for simplicity's sake)
Entry table[MAX_WORDS] = {0};
size_t nWordsUsed = 0;
int main(int argc, char* argv[])
{
    // Read the whole input from somewhere ...
    char * pWords = Read();
    // Parse the whole input 
    Parse(pWords);
    // Sort the table ...
    qsort((void*)table, nWordsUsed, sizeof(Entry), mySort);
    // Display it on screen ...
    // Save it to a file ...
    return 0;
}
commented: Thoughtful post +1

Hey that was a lot. Thanks! I'll try to follow your algorithm as soon as I confirm with my classmates that incorporation of linked lists/nodes isn't necessary for our project.

One more thing, can you give me a tip regarding memory allocations during the parsing part? I always end up having a segmentation fault error. Thanks. =)

>> can you give me a tip regarding memory allocations during the parsing part?
I would suggest that you read the input from the file in one fread() function call. I.e. read the file in at once - for that you have to dynamically allocate enough memory.
After you have allocated the needed memory, you do not have to use any memory allocation after that. This can be achieved if
a) your Parse() function modifies the input buffer by replacing non-alpha's with a null character
b) your Insert() function just sets the word pointer to point to the allocated memory. So you don't need to do anything like "Entry._pWord = (char*)malloc(...)", instead just set the _pWord to point to the word in the input buffer.

Trying to make this a bit clearer .. say that you have a non-constant string
char MyInputData[] = "This is a test string";
No you parse the whole string,
Parse(MyInputData)
which renders the string to:
"This0is0a0test0string"
i.e. 0==NULL character. The Entry table would contain 5 entries pointing to the MyInputData string, e.g. table[0]._pWord == MyInputData and table[1]._pWord == &MyInputData[5] etc.
Finally, when you are done, remember to free() the memory you have allocated.

Wow that really helped. I'm already done with the parsing part. *Hurray!*

The parsed tokens are in string form,is there any way that I could manipulate them as arrays? I'm referring to the comparing of each character so I can do things like : making them all in lowercase, comparing them with each other, etc.

The parsed tokens are in string form,is there any way that I could manipulate them as arrays? I'm referring to the comparing of each character so I can do things like : making them all in lowercase,

If you read the whole input into one block of dynamically allocated memory, regarding lower case conversion, do it before you parse anything by applying tolower() on the whole input you have. So, you do not have to worry about character case after that anymore.

comparing them with each other, etc.

If you store the pointers to a table of the Entry structs, you can iterate over the table and compare the individual entries you have stored e.g.

for(unsigned nn = 0; nn < nWordsUsed; ++ nn)
{
   if(0 == strcmp(table[nn]._pWord, someString))
   {
        // someString is already in the table
        ....
   }
}

Probably that was what you were wondering about (I hope).
Another place you need to use strcmp(), is your sort function (assuming that you will use qsort()).

I'm now puzzled. Is there another way of converting parsed strings first before converting them to lowercase?

I can't use fread because I am working with cgic.h commands [that was the library I used in order to come up with the uploading/interface/environment,etc]. That's why it turned out to be a little bit more complicated.

Upon reading from the uploaded file[using a command in cgic.h], the result is a single string. Then I was able to parse it into words [which are also strings] ..then I was stuck there.

I need to convert them to lowercase and assign them into their corresponding structures[as u have said]. Can you suggest a code? I just keep on ending up with an infinite loop/program crash whenever I try messing up with the assigning to their corresponding structs.

Sorry if I'm such a pain in the ass. I swear, I am trying my best. =)

>> ... the result is a single string.
I suppose you are able to modify that 'single string'? If so, you should be able to apply tolower() to the whole string before parsing anything (I'm not familiar with cgi lib (sorry)).

>> Then I was able to parse it into words [which are also strings]
=> good.
So I take that you effectively modify the input string by replacing non-alphas with a '\0' character when you parse (?).
Note! When you iterate over the string parsing it, be sure not to go past the input string (I take that it is a C-style string, i.e. terminated by a '\0' character).

It might prove very helpful if you post the code which is doing the word insertions to the table.
Showing also how you have actually declared the table.

I made a different way in parsing the string using strtok. The function looks like this:

void Parse(char *buffer)
{
        char *token;
        char delim[] = " .\"";

        for(token = strtok(buffer, delim); token != NULL; token = strtok(NULL, delim)) {
                fprintf(cgiOut, "<br>\n");
                cgiHtmlEscape(token);
                Insert(token);
        }
}

cgiHtmlEscape merely prints the value of the variable.

And correct me if I'm wrong, as far as I know strings cannot be taken as separate characters to be manipulated by any command in the ctype.h library [where tolower is].

#define MAXWORDS 100
size_t nWordsUsed

typedef struct word {
       char *word;
       int wfreq;
} WORD[MAXWORDS];

void Insert (char *token)
{
     WORD word;
     int loop;
     
     for(loop = 1; loop <= wcount; ++loop) {
         word[loop].word = token;
     }
}

One more problem about this code is that I have no idea of how will i get the word count in the string [represented by the variable 'wcount']. I tried placing a tracking variable in the for loop of parse to count the words, but the program crashes, I don't know why. There aren't any errors or warnings upon compilation.

I'll get back to this in 15 mins or so ...

I'll get back to this in 15 mins or so ...

Thanks. =)

Hey there. I already figured out what's wrong. I am now able to assign the words to their corresponding structs.

All I am left to do is to:
-arrange the words in decreasing # of word frequency.
-arrange in alphabetical order those with the same number of word freq.

Can I deal with these with just a single function?

And correct me if I'm wrong, as far as I know strings cannot be taken as separate characters to be manipulated by any command in the ctype.h library [where tolower is].

You can write yourself a routine that converts the input to lower case character by character, e.g. like;

void strlower(char * buffer)
{
	while(*buffer)
	{
		*buffer = tolower(*buffer);
		buffer ++;
	}
}

About the Parse() function, it seems to be OK. I take that the cgiHtmlEscape(token) does not modify the input buffer, so using it is OK, too. (Based on what I found in http://www.koders.com/c/fid273B6AAF9CA572AEAC4D91193A4ED0052119532E.aspx?s=cgiHtmlEscape#L2465)

But when it comes to the words table + the Insert() function, there you are going wrong.

#define MAXWORDS 100
size_t nWordsUsed[B] = 0[/B]; // Initialize to zero!
// I would change the typedef to ...
typedef struct
{
   char *word;
    int wfreq;
} WORD;
// Then ... the table that you will be using ([B]outside any functions[/B]),
// also zero-initialized.
WORD table[MAXWORDS] = {0};

void Insert (char *token)
{
    // We have a valid token here, which may or may not be already in table,
    // loop through the existing entries, if any, and check whether to insert or
    // increment the freq count
    for(unsigned nn = 0; nn < nWordsUsed; nn ++)
    {
       // assuming token is in lower case
       if(0 == strcmp(token, table[nn].word))
       {
            // todo: increment the freq count and [B]return[/B] ...
       }
    }

     // no match, a new word into the table ...

     // todo: check that the maximum count of words does not exceed here

     // No need to copy anything, just set the pointer ...
     table[nWordsUsed].word = token;
     ++ nWordsUsed;
}

You should learn about variable scopes, you had a local (non-static) variable in your Insert() function - that just does not work. Everytime the function returns, data in your local variable is gone.

Here's how my code ended up. What do you think? It's working fine so far . However, I don't know how to insert the other codes regarding the comparing stuffs.

void Insert(char *token, int loop)
{
        WORD word;
        word[loop].word = token;
        cgiHtmlEscape(word[loop].word);
}

void Parse(char *buffer)
{
        int loop;
        char *token;
        char delim[] = " .\"";

        loop = 1;
        for(token = strtok(buffer, delim); token != NULL; token = strtok(NULL, delim)) {
                fprintf(cgiOut, "<br>\n");
                Insert(token, loop);
                loop++;
        }
}

Thanks for the reminder on initializing the struct to {0}. I thought it wasn't necessary.

I'll try the function of converting to lowercase. You really are a great help. :P

void Insert(char *token, int loop)
{
        // This won't work at all <=> a single, non-static 
        // local WORD structure !!!
        [B]WORD word;[/B]
        word[loop].word = token;
        cgiHtmlEscape(word[loop].word);
}

Please take a closer look at my previous post.
Notice that you only want to insert a word which does not pre-exist in the table.
This implies that you have to search the existing entries and operate on the basis of that search.

Oh sorry for that. I was overwhelmed outright. :$

Meaning, I have to place some more codes/incorporate some more functions in order to make sure that no similar words might have a double assignment. Am I getting it right?

You can write yourself a routine that converts the input to lower case character by character, e.g. like;

void strlower(char * buffer)
{
	while(*buffer)
	{
		*buffer = tolower(*buffer);
		buffer ++;
	}
}

One more question about this: What does the 'buffer++' do? If I apply this function, will it directly convert each word to lowercase? Thanks.

Meaning, I have to place some more codes/incorporate some more functions in order to make sure that no similar words might have a double assignment. Am I getting it right?

Yes, there are two todo:s in the above Insert() function, which you have to do.
The check for exceeding the maximum number of words used in that function is also essential.

One more question about this: What does the 'buffer++' do? If I apply this function, will it directly convert each word to lowercase? Thanks.

buffer++ increments the pointer (i.e. buffer) to point to the next character.

The condition "while(*buffer)" could be written as
while('\0' != *buffer)
i.e. loop until you encounter the terminating NULL character in the string.
(If the string is NOT terminated with a NULL character, then the loop is good at producing a nice program crash sooner or later.)

It directly converts each character to lowercase.

I am not really sure if my strings are null terminated or how I am going to check/place a null terminator.

I am working with the insert function. Could you please tell me what's wrong with this code? I don't get any errors but the program crashes.

void Insert(char *token, int loop)
{
        int i, n;
        WORD word;

        for(i = 0; i < loop; i++){
                n = CompareStr(token, word[i].word);
                if( n == 0 ) {
                        word[i].count++;
                }else{
                        word[loop].word = token;
                        cgiHtmlEscape(word[loop].word);
                }
        }
}

And here's the CompareStr function:

int CompareStr (const void * a, const void * b)
{
  return (strcmp(a, b));
}

Oh mine, you still have a local variable WORD word inside the function ... that just won't work.
void Insert(char *token, int loop)
{
int i, n;
WORD word;
}

You must have an array of WORD structures declared outside your functions. i.e.

#include <stdio.h>
#include <string.h>
#define MAX_WORDS 1000
WORD word[MAX_WORDS] = {0}; // The one and only WORD table that you declare and use
unsigned WordCount = 0;
int main(int argc, char * argv[])
{
     // the program code here ...
}

You can also replace the CompareStr() function with strcmp().
Based on the previous postings, I'd say that your input string (which you parse) IS terminated with a NULL character, so don't worry about that.

O crap I forgot! Sorry.

Regarding my Insert function [minus my crap forgetfulness], is the for loop there already enough to suffice the correct assignment to structs?

Sorry to say, but no. The loop should be like;

void Insert(char *token)
{

for(i = 0; i < Count_Of_Words_Inserted_To_Table; i++)
{
     // This comparison is OK
     n = CompareStr(token, word[i].word);
     if(n == 0)
     {
           // This is also OK, increment the count
           word[i].count++;
           // but RETURN STRAIGHT AWAY ... 
           [B]return;[/B]
        }
}

// Now, because we are here, it means one thing only,
// which is that the token was not found in the table.
// (we looped the whole table through)
// So, just insert the token to the table

word[Count_Of_Words_Inserted_To_Table].word = token;

// and keep track of the count of words in the table
Count_Of_Words_Inserted_To_Table = Count_Of_Words_Inserted_To_Table + 1;
}

I take that the variable 'int loop' originates from your Parse() function, please drop that variable altogether - you do not need it and must not use it in Insert().
Count_Of_Words_Inserted_To_Table is equal to the WordCount variable I've used in the previous postings. (I.e. a global variable accompanying your word[] array.)

And remember to check that Count_Of_Words_Inserted_To_Table will not exceed the size of your word[] array! If it does, you will be enjoying a program crash.

I made the changes:

WORD word[MAXWORDS] = {0};
size_t wcount = 0;  //word count entering the table

Is this right? There are no compiling errors, change some of the variables too. However, the program still crashes on the event of this function.

void Insert(char *token)
{
        int i, loop;

        for(loop = 1; loop <= wcount; loop++){
                i = strcmp(token, word[loop].word);
                if( i == 0 ) {
                        word[loop].count++;
                        return;
                }else{
                        word[wcount].word = token;
                        cgiHtmlEscape(word[wcount].word);
                        fprintf(cgiOut, " , ");
                        cgiHtmlEscape(word[wcount].count);
                        return;
                }
        }
        wcount++;
}

Regarding the checking if the wcount is still less than the MAXWORDS, what output should I make if ever the wcount is still right or vice versa?

Well it is getting closer ... but there are errors, I try to explain.

void Insert(char *token)
{
        int i, loop;

        // Array indexes are zero-based, hence you start from 0, 
        // and loop must be less than wcount, so ...
        for(loop = 0; loop < wcount; loop++)
        {
                i = strcmp(token, word[loop].word);
                if( i == 0 ) {
                        word[loop].count++;
                        return;  // This is perfectly OK
                }
                else{
                // This else part was completely wrong ...
                // Just think about it, if token is not at index pointed 
                // to by 'loop', it may very well be at the next index or 
                // somewhere after that ...
                // so, you have to go through all the existing entries you
                //  currently have, i.e. you must do nothing within this else 
                // block here. 
                }
        }

        // So, we are out of the loop here ...
        // Now that you've looped the whole table without finding a match, you
        // insert a new entry 

        word[wcount].word = token;
        cgiHtmlEscape(word[wcount].word);
        fprintf(cgiOut, " , ");
        // word[wcount].count is a number, so you do not need to/
        // cannot HTML-escape it
        // cgiHtmlEscape(word[wcount].count);
        // you can write it to file as follows, if you want to
        fprintf(cgiOut, "%d", word[wcount].count);

        wcount++;
}

Whoa! It worked! Thank you so much!

All that's left is the sorting part. Oh my I'm afraid my code will be messed up again.

I'm still bothered by the qsort, it seems that it arranges automatically given that all the arguments are complete and valid. But in what order this qsort sorts a certain list?

<< in what order this qsort sorts a certain list?
You don't actually have to think about it, leave it as an internal of qsort().
Just be sure that your own sort function works as planned and provide valid arguments to the qsort() call.

Maybe I didn't understand you right ... so the final order of the elements in your array after qsort(), is practically dictated by your own sort function.

Here look at this code:

void CompareFreq(int a, int b)
{
        if(a < b){
                exch(a, b);
        }

}

int SortFreq(WORD word)
{
        int n;
        qsort(word[].count, wcount, sizeof(int), CompareFreq);
        for(n = 0; n < wcount; n++){
                fprintf(cgiOut, "<br>\n");
                cgiHtmlEscape(word[n].word);
                fprintf(cgiOut, " ,  %d", word[n].count);
        }
        return 0;

}

It's a junk. It's not even compiling. I placed the SortFreq Function at the last part of the Parse Function [which calls Insert].

And oh I remembered I think I also have to incorporate a function call for the alphabetical arrangement stuff in the event that both word[n].count are equal for two words.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.