My boss has recommended a hash table, but I'd like some second opinions as to what you think the best way to accomplish this task is.

My problem is that I don't know C++, so I've been learning things as I needed them to finish this project. I'm kind of stuck right now though, so a push into the direction of which data structures would be best suited for me would be great.


My task is as follows:

I have a list of strings. The strings are are five or less characters long. They're obtained by breaking up a larger string based on some code I've already written. For the sake of this post, I'll refer to the following list as my sample data.

The original string is:

CC.Cl


When processed by my program, the following strings (lets call these pieces "grams" from now on to avoid confusion) are returned.

C
C
.
C
l
CC
C.
.C
Cl
CC.
C.C
.Cl
CC.C
C.Cl
CC.Cl


What I need to do now, is take this list of grams and eliminate all the duplicates, leaving behind a list of unique grams. Then I need to process another string and compare it to these grams eliminating duplicates and adding the uniques. the end result will be a large collection of unique grams.


If you have an understanding of what I need and can give me a few pointers, you can stop reading here. The rest is what I've tried thus far.


I've already accomplished this in MatLab, but my method is painstakingly slow. As I understand it, O(1) is constant time, O(log(n)) is similar to what you would expect from a binary tree, and O(n) is a linear increase in processing time. Going on that understanding, my method would be something like O(n^2), it takes longer and longer as the search goes on.

My method was very straightforward, it compared the first gram to the next one. If they matched, it deleted the second gram (which in Matlab causes the entire array to shift up) and then compared to the same index again. If they did not match, the index incremented by one and repeat.

Then I would process the next string. This time, instead of looking for unique grams within itself, I'd look for duplicates with these grams and the grams from the first string. Duplicates deleted, unique grams added to the list of unique grams from the first string. Then the next string would be processed and so on.

That is why it was so slow. As the number of strings passed grew and grew, the list it had to search through got bigger and bigger. On a small scale, this didn't matter. However, I have 250,000 strings and some of the strings can reach a length of up to 500 characters. Some extrapolation based on data from 12000 strings shows that I can expect about 90,000 unique grams when I'm done.

A hash table was proposed, but I think my lack of data structure knowledge is being problematic. I've been doing extensive reading for the past few days, but I feel that there might be a better way to go about this.

So before I commit myself to the hash table, I'd like to get some input from this board to see if there might be a better way of going about this.


Thanks in advance everyone. If any clarification is needed, please tell me and I'll do the best I can.

Hello

depending on how many "grams" will be processed, that is eliminating duplicates, comparing two sets of "grams", hashing method is best method, it is of O(1). And compared with other searching methods it is really easy to program.

Guess how many "grams" of each set need to be compared to eliminate duplicates.

-----
tesu

I'd recommend that you use the std::set class. It has O(log N) time complexity for search-based operations and the duplicate rules mean that you don't need any special logic for finding unique grams.

Ah, two possible solutions!

:) Very good, this will keep me busy for a while. Thanks tesuji and Narue for the quick replies, I'll be sure to post a follow-up when I get this done.

This article has been dead for over six months. Start a new discussion instead.