I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

For example:
user A's hashtag = (cat, bull, cow, chicken, duck)
user B's hashtag =(cat, chicken, cloth)
user C's hashtag = (lenovo, Hp, Sony)

clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

Recommended Answers

All 39 Replies

I'd write a very short utility to get further info from the data, first. I'd want to know how good the hashtags are for differentiating one user from the other. Specifically, what percent of a large sampling (all 6 million), are unique? (100% - repeat%).

I'm not a Twitter user, so I'd want to know how good the hashtags are, in measuring similiarity - sounds like it could be the most important key you have?

If that hashtag similarity association is valid, I'd want to create a string, based on: hashtags first, then their name/handle.

Anything else that needs to be thrown into the ID stew we're about to cook? Shorter is better, but these strings must be able to tell us, upon quick analysis, the similarity ranking we're looking for.

I'm thinking then I'd use a two-pass sort of these strings, so everyone with "cat" in their hashtag, would be grouped together (so every word in the hash tags would be parsed out, and then sorted first, then the whole listing would be sorted). Now give a value for every matching word in the hashtags of the two user's you're comparing. Since the closest one's should be sorted near each other now, it's easy, and when the first word changes, you don't make any more comparisons with that first person.

So everyone with the word "cat" is compared with the first person with "cat", then the second person with "cat", then the third, etc. When you run out of people with "cat" in their hashtags, you stop comparing them. Maybe the next hashtag is "dog", so now you start comparing the first person in the data file with "dog" in their hashtag, with all the "dog" hashtaggers, then the second person with "dog" is compared down the "dog" hashtaggers, etc.

If two words match up, the similarity score should increase by 2-3X times what they were for the one word match up, etc.

You might think that with all these records, you'd be swamped trying to sort and re-sort them all, etc. Not true. Don't worry about that part, for now. Think about the similarity ranking scores. How best to keep the scores "granular", and not "clumped" into either a big pile of dissimilar rankings, or all clumped into the closely similar ranking - either one.

I like these kinds of challenges! ;)

Dear Adak

Thanks for your reply and good suggest.The hashtag is the words highlighted by users, which can reflect the users interests in some extent,if two users have some common hashtags, it means they have some hidden similarity/interest, which also means they have some hidden similarity. However, in Twitter, 90% of the users are not active, most of them haven't any hashtag, instead of directly calculate the similarity between all users, I would like to filter out all those un-related users(for a particular user, those users haven't common hashtags/no hashtags).To do this, I use an array to store all users, then for each user,use a linkedlist to chain all the related users group(for example, all those users who have some common hashtags with user A, we say they are the related group to A). Then second step is to calculate the similarity amount the related group for user A,and get the relative similarity value between A and all other users of the group. repeat this step for each users in twitter, so that we could get relative similarity value between any user in Twitter. next step is to find the related similarity value between all users, then according to the similarity value to build a graph(similarity =0 means no link between the two users), finally using this graph go through the fast unfold community detection algorithm to find the clusters of the social website.

For your suggestions, I am not very clear about some of parts. Please give me more detail explanation. Thank you very much.

Q1: What do you mean to create a string for hashtag ? and what is the purpose?

Q2: What do you mean the two-pass search? can you give me more details? a short example is much better to help me understand the concepts.

Q3: How exactly to implement a similarity score system, please give me short example to help me understand. Thanks.

I am a University student,this project is really tough for me, I have no good idea on how to implement it. Struggling now, I hope those Programming Pro can give me a hand to give some good and feasible ideas. Thanks in advance.

First, I'm not a professional - just a hobbyist, but have a keen interest in high speed processing in C. And three times your age. ;)

Say we had some records - a record is a name and their hashtag:

name     hashtags
====================
alfa     art,bats,cows
bravo    art,cows
charlie  baseball,cows
delta    basketball,dogs,ducks
delta    basketball,dogs, ducks
echo     basketball,dogs, ducks
...
...
zulu    art, bats, cows

Here, our NATO phonetic users, are sorted by name - but thats not what we want, or zulu will be very slow to be seen as similar to alfa, what we want is to sort each hashtag, and put it first into a string:

art bats cows,alfa
art bats cows,zulu
art cows bravo
baseball cows,charlie
basketball dogs ducks,delta
basketball dogs ducks,delta
basketball dogs ducks,echo

No linked lists needed (or wanted). Linked lists are slow, use them only when you can't use an array.

We've sorted each hashtag, into alphabetic order, (maybe they're already listed that way?), but because names were added to the string being sorted, duplicate records for the same user, are right next to each other, and quickly and easily removed if you prefer.

Also, all the art lovers are easily compared, and their similarities scored. Maybe 10 points for the first name, and 20-30 points for 2 matching hashtags, 70 points for 3 matching hashtags, etc. The score could be a flat percentage of all the matching hashtags, or it could take a more exponential curve upward, as the matching hashtags increase in number.

For charlie, the search is very fast, since he matches with no one with "baseball". You only need to look down to ONE more record to see that, thanks to the sort. To find the similarity with alfa and others with "cows", I'd suggest making another array of hashtags (again, in sorted order). Then every hashtag from every user could be checked quickly with a binary search on the hashtag (each one). This is also quite fast. I checked every word in Dickens "A Tale of Two Cities", in just a few seconds this way, with a dictionary of 40,000 words, in a string array[][].

Another option would be to use strstr() in a similar way, working down the string array[][] of hashtags, one by one:

art
baseball
basketball
bats
cows
dogs
ducks

etc.

It might be worthwhile to use an array of structs for this. Somehow you'll want to keep a data items of the final score reached for each group you have, for each person.

If you scored the similarity between each two people, then the similarity between any two people could be calculated.

hashtags       name     score to next person
======================================================
art bats cows,  alfa     100%
art bats cows,  zulu      66%
art cows,       bravo     50%
baseball cows,  charlie   n/a

So alfa and zulu are 100, but 66 for the next one. What's the similarity score between alfa and bravo?

max score possible: 100 * 2 or 200
actual score 100+66 or 166. total similarity: since 100 is perfect, it's 66%, same as the similarity from zulu to bravo.

It's late, and this is "off the cuff", but should give you some ideas. In general, try to sort the data as with keys that are the most useful to your goal. And arrays are blindingly fast on modern PC's (especially Intel i7's and up).

Don't be too hesitant to make large arrays on the heap (malloc). RAM is cheap. I was working on some sorting algorithms this weekend, working from large arrays. Just int's, so very quick, but still - 10 Million random int's in less than 1/2 a second? That's the kind of raw speed that you can't get with linked lists, imo.

(But there are times when linked lists really come into their own.)

Dear Adak

 you are really a smart guy. Thanks very much for your reply.  your way sounds the most useful and feasible idea. In fact, the hashtag is just only 

one of the measurements of the similarity between users, I categorise the measurement into content similarity, link similarity and meta-data similarity.

Content similarity: hashtags and tweets

Link similarity: follower-following relationship between two users and the number of times they have retweeted, mentioned or replied to each other

meta-data similarity: location, gender and age

content similarity = hashtags similarity + tweets similaity, and use the same way to get link similarity and meta-data similarity

the total similarity = content similarity + Link similarity + meta-data similarity

the total similarity is probabaly a digit value, which is something like the weigted link I Want to get for building the similarity graph of the twitter

website. something look like this:
Click Here

After I get this weighted graph, I will use it to go through my fast unfold community detection algorithm to find those potential Clusters, show that

the number of cluster in twitter.

For your hashtag similarity, I think your way is really good and feasible.To make sure I understand your idea, I will summarize your ideas, if not

correct, please point out.

  1. use array to store all records for all users(don't use linkedlist)

  2. Sort the hashtags and make it into String

  3. Compare the current user's hashtag with next user's hashtag, and use formula: (number of common hashtags/ total number of distinct hashtags of two

users) to get score to the next person

  1. To calculat the similarity between alfa and bravo = 100% * 66% = 66%

The way sounds really good, However, I still have some doults:

Q1: How to fast compare the sorted hashtags to get the score if the two users have a lot of hashtags ? use two for-loop(binary search) ?

Q2: As you mentioned:"To find the similarity with alfa and others with "cows", I'd suggest making another array of hashtags (again, in sorted order).",

I am not quite clear here. Since you have sorted all the hashtags and got all the scores to the next user, we should have no problem to get all the

similarity scores of any two users(just use the same way:similarity between alfa and bravo = 100% * 66% = 66%).

Q3: Another option you mentioned strstr(), can help explana more detail?

Q4: For the tweets similarity, can I use the same way? I think document similarity algorithm may be also workable to calculate tweets similarity

I am really appreciated you can discuss with me just like my professor/mentor. Thank you! Adak

All these are just ideas. You will find ways to refine them, going forward, before you decide "OK, this is set". Using the strings of hashes, for instance. Might find something better, but it's an idea that can work quickly, and should be considered.

Binary search is the kind of logic that you want your fast program, to use. In the world of speed up's we have things like:

1) Do nothing - that's ultimately fast. If you can do nothing further that gives you a gain from lots of records, early on (before a lot of other processing), that's what you want to do.

2) Well designed programs that logically use the data to it's fullest, and don't have to re-load, re-write, re-anything -- if possible. You'd be amazed how often programs have computers doing work, inefficiently.

3) Brilliant algorithms. Among them:

Binary search. With just a handful of comparisons, complete a search for a value, out of a million values! I'm making up a program to demonstrate it. I'll post it up later today.

Lots of similarly brilliant algorithms - from Dancing Links to Dijikstra's shortest path, to optimized Quicksort/Mergesort. They blow the slower algorithms into the weeds.

4) I shouldn't mention bit fiddling tricks, but they are quick. I almost never use them however.

Re: hashtags. strstr() is OK, but Boyer-Moore string searching algorithm may be worth using. strstr() is the standard C function for finding a word, (substring of any kind), inside a larger string. It returns a pointer to the location of the first char of that match.

Seeing several lines of actual raw data that will be used by the program, would help. Can you post that?

So the goal is to assign similiarity figures for users, AND also for their tweets? I have experience with this first part, but I've never studied document similarity.

Don't worry if you hear conflicting advice coming from me - I like to hear (and sometimes express), different possible solutions to problems, before I lock into one solution. I call it Lazy Compilation. ;)

Looking at the BIG picture, what is the expected number of users that will be processed by the program, per run, and what is the run-time parameters? A specific goal time needed, or just ASAP, or what?

I'm reminded of a recent "project" to optimize a program that took 50 minutes to get the raw input set up for inclusion into a database.

So I work on it, and get the input down to a 3 seconds, with a 175,000 strings of data. Massive inefficiencies, removed. Only to find out, the biggest bottleneck, was in the output function to the database, and he didn't want to change that code. :(

I'll post up the binary search example program for you to enjoy, later today.

I'd like to see several lines of the raw input, so I can see just what the program will need to work with.

Just a little program to show the extreme usefulness of Binary Searching (aka "binsearch").

#include <stdio.h>
#include <string.h>
#include <time.h>

#define SIZE 250000

typedef struct {
   char name[30];
   char state[40];
}user;

user users[SIZE][40];    //using memory from the heap, not the local function stack

int binarySearch(char *target, char *states[56]);

int main(void) {
   clock_t timer;
   int i,count,ok;
   char target[40];
   char *states[56]={"Alabama","Alaska","American Samoa","Arizona","Arkansas","California",
   "Colorado","Connecticut","Delaware","District of Columbia", "Florida","Georgia","Guam",
   "Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine",
   "Maryland","Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska",
   "Nevada","New Hampshire","New Jersey","New Mexico","New York","North Carolina","North Dakota",
   "Northern Marianas Islands","Ohio","Oklahoma","Oregon","Pennsylvania","Puerto Rico",
   "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah","Vermont","Virgin Islands",
   "Virginia","Washington","West Virginia","Wisconsin","Wyoming"};

   FILE *fpIn;
   fpIn=fopen("NamesStates.txt","r"); //250,000 names with states
   if(!fpIn) {
      printf("Error opening file!\n");
      return 1;
   }
   timer=clock();
   for(i=0;i<SIZE;i++) {
      fscanf(fpIn, "%s %[^\n]",users[i]->name,users[i]->state); 
   }

   fclose(fpIn);
   //match all the states in the from the users array, 
   //with the states in the states array,using the binary search.
   for(i=0,count=0;i<SIZE;i++) {
      strcpy(target,users[i]->state);
      ok=binarySearch(target, states);
      if(ok>-1)
         count++;
      else if(ok<0) {
         printf("target: %s users[i].state: %s \n",target, users[i]->state);
         getchar();
      }
   }

   timer=clock()-timer;
   printf("Matches found: %d   Elapsed time: %.3f seconds\n",count,(double)timer/CLOCKS_PER_SEC);

   printf("\n");
   return 0;
}

int binarySearch(char *target, char *states[56]) {
   int lo, hi, mid;
   lo=0;
   hi = 56-1;

   while(lo <= hi) {
      mid=lo + (hi-lo)/2;
      if((strcmp(states[mid], target)) <0) {
         lo=mid+1;
      }
      else if((strcmp(states[mid], target)) >0) {
         hi=mid-1;
      }
      else
         return mid;
   }
   return -1;
}

The above is not an optimized program - run of the mill. But it finds a quarter of million matches of 56 states and territories, in less than 4/10ths of a second, on my PC. Pretty impressive, this algorithm! ;)

This is the data file, which I simply copied until I had 250,000 lines of text.

AARON Alabama
ABDUL Alaska
ABE West Virginia
ABEL Colorado
ABRAHAM Delaware
ABRAM Massachusetts
ADALBERTO New Jersey
ADAM Michigan
ADAN Washington
ADOLFO Virgin Islands
ADOLPH Northern Marianas Islands
ADRIAN Oklahoma
AGUSTIN Rhode Island
AHMAD South Carolina
AHMED Wisconsin
AL Nevada
ALAN Hawaii
ALBERT Idaho
ALBERTO American Samoa
ALDEN New Hampshire
ALDO Ohio
ALEC California
ALEJANDRO New York
ALEX District of Columbia
ALEXANDER Maryland
ALEXIS Connecticut
ALFONSO Illinois
ALFONZO Arizona
ALFRED Vermont
ALFREDO Florida
ALI Georgia
ALLAN Louisiana
ALLEN Minnesota
ALONSO New Mexico
ALONZO Oregon
ALPHONSE South Dakota
ALPHONSO West Virginia
ALTON Virginia
ALVA Iowa
ALVARO Kansas
ALVIN Kentucky
AMADO North Carolina
AMBROSE Pennsylvania
AMOS Puerto Rico
ANDERSON North Dakota
ANDRE Wyoming
ANDREA Utah
ANDREAS Tennessee
ANDRES Mississippi
ANDREW Maine
ANDY Guam
ANGEL Texas
ANGELO Arkansas
ANIBAL Montana
ANTHONY Nebraska
ANTIONE Indiana

Dear Adak

 Thanks for your reply. I know this are just some ideas, which are what I 

really want. In fact, for the code writing, I should have no problem to implement

it if the idea sounds very complete and fantastic. The main problem for me now is

that I don't know what algorithm to use for different similarity measurements,

which could make the whole problem fast and lowest time complexity.

For example: what algorithm for each similarity measurements below:

hashtag similarity (which you have mentioned above)

tweets similarity? follower-following similairy? retweeted similarity?

Assume this is my database, and all the data I could get, the actual data I have

not got from my professor yet. but you can imagine all the data in my database

design is available in some kind of text file format. twiter database design:

Click Here

The focus here is to design a good program, which could fast compute the similarity

(total similarity) value between users, and filter out those no related users, and

finally get something like similarity graph text file. for example:

usera_id userb_id similarity vale

12 15 0.6
. . .
. . .
. . .

each line contains a pair of integers a b
it means there is an edge from a to b (means some similarity)
After I get something like above text file, I will pass the text file to the fast

unfold algorithm to find the potential clusters in the website.

SO ,my problem is to how to design a good ideas, which will make my program is fast

and reasonable. After finalizing the ideas, it sounds feasible and clear to me, the

code implementation I can do it.

  1. my expected number of users = 300 million

  2. no specific goal time need, just ASAP

  3. Thanks for your code for the binary search

  4. I hope you could help me to design ideas for the project, which is my weak part.

Do you have any email? I could send you my documentation the PPT slides, which is

clearly stated my problem and my current program design, however, my professor has

rejected my ideas, he said that my idea is too simple and too slow, it is not

possible to handle such a huge amount of data. if it's not convenient to show your

email, you can give me an email, my email address: ldaneil4738@gmail.com. Dear

Adak, I am appreciated from my deep heart, you are really my god.

Thanks for the kind words. I'll get you an email address. Not meaning to poke fun at a "god", but throw out the previous idea of mine to measure similarity - won't work for what you need. New idea is to create a boolean type string, which has a 1 value if it has the hashtag associated with that index of the string, (which would be in a table), or a 0 if the user did not have that particular hashtag.

I'm thinking structs, and array of structs, and every user has a similarity "string" in their struct. A simple example:

User has hashtags of: art, beaches, cats, cows,

And the hastags list is: aardvarks,art,bats,baseball,basketball,beaches,caps,cats,comics,cows.

So this users hashtag string would be:

    0100010101
    0 for aardvarks
     1 for art
      0 - bats
       0 - baseball
        0 - basketball
         1 - beaches
          0 - caps
           1 - cats
            0 -comics
             1 - cows

Now we can make an indexed table of these hastags strings, which would be created as the data was input, by using either an int index array or an array of pointers. In effect, it gives us a way of searching through the hashtag strings AS IF they were all in sorted order, even if they aren't sorted at all by hashtag strings.

Using binary search, we can also find the K closest hashtag strings, for any user, very quickly, BUT The downside to the above scheme is that in any such sorting, the 1's on the left side of the string, would be MUCH more important than the one's on the right side of the string.

So if the user's tag string was: 0100010101, we could immediately locate all matching tag strings, BUT a very similar user, who did not have the "art' tag, (but everything else), would not be shown at all, since they would sort substantially farther away.

I'll have to think about this some more

I need info on hashtags:

*how many different hashtags are there, in total? Any limit to this total number?

*is there a limit, in the number of hashtags any one user will have? Will some users have hundreds of them?

*is there a "strength" of the hashtags (for instance, is the 5th or 25th? hashtag as meaningful as the 1st one"?

*Can we limit the number of hashtags for any one user, without compromising the value of the hashtags?

The design of the Twitter database is good to see, but not nearly as helpful as what the actual input for the program, will be.

Oh! New post to read.

That helps! The cosine similarity article looks good.

While waiting to sort out an issue with my email, I'm putting together a data file with 3 thousand names, and giving each name, from 0 to 99 tags.

Get a better feel for what works well.

More on this, as it progresses.

Dear Adak

Thanks for the new idea. However, we have to assume there are millions of distinct hashtags, because there are a few million users in twitter, assume each user produced one distinct hashtag, it will be a

large number of hashtags, the hashtag list you memtioned will be very long, to find the hashtag binary string would be very time-consuming.

  1. no limit in the number of hashtags of each user, yes, some active user may have hundreds of or thousands of hashtags.

  2. yes, there should have strength of the hashtags, which is the frequency of some hashtags, if one hashtag is very frequent, it means someone is very very interested in the topic.

  3. we may limit the number of hashtags for any one user, let's say only pick up those frequent hashtags.

I think it is better to let you see my PPT slides, which have a lot of diagrams, which could make you better understand of my question.

please send an email to me : Thanks you very much.

email address:* ldaneil4738@gmail.com*

E-mail will have to wait until Monday.

In the meantime, delete the bool array idea I mentioned above. This is my latest idea:

We get all the tags for an individual user, and sort them by their frequency, in descending order (important). We also have a file with all the most frequent tags, from all the users. Say the top 500 million. There will be lots more tags, in total, but these will be the most frequent tags, so we're getting the meat of the data, since there will be tens of thousands of repeats, in the set of all tags from every user.

In the user's data file, the abbreviated record might look something like this:

12,Addison,kinetic happier ejectors gales sated madam neurones garters lapsed 

Where 12 is Addison's ID, and each word after her name, is a tag she's frequently used, which is sorted in descending order - so "kinetic" is her most common tag, for this example, and "happier" is her next most common tag, etc.

But we don't want to work through these words, because nothing is slower than doing a lot of string work, on a computer. So we use the tag array and binary search, to replace those tags, with the index numbers matching those tags.

Now Addison's abbreviated record looks like this:

12,Addison,2181 1614 1043 1294 3649 2480 2844 1405 2402 0 

Much nicer!

To make comparisons, we put those numbers into a string, including the spaces:

2181 1614 1043 1294 3649 2480 2844 1405 2402 0

And store that in a smaller struct with just Addison's ID number, and her tag index numbers.

Now sort the array of these small structs, by tag index numbers. You might get something like this:

2181 1614 1040 1000 4649   80 3844 3878    0 101   20% matches with Addison's tags
2181 1614 1043 1110 2222 1234 3977 6241  402 0     30% matches with Addison's tags
2181 1614 1043 1294 3362 2400 2111 4805  333 1075  40% matches with Addison's tags
2181 1614 1043 1294 3649 2480 2844 1405 2402 0 \
2181 1614 1043 1294 3649 2480 2844 1405 2402 0   100% matches with Addison's tags
2181 1614 1043 1294 3649 2480 2844 1405 2402 0 / (Addison's + 2 other users)
2181 1614 1043 1294 3649 2480 2844 1488 2402 0      90% mathces with Addison's tags
2185 1614 1043 1294 3649 2480 2844 1405 2402 0  -although it matches 90%, you won't
see this last user as similar, since the most important (leftmost) tags, don't match up.

This I'd consider to be an "overview" or "shallow" view. An in-depth view would be done like this:

1) take each tag number, for the user you want to find most similar for, and
fill the rest of the string with zero's, to guarantee it will be #1 in that range of the search.

2) call the binary search, and when it returns the index, get the user's ID's from those immediately below, until the tag number changes (2185 becomes 2186), and write them to a file, ONLY if they have N number of matching tags, which we'd test and count up, with strstr() (which is even faster than a binary search), but does less work.

3) repeat #2 for each tag number, adding them in append mode, to the same file.

4) when all tags numbers have been searched through in this way, that file will have a list of EVERYONE with an equal tag, if their count of matching tags is greater than N. We would input what N we wanted for that search.

If I haven't made your eyes glaze over too much, let me know what you think. ;)

Can I view PPT files, without powerpoint software?

I'm sure there was a bit of preloading into the drive's buffer, but I had to give a satisfied smile.

Time to load 3K records, and build the two arrays, including writing out all the 5K tags with 10 tags per user to the files:

0.015 seconds.

;)

I've increased the tags to 32 per person, minimum. I have to increase the number of users. I need to see more changes in the run times for the sorting. (which I'm experimenting with atm).

Dear Adak, The New Year is coming.

Wishing you a New Year
That brings luck and prosperity
Fills your home with joy and spirit.
Wish you Enjoy good health, and Everything goes well with your work.
Merry Christmas and Happy New Year!

Dear Adak, in fact I am very surprised that A person could have so many good ideas in a short time. it means that you really have a lot of experience. I admire you very much.

Thanks very much for your good ideas. I still doult about the speed.
Q1. How to sort the hashtags by frequency efficiently? I need to count how many times of a particular hashtag appear, the counting process will be very time-consuming.

Q2. What do you mean the hashtag array and binary search? if I am not wrong, you mean that use array to store all the hashtags, and sorted the array by alphabet. so that we can quickly find a particular hashtag with binary search. am I am right?

Thanks for the best wishes, and same to you and yours, in 2013.

The frequency of the hashtags would be for one individual only - at least one version of it needs to be - otherwise the similarity grouping goes out the window, as far as speed, because a sort on everyone's hash tags, won't mean much at all. We need the left most hashtag to be the most frequent hashtag from that ONE user.

I've done this kind of thing before, but usually it's on the CBoard forum, so there's lots of discussion and idea's floating around. With your permission I'd like to make a thread over there, just on the hashtag sorting issue, and see what they think. I am hoping someone like Deceptikon will also pitch in with idea's, on this forum.

I can't say too much about counting the hashtags for frequency by an individual, because I don't know what the raw data will be like. I agree the hashtag sorting is the biggest problem to the program. How could it not be?

I'm starting out with an optimized Quicksort I have. It's my champion integer sorter for large quantities (up to 10 million only though). I also have a Radix MSB function I thought I'd try, and I'm reading up on a couple other excellent string sorters.

The hashtags storage I'm suggesting, will have a file (and an array when needed), of the strings. The hashtags may be sorted, or just indexed (which is just like sorting, but without moving anything). So one array will have nothing but a list of all the hashtags, sorted (or indexed).

The users array won't have alphabetic hastags - only indexes to them. A bunch of numbers, put together into a long string, of each user's hashtags. Between each index, there will be a space. That will help save a lot of space compared to numbers. I thought about bit packing, but I believe speed will be better without it.

I've moved away a bit from the binary search, preferring to use the faster direct indexing. May need the binary search later on, for something, but I'm hoping not. It IS very fast - but there are some faster ways to search, and we appear to need speed!

Just finished putting together a 100K user list, and a 300K mock hashtag list (just words), so the testing will be ON! ;)

100 Million is looming quite large, just now. I believe you mentioned up to 300 Million records for the program, but I'm chalking that up to a bit too much holiday egg-nog?
< LOL >

It's now using the data for 100,000 records, with 52 hashtags for every user, from a set of 300,000 hashtags. Run time is long because I have an extra write out of all the data, (in one format), so it can be read in, in a better one. This will be removed.

Please post how you sorted (or were going to sort) these records. I'm unsure if it's better to:

*sort the tags indices as a single string (what I'm doing now)

or

*sort the tags as an arrays of 52 ints per user (in effect, making it a very large 2D int array with one user's hashtag indices, per row.

Once they see some code from you (in accordance with the forum's policies), I'm hoping a few regulars will reply and give us further ideas.

Also, I'd like to not propose things that were already not approved. No sense in that.

I'll have some preliminary sorting times late today.

I'm getting a new email addy. No one can remember the old one.

Sort times with the hashtags as strings was disappointing, at 16 seconds for 100K records with 52 tags for each user. I'm re-configuring the program to use numbers, instead of strings. Also, recoding the sorter function to work with records with multiple columns.

It will be much faster.

< Happy new year! >

Eureka!

Records sorted: 1000000 with 52 columns each. Time:  1.467 seconds -- new

Dear Adak

 Yes, I understand the frequency is for each individual, but we have to assume that everyone has the right to create thousands of hashtags, The main issue here is the sorting problem. In fact, similarity algorithm has a lot online, but the main problem is how to sort and compare effieciently. The data format is like that:

tid u_id tags date
1 22 cat,iphone,ipad... 12/8/2008
2 22 win8, Lenovo,Sony 13/8/2008
. . . .
. . . .
. . . .

In fact, I really feel that you are very clear my problem. and I am sure you know how to do it. Dear Adak, I really need your help, can drop me an email? I will show you my three problems. Thank you very much.

I will, but I'd like to keep this in the public forum, to the extent it's possible without divulging proprietary info. You started it here, it is an interesting topic, and imo it's not well covered by search engines. There's a ton of info on sorting, but I couldn't find anything half-decent on sorting records by columns of an array.

There WAS a few with Bubblesort, and clumsy logic, though! < ROFL!! >

Bubblesort: the Pain in the Butt, that just keeps on giving.

The big problems I see with any project like this are:

1) Acquiring the raw data

2) Classifying the raw data, into usable data

3) Processing the usable data

You must have #1 planned or handled already, or there would be no point to work with #2 or #3.

I understood from the previous pic that the format of the data would be:

ID_number: a unique integer
handle (name) a string, may not be unique
tags (some variable number of them)

Although there might be hundreds of tags a user might have used, we can't work with every tag. We need to select from those tags, the top 50 (fewer is faster), or so. That's where the binary search will be very useful. If a user has tags of:

art
beaches
cats
caves

Then those will be searched from the list of all tags (yes, millions are fine here). Each tag "hit" is counted, (by incrementing the index to that matching tag), and at the
end of that, the user will have:

152 (the index say of "art" in the tag list) 2,486 uses
229 (the index of "beaches" in the tag list) 3,221 uses
688 (the index of "cats" in the tag list)    3,877 uses
701 (the index of "caves" in the tag list)...1,025 uses

etc. for all the rest of the most popular tags for this user

So the record would be:

ID     Handle    Tags, sorted by frequency for this one user
1001,"bigTime1",688 229 152 701 etc. for all the tags

And then the users as a group are sorted for their tag columns, giving "bigTime1"
a list of those whose tags are most like his own, including ALL the 100% matches of tag freqency. That would be a shallow search - quick and with merit. A deeper search would include finding those who have tag indices of 229,152,701, etc., say, but DON'T have bigTime1's most frequent tag index of 688 (for instance), so they wouldn't sort (be listed next to), "bigTime1".

These "deep" searches would find all the less than 100% matchups, with "bigTime1", according to the parameters you wanted. Slower to MUCH slower, to generate them, however.

This is from the old mock up records I made. Changes would be from using strings in the indices, to using integers, and the number of tag indices. This is 3 records:

    ID_Num, Handle, Tags

    15586,Kiley8,000006 132491 087788 267085 099922 131047 298498 239581 095062 215047 216418 248025 267561 248143 248852 231187 026509 129680 212938 280977 176819 184012 242773 245082 037352 239834 166644 278049 075762 051887 191466 068101 064549 284924 294071 284652 196786 075815 209430 035475 017855 021510 251755 079182 171772 201358 031288 248045 085435 236643 061892 003400

    33134,Armani4,000015 160992 087892 015314 248883 025498 050903 234671 032283 024205 158333 149015 221993 166817 269272 081446 163882 248665 289154 076987 177827 186588 257206 098729 114887 149150 093845 000675 000847 138932 183237 280090 173362 275840 055326 212182 075129 259438 102663 120051 043807 218065 015736 079185 077510 006908 228418 257205 143261 085293 065414 187175

    70590,Dean9,000016 081608 263707 260785 185801 124811 001337 018290 129068 053988 057260 294179 246803 190935 210991 094844 048533 023148 117229 162968 082529 248340 066775 137946 158218 201145 222409 122719 240801 285711 264593 085920 024181 093387 114034 272244 145922 072794 068228 229350 095042 274719 200472 298128 255067 103013 051658 187219 221255 031318 092209 046113

A constraint on all of this, is time. With a project this large, more than one computer or multiple cores, will be very useful, if not essential:

1) to get the raw data, on an ongoing basis, hopefully.

2) to process the data from raw format, into usable data, including the significant work of counting the frequency of all the hashtags, by all the users.

3) processes the usable data from #2, and output the users that are most similar, in shallow searches.

4) processes any deep searches.

The computing power is certainly available. I'm looking at a server with 16 individual cores on each of it's 4 cpu's for example (64 cores total), BUT you need a LOT of disk space to handle data and processed info, on millions of records. And a fast internet connection. ;)

2, 3, and 4, could benefit from using multiple cores/threads.

Just thoughts on a holiday. I'll contact you, LD.

Check profile info detail, just short out the list who have same website link and information, that would be easy for you also...

I've been wrestling with the "Quicksort is not a stable sort", Gorilla - for a holiday diversion. ;)

The Quicksort gorilla was bad enough, but the Radix gorilla was REALLY touchy. Get him in a good arm bar, and he wants to bite, or throw malformed integers at you!

The inital hashtag column of one million records, is sorted by Quicksort in about 1/2 a second, but there the fun ends. All the other 51 columns of hashtag indices, have to be gone through, and have the out of order one's, corrected. You can't have Quicksort do it, unless you really slow it down.

So in run times, we're back to about where we were last week, with 17.1 seconds for 1 million records with 52 hastags per record. That is ONLY for sorting, btw.

The advantage is that the hastag values remain as digits, and don't have to be translated into strings, and then put back to integers, so they can be used.

This is an example of the first 100 rows of hashtags:

 0:  0   1 474 753 478 945 456 255   1 198 847 945 551  24 396 986 242  85 154 
 1:  0   3 299 465 382 737 460 803 301 771 179  50 801 262 483 330 815 905 349 
 2:  0   3 345  44 796  23 873  36 883 551 419 744 744 659 336 470 891 699  16 
 3:  0   3 849 827 747 316 243 938 517 197 597 700 653 425 613 307 298 446 293 
 4:  0   3 925  53 709 784 756 745 423 178 670  93 311 837 321 362 612 460 455 
 5:  0   4 873  86 923 163 182 771 644 454 353 918 845 682 210 997 429 755 772 
 6:  0   6 417 387 904  42 848  60 345 464 642 849 576 143 770 716 331 564 186 
 7:  0   6 832 185 660 326 499 338 910  34 978 664 625 260 107 298 186 979 132 
 8:  0   8 348 612 127 559 495 254 805 122 677 394 373 613 243 810 686 111 285 
 9:  0  10 154 192 615 536 896  45  15 445 418 637 784 145 829 849 351 313 243 
10:  0  11 522 647 514 445 715 990  98 406 592 955 431 266 864 113 107 958 138 
11:  0  12 333 455 930 249 427 629 609 426 967 981 928 488 462  97 778 363 331 
12:  0  12 361  19 428 960 195  48 826 329 672 520 579 620 729 422 291 659 187 
13:  0  16 767 659 471 118 680 993 124  49 583 252 279 107 962 435 192 486 419 
14:  0  16 966 710 733 518 635 786 823  48 252 343 111 925 984 362 707 572 591 
15:  0  17 654 931 719 475 350 774 603 248  61 441 451 212 806 450 635 904 114 
16:  0  19 691 146 561 202 679   1 752 965 701  25 437 870 898 537 858 575 464 
17:  0  20 976 306 427 291 532 412  19 157 639 885 976 321  56 268 570 518 302 
18:  0  21 193 144 243 660  18 852 615 723 330 703 663 576  50 298 584 875 131 
19:  0  22 398  74 875 882  38 249 931 199 732  32 154 573 960 950 950 909 364 
20:  0  22 405 307 198 711 412 774 394 432 396 319  93 783 994 134 963 504 622 
21:  0  24 115 423 464 806   1 457 739 253 135 909 456 590 381 893 357 742 608 
22:  0  25 450 853 240 990 249 169 798  10 533 545 534 449 522 559 104 926 153 
23:  0  25 463 335 908 489 405 391  51 640 763 720 221 598 413 634 383 690 243 
24:  0  27 533 441 604 117 202  83  94 775 308 738 551 248 638 733 393 156 543 
25:  0  27 796 943 401 592 894 783 425 326 940 523 286   8 198 229 636 702 509 
26:  0  28 110 295  95 114  66 871  57 989 475 766 940 929 759 376 940 943 688 
27:  0  29 557 109 343 650 105 219 889 460 148  18 183 213 313 700 998 231 515 
28:  0  30 296 196 704 905 178 308 377  51 542 352 819 922 371 855 996 923 542 
29:  0  32 442 885  57 918 597 860 546 727 783 814  97  67 365 817 839 852  74 
30:  0  32 771 597 742  99 255 632 643 316 317 644 298 256 532  41 594 638 565 
31:  0  33 829 853 851 720 993 822   6 615 207 313 566 822 317  69 665 429  27 
32:  0  33 920 848 920 180 651 515 264 420 461 143  67 827 522 216 382 989 748 
33:  0  35 130 640 960 325 881 218 315 962 453 390 269 124 489 881 648 291 888 
34:  0  36 149 297 104 248 995 599 644 505 938 811 788 485 302 425 807 345 124 
35:  0  38 145 488  51 719 269 136 572 914 193 874 845 189 351 786 411 689   6 
36:  0  38 837 687 725 958 108 502 753 944 794 298 799 326 428 118 636 828 392 
37:  0  39  17 818 886  43 629 135 656 657 563 340 437 664 281 735 105 870 434 
38:  0  39 215 110 391 412 970 531 564 589 967 247 158 259 569 331  14 598 607 
39:  0  39 312 757 916 764  84 497 901 660  18  99 268 832 946 102 857 302 966 
40:  0  39 735 769 860  92  76 150 852 699 646 618 938 542 895 829 256 258 225 
41:  0  41 996 250 385 420  35 814  80 551 931 869 269 322 467 955 265 367 652 
42:  0  42 600  55 374 355 294 176 893 338 284  30 887 894 742 663 411 157 589 
43:  0  43 668 426 329 264 299 577 309 623 788 447 224  71 933 250 800 474   8 
44:  0  43 989  71 332 696 433 770 549 501 581 131 731  51  62 243 609 396 285 
45:  0  44 485 672 957  33 198 423 190 280  32 379 907 409 854 971 399 827 798 
46:  0  46 565 684 262 715 952 357 769 142   1 865 681 928 930 606 817 346 441 
47:  0  47 282 598 719  73 363  76 541 103  96  96 224 104 926 446 330  17 861 
48:  0  47 389 253 895 940 753  57 486 288 265 443 918 581 678 782 572 132 293 
49:  0  47 559 174 135 228 645 777 669 345 798 491 742 553 907 401 179 818 539 
50:  0  48 840 318  31  66 518 893 533 158 922 718 419 981 538 777 462 363 702 
51:  0  52 618 826 174 322 732 634 270 293  79 940 839 545 567 289 105 797 601 
52:  0  53 951 977  49  37 242 300 977 457 232  45 399 809 608  32 744 320 168 
53:  0  54 486 248 213 320  95 823  84 736 821 283 294 770 431 674 987 101 791 
54:  0  54 577 217 820  15 308 255 506 184 326 165 574 847 426 134 440 414 823 
55:  0  54 623 942 228 169  53  58 335 247 814 831 686 446 365  75 421 882 892 
56:  0  56 645 222 980  16 414 295 164 225 593 168 118 811 655 583  55 970 648 
57:  0  56 664 356 244 625 112  84 878 892 296 646 174  72 245  11 839 152 170 
58:  0  57 247 829 174 468 162 907 334 406 826 714 111 704 314 911 971 963 768 
59:  0  57 603 320 215 272 686 877  31 279 832 799 946 483 289 253 231 699 829 
60:  0  59 200 793 520 575 975 937 871 431 456 429 167 382 588 852 719   6 294 
61:  0  59 278 668 361 286 814 707   8 897 264 770 370 282 117 442 437 960 984 
62:  0  61  25 525 165 328 537 854 123 507 915 277 426 387  94  61  65 248  85 
63:  0  62 589 462 470 227 458 581 735 141 962 490 606 547 688 618 246 466 797 
64:  0  63 160  46 841   0 130 650 242 218 631  95 117 421 347 419 121 981 653 
65:  0  63 209 378 103 131 553 702   0  46 445 867 902 321 225  74 987 985   7 
66:  0  63 212 666 711 622 448 647 373 639 513 324 805  50 310 979 313 698 709 
67:  0  66  79 269 656 348 452 659 870 696  46 198  33 377  97 908  27 907 666 
68:  0  68  12 823 986 807 246 110 960 168 439 875  15  77 710 199 423 356 265 
69:  0  69  99 507 487 152 592 291 541   5   7 414 287 576 384 814 924 180 322 
70:  0  71  40 707 928 246 564 627 284 446 514 199  92 855 629 625 351  24 728 
71:  0  71 398 127 489 902 485 382  31 271 783 376 801 753 714 370 454 641 854 
72:  0  72 188  69 525 686 707 840 971 271 934 879 509 730 137 777 145 247 478 
73:  0  72 246 115 501 205 244 802 631 807 809 782  44 597  69 553 654  88 768 
74:  0  73 402 276 146 184 839 654 652 606 472  46 438 633 433 699 365 644 378 
75:  0  73 550 105 312 194 690  24 296 902 570 540 262  35 917 607 209 644 236 
76:  0  73 643 399 909 903 149  77 154  47 790 260 336 300 965 744 380 325 232 
77:  0  76 335 425 913 331 647 779 872 995 474 332  64 331   4 708 786 505 604 
78:  0  78  84 604 622 639 211 187 410 961 753 658 537 380  20 676 856 165 960 
79:  0  78 385 681 281  34 409  92 756 976 139  21 940 949  44 118 285 132 909 
80:  0  83 624 737 980 665 480 260 843 355 245 703 329  20 683 425 838 891 129 
81:  0  83 934 649 661 599 370 503 498 482 859  59 215 640 576 171  55 371 684 
82:  0  93  62 735 801  52 237  32 224 888 973 230 439 459 962 457 236 956 232 
83:  0  93 953 799 427 504 119 846  24 504 779 998 465 683 357  55 647 998 395 
84:  0  94 300 580 647 220 312 557 496 923  45 266 211 334 548 572  58 577 363 
85:  0  95  46 403 629 736  52 738   0 887 421 128 504 109 647 275 338 690 955 
86:  0  96 864 852 941 892 349  75 504 911 867 483  54  33 984 167 632 125 321 
87:  0  97 301 523 422 155 162 427 252 142 816 283 729 507 909 163 687 475 588 
88:  0  98 230 448 811  36 597   7 548 945 417 613 465 116 795 243 414 656 201 
89:  0  98 570 144 955 309 187 957 195 891 340 306 667 167 979 713 237 643 659 
90:  0 100 661  51 433 665 802 212 568 423 130 928 883 228  84 393 162 921 565 
91:  0 101 126 315 768 246 775 818 626 696 312 607 563 614 136 572 595  56 344 
92:  0 101 540 361 181 993 723 935 106 308 873 942 489 502 518 818 751 234 492 
93:  0 102 561  98 993 973 375 151 407 224 868 706 694 201 429 596 624 615 181 
94:  0 102 727 160 441 691 874 726 934 829 174 961 899 758 353 443 546 815 897 
95:  0 102 745 558 561 119 843  45 240  83 112 931 652 598 220 381 310   2 694 
96:  0 104 877 460 680  77 314 331 957 803 373 762 174 642 350 904  37  98 106 
97:  0 105 643 680 617 444 522 223 223 479 599 352 236 978 442 231 210  77  70 
98:  0 106 216 344  28 566 288 560 764 304 630 288 269 580 165 803 345 854 372 
99:  0 107 663 754  13 171 619 919 756  32 443 799 625 454 160 690 880   7 312 

Records sorted: 1000000 Columns: 52 Time: 17.191000 seconds

With random values between 0 and 999 (so I can fit them on this page and debug the program easier).

There may very well be a way to cut this time down, without eating into the huge memory requirements the program needs.

Speaking of time, how long before this is expected to be done?

Good news! This "column wise" record sorting, which I've been struggling with, is well known as multi-key sorting.

So the sort times will be able to be cut a GREAT deal. I should have seen that, but missed it.

I am thinking of a simple and easy way. For example, just store user as an object inside an array. then compute the similarity of each other, before you compute the hashtag similarity, sorted the hashtag by alphabet(search time complexity O(log n)), the total time complexity will be nnO(log n), which seems still very slow. I think I have to think about your way carefully, and try to understand it and use it. Thanks Adak

A struct in C, corresponds to an object (roughly), in OOP programming. Each record would then be an "object", and other record fields can be added as needed.

If you make the hashtags in the record, a string, instead of a number (index) to a master hashtag list (all strings), then "deeper" searches will be much slower. You'll be making a simple number comparison in the first case, but several actual string comparisons will be needed, in the second case, since you will not have the index of the string anymore.

There would need to be a master list of hashtags kept, which would enumerate all the strings.

You're welcome. I'm working on a permutation program tonight, but tomorrow I'll get back into refining the sorting on those hashtags.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.