Big speed up from multi-key sorting, etc. 2 seconds for a million records.

Uses a bit more memory than I'd like, but this is with 52 hashtag indices per record.

Dear Adak,

 Thanks a lot for your sincere help. Even though you have listed some results and computation time, and I also believe you can do it. From so many ideas, can you list down the best and feasible idea you think which can work, Please list in point form Step by step, so that I could fully understand your latest idea, because the sorting problem is still not very easy, and seems still very time-comsuming and troublesome. Please use point form to list your think, try to give short and simple examples for me better understand. Thanks very much.

you can post my problem to any forum since there is some pro able to solve it. no confidential at all.

Looking at the bigger picture, it would be best to use database software, to handle this quantity of data, and provide you with the info from that data, in the report format you need. For an academic situation, maybe you are required to do this with your own program(s).

Moving on, then:

1) If you judge hashtag (tags), matching to be a part of similarity between users, then you need to have a way of gathering the tag data. You need a way to gather up the raw data quickly and put it into your servers.

Each tag will go into a file, with no sorting, because it's placement is fixed. New tags are appended to the end of the tag file. Each tag is given a number, matching it's row number, in the list of tags.

Each user you record will have their name in the user file, and in their own file, they will have the list of all the tags they have used - but not tags, just tag row numbers.
(The indeices if you prefer). These numbers will be counted up for frequency of use, and the most frequent 50 tags or so, will be recorded, in a users data file. Minimum data, just an ID number, name/handle, and the 50 or so most frequent tags they've used, by index number.

You must have this before any steps below, can work.

2) Now open this data file just made at the end of step 1), and read the data into an array, and sort it by the index numbers, from left (most frequently used), to the right (least frequently used).

All users with 100% matching hashtags will now be right next to each other. Users with the first 5 matching tags, will be close, but not right next to each other. A similarity score could be made just like a batting average in baseball: out of 50 tags (times at bat), how many are matches (how many times does he get a hit): so Number of matching tags divided by 50, equals similarity score. A score of 1 is perfect, .500 would be 50% (remember that percent means hundredths, so 50 divided by 100 equals .50).

A shallow search would find all the high matching users, as long as their most frequent tags matched. This could be done quickly, using the index numbers, for either a group, (everyone with > 95% matches to anyone else), or for an individual user "Juan Li".
This would be a quick and PARTIAL report on similarity by comparing matching tags.

The binary search I posted earlier, can be adjusted to do this, as you wish.

A deeper search to find users who matched well but did NOT have their most frequent tags match (lots of other tags did match though), would take a lot more time. This would be the FULL report on similarity by comparing matching hashtags.

Making the deeper search for every user, would require more time, software, and hardware. Not a job for a novice, imo. The similarities would change of course, so it would need frequent updates.

In general, with everything you do with the data, you set up one record field that ties into what another function or program will need, to do it's job, without having to repeat the same job, in whole or in part, a second time.

I don't see sorting the tags as a bottleneck, any more. You need to sort the huge amount of data, in "buckets", anyway. Small enough that you can view and edit them in an editor, and fit them into files, without causing the file to be so big it's fragmented into pieces by the operating system, and stored in a large number of different sectors on the disk. :(

After you've sorted the large amount of data into "buckets", you will write out the data to a file, who's name will include it's number: 10sorted.txt might be the name you make up using the program, for the 10th file of sorted records and tags. The program will remember that number 10, so it knows it needs to merge files 0 through 10 (or 1 through 10 if you prefer).

Merge sort is great for this. Whole files are merged in very short order. You might do a multi-tier merge - it's very elegant and fast:

I'll describe it in a follow up post. This one is getting too long.

Use arrays, of course. The arrays can be either global (declared outside a function, usually above main(), or preferrably, created with malloc(). Be sure to include stdlib.h for malloc().

Although I have preferred one style of Quicksort, I am starting to really warm up to qsort() for these sorting jobs. The only problem with qsort is the darn syntax for the parameters. It requires void pointers, which means casts are needed for both the call to qsort, and for the parameters to the compare function you make.

Just look it up, and write it down - one for strings, and another syntax example for integers, should get you started.

qsort(A, MAX, sizeof(A[0]), compare_int); //calling qsort

where A is the array name, MAX is the highest index to be sorted (and technically a size_t, sizeof(A[0]) is the size of the first element of A, and compare_int is the name of the compare function that you want qsort() to use.


int compare_str(const void *a, const void *b) {

The compare functions are typically tiny, as above.


int compare_int(const void *a,  const void *b) {
return ( *(int*)a - *(int*)b );

For multi=key comparisons, the compare function looks like this:

int compare ( const void *a, const void *b ) {
  const struct user *pa = (const struct user *)a;
  const struct user *pb = (const struct user *)b;
  for ( int i = 0 ; i < LEN ; i++ ) {
    if ( pa->data[i] < pb->data[i] ) return -1;
    if ( pa->data[i] > pb->data[i] ) return +1;
  return 0;

When your input data is unsorted, and you need to sort just the part you have in memory atm, then you'll find it easiest to write out all the data to one or more files:


and try to keep the number of files written out, an even number. Keep that last file number for raw data - 4 in this case. Keep the highest number and letter, for every tier processed, so your program can find the proper file, next.

Every two files are then merged with mergesort:

Raw Data tier1  tier2    more tirers may be needed

  1data \
         1Adata \
  2data /        \
                  \ 1Bdata   <====Final sorted output file
  3data \        /
         2Adata /
  4data /

You can merge more than two files at a time, by using more than 1 thread and/or process.

You can see the obvious advantages of having an even number of files - and a multiple of 2 times an even number, at that. Six is an even number, for instance, but because it breaks down into two odd file numbers, (3 and 3),it's more trouble to handle than say, 4 or 8 files.

The number and letter combo tell your program:
1) What file should be opened next, and what name it should be given.

2) The tier level of the file - A is the first tier level, B is the second tier level, etc.

If you have a million or so lines of data in a file - stop. You need to be able to open it with an editor, view it, and possibly edit it - and you can't do that with files that are just way too big.

Try your favorite editor and see what file size it can handle.

Compression would be good for archive purpposes, but it will substantially slow down processing on current data, and should be avoided.

Thanks Adak. I am thinking to use database for the hashtag. Using Latent Dirichlet Allocation for tweets similarity. The reference is below, it seems very good and feasible way to get the similarity.,d.bmk

The way you listed looks very complicated.

That's what database software (and the operating system itself), will do, behind the scenes.

Even more, actually. They may not do it as fast, depending on what you want, but they are very flexible, and generally have decades of programmer-years behind them, in making it right.

If you have a program and a large file of strings for instance, just enter this at a command prompt (in a console window):

sort <nameOfInputFile >nameOfOutputFile

In the case of Windows, that will launch multiple threads or processes, and give you a guaranteed amount of resources, also. Unless you know how to use those extra resources, even the best sorting algorithm can't beat Windows sort command.

You never have to use multi-tier merging, but it is elegant, and fast. Not as difficult as it looks - you keep the number of the highest file at each tier, in a simple int array or variable.

Good luck with your project.

In fact, I feel you are very smart, and I believe your way should also be feasible and better performance, it's just look like a bit tedious, not very easy to grab your way and seems not easy to implement. May know your email? I send you my current ideas.

I'm still yakking with my ISP about a handle that I want. :(

If you want to run your own program to gather and make statistical reports on 300 million people - it's going to need some "tedious" code work. You can't fit 300 million records into memory, and still work with it, all at the same time.

Since you want to avoid that coding, then clearly a database is the way to go. I can't help you with database questions, however. Haven't used one in 20 years, and they bore me to tears. ;)

Go to the forum that specializes in your database. There, you'll find the most experienced users to answer your questions.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts learning and sharing knowledge.