The core of the core of the big data solutions -- Map

Question

pww71 0 Newbie Poster

10 Years Ago

Title: The core of the core of the big data solutions -- Map
Author: pengwenwei
Email: pww71@sina.com
Language: c++
Platform: Windows, linux
Technology: Perfect hash algorithm

Level:       Advanced
Title:       The core of the core of the big data solutions -- Map
Author:      pengwenwei
Email:       pww71@sina.com
Language:    c++
Platform:    Windows, linux
Technology:  Perfect hash algorithm
Level:       Advanced
Description: Map algorithm with high performance
Section      MFC c++ map stl
SubSection   c++ algorithm
License:     (GPLv3)

    Download demo project - 1070 Kb
    Download source - 1070 Kb

Introduction:
For the c++ program, map is used everywhere.And bottleneck of program performance is often the performance of map.Especially in the case of large data,and the business association closely and unable to realize the data distribution and parallel processing condition.So the performance of map becomes the key technology.

In the work experience with telecommunications industry and the information security industry, I was dealing with the big bottom data,especially the most complex information security industry data,all can’t do without map.

For example, IP table, MAC table, telephone number list, domain name resolution table, ID number table query, the Trojan horse virus characteristic code of cloud killing etc..

The map of STL library using binary chop, its has the worst performance.Google Hash map has the optimal performance and memory at present, but it has repeated collision probability.Now the big data rarely use a collision probability map,especially relating to fees, can’t be wrong.

Now I put my algorithms out here,there are three kinds of map,after the build is Hash map.We can test the comparison,my algorithm has the zero probability of collision,but its performance is also better than the hash algorithm, even its ordinary performance has no much difference with Google.

My algorithm is perfect hash algorithm,its key index and the principle of compression algorithm is out of the ordinary,the most important is a completely different structure,so the key index compression  is fundamentally different.The most direct benefit for program is that for the original map need ten servers for solutions but now I only need one server.
Declare: the code can not be used for commercial purposes, if for commercial applications,you can contact me with QQ 75293192.
Download:
https://sourceforge.net/projects/pwwhashmap/files/pwwHash.zip/download

Applications:
First,modern warfare can’t be without the mass of information query, if the query of enemy target information slows down a second, it could lead to the delaying fighter, leading to failure of the entire war. Information retrieval is inseparable from the map, if military products use pwwhashMap instead of the traditional map,you must be the winner.

Scond,the performance of the router determines the surfing speed, just replace open source router code map for pwwHashMap, its speed can increase ten times.
There are many tables to query and set in the router DHCP ptotocol,such as IP,Mac ,and all these are completed by map.But until now,all map are  using STL liabrary,its performance is very low,and using the Hash map has error probability,so it can only use multi router packet dispersion treatment.If using pwwHashMap, you can save at least ten sets of equipment.

Third,Hadoop is recognized as the big data solutions at present,and its most fundamental thing is super heavy use of the map,instead of SQL and table.Hadoop assumes the huge amounts of data so that the data is completely unable to move, people must carry on the data analysis in the local.But as long as the open source Hadoop code of the map changes into pwwHashMap, the performance will increase hundredfold without any problems.


Background to this article that may be useful such as an introduction to the basic ideas presented:
http://blog.csdn.net/chixinmuzi/article/details/1727195

algorithm c++ data-structure open-source windows-virus

3 Contributors
3 Replies
260 Views
3 Hours Discussion Span
Latest Post 10 Years Ago Latest Post by mike_2000_17

All 3 Replies

mike_2000_17 2,669 21st Century Viking

10 Years Ago

I command you for your enthusiasm with this project of yours. But Schol-R-LEA is right, you are getting dangerously close to spamming with these posts. I deleted the other duplicate posts you made.

About your project.... I'm afraid you have a lot to learn. It pains me to say that because you clearly put a lot of effort into it, but I just can't let you go on believing in your own illusions.

Nearly everything you said in your post is wrong.

For the c++ program, map is used everywhere.

That's wrong. The std::map class is one of the least used C++ containers in production code.

And bottleneck of program performance is often the performance of map.

Nobody would use std::map for heavy-lifting or high-performance code.

Especially in the case of large data

For large amounts of data, people use sophisticated data base engines that are highly tuned for performance. The std::map class is a simple class for simple, small and non-demanding applications, every professional knows that.

unable to realize the data distribution and parallel processing condition

To enable parallel processing, the key is to segment the data into partitions of elements that avoid false sharing on the memory architecture and find suitable ways to resolve data-races (such as fine-grained locks, lock-free schemes, or a hybrid like software transactional memory).

The map of STL library using binary chop, its has the worst performance.

That's not true. The typical std::map implementation uses a red-black tree, which is a binary search tree. There is a big difference between binary search and a binary search tree. And it's certainly not the "worst" performance. The std::map can be fairly competitive in terms of performance when you have between 50 and 500 elements, in general. Below about 50 elements, you would use std::vector instead (and using linear search, not binary search). Above about 500 elements, you would be using a hash-map. That's just a basic rule of thumb, and it will vary a lot depending on the types of elements your storing and the relative cost of comparisons versus hash value computations. But the point is that a map is not the "worst", but it would be more accurate to say that it's rarely the best.

Google Hash map has the optimal performance and memory at present, but it has repeated collision probability.

How do you know that it has optimal performance? I'm sure it has good performance, but you can't say it's optimal. Optimality is a theoretical thing, it doesn't apply to real life.

Now the big data rarely use a collision probability map,especially relating to fees, can't be wrong.

All hash functions will lead to collisions, and dealing with collisions in a hash-map is absolutely necessary. The google dense-hash implementation uses a quadratic probing into the table to resolve collisions, which is a very typical and performs well for moderately sized hash-maps. The standard C++ hash-map, called std::unordered_map uses single linked-lists as its buckets to handle collisions. These are the two most typical approaches used, but there are others.

And I don't get why you say "can't be wrong". Having a collision does not mean that you just return an error and give up. Collisions are resolved by placing colliding elements somewhere else, that's it.

My algorithm is perfect hash algorithm,its key index and the principle of compression algorithm is out of the ordinary,the most important is a completely different structure,so the key index compression is fundamentally different.

There is no such thing as a perfect hash function, except for when the input data is perfectly predictable in some way. You should read deceptikon's article on hashing for a better explanation of this.

I was curious about what was the hashing function you were using, so, I downloaded your code and checked it out. Your hashing function is actually not included in your code, it's missing the pwwHash.cpp file. So, I can't judge it. But in any case, a hash-map implementation that does not let you specify the hash function that should be used is completely useless. Choosing a proper hash function is highly dependent on the probability distribution of the data that you plan to store in it, as explained in deceptikon's article. And so, I can safely say that your hash function is far from being perfect. And the idea that you don't have collisions with your hash functions is just crazy, such a thing does not exist, it would break the laws of information theory. One way or another, you must have got this all wrong.

And you say it is "out of the ordinary", how? It is "fundamentally different", how?

As for the code that I have seen. I can say that it is very poorly written. I have spotted more bugs in 10 minutes of looking at the code than I could possibly list here. And from the looks of it, your implementation of the "search tree" is a single linked-list, which makes no sense whatsoever. First of all, the whole point of a hash-map is to not use a search tree. And second, you can't create a search tree with a single linked-list. And finally, if you are using your hash values to index into a linked-list, then the performance would be horrible.

Your code has a tremendous amount of unnecessary complexity. You use a lot of C'isms (like malloc / free). Way more auxiliary, dynamically-allocated memory than you could possibly need to implement such a simple data structure. You waste a lot of memory for things that could so obviously be avoided. There are a number of things that are incomprehensible, such as why you need m_pwwMap to be an array (since you only ever use one element of it). You have no concern for exception-safety. And finally, I did not see any re-hashing functions in your code, a dynamic-size hash-map without re-hashing is impossible.

Frankly, if I were reviewing this code and you were working for me, I'd fire you. Because this code is useless for production. Not to mention that the style is horrendous.

Also, I noticed that you say on your sourceforge page that your price for technology transfer is 2,000,000,000 dollars.... wow... 2 billion dollars... I would not even pay a penny for this.

If you wish to learn how hash-maps are properly written in C++, study the implementation of unordered containers from Boost.Intrusive. At the very least, you have to compete with the standard std::unordered_map class template, which is hard to beat.

Edited 10 Years Ago by mike_2000_17 because: note

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Schol-R-LEA 1,446 Commie Mutant Traitor Featured Poster · Answer 1 · 2015-03-25T03:33:46+00:00

All well and good, and of some definite interest in general, but... why post this here, except as a way of advertising your blog and thesis? This forum is mostly for assisting others in finding solutions to problems; your neither ask a question nor (within the post) provide an answer. While a few of us will appreciate the contents of your thesis, simply posting the abstract and expecting us to go to your blog for the details is simply inappropriate.

Also, intentionally posting the same material repeatedly is a clear violation the forum rules against spamming. In a persistent message board like this, there is little need to keep sending the same message over and over again, and it is extremely rude to do so.

pww71 0 Newbie Poster · Answer 2 · 2015-03-25T03:55:20+00:00

pww71 0 Newbie Poster

10 Years Ago

Yes I should post it to the discussion.

The core of the core of the big data solutions -- Map

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers