I am trying to read a file which contains URLs and are 100 million in number. What I need to do is find out the different websites they are from. So, I am taking chunk of data in memory and reading it line by line. Also, I need to find out how many URLs does each website has in the file and what are those URLs. The way I figured it out is ro have a class domain which is:

class domain{
public:
	string domainname;
	int nolinks;
	string URLs;
	domain()
	{
		nolinks=1;
	}
};

and then declare a hash_set in which I insert domain objects. This is the declaration of hash_set

class hash_fnc:public stdext::hash_compare<std::string>{
public:
	/*enum{
		bucket_size=1024,
		min_buckets=8
	};*/
	size_t operator()(const domain& d)const
	{	 
	size_t h = 0;
    std::string::const_iterator p, p_end;
	for(p = d.domainname.begin(), p_end = d.domainname.end();    p != p_end; ++p)
    {
      h = 31 * h + (*p);
    }
	return h;
	}

	bool operator()(const domain& x,const domain& y) const
	{
		return x.domainname.compare(y.domainname)<0;
	}

};

hash_set<domain,hash_fnc>_domain;
pair<hash_set<domain,hash_fnc>::iterator,bool>ret;

While reading the file, for each URL ex. www.cleaned.be/forum/index.php?showuser=1, I take the website www.cleaned.be, create a domain object with domain.domainname="www.cleaned.be" and insert it to my hash_set. Whenver I encounter another URL of the same website, I try to check if it already exists and increase the count domain.nolinks by 1 and append the URL to domain.URLs. This is the code block for that:

domain X;
						X.domainname="www.somedomain.com";
						//X.URLs.assign("www.somedomain.com/index/____/x.html");
						ret=_domain.insert(X);
			if(ret.second==false) //it already exists
						{
				(ret.first)->nolinks++;
				(ret.first)->URLs.append("\n ");
(ret.first)->URLs.append("www.somedomain.com/index/____/x.html");

}
I do this for every line or URL in the file. Now the problem:

Although this worked out really well for an ordinary set<> , it is not for hash_set<>. The URLs are not getting added to the correct domain and my computer turns off while running this program sometimes. Also, the output is now missing almost half the domains and also messed up.Obviously I am making huge mistakes. So, please try to help me. I'll really be thankful.

Recommended Answers

All 8 Replies

Stupid question: if it works for set<>, why not use set<> instead of hash_set<>? I know it's the difference between a balanced BST and a hash table, but if set<> isn't too much slower you should use the way that works right now instead of fix something broken.

I would have done that long before had it not been slow. You see as the size of table increase, my insert() time increases and its slowing down the program. That's why I figured out, I needed a hash table.

Here, I am pasting a small part of the garbled output that I am getting:

www.mvbbs.net:26 

www.mvbbs.net/forumdisplay.php?fid=2
www.mvbbs.net/forumdisplay.php?fid=2
ladiesofnuevolaredo.com/index.php?action=profile;u=665
www.onepiecesp.com/foro/index.php?action=profile;u=1
 vsedlyavsex.com/forum/showthread.php?t=33188
 www.sendbiz.cn/forumdisplay.php?f=18
 www.themoneyforum.com/forumdisplay.php?f=60
 www.mvbbs.net/forumdisplay.php?fid=117
 www.mvbbs.net/forumdisplay.php?fid=117
 www.myoutdoortv.com/forum/forumdisplay.php?f=76
 www.benchracers.com/forum/forumdisplay.php?f=11
 www.almuhaphda.com/vb/member.php?u=849
 www.mvbbs.net/forumdisplay.php?fid=115
 www.mvbbs.net/forumdisplay.php?fid=115
 www.mvbbs.net/forumdisplay.php?fid=48
 www.mvbbs.net/forumdisplay.php?fid=48
 www.forum.foxit.ru/showthread.php?t=41
 www.type2chat.com/forumdisplay.php?f=176
 www.mvbbs.net/forumdisplay.php?fid=49
 www.mvbbs.net/forumdisplay.php?fid=49
 www.climatechangeideas.net/index.php?action=profile;u=32
 www.mvbbs.net/forumdisplay.php?fid=58
 www.mvbbs.net/forumdisplay.php?fid=58
 www.mvbbs.net/forumdisplay.php?fid=51
 www.mvbbs.net/forumdisplay.php?fid=51
 www.pojo.biz/board/showthread.php?t=365567

As we can see here, the following domains :

www.mvbbs.net
www.pojo.biz
www.type2chat.com
www.climatechangeideas.net
www.forum.foxit.ru
www.alumhaphda.com
www.themoneyforum.com
...
...
...

are all getting hashed to the same to the same value. But the thing to notice is the way the count of www.mvbbs.net is incremented to 26 whereas it should not have been the case. I hope the problem is clear. It'll be great if someone can point out the faulty part. Its the first time I am using STL and hash tables. I know I am wrong in the implementation but just need to know where.

bool operator()(const domain& x,const domain& y) const
{
return x.domainname.compare(y.domainname)<0;
}

This is equal operation for Hash Conflict. Maybe you need to change it as

x.domainname.compare(y.domainname)==0;

sorry my change is not properly for your application. But the issue is at that point. y equals x if and only if y is same as x in length of x. So your implementation does not match that rule.

sorry my change is not properly for your application. But the issue is at that point. y equals x if and only if y is same as x in length of x. So your implementation does not match that rule.

But isn't the function seem to be giving proper ordering based on the URL string. I guess matching lengths alone cannot conclude that two strings match. Also, this comparator functor worked absolutely fine with <set> giving proper ordering. Correct me if I am wrong.

no replies..........???????????


any suggestion , any help ...people

the issue is in hash function, the equal and hash code operations.
The rule is if two items are equal, their hash code must be same too!!!
you break the rule so you failed.
For example,
A is "www.hello1.com"
B is "www.hello2.com"
In your current logic A equals B. But their hash code are different.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.