vector duplicates removal

Question

heidik 0 Light Poster

14 Years Ago

Hello Everyone.

Could anyone please help me quickly find duplicates in a vector of over 60000 records. It takes a loooootttttt of time ;'(. Here's the code.

#include <string>
#include <vector>
#include <iostream>
#include <algorithm>
#include <sstream>

struct MyPred
{
	std::string x;
	std::string y;

	// add default values
	MyPred(const std::string& x = "", const std::string& y = ""): x(x), y(y) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y;
	}

	bool operator<(const MyPred& p) const
	{
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		return false;
	}
};


int main()
{
	std::vector<MyPred>* vPred = new std::vector<MyPred>;
	MyPred pred;

	for(int i = 0; i < 60000; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	for(int i = 10; i < 25; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	for(int i = 35000; i < 35005; i++)
	{
		std::stringstream ss1;
		ss1 << i;
		pred.x = ss1.str();
		std::stringstream ss2;
		ss2 << i+1;
		pred.y = ss2.str();
		vPred->push_back(pred);
	}

	/*vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("a", "b"));
	vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("b", "a"));
	vPred->push_back(MyPred("a", "b"));
	vPred->push_back(MyPred("a", "c"));
	vPred->push_back(MyPred("b", "b"));
	vPred->push_back(MyPred("a", "a"));
	vPred->push_back(MyPred("a", "a"));*/

	// The values need to be in order for equal_range() to work
	std::sort(vPred->begin(), vPred->end());

	std::vector<MyPred> uPred; // values that were always unique
	std::vector<MyPred> dPred; // values that were duplicated

	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;

	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); i = ret.second)
	{
		ret = std::equal_range(i, vPred->end(), *i);
		
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred.push_back(*i); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

	std::cout << "vPred: Sorted input\n";
	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	std::cout << "dPred: Only the values that were duplicated\n";
	for(std::vector<MyPred>::iterator i = dPred.begin(); i != dPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	std::cout << "uPred: Only the values that were unique\n";
	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
	{
		std::cout << "[" << i->x << ", " << i->y << "]" << '\n';
	}

	delete vPred;
}

c++

3 Contributors
6 Replies
111 Views
10 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by heidik

daviddoria 334 Posting Virtuoso

14 Years Ago

Here is my initial feedback:
1) You should make variable names and class names as descriptive as possible. What is a "Pred"? What is 'vPred'? What is 'x' and 'y'? Why are they strings?

2) (1) is also aided significantly by the heavy use of comments!

3) What is your goal? You may be better off using another container besides std::vector (std::set comes to mind) if you are trying to simply store unique elements).

Give us some more clues and we'll try to help further :)

David

Ancient Dragon 5,243 Achieved Level 70

14 Years Ago

Use <map> and you won't have to be concerned about duplicates.

Or sort the vector first to make it easier to find duplicates.

The operator< on line 20 will not work correctly because it will never reach the two lines that test for y (lines 24, 25 and 26 are unreachable).

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

heidik 0 Light Poster · Answer 1 · 2010-10-24T19:40:20+00:00

Thanks for your reply David :). Actually it is just an example that's why I have used variable names like a, x, y, z :P. They are strings because I am reading from a file in the actual assignment which has more than 60000 records (lines). Those strings are a combination of both strings and ints. pred is a struct object. vPred is a struct type of vector.

Could you please help me complete this assignment. The first and last string elements of each line (in that file of 60000 records) are start date/time and end date/time respectively. I have to find the duplicates in original vector where all string elements in a line are the same except the first element i-e start date/time. One line of file can have many duplicates i-e 2 or even 3 e-g. I have to find the difference between the minits of the start date/time and the minits of the end date/time and push_back the element from the duplicates to the original vector where the time difference is the minimum. e-g

Start date/time of '1' element is 2006/06/01 16:34:43
End date/time of that element is 2006/06/01/ 16:55: 51

Start date/time of duplicate element (of '1') is 2006/06/01 16:24:43
End date/time of that element is 2006/06/01/ 16:55: 51 (same as '1')

I now have to find the difference between 34 (minits of '1' element's start date/time) and 55 (minits of '1' element's end date/time) which is equal to 21

and the difference between 24 (minits of duplicate (of '1') element's start date/time) and 55 (minits of duplicate (of '1') element's end date/time) which is equal to 31.

since 21 is less than 31 so I will push_back the element with time difference of 21 to the original vector and will discard all its duplicates.

like wise finding the time difference between all the duplicates of element '1' as it can have many duplicates and not just one.

Could you help me do this? I hope I have explained well. For the time being I have put random int+string combination to struct first (start date/time) and last (end date/time) member variables instead of proper date/time to find the difference between the int part of that variable.

#include <string>
#include <vector>
#include <iostream>
#include <algorithm>

struct MyPred
{
	std::string a;
	std::string x;
	std::string y;
	std::string z;

	MyPred(const std::string& a, const std::string& x, const std::string& y, const std::string& z): a(a), x(x), y(y), z(z) {}

	bool operator==(const MyPred& p) const
	{
		return x == p.x && y == p.y && z == p.z; // a == p.a && 
	}

	bool operator<(const MyPred& p) const
	{
		//if(a < p.a) return true;
		//if(a > p.a) return false;
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		if(z < p.z) return true;
		if(z > p.z) return false;
		return false;
	}
};


int main()
{
	std::vector<MyPred>* vPred = new std::vector<MyPred>;
	vPred->push_back(MyPred("a2c", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("j4h", "b", "c", "j87h"));
	vPred->push_back(MyPred("d4f", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("n7s", "1Gak", "c", "d4f"));
	vPred->push_back(MyPred("l9m", "b", "c", "j87h"));
	vPred->push_back(MyPred("p24a", "x", "c", "p43a"));
	vPred->push_back(MyPred("q56r", "l", "m", "q90r"));
	vPred->push_back(MyPred("g11v", "8f", "h", "g63v"));
	vPred->push_back(MyPred("u3w", "v", "d", "u11w"));
	vPred->push_back(MyPred("k76l", "x", "c", "p43a"));
	vPred->push_back(MyPred("p24a", "g", "z", "p43a"));

	// The values need to be in order for equal_range() to work
	std::sort(vPred->begin(), vPred->end());

	std::vector<MyPred> uPred; // values that were always unique
	std::vector<MyPred>* dPred = new std::vector<MyPred>; // values that were duplicated

	std::pair<std::vector<MyPred>::iterator, std::vector<MyPred>::iterator> ret;

	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); i = ret.second)
	{
		ret = std::equal_range(i, vPred->end(), *i);
		
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyPred>::iterator j = ret.first; j != ret.second; ++j)
				{
					dPred->push_back(*j); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			uPred.push_back(*i);
		}
	}

	std::cout << "vPred: Sorted input\n";
	for(std::vector<MyPred>::iterator i = vPred->begin(); i != vPred->end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	std::cout << "dPred: Only the values that were duplicated\n";
	for(std::vector<MyPred>::iterator i = dPred->begin(); i != dPred->end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	std::cout << "uPred: Only the values that were unique\n";
	for(std::vector<MyPred>::iterator i = uPred.begin(); i != uPred.end(); ++i)
	{
		std::cout << "[" << i->a << ", " << i->x << ", " << i->y << ", " << i->z << "]" << '\n';
	}

	delete vPred;
	delete dPred;

	char a;
	std::cin >> a;
}

daviddoria 334 Posting Virtuoso Featured Poster · Answer 2 · 2010-10-24T20:58:07+00:00

Unfortunately I can't look into this enough to help you with the assignment.

pred is a struct object. vPred is a struct type of vector.

Of course, I read the code :). My point is that the word "pred" does not tell me what it is. Is it a "predator"? A "predominant something"? See what I mean? Then "vPred" could be called "predators" or something, though vPredator might be acceptable?

heidik 0 Light Poster · Answer 3 · 2010-10-24T21:20:08+00:00

Could you please have a look at this David :)

#include <string>
#include <vector>
#include <iostream>
#include <algorithm>
 
struct MyStruct // I was trying to use a predicate in equal_range() so I named the struct MyPred :P.
{
	std::string start_DateTime; // it represents the start date/time
	std::string x;
	std::string y;
	std::string end_DateTime; // represents the end date/time
 
	MyStruct(const std::string& strt_DT, const std::string& x, const std::string& y, const std::string& end_DT): start_DateTime(strt_DT), x(x), y(y), end_DateTime(end_DT) {}
 
	bool operator==(const MyStruct& p) const
	{
		return x == p.x && y == p.y && end_DateTime == p.end_DT; // start_DateTime == p.strt_DT && 
	}
 
	bool operator<(const MyStruct& p) const
	{
		//if(start_DateTime < p.strt_DT) return true;
		//if(start_DT > p.strt_DT) return false;
		if(x < p.x) return true;
		if(x > p.x) return false;
		if(y < p.y) return true;
		if(y > p.y) return false;
		if(end_DateTime < p.end_DT) return true;
		if(end_DateTime > p.end_DT) return false;
		return false;
	}
};
 
 
int main()
{
	std::vector<MyStruct>* vecStruct = new std::vector<MyStruct>;
	vecStruct->push_back(MyStruct("a2c", "1Gak", "c", "d4f"));
	vecStruct->push_back(MyStruct("j4h", "b", "c", "j87h"));
	vecStruct->push_back(MyStruct("d4f", "1Gak", "c", "d4f"));
	vecStruct->push_back(MyStruct("n7s", "1Gak", "c", "d4f"));
	vecStruct->push_back(MyStruct("l9m", "b", "c", "j87h"));
	vecStruct->push_back(MyStruct("p24a", "x", "c", "p43a"));
	vecStruct->push_back(MyStruct("q56r", "l", "m", "q90r"));
	vecStruct->push_back(MyStruct("g11v", "8f", "h", "g63v"));
	vecStruct->push_back(MyStruct("u3w", "v", "d", "u11w"));
	vecStruct->push_back(MyStruct("k76l", "x", "c", "p43a"));
	vecStruct->push_back(MyStruct("p24a", "g", "z", "p43a"));
 
	// The values need to be in order for equal_range() to work
	std::sort(vecStruct->begin(), vecStruct->end());
 
	std::vector<MyStruct> unqVecStruct; // values that were always unique
	std::vector<MyStruct>* dupVecStruct = new std::vector<MyStruct>; // values that were duplicated
 
	std::pair<std::vector<MyStruct>::iterator, std::vector<MyStruct>::iterator> ret;
 
	for(std::vector<MyStruct>::iterator i = vecStruct->begin(); i != vecStruct->end(); i = ret.second)
	{
		ret = std::equal_range(i, vecStruct->end(), *i);
 
		if(ret.second - ret.first != 1) // duplicates
		{
				for(std::vector<MyStruct>::iterator j = ret.first; j != ret.second; ++j)
				{
					dupVecStruct->push_back(*j); //put each duplicate onto a new vector
				}
		}
		else if(ret.second - ret.first == 1)
		{
			unqVecStruct.push_back(*i);
		}
	}
 
	std::cout << "vecStruct: Sorted input\n";
	for(std::vector<MyStruct>::iterator i = vecStruct->begin(); i != vecStruct->end(); ++i)
	{
		std::cout << "[" << i->start_DateTime << ", " << i->x << ", " << i->y << ", " << i->end_DateTime << "]" << '\n';
	}
 
	std::cout << "dupVecStruct: Only the values that were duplicated\n";
	for(std::vector<MyStruct>::iterator i = dupVecStruct->begin(); i != dupVecStruct->end(); ++i)
	{
		std::cout << "[" << i->start_DateTime << ", " << i->x << ", " << i->y << ", " << i->end_DateTime << "]" << '\n';
	}
 
	std::cout << "unqVecStruct: Only the values that were unique\n";
	for(std::vector<MyStruct>::iterator i = unqVecStruct.begin(); i != unqVecStruct.end(); ++i)
	{
		std::cout << "[" << i->start_DateTime << ", " << i->x << ", " << i->y << ", " << i->end_DateTime << "]" << '\n';
	}
 
	delete vecStruct;
	delete dupVecStruct;
}

heidik 0 Light Poster · Answer 4 · 2010-10-24T21:27:44+00:00

std::vector<MyStruct>* vecStruct = new std::vector<MyStruct>;
	vecStruct->push_back(MyStruct("2006/06/01 16:24:45", "1Gak", "c", "2006/06/01 16:14:45"));
	vecStruct->push_back(MyStruct("2006/06/01 13:13:45", "b", "c", "2006/06/01 13:15:45"));
	vecStruct->push_back(MyStruct("2006/06/01 16:14:45", "1Gak", "c", "2006/06/01 16:14:45"));
	vecStruct->push_back(MyStruct("2006/06/01 16:30:45", "1Gak", "c", "2006/06/01 16:14:45"));
	vecStruct->push_back(MyStruct("2006/06/01 13:17:45", "b", "c", "2006/06/01 13:15:45"));
	vecStruct->push_back(MyStruct("2006/06/02 13:14:45", "x", "c", "2006/06/02 13:15:45"));
	vecStruct->push_back(MyStruct("2006/06/03 15:11:45", "l", "m", "2006/06/03 15:15:45"));
	vecStruct->push_back(MyStruct("2006/06/03 15:11:45", "8f", "h", "2006/06/03 15:11:45"));
	vecStruct->push_back(MyStruct("2006/06/03 17:11:45", "v", "d", "2006/06/03 19:11:45"));
	vecStruct->push_back(MyStruct("2006/06/02 13:17:45", "x", "c", "2006/06/02 13:15:45"));
	vecStruct->push_back(MyStruct("2006/06/01 16:24:45", "g", "z", "2006/06/01 16:14:45"));