Find the duplicate numbers in a set of ints

Question

kinggarden 0 Newbie Poster

16 Years Ago

Hi, everybody.

The question is :

if the user input a set of numbers, you don't know how long the

set is, like 1, 2, 3, 4, 5, 5, 5, 6, 1, 3, -1, -12141, 123, -1...., you are

asked to find the dumplicate numbers in this set. I mean 1, 5, -1.

Acturally, I just can find an o(n ^ 2) algorithm by scaning the

set for n times.

I think there must be a better way, maybe O(n) or less:)

Just for fun:)

c++

6 Contributors
25 Replies
698 Views
3 Days Discussion Span
Latest Post 16 Years Ago Latest Post by vijayan121

All 25 Replies

Narue 5,707 Bad Cop

16 Years Ago

>sorting the list should help.
Sorting the list with a general algorithm gives you (at best) O(NlogN) performance just for the sort. Then you have to determine the duplicates. It also means that you have to store the list somewhere first, which means your storage cost is O(N). If there are a lot of duplicates, you're wasting storage for this problem. See below.

>this would take O( N log N ).
That's a curious solution. Why didn't you go with a map instead?

#include <cstdlib>
#include <iostream>
#include <map>

int main()
{
  std::map<int, int> freq;

  for ( int i = 0; i < 30; i++ )
    ++freq[std::rand() % 20];

  std::cout<<"Duplicates:\n";
  std::map<int, int>::iterator it = freq.begin();

  while ( it != freq.end() ) {
    if ( it->second > 1 )
      std::cout<< it->first <<" -- "<< it->second <<" times\n";

    ++it;
  }
}

Not only does this give you a unique list, it also marks each of the duplicates for later use. That's a great deal more flexible than what you have, which can only mark duplicates once.

Of course, it still has a storage problem because of the counter for each item. Even if you use the map as your primary data structure rather than the vector you're using now, it's still effectively a 2N storage cost for integers. Since the OP only seems to want the duplicates, you can remove all but the duplicates with fair ease and limit the storage penalty to when nearly every number has at least one duplicate:

#include <cstdlib>
#include <iostream>
#include <map>

void mark_sweep ( std::map<int, int>& mark )
{
  std::map<int, int>::iterator it = mark.begin();
  std::map<int, int> sweep;

  while ( it != mark.end() ) {
    if ( it->second > 1 )
      sweep.insert ( *it );
    ++it;
  }

  mark.swap ( sweep );
}

int main()
{
  std::map<int, int> freq;

  for ( int i = 0; i < 30; i++ )
    ++freq[std::rand() % 20];
  mark_sweep ( freq );

  std::cout<<"Duplicates:\n";
  std::map<int, int>::iterator it = freq.begin();

  while ( it != freq.end() ) {
    std::cout<< it->first <<" -- "<< it->second <<" times\n";
    ++it;
  }
}

iamthwee commented: good idea and nicee avatar +10

hinduengg 26 Junior Poster in Training

16 Years Ago

So what is it that you basically require? - an array of removed duplicate integers , or printing the array once with no duplicates but containing those nos which previously had a duplicate along with the remaining numbers of the array . Please clarify this doubt.
May be then , even I can try to offer some aid like a nice member of Dani web.

thekashyap 193 Practically a Posting Shark

16 Years Ago

#include <google/dense_hash_set>

Off the topic but just curious what's this google/dense_hash_set ? Is there some open source implementation from Google ? Any links for more info.. ?
Asking coz of the difference in performance I see.. :)

thekashyap 193 Practically a Posting Shark

16 Years Ago

btw, I've already seen this: http://code.google.com/p/google-sparsehash/
was wondering if there is more like this ? any other containers' implementation.. ?

vijayan121 1,152 Posting Virtuoso

16 Years Ago

btw, I've already seen this: http://code.google.com/p/google-sparsehash/
was wondering if there is more like this ? any other containers' implementation.. ?

documentation for dense/sparse hash/map/table is available at
http://google-sparsehash.googlecode.com/svn/trunk/doc/index.html
dense_hash/map trades memory for performance, sparse versions are very memory efficient, but slow. the performance is good, but the difference between O(N logN) and O(N) [ log N ] should be about 20 to 21 times for our data set. the gcc implementation is really a first beta; i think the focus was on getting out a working version, not performance.

you may also find this link interesting
http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools
TCMalloc is really fast.

both are truly free (under the nonviral BSD license); you could use it in anyway you want without restrictions.

more code is available at http://code.google.com/
i've not tried any of the others.

~s.o.s~ commented: Nice +22

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

kinggarden 0 Newbie Poster · Answer 1 · 2007-06-19T15:39:53+00:00

Sorry, dumplicate -> duplicate:(

I don't know how to edit the title:)

thekashyap 193 Practically a Posting Shark · Answer 2 · 2007-06-19T16:49:57+00:00

thekashyap 193 Practically a Posting Shark

16 Years Ago

sorting the list should help.

vijayan121 1,152 Posting Virtuoso · Answer 3 · 2007-06-19T18:23:52+00:00

if you are not alowed to reorder the original sequence, but are allowed to use auxiliary storage

#include <vector>
#include <set>
#include <iostream>
#include <string>
using namespace std;

template< typename CNTR >
void print_duplicates( const CNTR& cntr )
{
  set< typename CNTR::value_type > unique ;
  for( size_t i=0 ; i<cntr.size() ; ++i )
    if( !unique.insert( cntr[i] ).second )
      cout << "duplicate " << cntr[i] << '\n' ;
}

int main()
{
  string str ; cin >> str ;
  vector<int> v( str.begin(), str.end() ) ;
  print_duplicates(v) ;
}

this would take O( N log N ). instead of a set, if you use a hashtable with a good hash function, you could do this in linear time.

iamthwee · Answer 4 · 2007-06-19T23:08:03+00:00

if you are not alowed to reorder the original sequence, but are allowed to use auxiliary storage
#include <vector>
#include <set>
#include <iostream>
#include <string>
using namespace std;

template< typename CNTR >
void print_duplicates( const CNTR& cntr )
{
  set< typename CNTR::value_type > unique ;
  for( size_t i=0 ; i<cntr.size() ; ++i )
    if( !unique.insert( cntr[i] ).second )
      cout << "duplicate " << cntr[i] << '\n' ;
}

int main()
{
  string str ; cin >> str ;
  vector<int> v( str.begin(), str.end() ) ;
  print_duplicates(v) ;
}
this would take O( N log N ). instead of a set, if you use a hashtable with a good hash function, you could do this in linear time.

What input are you using, I don't get anything intelligible for the output?

Narue 5,707 Bad Cop Team Colleague · Answer 5 · 2007-06-19T23:11:55+00:00

>What input are you using, I don't get anything intelligible for the output?
It's kind of goofy. He uses the characters in a string to initialize the vector of integers. You have to input a string with no spaces, and the output will be the integer code for each of the characters in the string. Try "012345123" as input. Assuming ASCII, you'll get 49, 50, and 51 as the duplicates.

iamthwee · Answer 6 · 2007-06-19T23:15:13+00:00

>What input are you using, I don't get anything intelligible for the output?
It's kind of goofy. He uses the characters in a string to initialize the vector of integers. You have to input a string with no spaces, and the output will be the integer code for each of the characters in the string. Try "012345123" as input. Assuming ASCII, you'll get 49, 50, and 51 as the duplicates.

Ah... :)

vijayan121 1,152 Posting Virtuoso · Answer 7 · 2007-06-19T23:15:19+00:00

> why didn't you go with a map instead
because i was working under the premise:
if you are not alowed to reorder the original sequence, but are allowed to use auxiliary storage

> ...it also marks each of the duplicates for later use. That's a great deal
> more flexible than what you have ...
true.

here is an extract from the python cookbook (this is to remove, not just identify duplicates) [Raymond Hettinger, 2002]

def uniq(alist)    # Fastest order preserving
  set = {}
  return [set.setdefault(e,e) for e in alist if e not in set]

def uniq(alist)    # Fastest without order preserving
  set = {}
  map(set.__setitem__, alist, [])
  return set.keys()

iamthwee · Answer 8 · 2007-06-19T23:30:10+00:00

Is that map solution something like the first page of this narue.

Narue 5,707 Bad Cop Team Colleague · Answer 9 · 2007-06-19T23:31:58+00:00

>because i was working under the premise:
>if you are not alowed to reorder the original sequence, but are allowed to use auxiliary storage
My question still stands. A set reorders the sequence just as much as a map. The order preserving behavior you get (assuming that's what you're talking about with the python example) isn't in your choice of a set, but the order in which you process the items from your vector.

Narue 5,707 Bad Cop Team Colleague · Answer 10 · 2007-06-19T23:38:14+00:00

>Is that map solution something like the first page of this narue.
Something like that. The map solution does save the first occurrence of the item and simply marks further occurrences as duplicates by incrementing the counter. However, since a map is an associative container, it changes the order of the items you put in it according to that association. In other words, by default the map sorts the items you insert.

vijayan121 1,152 Posting Virtuoso · Answer 11 · 2007-06-19T23:46:12+00:00

My question still stands. A set reorders the sequence just as much as a map. The order preserving behavior you get (assuming that's what you're talking about with the python example) isn't in your choice of a set, but the order in which you process the items from your vector.

true. the set is used only to check if the element is a duplicate (and is discarded on return from the function). the sequence is processed in the original order. it just finds (and in this example prints out) the duplicate values; it does not modify the original sequence at all (it is a const).

kinggarden 0 Newbie Poster · Answer 12 · 2007-06-20T15:46:23+00:00

So what is it that you basically require? - an array of removed duplicate integers , or printing the array once with no duplicates but containing those nos which previously had a duplicate along with the remaining numbers of the array . Please clarify this doubt.
May be then , even I can try to offer some aid like a nice member of Dani web.

Thank you:)

I just want to find the duplicate numbers in the user's input and print

them out as quickly as posible. The less time the better.

As you discussed, if I use complex datastructure, like map, could I

spend the least time?

Narue 5,707 Bad Cop Team Colleague · Answer 13 · 2007-06-20T18:41:10+00:00

>I just want to find the duplicate numbers in the user's input and print
>them out as quickly as posible.
Somehow I doubt that the user's input will be significant enough to be noticeably slow even with an O(N^2) algorithm. Not to mention that printing the duplicates will probably take more time than finding them. ;)

However, since there are no other requirements, you can combine suggestions:

#include <iostream>
#include <set>

int main()
{
  std::set<int> unique;
  std::set<int> dups;
  int input;

  while ( std::cin>> input ) {
    if ( !unique.insert ( input ).second )
      dups.insert ( input );
  }

  std::set<int>::iterator it = dups.begin();

  std::cout<<"Duplicates:\n";
  while ( it != dups.end() )
    std::cout<< *it++ <<'\n';
}

This prints only the first occurrence of the duplicate (vijayan121's original code gave you a hit for every occurrence of a duplicate). You didn't require any stability of the input order, so I just went with a straight set as the primary data structure. It's fast enough that you probably won't see any noticeable performance issues.

Why didn't I use a map? ;) Because I felt like it, but that solution is just as easy:

#include <iostream>
#include <map>

int main()
{
  std::map<int, int> freq;
  int input;

  while ( std::cin>> input ) {
    ++freq[input];
  }

  std::cout<<"Duplicates:\n";

  std::map<int, int>::iterator it;

  for ( it = freq.begin(); it != freq.end(); ++it ) {
    if ( it->second > 1 )
      std::cout<< it->first <<'\n';
  }
}

vijayan121 1,152 Posting Virtuoso · Answer 14 · 2007-06-20T18:53:53+00:00

using a hash technique would be faster; linear time

#include <iostream>
#include <iterator>
#include <vector>
#include <unordered_set>
using namespace std;

//note: requires gcc 4.3; compile with -std=c++0x

int main()
{
  int a[] = { 1, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9, 3, 5, 2, 8 };
  unordered_set<int> set ; // hash impl. with buckets (tr1)
  cout << "duplicates\n-------------\n" ;
  for( size_t i=0 ; i<sizeof(a)/sizeof(a[0]) ; ++i )
    if( set.find( a[i] ) == set.end() ) set.insert( a[i] ) ;
    else cout << a[i] << '\n' ; 
}

Narue 5,707 Bad Cop Team Colleague · Answer 15 · 2007-06-20T19:06:59+00:00

>//note: requires gcc 4.3; compile with -std=c++0x
This is my problem with your suggestion. If you don't have a compiler that supports TR1, you have to write your own hash table or use a third party library. This affects the performance. Not to mention that it's a dubious optimization. I honestly don't think the difference between a balanced tree and a hash table (the two common implementations of set/map and unordered_set) will be noticeable to the user in an I/O bound program.

vijayan121 1,152 Posting Virtuoso · Answer 16 · 2007-06-20T19:12:42+00:00

>I honestly don't think the difference ... will be noticeable to the user in an I/O bound program.

true; it may not even be measurable. i/o will dominate everything else.
note: the current version of boost has an implementation of tr1 (in std::tr1) that would work with most compilers.

vijayan121 1,152 Posting Virtuoso · Answer 17 · 2007-06-21T07:40:32+00:00

here is a performance measure

#include <iostream>
#include <iterator>
#include <vector>
#include <unordered_set>
#include <set>
#include <google/dense_hash_set>
#include <random>
#include <boost/timer.hpp>
using namespace std;

//note: requires gcc 4.3; compile with -std=c++0x

void get_dup_set( const vector<int>& in, vector<int>& dups )
{
  dups.clear() ;
  set<int> set ; 
  for( size_t i=0 ; i<in.size() ; ++i )
    if( !set.insert( in[i] ).second ) dups.push_back( in[i] ) ;
}

void get_dup_hash( const vector<int>& in, vector<int>& dups )
{
  dups.clear() ;
  unordered_set<int> set ; // hash impl. with buckets (tr1)
  for( size_t i=0 ; i<in.size() ; ++i )
    if( set.find( in[i] ) == set.end() ) set.insert( in[i] ) ;
    else dups.push_back( in[i] ) ; 
}

void get_dup_google( const vector<int>& in, vector<int>& dups )
{
  dups.clear() ;
  google::dense_hash_set< int, hash<int> > set ; 
  set.set_empty_key(-1) ;
  for( size_t i=0 ; i<in.size() ; ++i )
    if( !set.insert( in[i] ).second ) dups.push_back( in[i] ) ;
}

int main()
{
  const int N = { 1024*1024*4 };
  vector<int> in(N), dups ;
  dups.reserve(N) ;
  mt19937 twister ;
  uniform_int<> distribution(1,N/2) ;
  variate_generator< mt19937, uniform_int<> > generator(twister,distribution) ;
  for( size_t i=0 ; i<in.size() ; ++i ) in[i] = generator() ;
  boost::timer timer ;
  get_dup_set(in,dups) ;
  cout << "using std::set: " << timer.elapsed() << " seconds\n" ;
  dups.reserve(N) ;
  timer.restart() ;
  get_dup_hash(in,dups) ;
  cout << "using tr1 hash: " << timer.elapsed() << " seconds\n" ;
  dups.reserve(N) ;
  timer.restart() ;
  get_dup_google(in,dups) ;
  cout << "using google hash: " << timer.elapsed() << " seconds\n" ;
}

g++43 -std=c++0x -O3 -I/usr/local duplicates.cpp ; ./a.out
using std::set: 37.8828 seconds
using tr1 hash: 10.3906 seconds
using google hash: 0.75 seconds

Narue 5,707 Bad Cop Team Colleague · Answer 18 · 2007-06-21T18:30:14+00:00

>here is a performance measure
You forgot to include displaying the duplicates in your measure.

vijayan121 1,152 Posting Virtuoso · Answer 19 · 2007-06-21T20:22:11+00:00

since we are printing only the duplicates, the time taken to print them out will depend on the number of duplicate values. any way, here are the results for the earlier code.
note: i have replaced gcc <random> with boost<random.hpp>; gcc 4.30 stage 1 snapshot's implementation of uniform_int<> distribution seems to be buggy; these results would be more accurate.

#include <iostream>
#include <iterator>
#include <vector>
#include <unordered_set>
#include <set>
#include <google/dense_hash_set>
#include <boost/random.hpp>
#include <boost/timer.hpp>
#include <algorithm>
#include <fstream>
using namespace std;

//note: requires gcc 4.3; compile with -std=c++0x

void get_dup_set( const vector<int>& in, vector<int>& dups )
{
  dups.clear() ;
  set<int> set ; 
  for( size_t i=0 ; i<in.size() ; ++i )
    if( !set.insert( in[i] ).second ) dups.push_back( in[i] ) ;
}

void get_dup_hash( const vector<int>& in, vector<int>& dups )
{
  dups.clear() ;
  unordered_set<int> set ; // hash impl. with buckets (tr1)
  for( size_t i=0 ; i<in.size() ; ++i )
    if( set.find( in[i] ) == set.end() ) set.insert( in[i] ) ;
    else dups.push_back( in[i] ) ; 
}

void get_dup_google( const vector<int>& in, vector<int>& dups )
{
  dups.clear() ;
  google::dense_hash_set< int, hash<int> > set ; 
  set.set_empty_key(-1) ;
  for( size_t i=0 ; i<in.size() ; ++i )
    if( !set.insert( in[i] ).second ) dups.push_back( in[i] ) ;
}

int main()
{
  const int N = { 1024*1024*4 };
  vector<int> in(N), dups ;
  dups.reserve(N) ;
  boost::mt19937 twister ;
  boost::uniform_int<> distribution(1,N/2) ;
  boost::variate_generator< boost::mt19937, boost::uniform_int<> > 
                                   generator(twister,distribution) ;
  for( size_t i=0 ; i<in.size() ; ++i ) in[i] = generator() ;
  boost::timer timer ;
  get_dup_set(in,dups) ;
  cout << "using std::set: " << timer.elapsed() << " seconds\n" ;
  cout << "\t#duplicates: " << dups.size() << '\n' ;
  dups.reserve(N) ;
  timer.restart() ;
  get_dup_hash(in,dups) ;
  cout << "using tr1 hash: " << timer.elapsed() << " seconds\n" ;
  cout << "\t#duplicates: " << dups.size() << '\n' ;
  dups.reserve(N) ;
  timer.restart() ;
  get_dup_google(in,dups) ;
  cout << "using google hash: " << timer.elapsed() << " seconds\n" ;
  cout << "\t#duplicates: " << dups.size() << '\n' ;
  ofstream file("duplicates.txt") ;
  timer.restart() ;
  copy( dups.begin(), dups.end(), ostream_iterator<int>(file,"\n") ) ;
  cout << "print (file): " << timer.elapsed() << " seconds\n" ;
  timer.restart() ;
  copy( dups.begin(), dups.end(), ostream_iterator<int>(cout,"\n") ) ;
  cout << "print (stdout): " << timer.elapsed() << " seconds\n" ;
}

extract of output:
> g++43 -std=c++0x -O3 -I/usr/local duplicates.cpp ; ./a.out
using std::set: 21.2031 seconds
#duplicates: 2381039
using tr1 hash: 5.42188 seconds
#duplicates: 2381039
using google hash: 0.53125 seconds
#duplicates: 2381039
print (file): 1.4375 seconds
...
<omitted>
...
print (stdout): 5.03906 seconds

for the original problem stated in this thread (numbers entered by a user; a hundred or so at most), performance is not measurable - times are equal to or tending to 0.0 seconds

iamthwee · Answer 20 · 2007-06-22T01:34:16+00:00

That's all well and good but I doubt the OP knows standard c++ let alone how to use it with boost or <insert other library here>.

IMO maybe you ought to consider this fact when answering other questions?

Find the duplicate numbers in a set of ints

Recommended Answers Collapse Answers

All 25 Replies

Recommended Answers