sorting the list should help.
thekashyap
Practically a Posting Shark
811 posts since Feb 2007
Reputation Points: 254
Solved Threads: 75
if you are not alowed to reorder the original sequence, but are allowed to use auxiliary storage
#include <vector>
#include <set>
#include <iostream>
#include <string>
using namespace std;
template< typename CNTR >
void print_duplicates( const CNTR& cntr )
{
set< typename CNTR::value_type > unique ;
for( size_t i=0 ; i<cntr.size() ; ++i )
if( !unique.insert( cntr[i] ).second )
cout << "duplicate " << cntr[i] << '\n' ;
}
int main()
{
string str ; cin >> str ;
vector<int> v( str.begin(), str.end() ) ;
print_duplicates(v) ;
}
this would take O( N log N ). instead of a set, if you use a hashtable with a good hash function, you could do this in linear time.
vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287
>sorting the list should help.
Sorting the list with a general algorithm gives you (at best) O(NlogN) performance just for the sort. Then you have to determine the duplicates. It also means that you have to store the list somewhere first, which means your storage cost is O(N). If there are a lot of duplicates, you're wasting storage for this problem. See below.
>this would take O( N log N ).
That's a curious solution. Why didn't you go with a map instead?
#include <cstdlib>
#include <iostream>
#include <map>
int main()
{
std::map<int, int> freq;
for ( int i = 0; i < 30; i++ )
++freq[std::rand() % 20];
std::cout<<"Duplicates:\n";
std::map<int, int>::iterator it = freq.begin();
while ( it != freq.end() ) {
if ( it->second > 1 )
std::cout<< it->first <<" -- "<< it->second <<" times\n";
++it;
}
}
Not only does this give you a unique list, it also marks each of the duplicates for later use. That's a great deal more flexible than what you have, which can only mark duplicates once.
Of course, it still has a storage problem because of the counter for each item. Even if you use the map as your primary data structure rather than the vector you're using now, it's still effectively a 2N storage cost for integers. Since the OP only seems to want the duplicates, you can remove all but the duplicates with fair ease and limit the storage penalty to when nearly every number has at least one duplicate:
#include <cstdlib>
#include <iostream>
#include <map>
void mark_sweep ( std::map<int, int>& mark )
{
std::map<int, int>::iterator it = mark.begin();
std::map<int, int> sweep;
while ( it != mark.end() ) {
if ( it->second > 1 )
sweep.insert ( *it );
++it;
}
mark.swap ( sweep );
}
int main()
{
std::map<int, int> freq;
for ( int i = 0; i < 30; i++ )
++freq[std::rand() % 20];
mark_sweep ( freq );
std::cout<<"Duplicates:\n";
std::map<int, int>::iterator it = freq.begin();
while ( it != freq.end() ) {
std::cout<< it->first <<" -- "<< it->second <<" times\n";
++it;
}
}
Narue
Bad Cop
15,460 posts since Sep 2004
Reputation Points: 6,464
Solved Threads: 1,401
if you are not alowed to reorder the original sequence, but are allowed to use auxiliary storage
#include <vector>
#include <set>
#include <iostream>
#include <string>
using namespace std;
template< typename CNTR >
void print_duplicates( const CNTR& cntr )
{
set< typename CNTR::value_type > unique ;
for( size_t i=0 ; i<cntr.size() ; ++i )
if( !unique.insert( cntr[i] ).second )
cout << "duplicate " << cntr[i] << '\n' ;
}
int main()
{
string str ; cin >> str ;
vector<int> v( str.begin(), str.end() ) ;
print_duplicates(v) ;
}
this would take O( N log N ). instead of a set, if you use a hashtable with a good hash function, you could do this in linear time.
What input are you using, I don't get anything intelligible for the output?
iamthwee
Posting Expert
5,950 posts since Aug 2005
Reputation Points: 1,543
Solved Threads: 439
>What input are you using, I don't get anything intelligible for the output?
It's kind of goofy. He uses the characters in a string to initialize the vector of integers. You have to input a string with no spaces, and the output will be the integer code for each of the characters in the string. Try "012345123" as input. Assuming ASCII, you'll get 49, 50, and 51 as the duplicates.
Narue
Bad Cop
15,460 posts since Sep 2004
Reputation Points: 6,464
Solved Threads: 1,401
>What input are you using, I don't get anything intelligible for the output?
It's kind of goofy. He uses the characters in a string to initialize the vector of integers. You have to input a string with no spaces, and the output will be the integer code for each of the characters in the string. Try "012345123" as input. Assuming ASCII, you'll get 49, 50, and 51 as the duplicates.
Ah... :)
iamthwee
Posting Expert
5,950 posts since Aug 2005
Reputation Points: 1,543
Solved Threads: 439
> why didn't you go with a map instead
because i was working under the premise:
if you are not alowed to reorder the original sequence, but are allowed to use auxiliary storage
> ...it also marks each of the duplicates for later use. That's a great deal
> more flexible than what you have ...
true.
here is an extract from the python cookbook (this is to remove, not just identify duplicates) [Raymond Hettinger, 2002]
def uniq(alist) # Fastest order preserving
set = {}
return [set.setdefault(e,e) for e in alist if e not in set]
def uniq(alist) # Fastest without order preserving
set = {}
map(set.__setitem__, alist, [])
return set.keys()
vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287
Is that map solution something like the first page of this narue.
iamthwee
Posting Expert
5,950 posts since Aug 2005
Reputation Points: 1,543
Solved Threads: 439
>because i was working under the premise:
>if you are not alowed to reorder the original sequence, but are allowed to use auxiliary storage
My question still stands. A set reorders the sequence just as much as a map. The order preserving behavior you get (assuming that's what you're talking about with the python example) isn't in your choice of a set, but the order in which you process the items from your vector.
Narue
Bad Cop
15,460 posts since Sep 2004
Reputation Points: 6,464
Solved Threads: 1,401
>Is that map solution something like the first page of this narue.
Something like that. The map solution does save the first occurrence of the item and simply marks further occurrences as duplicates by incrementing the counter. However, since a map is an associative container, it changes the order of the items you put in it according to that association. In other words, by default the map sorts the items you insert.
Narue
Bad Cop
15,460 posts since Sep 2004
Reputation Points: 6,464
Solved Threads: 1,401
My question still stands. A set reorders the sequence just as much as a map. The order preserving behavior you get (assuming that's what you're talking about with the python example) isn't in your choice of a set, but the order in which you process the items from your vector.
true. the set is used only to check if the element is a duplicate (and is discarded on return from the function). the sequence is processed in the original order. it just finds (and in this example prints out) the duplicate values; it does not modify the original sequence at all (it is a const).
vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287
So what is it that you basically require? - an array of removed duplicate integers , or printing the array once with no duplicates but containing those nos which previously had a duplicate along with the remaining numbers of the array . Please clarify this doubt.
May be then , even I can try to offer some aid like a nice member of Dani web.
hinduengg
Junior Poster in Training
88 posts since Jun 2007
Reputation Points: 36
Solved Threads: 4
>I just want to find the duplicate numbers in the user's input and print
>them out as quickly as posible.
Somehow I doubt that the user's input will be significant enough to be noticeably slow even with an O(N^2) algorithm. Not to mention that printing the duplicates will probably take more time than finding them. ;)
However, since there are no other requirements, you can combine suggestions:
#include <iostream>
#include <set>
int main()
{
std::set<int> unique;
std::set<int> dups;
int input;
while ( std::cin>> input ) {
if ( !unique.insert ( input ).second )
dups.insert ( input );
}
std::set<int>::iterator it = dups.begin();
std::cout<<"Duplicates:\n";
while ( it != dups.end() )
std::cout<< *it++ <<'\n';
}
This prints only the first occurrence of the duplicate (vijayan121's original code gave you a hit for every occurrence of a duplicate). You didn't require any stability of the input order, so I just went with a straight set as the primary data structure. It's fast enough that you probably won't see any noticeable performance issues.
Why didn't I use a map? ;) Because I felt like it, but that solution is just as easy:
#include <iostream>
#include <map>
int main()
{
std::map<int, int> freq;
int input;
while ( std::cin>> input ) {
++freq[input];
}
std::cout<<"Duplicates:\n";
std::map<int, int>::iterator it;
for ( it = freq.begin(); it != freq.end(); ++it ) {
if ( it->second > 1 )
std::cout<< it->first <<'\n';
}
}
Narue
Bad Cop
15,460 posts since Sep 2004
Reputation Points: 6,464
Solved Threads: 1,401
using a hash technique would be faster; linear time
#include <iostream>
#include <iterator>
#include <vector>
#include <unordered_set>
using namespace std;
//note: requires gcc 4.3; compile with -std=c++0x
int main()
{
int a[] = { 1, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9, 3, 5, 2, 8 };
unordered_set<int> set ; // hash impl. with buckets (tr1)
cout << "duplicates\n-------------\n" ;
for( size_t i=0 ; i<sizeof(a)/sizeof(a[0]) ; ++i )
if( set.find( a[i] ) == set.end() ) set.insert( a[i] ) ;
else cout << a[i] << '\n' ;
}
vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287
>//note: requires gcc 4.3; compile with -std=c++0x
This is my problem with your suggestion. If you don't have a compiler that supports TR1, you have to write your own hash table or use a third party library. This affects the performance. Not to mention that it's a dubious optimization. I honestly don't think the difference between a balanced tree and a hash table (the two common implementations of set/map and unordered_set) will be noticeable to the user in an I/O bound program.
Narue
Bad Cop
15,460 posts since Sep 2004
Reputation Points: 6,464
Solved Threads: 1,401
>I honestly don't think the difference ... will be noticeable to the user in an I/O bound program.
true; it may not even be measurable. i/o will dominate everything else.
note: the current version of boost has an implementation of tr1 (in std::tr1) that would work with most compilers.
vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287
here is a performance measure
#include <iostream>
#include <iterator>
#include <vector>
#include <unordered_set>
#include <set>
#include <google/dense_hash_set>
#include <random>
#include <boost/timer.hpp>
using namespace std;
//note: requires gcc 4.3; compile with -std=c++0x
void get_dup_set( const vector<int>& in, vector<int>& dups )
{
dups.clear() ;
set<int> set ;
for( size_t i=0 ; i<in.size() ; ++i )
if( !set.insert( in[i] ).second ) dups.push_back( in[i] ) ;
}
void get_dup_hash( const vector<int>& in, vector<int>& dups )
{
dups.clear() ;
unordered_set<int> set ; // hash impl. with buckets (tr1)
for( size_t i=0 ; i<in.size() ; ++i )
if( set.find( in[i] ) == set.end() ) set.insert( in[i] ) ;
else dups.push_back( in[i] ) ;
}
void get_dup_google( const vector<int>& in, vector<int>& dups )
{
dups.clear() ;
google::dense_hash_set< int, hash<int> > set ;
set.set_empty_key(-1) ;
for( size_t i=0 ; i<in.size() ; ++i )
if( !set.insert( in[i] ).second ) dups.push_back( in[i] ) ;
}
int main()
{
const int N = { 1024*1024*4 };
vector<int> in(N), dups ;
dups.reserve(N) ;
mt19937 twister ;
uniform_int<> distribution(1,N/2) ;
variate_generator< mt19937, uniform_int<> > generator(twister,distribution) ;
for( size_t i=0 ; i<in.size() ; ++i ) in[i] = generator() ;
boost::timer timer ;
get_dup_set(in,dups) ;
cout << "using std::set: " << timer.elapsed() << " seconds\n" ;
dups.reserve(N) ;
timer.restart() ;
get_dup_hash(in,dups) ;
cout << "using tr1 hash: " << timer.elapsed() << " seconds\n" ;
dups.reserve(N) ;
timer.restart() ;
get_dup_google(in,dups) ;
cout << "using google hash: " << timer.elapsed() << " seconds\n" ;
}
g++43 -std=c++0x -O3 -I/usr/local duplicates.cpp ; ./a.out
using std::set: 37.8828 seconds
using tr1 hash: 10.3906 seconds
using google hash: 0.75 seconds
vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287