Remove duplicates preserve order

Question

iamthwee

14 Years Ago

Hi,

Let's say I have a csv file like such:-

part,text,true,quantity

371336959,-New DDM Part-,Y,1
449127604,-New DDM Part-,Y,1
808635064,-New DDM Part-,Y,2
189657,-New DDM Part-,Y,2
319330767,-New DDM Part-,Y,1
371336959,-New DDM Part-,Y,1
189657,-New DDM Part-,Y,1

Now I want to remove duplicate parts and add the quantities together whilst PRESERVING THE ORIGINAL ORDER so that it becomes:

371336959,-New DDM Part-,Y,2
449127604,-New DDM Part-,Y,1
808635064,-New DDM Part-,Y,2
189657,-New DDM Part-,Y,3
319330767,-New DDM Part-,Y,1
.

What is the best way to achieve this?

Code snippets or full code is appreciated!

c++

Edited 14 Years Ago by iamthwee because: n/a

6 Contributors
20 Replies
280 Views
1 Day Discussion Span
Latest Post 14 Years Ago Latest Post by iamthwee

Ancient Dragon 5,243 Achieved Level 70

14 Years Ago

1. Read the lines into a vector of strings
2. When a line is read, search the vector to see if the line already exists. If it does then just add the quantity to the string that was found. If not then add the new string to the end of the vector
To do that you will have to extract the quantity from the vector string, convert it to a number, do the same with the new string, add them together, then put it back into the vector

Warning: Not compiled or tested. Some people will most likely object to my use of atol(), so substitute anything you wish.

vector<string> arry;
string newstring = "<whatever>";
string& s = arry[i]; // duplicate string;
// extract value from the end of the string in the vector
size_t pos = s.find_last_of(',');
int n1 = atol(s.substr(pos+1);
s = s.substr(0,pos);

// extract value from the end of the string read from the file
pos = newstring.find_last_of(',');
int n2 = atol(newstring.substr(pos+1);
// add the two values and put back into the vector
stringstream str(s);
str << n1 + n2;
s = str.str();

3. Rewrite the file if necessary using the contents of the vector

Edited 14 Years Ago by Ancient Dragon because: n/a

Ancient Dragon 5,243 Achieved Level 70

14 Years Ago

Yes I suppose map could be made to work, so would a few other container classes.

Ancient Dragon 5,243 Achieved Level 70

14 Years Ago

I require complete and tested code snippets please!
Thank you.

Are you kidding??? You've been around just as long as I have and you can do that yourself. Besides, I don't have time to do it -- I'm playing Tourchlight game :)

Ancient Dragon 5,243 Achieved Level 70

14 Years Ago

I'm really surprised that you are unable to do that simple program yourself. Here it is compiled and tested. Now all you have to do is finish it with reading from the file into the vector and searching the vector for the string.

#include "stdafx.h"
#include <iostream>
#include <string>
#include <vector>
#include <sstream>
using namespace std;


int main()
{
vector<string> arry;
arry.push_back("371336959,-New DDM Part-,Y,1");
string newstring = "371336959,-New DDM Part-,Y,1";
int i = 0;
string& s = arry[i]; // duplicate string;
// extract value from the end of the string in the vector
size_t pos = s.find_last_of(',');
int n1 = atol(s.substr(pos+1).c_str());
s = s.substr(0,pos);

// extract value from the end of the string read from the file
pos = newstring.find_last_of(',');
int n2 = atol(newstring.substr(pos+1).c_str());
// add the two values and put back into the vector
stringstream str;
str << s << ',' << n1 + n2;
s = str.str();
cout << arry[i] << '\n';
}

Aranarth 126 Posting Whiz in Training

14 Years Ago

And here's another. This will be many, many times faster if the file contains a lot of entries.

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <map>
#include <sstream>

using namespace std;

struct Foo
{
  Foo() : quantity(0) {}
  Foo(const string& part,const string& a,const string& b,int quantity) : part(part), a(a), b(b), quantity(quantity) {}
  Foo(const Foo& src) : part(src.part), a(src.a), b(src.b), quantity(src.quantity) {}

  string part;
  string a;
  string b;
  int quantity;
};

typedef unsigned int uint;

int main()
{
   ifstream read ( "bom.csv" );

   vector<Foo> stuff;
   map<string,uint> parts;

   string line;
   while ( getline ( read, line, '\n' ) )
   {
      vector<string> chunks;
      string token;
      istringstream iss (line);

      while ( getline ( iss, token, ',' ) )chunks.push_back ( token );

      if (chunks[3] == "REF")chunks[3] = "999";

      int x;
      istringstream ins;
      ins.str ( chunks[3] );
      ins >> x;
      Foo test(chunks[0],chunks[1],chunks[2],x);

      if (parts.find(test.part)!=parts.end())stuff[parts[test.part]].quantity+=test.quantity;
      else
      {
        parts[test.part]=stuff.size();
        stuff.push_back ( test );
      }
   }

   for (uint i=0;i<stuff.size();i++)cout << stuff[i].part << ' ' << stuff[i].quantity << endl;

   cin.get();
}

You should also consider using boost's lexical_cast or write your own strToInt function. Five lines just to do a simple string->int conversion is quite ugly.

Edited 14 Years Ago by Aranarth because: n/a

Ancient Dragon commented: nice :) +28

iamthwee commented: green candy coming your way +11

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

abhimanipal 91 Master Poster · Answer 1 · 2010-06-01T21:34:55+00:00

Same solution as the one suggested by AD but you could use a map instead of a vector

iamthwee · Answer 2 · 2010-06-01T21:45:29+00:00

I require complete and tested code snippets please!

Thank you.

iamthwee · Answer 3 · 2010-06-01T21:55:58+00:00

I've got a brain fart lol.

Anyone else?

If anyone can do this ... WOW I'll be most grateful.

iamthwee · Answer 4 · 2010-06-01T22:58:50+00:00

Well I got this so far...

Just need to write the procedure to remove duplicate lines and total quantities.

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>

using namespace std;

class Foo
{
public:
   string part;
   string a;
   string b;
   int quantity;
};



int main()
{
   string line;

   ifstream read ( "bom.csv" );

   vector <Foo> stuff;


   while ( getline ( read, line, '\n' ) )
   {

      vector <string>chunks;
      Foo test;
      string token;
      istringstream iss ( line );

      /*split the line up using a comma as a delimiter*/
      while ( getline ( iss, token, ',' ) )
      {
         chunks.push_back ( token );
      }

      test.part = chunks[0];
      test.a = chunks[1];
      test.b = chunks[2];

      /*check for REF case
        Sometimes the file doesn't contain valid quantities
        If it doesn't it will alway be REF
        Change this to an arbitrary 999*/

      if (chunks[3] == "REF")
      {
         chunks[3] = "999";
      }
      
      /*convert string to integer*/ 
      int x;
      istringstream ins;
      ins.str ( chunks[3] );
      ins >> x;

      test.quantity = x;

      stuff.push_back ( test );
   }

   /*now begin procedure to remove duplicate lines and total quantities*/

   read.close();
   cin.get();

}

iamthwee · Answer 5 · 2010-06-01T23:05:07+00:00

@AD

Hi, that's the easy part.

The bit I'm having difficulty with is preserving the order whilst removing duplicates.

Thanks.

[Bear in mind it might just now be duplicates, but triplicates or more]

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 6 · 2010-06-01T23:05:24+00:00

As I mentioned before, don't add a new string to the vector unless the vector does not already contain the string. Instead of adding them just combine them. That should be done before line 64. And if you look at the code I posted you don't need that chunks vector.

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 7 · 2010-06-01T23:24:52+00:00

Here is one way to do it. I changed stuff to vector<string>

#include "stdafx.h"
#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <fstream>
using namespace std;



class Foo
{
public:
   string part;
   string a;
   string b;
   int quantity;
};



int main()
{
   string line;

   ifstream read ( "bom.csv" );

   vector <string> stuff;


   while ( getline ( read, line, '\n' ) )
   {

      vector <string>chunks;
      Foo test;
      string token;
      istringstream iss ( line );

      /*split the line up using a comma as a delimiter*/
      while ( getline ( iss, token, ',' ) )
      {
         chunks.push_back ( token );
      }

      test.part = chunks[0];
      test.a = chunks[1];
      test.b = chunks[2];

      /*check for REF case
        Sometimes the file doesn't contain valid quantities
        If it doesn't it will alway be REF
        Change this to an arbitrary 999*/

      if (chunks[3] == "REF")
      {
         chunks[3] = "999";
      }
      vector<string>::iterator it = stuff.begin();
      size_t pos1 = line.find_last_of(',');
      string t = line.substr(0,pos1);
      bool found = false;
      for(; it != stuff.end(); it++)
      {
          size_t pos2 = it->find_last_of(',');
          if( it->substr(0, pos2) == t)
          {
              int n1 = atol(chunks[3]);
              int n2 = atol(it->substr(pos2+1).c_str());
              stringstream str;
              str << t << ',' << n1+n2;
              *it = str.str();
              found = true;
              break;
          }
      }
      if( found == false)
        stuff.push_back ( line );
   }

   /*now begin procedure to remove duplicate lines and total quantities*/

   read.close();
   cin.get();

}

iamthwee · Answer 8 · 2010-06-02T02:39:46+00:00

Thanks I'll test this out to see if the output is what I want.

I.e order preserved with duplicates removed and quantities totalled up later.

If you pass, you get some green candy. LOL

mrnutty 761 Senior Poster · Answer 9 · 2010-06-02T04:12:53+00:00

Use the property of sets. It only inserts elements that are unique. That means
no same element will exist in the container.

Do something like this. Note not compiled :

struct MyData{
 //bunch of datas
};
bool operator(const MyData& lhs, const MyData& rhs){ return true; }
std::istream& operator >>(std::istream& istream, MyData& data){
  //read in data here
}

std::ostream& writeData(std::ostream& out, const std::set<MyData>& data){
 //write data here
}
int main(){
 std::set<MyData> fileData;
 std::istream fileReader( "text.txt" );
 //read from file and insert into my set
 std::copy(std::istream_iterator<MyData>(fileReader), //begin reading
           std::istream_iterator<MyData>(),           //end reading
           std::inserter(fileData,fileData.begin()) );//insert the read-ed data to
 writeData(cout,fileData); //print the data read
}

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 10 · 2010-06-02T05:27:12+00:00

std::set sorts the data, which is not what the op wants. The order of the data within the container needs to be preserved.

mrnutty 761 Senior Poster · Answer 11 · 2010-06-02T08:29:59+00:00

std::set sorts the data, which is not what the op wants. The order of the data within the container needs to be preserved.

Well, I'm not completely sure, but since sets are implemented as a binary tree, and
the compare function that I listed just returns true, is the order not reserved? Because
all the sets is going to do, is append the data to the end of the list, and thus act like
a linked list. But while appending it to the end of the list, it still preserves the
uniqueness property. I may be wrong though.

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 12 · 2010-06-02T09:34:41+00:00

Nope -- set inserts the new data in sorted order. Using the program below and the file data in the original post the results are this:

189657          3
319330767               1
371336959               2
449127604               1
808635064               2
Press any key to continue . . .

IMHO a <MAP> would be better than <SET> because I had to typecast out the const in order to update the quantity field of MyData structure that is already in the set. I don't like doing that.

#include <iterator>
#include <iostream>
#include <string>
#include <sstream>
#include <fstream>
#include <iomanip>
#include <algorithm>
#include <set>
using namespace std;



struct MyData{
   string part;
   string a;
   string b;
   int quantity;
};


bool operator<(const MyData& lhs, const MyData& rhs)
{
    std::string lhsa, rhsa;
    lhsa = lhs.part + lhs.a + lhs.b;
    rhsa = rhs.part + rhs.a + rhs.b;

    return (lhsa < rhsa);
}
bool operator==(const MyData& lhs, const MyData& rhs)
{
    if( lhs.a == rhs.a && lhs.b == rhs.b)
        return true;
    return false; 
}




void display(std::set<MyData>& fileData)
{
   std::set<MyData>::iterator it = fileData.begin();
   for(; it != fileData.end(); it++)
   {
       cout << it->part << "\t\t" << it->quantity << '\n';

   }
}

int main()
{
    std::string line;
 std::set<MyData> fileData;
 std::ifstream fileReader( "text.txt" );
 if( !fileReader.is_open())
 {
     cout << "Open failed\n";
     return 1;
 }
 while( getline(fileReader,line) )
 {
     MyData d;
     std::string c;
     stringstream str(line);
     getline(str,d.part, ',');
     getline(str,d.a, ',');
     getline(str,d.b, ',');
     getline(str,c, ',');
     if( c == "REF" )
         c = "999";
     d.quantity = atol(c.c_str());
     std::set<MyData>::iterator it = fileData.find(d);
     if( it == fileData.end())
         fileData.insert(d);
     else
     {
         (int)((it)->quantity) += d.quantity;
     }

    }
    display(fileData);
}

vijayan121 1,152 Posting Virtuoso · Answer 13 · 2010-06-02T10:25:31+00:00

This is how I would do it. O(N log N) time.

a. Keep track of the line number as the lines are read into a vector.
b. Sort on the string (part that is to be checked for duplicates)
c. Partition into unique and non-unique lines (on the string)
d. Sort the unique lines on line number

#include <fstream>
#include <string>
#include <sstream>
#include <vector>
#include <algorithm>

struct csv
{
    int line_number ;
    std::string first ;
    std::string rest ;

    csv( const std::string& line )
    {
        static int n = 0 ;
        line_number = ++n ;
        std::istringstream stm( line ) ;
        std::getline( stm, first, ',' ) ;
        std::getline( stm, rest ) ;
    }

    bool operator == ( const csv& that ) const
    { return rest == that.rest ; }

    bool operator < ( const csv& that ) const
    { return rest < that.rest ; }
};

struct cmp_line_number
{
    bool operator() ( const csv& first, const csv& second ) const
    { return first.line_number < second.line_number ; }
};

int main()
 {
     std::vector< csv > csvs ;
     {
       std::string line ;
       std::ifstream file_in( "has_dups.csv" ) ;
       while( std::getline( file_in, line ) ) csvs.push_back(line) ;
     }

     std::sort( csvs.begin(), csvs.end() ) ;
     typedef std::vector< csv >::iterator iterator ;
     iterator end = std::unique( csvs.begin(), csvs.end() ) ;
     std::sort( csvs.begin(), end, cmp_line_number() ) ;

     std::ofstream file_out( "no_dups.csv" ) ;
     for( iterator iter = csvs.begin() ; iter != end ; ++iter )
        file_out<< iter->first << ',' << iter->rest << '\n' ;
 }

Aranarth 126 Posting Whiz in Training · Answer 14 · 2010-06-02T12:15:31+00:00

This is how I would do it. O(N log N) time.
a. Keep track of the line number as the lines are read into a vector.
b. Sort on the string (part that is to be checked for duplicates)
c. Partition into unique and non-unique lines (on the string)
d. Sort the unique lines on line number

The idea was not to eliminate duplicate lines, but to merge lines with the same part number.

iamthwee · Answer 15 · 2010-06-02T15:26:59+00:00

Aranarth you get the big prize.

Thanks.