Hello Everyone!

I am a beginner user of Python without IT background (I am interested in IT developing my knowledge step by step). So far I did simple scripts in Python and I encountered a problem. I was looking on Google, however I didn't find the satisfactory answers which would help me - Therfore, I decided to write here.

I'm struggling to identify duplicates in CSV file. My CSV file contains contacts from the database. Every column corresponds to particular data (name, surname, job title, company, email, contact ID etc.). In this database there are duplicates and I want to read through this database and identify duplicated emails. If script finds the duplicated email, then it writes email address, contact ID to a separate file - creating new file which will consist of duplicates only. Does anyone could help me with this task? I would much appreciate your help.

Regards,
mgunia

Recommended Answers

All 8 Replies

You should use a dictionary to store the e-mails in a dictionary if it is not already there. If it is there, then it is a duplicate. I would suggest that you store the original record and the duplicate because they may be different and you would want to keep the correct one, not the first one found.

this is not a problem because spotted duplicates will be marked as "email duplicate" in the database and accordingly skipped. So no information would be missed. The problem for me is to operate the dictionary. I don't know how I should write the script which says "if email duplicate add to the file - and add only contact ID and email address".

When you identify a duplicate, let's say there are 2, do both get written to the file or just the one that wasn't first?

Dear Rrashkin, the problem is I don't know how should I identify duplicates. As I wrote I'm quite new to Python and I have no IT background (so I can't compare it with the other programming languages). I assume I should make a loop and "tell" Python to compare every record with the whole database - similar like in Excel using countif function where there is Range and Criteria. But how should I clarify that my "criteria" would be the email address (which is a 26th column in my database) and it should be compared to the "Range" which would be the whole database?

My question really was after you identify a duplicate (never mind for now how you do that), then you have 2 records with the same, let's say, email field value. What next? Do you write both records to the new file? Just one (which one)?

This may give you a few hints:

''' csv_rread102.py
find duplicate email addresses in a csv file
'''

# csv type test data
# name, surname, job_title, company, email
csv_data = '''\
Arden,Adam,clerk,ACME Tools,aaden@gmail.com
Bison,Bert,manager,Ideal Plumbing,bertbison@idel.com
Clark,Clara,assistant,ACME Tools,clarkc@gmail.com
Arden,Adam,supervisor,ACME Tools,aaden@gmail.com
Clark,Clara,receptionist,ACME Homes,clarkc@gmail.com
'''

import csv

fname = "aaa_test7.csv"
# write the test data file
with open(fname, "w") as fout:
    fout.write(csv_data)

# read the test data file back in
with open(fname, "r") as fin:
    reader = csv.reader(fin)
    # create a dictionary of email:frequency pairs
    email_freq = {} 
    for row in reader:
        print(row)  # test
        name, surname, job_title, company, email = row 
        email_freq[email] = email_freq.get(email, 0) + 1

print('-'*70)

# refresh reader object
with open(fname, "r") as fin:
    reader = csv.reader(fin)
    # make a list of all rows that have duplicate emails
    duplicate_emails = []
    for row in reader:
        name, surname, job_title, company, email = row
        # add row to list if email frequency is above 1
        if email_freq[email] > 1:
            duplicate_emails.append(row)


# show results
import pprint
pprint.pprint(email_freq)
print('-'*70)
pprint.pprint(sorted(duplicate_emails))

''' my result >>>
['Arden', 'Adam', 'clerk', 'ACME Tools', 'aaden@gmail.com']
['Bison', 'Bert', 'manager', 'Ideal Plumbing', 'bertbison@idel.com']
['Clark', 'Clara', 'assistant', 'ACME Tools', 'clarkc@gmail.com']
['Arden', 'Adam', 'supervisor', 'ACME Tools', 'aaden@gmail.com']
['Clark', 'Clara', 'receptionist', 'ACME Homes', 'clarkc@gmail.com']
----------------------------------------------------------------------
{'aaden@gmail.com': 2, 'bertbison@idel.com': 1, 'clarkc@gmail.com': 2}
----------------------------------------------------------------------
[['Arden', 'Adam', 'clerk', 'ACME Tools', 'aaden@gmail.com'],
 ['Arden', 'Adam', 'supervisor', 'ACME Tools', 'aaden@gmail.com'],
 ['Clark', 'Clara', 'assistant', 'ACME Tools', 'clarkc@gmail.com'],
 ['Clark', 'Clara', 'receptionist', 'ACME Homes', 'clarkc@gmail.com']]
'''

Thank you for your post Ene Uran. The code works for me when it is as the original. However, I have the existing CSV file with the whole database which this script should read through. The problem is, when I replace typed values in csv_data to the path of a csv file it returns an error about not enough number of values (defined in line 23). I assume that this part - "name, surname, job_title, company, email = row" is creating this problem and it should be replaced for something appropriate.

My file is full of different columns. However, I need only a column number 1 (contact ID) and column number 25 (email address). How should I amend this code to get the final output only with those 2 columns?

Remember row index starts at zero, so use for instance ...
contact_ID = row[0]
email = row[25]
rather than unpacking the whole row.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.