Member Avatar for cyon

What is the most efficient way to:

  1. Check if a list has duplicate items (return a boolean)
  2. Get duplicate items in a list (return a list)

I'm looking for something quick and concise, as my solutions are rather cumbersome:

def list_has_duplicate_items( L ):
	for item in L:
		if L.count(item) > 1:
			return True
	return False

def get_duplicate_items( L ):
	new = set()
	for item in L:
		if L.count(item) > 1:
			new.add( item )
	return list(new)

L = [ 'oranges' , 'apples' , 'oranges' , 'grapes' ]
print 'List: ' , L
print 'Does list have duplicate item(s)? ' , list_has_duplicate_items( L )
print 'Redundant item(s) in list: ' , get_duplicate_items( L )

"""Ouput:
List:  ['oranges', 'apples', 'oranges', 'grapes']
Does list have duplicate item(s)?  True
Duplicate item(s) in list:  ['oranges']"""

I'm wondering why there isn't a set function that will pop out the duplicate items. Also, a side question: Am I using return statements in the proper way? Thanks.

Any help greatly appreciated!

Recommended Answers

All 9 Replies

I would use something like

from collections import defaultdict

def list_has_duplicate_items(L):
    return len(L) > len(set(L))

def get_duplicate_items(L):
    D = defaultdict(int)
    for k in L:
        D[k] += 1
    return [k for (k, v) in D.items() if v > 1]

Quite same as Gribouillis but here anyway (find_dup as generator, here put to list by list()):

L = [ 'oranges' , 'apples' , 'oranges' , 'grapes', 'oranges' ]

def has_dup(li):
    return len(li) != len(set(li))

def find_dup(li):
    s= set()
    shown = set()
    for item in li:
        if item in  s - shown:
            yield item
            shown.add(item)
        else:
            s.add(item)

print 'The list: ',L
print 'Does list have duplicate item(s)? ', has_dup(L)
print 'Redundant item(s) in list: ' , list(find_dup(L))

Here is a version of get_duplicate_items as a one liner. I don't know if it is more efficient

from itertools import groupby, islice

def get_duplicate_items(iterable):
    return [k for (k, v) in groupby(sorted(iterable)) if len(tuple(islice(v, 2))) == 2]

print(get_duplicate_items('"Give me bacon and eggs," said the other man.'))

""" My output --->
[' ', '"', 'a', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't']
"""

Gribouillis one liner is at least out of order, my set solution gives solution in order of occurance of the double:

sent='"Give me bacon and eggs," said the other man.'
print sent
print list(find_dup(sent)) ## see earlier post

Output:

"Give me bacon and eggs," said the other man.
['e', ' ', 'a', 'n', 'g', '"', 's', 'i', 'd', 'o', 't', 'h', 'm']

Gribouillis one liner is at least out of order, my set solution gives solution in order of occurance:

sent='"Give me bacon and eggs," said the other man.'
print sent
print list(find_dup(sent)) ## see earlier post

Output:

"Give me bacon and eggs," said the other man.
['e', ' ', 'a', 'n', 'g', '"', 's', 'i', 'd', 'o', 't', 'h', 'm']

Actually, your result is ordered according to the position of the second occurrence of each repeated item (which is also the position of first repetition).

A one liner i came upp with for fun.

print 'List has duplicate item %s' % [item for item in set(L) if L.count(item) > 1]
'''List has duplicate item ['oranges']'''

For the sake of simplicity ...

def list_has_duplicate_items( mylist ):
    return len(mylist) > len(set(mylist))

def get_duplicate_items( mylist ):
    return [item for item in set(mylist) if mylist.count(item) > 1]

mylist = [ 'oranges' , 'apples' , 'oranges' , 'grapes' ]
print 'List: ' , mylist
print 'Does list have duplicate item(s)? ' , list_has_duplicate_items( mylist )
print 'Redundant item(s) in list: ' , get_duplicate_items( mylist )

"""Ouput:
List:  ['oranges', 'apples', 'oranges', 'grapes']
Does list have duplicate item(s)?  True
Duplicate item(s) in list:  ['oranges']
"""

Well, looks like we all run on the same wavelength.

I suspect count is heavy operation and came up one more variation of the theme:

def list_has_duplicate_items( mylist ):
    return len(mylist) > len(set(mylist))

## by "manual partition" for any sequence, without probably heavy count
## item which is in the set (to avoid multiple in case of more than two)
## and which occurs after the first occurance (can not use partition, it is only for strings)
## out of order because of set()

def get_duplicate_items( mylist ):  
    return [item for item in set(mylist) if item in mylist[mylist.index(item)+1:]]

mylist = [ 'oranges' , 'apples' , 'oranges' , 'grapes' ]
print 'List: ' , mylist
print 'Does list have duplicate item(s)? ' , list_has_duplicate_items( mylist )
print 'Redundant item(s) in list: ' , get_duplicate_items( mylist )

sent='"Give me bacon and eggs," said the other man.'
print sent
print get_duplicate_items(sent)

"""Ouput:
List:  ['oranges', 'apples', 'oranges', 'grapes']
Does list have duplicate item(s)?  True
Redundant item(s) in list:  ['oranges']
"Give me bacon and eggs," said the other man.
['a', ' ', '"', 'e', 'd', 'i', 'h', 'm', 'o', 'n', 'g', 's', 't']
"""

Finally after quite much false tries, here is my code for small sequences (for big ones use dict or set), which does not use count or set and keeps the sequence in order and gives item in the place of first occurrence (duplicates in place of first double).0

Also test of set solution little simpler way with big dictionary file.

import time
## probably more efficient previous answer
def list_has_duplicate_items(mylist):
    return len(mylist) != len(set(mylist))

## alternative without set, short cut return for first double
def isdupseq(seq):
    for i,item in enumerate(seq):
        if seq[i] in seq[:i]:
            return True
    else: return False
    
## item which is in the beginning of sequence, remove duplicates, in order
def get_duplicate_items(mylist):  
    return  nodups([ mylist[i] for i in range(1,len(mylist))
             if (mylist[i] in mylist[:i]) ])

## In order!
def nodups(seq):
    return [seq[i] for i in range(len(seq))
            if not i or seq[i] not in seq[:i]
            ]

def show_dups(seq):
    s= set()
    shown=[]
    for item in seq:
        if item in s and item not in shown:
            shown.append(item)
        else: s.add(item)
    return shown

## generator using set (suitable for big sequences)

def dups_out(seq):
    s= set()
    for item in seq:
        if item not in  s:
            yield item
            s.add(item)

mylist = [ 'apples' , 'oranges' , 'grapes', 'apples' , 'grapes', 'mangos','peaches', 'oranges']
print 'List: ' , mylist
print 'Does list have duplicate item(s)? ' , list_has_duplicate_items( mylist ),', my check:',isdupseq(mylist)
print 'Redundant item(s) in list in order: ' , get_duplicate_items( mylist )
print 'Without duplicates', nodups(mylist),'or',list(show_dups(mylist))
if not isdupseq(nodups(mylist)):
    print 'Duplicates successfully removed!'

sent='"Give me bacon and eggs," said the other man.'
print sent
print get_duplicate_items(sent)
print 'Without duplicates: ',''.join(nodups(sent))

## this needs set solution or similar
t=time.clock()
print 'Letters of very big dict file: ' ## see anagram code snippet
for i in dups_out(open('list.txt','r').read()) : print i,
print
print 'Took',time.clock()-t,'s'

t=time.clock()
print 'Double letters of very big dict file: ' ## see anagram code snippet
print show_dups(open('list.txt','r').read())
print 'Took',time.clock()-t,'s'

"""Output:
List:  ['apples', 'oranges', 'grapes', 'apples', 'grapes', 'mangos', 'peaches', 'oranges']
Does list have duplicate item(s)?  True , my check: True
Redundant item(s) in list in order:  ['apples', 'grapes', 'oranges']
Without duplicates ['apples', 'oranges', 'grapes', 'mangos', 'peaches'] or ['apples', 'grapes', 'oranges']
Duplicates successfully removed!
"Give me bacon and eggs," said the other man.
['e', ' ', 'a', 'n', 'g', '"', 's', 'i', 'd', 'o', 't', 'h', 'm']
Without duplicates:  "Give mbacondgs,thr.
Letters of very big dict file: 
a 
r d v k b c i u s f t l o n e m h y g j z - p w q x ' 2 I O
Took 0.268755945239 s
Double letters of very big dict file: 
['a', 'r', '\n', 'b', 'c', 'k', 'n', 'd', 'o', 'e', 't', 's', 'm', 'h', 'l', 'i', 'y', 'v', 'u', 'g', 'j', '-', 'f', 'p', 'z', 'w', 'q', 'x', "'", '2', 'I']
Took 0.470933596309 s
"""
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.