Find duplicate words in a text (Python) UserPageVisits:3195 active 80 80 DaniWeb 561 60 2013-10-05T03:58:09+00:00 https://www.daniweb.com/programming/software-development/code/463893/find-duplicate-words-in-a-text-python

Find duplicate words in a text (Python)

vegaseat

A simple way to find duplicate words in a text. In this case the text is preprocessed to eliminate punctuation marks and set all words to lower case.

3,195 Views
About the Author

Scientist

code snippet
''' Count_find_duplicate_words101.py
find duplicate words in a text (preprocessed)
using Counter() from the Python module collections and set()
following a tip from raymondh
tested with Python27, IronPython27 and Python33  by vegaseat  24sep2013
'''

from string import punctuation
from collections import Counter

# sample text for testing
text = """\
If you see a turn signal blinking on a car with a southern license plate,
you may rest assured that it was on when the car was purchased."""

# preprocess text, remove punctuation marks and change to lower case
text2 = ''.join(c for c in text.lower() if c not in punctuation)

# text2.split() splits text2 at white spaces and returns a list of words
word_list = text2.split()

duplicate_word_list = sorted(Counter(word_list) - Counter(set(word_list)))

# show result
print("Original text:")
print(text)
print('-'*72)
print("A list of duplicate words in the text:")
print(duplicate_word_list)

''' result ...
Original text:
If you see a turn signal blinking on a car with a southern license plate,
you may rest assured that it was on when the car was purchased.
------------------------------------------------------------------------
A list of duplicate words in the text:
['a', 'car', 'on', 'was', 'you']
'''

Neat.
Personally I would try:

duplicate_word_list = [word for word, count in Counter(word_list).most_common() if count > 1]

Just because I am used to using most_common with Counter.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of 1.18 million developers, IT pros, digital marketers, and technology enthusiasts learning and sharing knowledge.