Read file and delete repeated lines

Question

Dan08 8 Junior Poster

14 Years Ago

Hey, I have a really big file on my computer, this file only got words in it. And because the file is really big, I sure there are repeted words in there. Is there a way of deleting every single repeated word, leaving at least one? Every single word is a line, so I the script will have to compare lines and delete the repeated ones, leaving at least one. Thanks Dan08.

python

7 Contributors
6 Replies
4K Views
1 Day Discussion Span
Latest Post 14 Years Ago Latest Post by TrustyTony

jcao219 18 Posting Pro in Training

14 Years Ago

Something like this?

def uniquelines(lineslist):
    unique = {}
    result = []
    for item in lineslist:
        if item.strip() in unique: continue
        unique[item.strip()] = 1
        result.append(item)
    return result

file1 = open("wordlist.txt","r")
filelines = file1.readlines()
file1.close()
output = open("wordlist_unique.txt","w")
output.writelines(uniquelines(filelines))
output.close()

jice 53 Posting Whiz in Training

14 Years Ago

Fast method, using sets :

lines=open(myfile,'r').readlines()
uniquelines=set(lines)
open(outfile,'w').writelines(uniquelines)

which can be done in only one line :

open(outfile,'w').writelines(set(open(myfile,'r').readlines()))

Edited 14 Years Ago by jice because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

tbone2sk 14 Junior Poster in Training · Answer 1 · 2010-04-17T05:10:50+00:00

This should do it, I did not test it on any files so if there might be an error.

infile = open("filename.txt","r")
wordsDict = {}

for line in infile:
    addBoolean = True
    for word in wordsDict:
        if word == line:
            addBoolean = False
            break
    if addBoolean:
        wordsDict[line] = None
infile.close()    

outfile = open("outfile.txt","w")
for word in wordsDict:
    outfile.write(word+'\n')

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 2 · 2010-04-17T19:35:45+00:00

Fast method, using sets :

lines=open(myfile,'r').readlines()
uniquelines=set(lines)
open(outfile,'w').writelines(uniquelines)

which can be done in only one line :

open(outfile,'w').writelines(set(open(myfile,'r').readlines()))

The drawback of using sets this way is that the lines may not be written in the same order as in the input file. There are 2 functions unique_everseen and unique_justseen in the itertools recipes http://docs.python.org/py3k/library/itertools.html#recipes which produce the items in order, for example

# python 2.6
>>> from itertoolrecipes import unique_everseen
>>> list(unique_everseen("ABAHHRHHHJAKJKAJKLHKLDSHQIUHUI"))
['A', 'B', 'H', 'R', 'J', 'K', 'L', 'D', 'S', 'Q', 'I', 'U']

ultimatebuster 14 Posting Whiz in Training · Answer 3 · 2010-04-17T21:26:44+00:00

# Without looking at the previous posted code in detail. This is my version

inputfile = open("input.txt")
unique = []
for line in inputfile:
    line = line.strip()
    if line not in unique:
        unique.append(line)

inputfile.close()
for i in range(0, len(unique)):
    unique[i] += "\n"

output = open("output.txt", "w")
output.writelines(unique)
output.close

TrustyTony 888 pyMod Team Colleague Featured Poster · Answer 4 · 2010-04-18T04:22:44+00:00

OK, here still my try with command line interface. File output by redirecting output to file.

## sorted unique lines printer (redirect to file to save output)
## inside python use wordlist function
import os,string
from sys import argv
from collections import defaultdict

def wordlist(f):
    d = defaultdict(bool)
    if os.path.isfile(f):
        k = open(f,'r').readlines()
        for w in k:
            d[w] = True
    return sorted(d.keys()) ## if we want to sort the file to organize better

if len(argv)>1 and os.path.isfile(argv[1]):
    print "".join(wordlist(argv[1]))
else: print 'Usage:', os.path.basename(argv[0]), 'name_of_the_file'

""" Test in command line:
D:\Tony>sort test.txt| uniq | wc
   9187    9187  123757

D:\Tony>un test.txt >unique.txt

D:\Tony>wc unique.txt
  9187   9186 123757 unique.txt

D:\Tony>"""