Remove duplicate records from 4GB file

Question

4evrmrepylrning 0 Newbie Poster

12 Years Ago

Help needed please. Desperate. I have a 3.somethingGB file that contains records that looks like this:

<tag>
<number>1</number>
<info>blah blah</info>
<more>Lorem Ipsum</more>
<anothertag>The quick brown fox...</anothertag>
<id>32444</id>
<yetmore>blah blah</yetmore>
</tag>
<tag>
<number>2</number>
<info>qwerty</info>
<more>yada yada qwerty</more>
<anothertag>yada yada qwerty</anothertag>
<id>32344</id>
<yetmore>yada yada qwerty</yetmore>
</tag>
<tag>
<number>3</number>
<info>yada yada qwerty</info>
<more>yada yada qwerty</more>
<anothertag>yada yada</anothertag>
<whatever>yada</whatever>
<id>32444</id>
<yetmore>yada yada</yetmore>
</tag>

I need to find the records that contain duplicate <id> tags. I was thinking reading all the <id> tags into a list and then trying to find the duplicates somehow and then iterating over the file again somehow and removing them that was but basically I'm not sure what to do. I'll start something and post back later but any help would be appreciated.

python

4 Contributors
21 Replies
273 Views
4 Days Discussion Span
Latest Post 12 Years Ago Latest Post by Gribouillis

All 21 Replies

Gribouillis 1,391 Programming Explorer

12 Years Ago

As your data file looks like xml, I obtained some results very easily using module xml.parsers.expat. Here is the code. It uses this code snippet adaptstrings http://www.daniweb.com/software-development/python/code/389929 (this could be optimized later).

from itertools import chain
from adaptstrings import adapt_as_file
import xml.parsers.expat

PREFIX = "<document>\n"
SUFFIX = "</document>\n"

def byte_index(p):
    return p.CurrentByteIndex - len(PREFIX)

class glo:
    IN_ID = False

# 3 handler functions
def start_element(name, attrs):
    if name in ('tag', 'id'):
        print '<%s>' % name, attrs,"at byte", byte_index(p)
    if name == 'id':
        glo.IN_ID = True
def end_element(name):
    if name in ('tag', 'id'):
        print '</%s>' % name, "at byte", byte_index(p)
    if name == 'id':
        glo.IN_ID = False
def char_data(data):
    if glo.IN_ID:
        print 'Data:', repr(data)

p = xml.parsers.expat.ParserCreate()

p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data

p.ParseFile(adapt_as_file(chain((PREFIX,), open("dups.xml"), (SUFFIX,))))

""" my output -->

<tag> {} at byte 0
<id> {} at byte 121
Data: u'32444'
</id> at byte 130
</tag> at byte 165
<tag> {} at byte 172
<id> {} at byte 289
Data: u'32344'
</id> at byte 298
</tag> at byte 340
<tag> {} at byte 347
<id> {} at byte 493
Data: u'32444'
</id> at byte 502
</tag> at byte 537
"""

The file "dups.xml" contained the exact data that you wrote above. As you can see, the parser gives the byte position in the file where the relevant tags are found. It can also give the line and column numbers. This code is only a first draft, but it could be a reasonable starting point.

Edited 12 Years Ago by Gribouillis because: n/a

woooee 814 Nearly a Posting Maven

12 Years Ago

I need to find the records that contain duplicate <id> tags.

Which duplicate records do you want to delete, the first or last? In any case, with a 3GB file you probably want to iterate over the file one record at a time instead of trying to read it into memory. And just for clarity, we are assuming that one record group = from <tag> to </tag>, so you first want to store one record group and either save the id number and write the group to a file, or ignore the group if you want to delete it.

Edited 12 Years Ago by woooee because: n/a

Gribouillis 1,391 Programming Explorer

12 Years Ago

Also notice that you could perhaps obtain better performance for the previous code by reading larger chunks of data at a time (say 100 KB or 1 MB), because the previous code reads line by line. Here are the necessary changes

from adaptstrings import adapt_as_opener

CHUNK = 2 ** 17 # 128 KB

@adapt_as_opener
def source(filename, mode='r'):
    yield PREFIX
    with open(filename, mode) as ifh:
        while True:
            s = ifh.read(CHUNK)
            if s:
                yield s
            else:
                break
    yield SUFFIX

p.ParseFile(source("dups.xml"))

Edited 12 Years Ago by Gribouillis because: n/a

snippsat 661 Master Poster

12 Years Ago

I wonder about the same as woooee.
Using a parser can be better for this task as BeautifulSoup or lxml.
A quick demo.

from BeautifulSoup import BeautifulStoneSoup

xml = '''\
<tag>
<number>1</number>
<info>blah blah</info>
<more>Lorem Ipsum</more>
<anothertag>The quick brown fox...</anothertag>
<id>32444</id>
<yetmore>blah blah</yetmore>
</tag>
<tag>
<number>2</number>
<info>qwerty</info>
<more>yada yada qwerty</more>
<anothertag>yada yada qwerty</anothertag>
<id>32344</id>
<yetmore>yada yada qwerty</yetmore>
</tag>
<tag>
<number>3</number>
<info>yada yada qwerty</info>
<more>yada yada qwerty</more>
<anothertag>yada yada</anothertag>
<whatever>yada</whatever>
<id>32444</id>
<yetmore>yada yada</yetmore>
</tag>'''

soup = BeautifulStoneSoup(xml)
tag = soup.findAll('id')
print tag
#--> [<id>32444</id>, <id>32344</id>, <id>32444</id>]

Edited 12 Years Ago by snippsat because: n/a

Gribouillis 1,391 Programming Explorer

12 Years Ago

This new code generates the index at 13 MB/s with a few assumptions. It should handle the 4GB in a little more than 5 minutes. It uses this code snippet http://www.daniweb.com/software-development/python/code/418239/1783422#post1783422

#!/usr/bin/env python
# -*-coding: utf8-*-
# Title: dups2.py
# Author: Gribouillis

"""Generate the index file with regexes and chunked input and output

    This code does not parse xml, but it assumes that:
    * records are delimited by <tag> and </tag> items, and that these items
        are only used with this meaning in the file.
    * within <tag> and </tag> sections, record's identity is delimited by <id> and </id>
        tags containing an integer value, and these items are only used with this meaning in the file.
    
    The code contains a few assert statements to check these assumptions.
"""

import re
from writechunks import MB, ChunkedOutputFile

class State:
    BASE = 0
    TAG = 1
    ID = 2
    TAGEND = 3

expected_state = {
    '<tag>': State.BASE,
    '<id>': State.TAG,
    '</id>': State.ID,
    '</tag>': State.TAGEND,
}

def next_state(state):
    return (state + 1) % 4

tag = re.compile("</?(?:tag|id)>")

def main2(input_filename, input_chunk, ofh):
    with open(input_filename) as ifh:
        state = State.BASE
        offset = 0
        last_end = 0
        id_saved = ''
        tail = ''
        while True:
            s = ifh.read(input_chunk)
            if s:
                if tail:
                    s = tail + s
            else:
                ofh.write("%d\teof\n" % (offset + len(tail)))
                return
            size = len(s)
            for match in tag.finditer(s):
                t = match.group(0)
                assert expected_state[t] == state
                last_end = match.end()
                if state == State.TAG:
                    begin_id = last_end
                elif state == State.ID:
                    id = id_saved + s[begin_id:match.start()]
                    ofh.write("%s\t%d\n" % (begin_record, int(id)))
                elif state == State.BASE:
                    begin_record = offset + match.start()
                state = next_state(state)
            tail = min(20, size - last_end)
            if state == State.ID:
                id_saved = s[begin_id:-tail] if tail else s[begin_id:]
                begin_id = 0
            else:
                id_saved = ''
            tail = s[-tail:] if tail else ''
            offset += len(s) - len(tail)

def main(input_filename, output_filename, input_chunk, output_chunk):
    with open(output_filename, "w") as ofh:
        with ChunkedOutputFile(ofh, output_chunk) as out:
            main2(input_filename, input_chunk, out)

if __name__ == "__main__":
    main("dups2.txt", "index2.txt", 32 * MB, 16 * MB)

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

4evrmrepylrning 0 Newbie Poster · Answer 1 · 2012-03-15T21:33:22+00:00

#!/usr/local/bin/python2.6   
import re

list_ = []
with open('dups.txt', 'r') as thefile_:
    for line_ in thefile_:
        match_id = re.compile(r'^<id>(.*?)</id>').search(line_)
        if match_id != None:
            list_.append(match_id.group(1))

idcount={}
for i in list_:
    idcount[i] = idcount.get(i, 0) + 1

dups={}
for k, v in idcount.iteritems():
    if v > 1:
        dups[k] = v

for key, value in dups.iteritems():
    print key, value

Hacked this together. At least now I can tell which ids are duplicates and how often they appear. The above outputs this: 32444 2 Not sure how this helps me yet :(

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 2 · 2012-03-16T10:53:51+00:00

I don't think BeautifulSoup will work as it loads the whole parse tree. I'm not sure it can be avoided.

4evrmrepylrning 0 Newbie Poster · Answer 3 · 2012-03-16T14:45:05+00:00

Wow, thanks for all the input. I have to admit I'n quite new to Python and I don;t really come from a programming background so there are a few concepts here that are completely new to me so please be patient with me. At the moment I have about 10 different things I'm working on which doesn't help either. A few things:

I was always under the impression(probably mistakenly so) that XML parsers are more a tool used for validation. I'll investigate this no less as I'll be dealing with more XML in the future so anything that makes my life easier is a bonus!

Yes, the records span from <tag> to </tag>. I have an idea which I will be testing out soon. I'll post back in a bit.

4evrmrepylrning 0 Newbie Poster · Answer 4 · 2012-03-16T15:33:46+00:00

This works on my 544B file. I'm going to try it on the big one now.

<tag>
<number>1</number>
<info>blah blah</info>
<more>Lorem Ipsum</more>
<anothertag>The quick brown fox...</anothertag>
<id>32444</id>
<yetmore>blah blah</yetmore>
</tag>

<tag>
<number>2</number>
<info>qwerty</info>
<more>yada yada qwerty</more>
<anothertag>yada yada qwerty</anothertag>
<id>32344</id>
<yetmore>yada yada qwerty</yetmore>
</tag>

<tag>
<number>3</number>
<info>yada yada qwerty</info>
<more>yada yada qwerty</more>
<anothertag>yada yada</anothertag>
<whatever>yada</whatever>
<id>32444</id>
<yetmore>yada yada</yetmore>
</tag>

#!/usr/local/bin/python2.6
import re

thelist_=[]
idlist_=[]

with open('dups.txt', 'r') as thefile_:
    dedup_ = open('dedup.txt', 'w')
    recsremoved_ = open('removed.txt', 'w')
    for line_ in thefile_:
        match_id = re.compile(r'^<id>(.*?)</id>').search(line_)

        if match_id != None:
            idtag_ = match_id.group(1)

        if '</tag>' in line_:
            thelist_.append(line_)
            if idtag_ not in idlist_:
                for item in thelist_:
                    dedup_.write(item)
                idlist_.append(idtag_)    
            else:
                for item in thelist_:
                    recsremoved_.write(item)
            
            thelist_=[]        
        else:
            thelist_.append(line_)
    dedup_.close()
    recsremoved_.close()

Results in dedup.txt:

<tag>
<number>1</number>
<info>blah blah</info>
<more>Lorem Ipsum</more>
<anothertag>The quick brown fox...</anothertag>
<id>32444</id>
<yetmore>blah blah</yetmore>
</tag>
<tag>
<number>2</number>
<info>qwerty</info>
<more>yada yada qwerty</more>
<anothertag>yada yada qwerty</anothertag>
<id>32344</id>
<yetmore>yada yada qwerty</yetmore>
</tag>

And removed.txt:

<tag>
<number>3</number>
<info>yada yada qwerty</info>
<more>yada yada qwerty</more>
<anothertag>yada yada</anothertag>
<whatever>yada</whatever>
<id>32444</id>
<yetmore>yada yada</yetmore>
</tag>

Does this look ok?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 5 · 2012-03-16T15:41:45+00:00

It won't work on the big file because you can't store that much data in memory. I'm going to post a code to create an index file first.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 6 · 2012-03-16T16:06:34+00:00

This code should run against the big file and create a shorter index file which you could use to find duplicate id's and select which items you want to rewrite

from itertools import chain
from adaptstrings import adapt_as_opener
import xml.parsers.expat

KB = 1024
MB = 1024 * KB

class IndexCreator(object):
    PREFIX = "<document>\n"
    SUFFIX = "</document>"

    def __init__(self):
        self.parser = None
        self.in_id = None
        self.ofh = None
        self.state = None
        
    def byte_index(self):
        return self.parser.CurrentByteIndex - len(self.PREFIX)
    
    def run(self, input_filename, output_filename, chunk_size):
        self.parser = p = xml.parsers.expat.ParserCreate()
        self.state = 0
        with open(output_filename, "w") as ofh:
            self.ofh = ofh
            self.in_id = False
            p.StartElementHandler = self.start_element
            p.EndElementHandler = self.end_element
            p.CharacterDataHandler = self.char_data
            p.ParseFile(self.source(input_filename, chunk = chunk_size))
            self.eof_byte = self.byte_index() - len(self.SUFFIX)
            ofh.write("%d\teof\n" % self.eof_byte)



    def start_element(self, name, attrs):
        if name == 'tag':
            assert self.state == 0
            self.ofh.write(str(self.byte_index()))
            self.state = 1
        elif name == 'id':
            assert self.state == 1
            self.in_id = True
            self.state = 2
            
    def end_element(self, name):
        if name == 'tag':
            assert self.state == 4
            self.state = 0
        elif name == 'id':
            assert self.state == 3
            self.in_id = False
            self.state = 4
            
    def char_data(self, data):
        if self.in_id:
            assert self.state == 2
            self.ofh.write('\t%d\n' % int(data))
            self.state = 3

    @adapt_as_opener
    def source(self, filename, mode='r', chunk= 1 * MB):
        yield self.PREFIX
        with open(filename, mode) as ifh:
            while True:
                s = ifh.read(chunk)
                if s:
                    yield s
                else:
                    break
        yield self.SUFFIX

if __name__ == "__main__":
    ic = IndexCreator()
    ic.run("dups.txt", "index.txt", 8 * MB)

""" After running this, the content of index.txt is

0       32444
172     32344
347     32444
543     eof

The first column contains byte ofsets of items, the second the item's id
"""

What is the size of index.txt after a run ?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 7 · 2012-03-16T17:20:04+00:00

It occured to me that if the previous code is too slow, we can try to speed it up by writing 1 MB chunks at a time in the index file. For this you can use this class

from cStringIO import StringIO

class Output(object):
    def __init__(self, ofh, chunk = 1 * MB):
        self.ofh = ofh
        self.io = StringIO()
        self.sz = 0
        self.chunk = chunk
        
    def write(self, s):
        self.io.write(s)
        self.sz += len(s)
        if self.sz > self.chunk:
            self.ofh.write(self.io.getvalue())
            self.io = StringIO()
            self.sz = 0
            
    def flush(self):
        if self.sz:
            self.ofh.write(self.io.getvalue())
            self.io = None
            self.sz = 0
        self.ofh.flush()
            
    def __enter__(self):
        return self
        
    def __exit__(self, *args):
        self.flush()

And replace the run() method in class IndexCreator with

def run(self, input_filename, output_filename, chunk_size):
        self.parser = p = xml.parsers.expat.ParserCreate()
        self.state = 0
        with open(output_filename, "w") as output:
            with Output(output) as ofh:
                self.ofh = ofh
                p.StartElementHandler = self.start_element
                p.EndElementHandler = self.end_element
                p.CharacterDataHandler = self.char_data
                p.ParseFile(self.source(input_filename, chunk = chunk_size))
                self.eof_byte = self.byte_index() - len(self.SUFFIX)
                ofh.write("%d\teof\n" % self.eof_byte)

4evrmrepylrning 0 Newbie Poster · Answer 8 · 2012-03-16T17:30:40+00:00

I really appreciate your help with this. I have yet to figure out how classes work. They make my head hurt. I'll give this a go and report back later. Thanks

woooee 814 Nearly a Posting Maven · Answer 9 · 2012-03-16T21:20:55+00:00

Your code looks like it should work fine. Post back if you run into further problems.

4evrmrepylrning 0 Newbie Poster · Answer 10 · 2012-03-16T21:27:04+00:00

Hi woooee. It does work but it's grinding to a crawl. I've trialed it on an 880mb file and in the 5 and a half hours since I started this it has churned out just over 660MB. I've been distracted with other things today so I haven't looked at the parsing stuff. I'll have to fiddle with this on Monday again.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 11 · 2012-03-16T22:48:36+00:00

You must not read the file line by line. For 3 GB, if each line is 70 B long, it means about 50 millions calls to read(). I don't know how many operations it means on your hard drive, but it is probably evil. You must read the input file by large chunks. In the same way, you must write the output file by large chunks. The advantage of my file adapter is that allows you to transparently read the file by arbitrary sized chunks.

woooee 814 Nearly a Posting Maven · Answer 12 · 2012-03-17T01:02:54+00:00

In addition, the slow down also comes from using a list. Use a set or dictionary to store and lookup the id. It is something like 1000 times faster.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 13 · 2012-03-17T12:08:37+00:00

Here is a tested version of my code. It generated a 11 MB index file out of a 136 MB input file on my machine in 30 seconds, meaning that you can expect 15 minutes for 4 GB. Note that input data was obtained by duplicating the data you provided. In your actual file, the 'lorem ipsum' and 'blah blah' part may be longer, which means a relatively smaller index file.

The input file was read by chunks of 32 MB. You can probably increase this value.

I think writing the index file is a very useful step. The index file will be much shorter than your input file, and it contains everything you need to sort and select items and prepare the output step without the need to read the big file every time you want to test the program.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 14 · 2012-03-19T22:42:17+00:00

Gribouillis 1,391 Programming Explorer

12 Years Ago

Please add id_saved = '' between line 61 and 62 :)

4evrmrepylrning 0 Newbie Poster · Answer 15 · 2012-03-19T22:45:45+00:00

Gribouillis thank you for all your effort. At the moment I am snowed under but this has to be done by next week so I will definitely look at it. I'm told we use lxml here at work. I definitely have to dissect what you have done here. Thanks again

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 16 · 2012-03-19T22:50:09+00:00

Gribouillis thank you for all your effort. At the moment I am snowed under but this has to be done by next week so I will definitely look at it. I'm told we use lxml here at work. I definitely have to dissect what you have done here. Thanks again

lxml is very good, but I think it needs to store the whole parse tree.

Remove duplicate records from 4GB file

Recommended Answers Collapse Answers

All 21 Replies

Recommended Answers