large data comparson dictionary memory error

Question

pythonbegin 0 Light Poster

14 Years Ago

Hi All

Hope everyones doing good.

I have two \t files with 3 columns, file1 contains 600050 rows and file2 contains 11221133 rows.

I am comparing file2 with file1 to match common entries in first two columns, if file1[0:2] in file2[0:2 ,] write file2[0:2]+column 3 else fil1[0:2] + 5.

I did this by using two dictionaries with elements of column[0:2] is key and column[3] as value and but loading of file2 in dictionary gave a memory error.

dict2[item1,item2] = item3
MemoryError

I also tried list but memory error, sets works but huge duplicates and unordered. I want to preserve the order as same as file1.

find attached test files.

f1 = open(file1) 
    f2 = open(file2)   # very very large file
    f3 = open(file3,'w')

    dict1 = {}
    dict2 = {}
    for line in f1:
        lstrip = line.strip('\n')
        item1,item2,item3 = lstrip.split()
        dict1[item1,item2] = item3

    for line in f2:
        lstrip = line.strip('\n')
        item1,item2,item3 = lstrip.split()
        dict2[item1,item2] = item3        # here its giving memory error

    for item in dict1.keys():
        if item in dict2:
            match = item[0] +'\t'+ item[1] + '\t' + dict2[item] + '\n'
            f3.write(data)
        else:
            data= item[0] +'\t'+ item[1] + '\t' + str(0) + '\n'
            f3.write(data)
    f1.close()
    f2.close()
    f3.close()

python

testfile1.txt (0.74 KB)

testfile2.txt (5.96 KB)

3 Contributors
6 Replies
1K Views
2 Days Discussion Span
Latest Post 14 Years Ago Latest Post by griswolf

All 6 Replies

woooee 814 Nearly a Posting Maven

14 Years Ago

A second dictionary is not required. Create a tuple containing item1 and item2 and look up in the dictionary. If found, you have a match. If you want to keep the keys in the same order as the file, use the ordered dictionary. For Python 2.7 it is in "collections".

Edited 14 Years Ago by woooee because: n/a

griswolf 304 Veteran Poster

14 Years Ago

Umm. Too bad that the data you need to keep is in the shorter file. Still, if there is enough room to make a set of pairs from the large file, you can do it like this:

openfile2 = open(file2,'r')
matchset = set((tuple(x.split()[:2]) for x in openfile2))
openfile1 = open(file1,'r')
openout = open(outfile,'w')
for row in openfile1:
  items = row.split()
  if tuple(items[:2]) in matchset:
     openout.write(" ".join(items[0],items[1],items[2])+'\n')
  else:
     openout.write(" ".join(items[0],items[1],items[4])+'\n')
openout.close()
openfile1.close()
openfile2.close()

If that still causes a memory error, then you will have to do it multiple times. Split file2 into some smaller parts and do this:

Treat each sub-file of file 2 as the entire file2 in the algorithm above
Instead of writing the final result on a non-match, write the whole line
On the next pass, use the modified outfile as the infile:
- Already modified lines will either be recreated or untouched: OK
- Some unmodified lines will be matched and modified: OK
On the last pass, use the algorithm above, with file1 as the mostly modified file and the last part of file2 making a dictionary.

Note that this only works because you want adjacent columns if matched. In other cases, you will have to write the whole line with a first character/column flag for matches (don't look at those lines in later passes). Then make a final pass through the file, rewriting it according to the marks.

Edited 14 Years Ago by griswolf because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

pythonbegin 0 Light Poster · Answer 1 · 2011-03-20T18:36:42+00:00

Its still giving a memory error :-(.

Umm. Too bad that the data you need to keep is in the shorter file. Still, if there is enough room to make a set of pairs from the large file, you can do it like this:
openfile2 = open(file2,'r')
matchset = set((tuple(x.split()[:2]) for x in openfile2))
openfile1 = open(file1,'r')
openout = open(outfile,'w')
for row in openfile1:
  items = row.split()
  if tuple(items[:2]) in matchset:
     openout.write(" ".join(items[0],items[1],items[2])+'\n')
  else:
     openout.write(" ".join(items[0],items[1],items[4])+'\n')
openout.close()
openfile1.close()
openfile2.close()
If that still causes a memory error, then you will have to do it multiple times. Split file2 into some smaller parts and do this:
Treat each sub-file of file 2 as the entire file2 in the algorithm above

Instead of writing the final result on a non-match, write the whole line

On the next pass, use the modified outfile as the infile:
Already modified lines will either be recreated or untouched: OK

Some unmodified lines will be matched and modified: OK

On the last pass, use the algorithm above, with file1 as the mostly modified file and the last part of file2 making a dictionary.

Note that this only works because you want adjacent columns if matched. In other cases, you will have to write the whole line with a first character/column flag for matches (don't look at those lines in later passes). Then make a final pass through the file, rewriting it according to the marks.

pythonbegin 0 Light Poster · Answer 2 · 2011-03-20T18:47:55+00:00

I tried using the ordered dict, but its still giving the memory error. I tried it for small set where it works but ordered is not preserved.

my code is like :

for row in openfile2: 
    line = row.split()
    if tuple(line[:2]) in od.keys():
        print line
    else:
         "here i want to print the key,value pair from od for which no entry in file2"

can anyone help on this?

A second dictionary is not required. Create a tuple containing item1 and item2 and look up in the dictionary. If found, you have a match. If you want to keep the keys in the same order as the file, use the ordered dictionary. For Python 2.7 it is in "collections".

woooee 814 Nearly a Posting Maven · Answer 3 · 2011-03-20T23:00:47+00:00

I tried using the ordered dict, but its still giving the memory error

That is too vague to be of any value. Post the actual error. Also this line
if tuple(line[:2]) in od.keys():
should just be
if tuple(line[:2]) in od:

".keys" returns a list of the keys which means there is double the amount of memory for the number of keys. If you can not do this with the memory available, then you want to use an SQLite database on disk instead.

griswolf 304 Veteran Poster · Answer 4 · 2011-03-21T02:34:54+00:00

I tried using the ordered dict, but its still giving the memory error. I tried it for small set where it works but ordered is not preserved.
my code is like :
for row in openfile2: 
    line = row.split()
    if tuple(line[:2]) in od.keys():
        print line
    else:
         "here i want to print the key,value pair from od for which no entry in file2"
can anyone help on this?

I can only suggest again that you split the work into more manageable pieces. I tried to read in that many triples from a file (generated by file.write(%s\t%s\t%s\n"%(random.random(),random.random(),random.random())) and was unable to get much past 8 million lines (it did proceed very slowly after that, but there was obviously a lot of thrashing going on: it took anything from .3 seconds to 850 seconds to read the next 1000 rows. Yes: nearly 15 minutes!) I'm running on OS/X with 4G memory.

While I was doing this work, it occurred to me that you might have duplicate keys in your long file. What should happen in that case? There are three options:

First key wins
Last key wins
Value for a randomly chosen key is used

(up to 11 million rows, my random data had no duplicate keys. I killed the program at that point since it was apparent it would not finish in reasonable time)

For reference, here's my code for reading from the file

#!/usr/bin/env python

from random import random
import time

data = {}
doublecounter = 0

def mumble(i,lp,s,ss,e):
   looptime = e-ss
   totaltime = e-s
   print "%8d: count: %d, (lp:%d) lptime: %2.2f, ttime: %2.2f"%(i,doublecounter,lp,looptime,totaltime)

def doit(f):
   when = 1000000
   global data,doublecounter
   start = time.time()
   count = 0
   lpstart = start
   end = start
   for line in f:
      if 0 == count % when:
         lpstart = end
         end = time.time()
         mumble(count,when,start,lpstart,end)
         if count == 7000000: when /= 10
         elif count == 8800000: when /=10
         elif count == 11000000: when /= 10
      s = line.split()
      k = tuple(s[:2])
      v = s[2]
      data.setdefault(k,[])
      data[k].append(v)
      if len(data[k]) > 1:
         doublecounter+= 1
      count += 1

with open ('f2','r') as f:
   doit(f)

and the first several rows of my f2 file

0.681726943412	0.317524601127	0.774220362723
0.960827529946	0.884868924006	0.805958559062
0.948431957255	0.654394548708	0.261958105771
0.790787661492	0.588754682813	0.784801700146
0.91496579649	0.65679730019	0.643389604304
0.410742283212	0.266691538578	0.251305611073
0.452187326938	0.537941526934	0.162800839411
0.298231566648	0.287904077361	0.553563473187
0.892003052642	0.483519506157	0.605940960314
0.118257450942	0.51597182572	0.868219791638

(the white spaces are tabs)

large data comparson dictionary memory error

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers