I've built a Counter (which is not implemented for python 2.6) for reading a sequence file of strings to a dictionary and trying to return those sequences are unique for that file/a number of files. I use sequences with a length of X-characters as key in my dictionary and put how many times that key has been read to the dictionary (if key in mydictionary: mydict[key]+=1 else: mydictionary[key] = 1).
After I have read the whole file I just check which keys have 1 as value in the dictionary and the save those entries to another dictionary for unique sequences. The problem is that the program consumes more than 2 GB of my the memory and grows all the time until everything has been put into the dictionary. Is this common for dictionaries in python or can it be that I have a memory leak in the code? The program consumes 2.5 GB for three files of 5.1MB each.

Recommended Answers

All 7 Replies

Which version of Python are you using?

2.6

Which version of Python are you using?

I'm using ver 2.6

First, do not read the entire file at one time, i.e. readlines(), instead:

for rec in open(file_name, "r"):
    if key in mydictionary: 
        mydict[key]+=1 
    else: 
        mydictionary[key] = 1

If that doesn't help then you will have to switch an SQL file on disk. SQLite is simple to use for something like this so post back if you want some help.

First, do not read the entire file at one time, i.e. readlines(), instead:

for rec in open(file_name, "r"):
    if key in mydictionary: 
        mydict[key]+=1 
    else: 
        mydictionary[key] = 1

If that doesn't help then you will have to switch an SQL file on disk. SQLite is simple to use for something like this so post back if you want some help.

The thing is that I have to compare all strings of say 3 characters in that file, eg
if the file contains AGTCG, then the entries to the dictionary would be: AGT, GTC and TCG. I read the file line by line to a string to load it to the dictionary afterwards. Shouldn't pythons GC remove the string that I have read to memory after I put the sequences to the dictionary? Isn't it the dictionary that grows for every entry?
I tried to run the program with a version using brute force and a list where I read the whole file into a string and then append every 3 characters (same as above) and the used memory stays constant at 17,8% of 4GB :S

I'm afraid that the runtime for the program will increase if I don't read the whole file to RAM. Thanks for the advice about SQLite. I may have to check how it works in python (have very little experience of using databases).

I'm afraid that the runtime for the program will increase if I don't read the whole file to RAM

"I'm afraid" doesn't hold much water in the real world. Try reading the file both ways and see if there is a significant time difference.

You need a method/function that contain yield to yield the values. Yield will release memory after giving so there will not be any memory holding.

eg..

yield d["home"]= d1.values()

Spot on. ;)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.