http://paste.pocoo.org/show/234632/

The error's in there and I explain what my thoughts are. Extra information: There is definitely enough RAM. I watched the RAM go down as the program ran (lol), but there was still 2.3GB of RAM before the program crashed, giving the MemoryError.

Here is an example of what the input file is like; it's not exactly like this, and there are occasions where lines beside the first line aren't the length of the "first line being read," but the length check should take care of that.

http://paste.pocoo.org/show/234638/

The problem is because the REPETITIONS_REQUIRED is set to 6, and that's not going to happen in an input file this small. If you change it to 1, it works. But that's not my problem. >_>

The purpose of the program is to find repetitions of KMER_length up to two lines of genome file? So in the case of example file it means that in the last lines the dictionary must have astronomical amount of alternatives.

I think you can do simpler program using Pythons own funcitions, fi you are able to save enough subsequences and full sequence in memory.

Is there possibility that end of KMER has same sequence os beginning and next KMER start inside previous one?

Here is the code I would use to count the existence of repeats of length KMER_LENGTH:

#========================================================================================
# Edit below here
#========================================================================================
GENOME_FILE = "hg18.fa"
OUTPUT_FILE = "dictionary.txt"
KMER_LENGTH = 25
REPETITIONS_REQUIRED = 6
#========================================================================================
# Edit above here
#========================================================================================
dictionary = {}
genome_sequence = open(GENOME_FILE, "r").read().replace('\n','') ## everything one line
file_being_written_to = open(OUTPUT_FILE, "w")

for i,_ in enumerate(genome_sequence):
    piece = genome_sequence[i:i+KMER_LENGTH]
    if piece in genome_sequence[i+KMER_LENGTH:]:
        count = genome_sequence.count(piece)
        if count>=REPETITIONS_REQUIRED: dictionary[piece]=count

if len(dictionary):
    print 'Repeating KMERS are:'
    for item in dictionary.items():
        print "%s : %i" % item
       
else:
    print "No enough repeats found."

The purpose of the program is to find strings of length "KMER_LENGTH" defined by the user in the file "GENOME_FILE." It takes two lines at a time to reduce the strain on the computer. It'll run a segment of code on the combination of those two lines, finding all the kmers from that combination.

Then, it proceeds with the rest of the file. It's kinda like if I had a file saying:
AAAAA
BBBBB
CCCCC
and I was looking for 4-mers. The program does this
AAAAABBBBB
List:
AAAA
AAAA
AAAB
AABB
ABBB

Now:
BBBBBCCCCC
List:
BBBB
BBBB
etc.

Since there are some parts that run into the next line. I hope you understand. ._. And putting on everything on one line is bad because having a 3GB file opened at one time is bad.

EDIT: New code - http://paste.pocoo.org/show/234712/
The problem was that the algorithm for finding the kmers in the combination string was wrong. There were too many repeats in such a small file.

Edited 6 Years Ago by mastermoo420: n/a

So there is my code alternative to print the sequences passing the test and their count (repetitions as two to get result):

#========================================================================================
# Edit below here
#========================================================================================
GENOME_FILE = "hg18.fa"
OUTPUT_FILE = "dictionary.txt"
KMER_LENGTH = 25
REPETITIONS_REQUIRED = 2
#========================================================================================
# Edit above here
#========================================================================================
dictionary = {}
genome_sequence = open(GENOME_FILE, "r")
file_being_written_to = open(OUTPUT_FILE, "w")
piece =' '+genome_sequence.read(KMER_LENGTH-1) ## first char will be dropped in loop

for line in genome_sequence:
    for base in line.rstrip():
        if base not in 'ACGT':
            print '**Bad line**:'
            print line
            break
        piece = piece[1:] + base
        dictionary[piece]= dictionary[piece]+1 if piece in dictionary else 1

for kmer,count in dictionary.items():
    if count>=REPETITIONS_REQUIRED:
        print >> file_being_written_to,"%s : %i" % (kmer,count)

file_being_written_to.close()
print open(OUTPUT_FILE).read()
raw_input('Ready')

Edited 6 Years Ago by pyTony: n/a

This article has been dead for over six months. Start a new discussion instead.