Hi,
I have a large file (11 GB), that I want to extract information from. I decided that the file was to big to work with, so I ended up with splitting it into 20 smaller files. Now I don't know what the smartest thing is to do with these files. Are they still to big? I want to open them and read them, but I was thinking that its to time demanding to do so that many times. I'm pretty new to python and now I'm not sure on how to proceed.

Recommended Answers

All 5 Replies

It depends what you want the file. Generally the way to work with large file is split and merge or you maybe do not need split if you work with generator expressions not loading all data in memory at once. So could you specify the processing you are doing to the data.

This is how the file looks like, it's a tab delimited file.

Bgr  Pro     ID         ff   Aa   FIL    Info                              
2   14370  AT6054257  3.54  5.67  PASS  NS=3;DP=14;prob=0.5;DB;H2

This is only one row from the file, but I have many lines and want to extract prob from column Info.

I know how to extract the data from the file, I just don't know how to do it when I have such a large file...

If you are unsure about how much memory you need maybe you can separate the wanted part from file to another file:

def get_prob(line):
    before, prob, after = line.partition('prob=')
    if prob:
        return after.partition(';') [0]

with open('result.txt', 'w') as out_file, open('data_sofia.txt') as in_file:
    for line in in_file:
        prob = get_prob(line)
        if prob:
            out_file.write(prob+'\n')

The main program part can be expressed also with generator and writelines more concisely:

with open('result.txt', 'w') as out_file, open('data_sofia.txt') as in_file:
    out_file.writelines(prob+'\n' for prob in (get_prob(line) for line in in_file) if prob)

I have a large file (11 GB), that I want to extract information from.

At the risk of posting some off-topic, is using Python an absolute requirement? If not, and assuming you are on *nix, you can easily extract the probability using a one-liner:

cat tab.txt | sed '1d' | awk 'BEGIN{FS="\t"} {split($7,arr1,";"); split(arr1[3], arr2, "="); print arr2[2] }'

thanks!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.