I'm trying to make program using Google 5gram data to compare with one of English exam called TOEIC. There are 118 files for 4grams and each size of 4grams are approximately 300MB and each files has 10,000,000 lines.

So, here is the point.
It requires about 4 or 5 seconds to read few bytes of lines from a file using
readlines(some value). Even though, 5 second is quite short time it going to be few minutes if the system need to check the end of lines.

I've just starting to learn python few month ago so I don't know how to reduce the time to read lines in file.

I know, there is a way to read a specific line of a file but it also takes times to upload to memory first. I heard that it doesn't need to upload to memory to read specific lines in Java. Is there any similar way to read specific line without costing uploading time like Java?

This is the part of my code caparing part.

while bcheck == 0:
        if bcheck == 1:

        #read a 5gram file match with first letter of input

        if nRange == maxLine:
            if str1stLetters == 'A':
                tx = open(r"d:\##DB\Google-4gram\4gm-0034")
            elif str1stLetters == 'B':
                tx = open(r"d:\##DB\Google-4gram\4gm-0036")
            elif str1stLetters == 'C':
                tx = open(r"d:\##DB\Google-4gram\4gm-0038")
            elif str1stLetters == 'D':

            so on....

        lines       = tx.readlines(2000000) [B]# WANT TO REDUCE COSTING TIME OF THIS [/B]
        countLine   = 0

        for x in range(len(lines)):
            str4gram    = lines[countLine]

            if strSearch in str4gram:
                strSplit4gram   = str4gram.split()
                print str4gram
                print nRange,"th line"
                tShow.insert(INSERT, str4gram + "\n")
                nRange      = nRange + 1

                if nRange == 10000000:
                    bcheck          = 1
                    str4gram        = lines[countLine]
                    strSplit4gram   = str4gram.split()

                    print nRange,"th line"
                    tShow.insert(INSERT, str4gram + "\n")
                    print nRange

                if strSearch in lines[countLine + 1]:
                    countLine = countLine + 1
                    bcheck          = 1
                nRange = nRange + 1
                countLine = countLine + 1

    tShow.insert(INSERT, "\n")
    tInput.delete(0, END), tInput_2.delete(0, END)

Recommended Answers

All 10 Replies

This should be faster.

lines = [line for line in tx.readlines(2000000)]

Cheers and Happy coding

Yes indeed.
Because of your advice, the system a bit faster then before.
Can't I just access the specific line without loading the file to memory?

Thank you so much mate.

Can't I just access the specific line without loading the file to memory?

Yes use linecache from standard library.

'''--> l.txt
line 1.
line 2.
line 3.

from linecache import getline
#Notice  linecache count from 1
print getline('l.txt', 2).strip()
#--> line 2.

Thanks snippsat.
I've alread applied your recommended method to my program before.
This way could help to access specific line of a file which line I want.
However, Before access to the line, The file has to upload to the memory first then
we can access to any lines using getline().

But thing is, in my program, The computer has to campare more than 5 files to find the matched words because of the huge size of the ngram data.
So that makes out of the memory.

I hope that there is a dramatic way to solve this problem.
Does anyone knows about how to access a line directly to the file without uploading memory?

Thanks to all and happy coding

for line in open('my.txt'):
    print line,

This is memory efficient and fast way,it reads line by line.
Not the hole file into memory.
Am not sure(dont think so)it`s a way to get one line,without reading something into memory.

I think you must read the files ones and generate index for seek positions in the file for each file and save the index. Do use snippsat's suggested way of reading file, not readlines. Once you have index finding the line is instantanous.

Thanks again your help.
I've used your suggested way to solve my problem.
This way would be faster.

well, have you used binary search algorithm in python?
I'm trying to apply binary search to my program to reduce costing time.
There is no need to comapre whole lines from first line to the last line.
So I want to jump to line.

I've found this method to jump to line and follows is the code.

from itertools import islice

def seek_to_line(f, n):
    for ignored_line in islice(f, n - 1):
        pass   # skip n-1 lines

tx = open(r"d:\##DB\Google-4gram\4gm-0062")
k = seek_to_line(tx, 9000)    # seek to line 9000

print k, type(k)
print tx.readline()

As you can expect that there is the error at the end of code print tx.readline().
Error message is this. "Mixing iteration and read methods would lose data"
Couldn't find the way to solve this error yet.

Any idea to solve this error?


Did you try to generate next line with next?

print next(tx)

Binary search style of activity is usually handled by using the module bisect .

Very nice idea to use the islice to go specific line, but I do not understand why to write it so complicated?

>>> import itertools
>>> f = open('LICENSE.txt')
>>> print next(itertools.islice(f,10,11)),
in Reston, Virginia where he released several versions of the
>>> print next(f),

Thanks for your help.

It has been solved simply using tx.next().
I believe that you've using python for quite long time.

If I understand your question correctly, the reason why I'm write is I need to read a line not a word.

The 4gram Google supplying has 10,000,000 lines such as follows.

USB Drive SanDisk Cruzer 1072
USB Drive Shuttle , 61
USB Drive Shuttle / 43
USB Drive Shuttle </S> 46
USB Drive So Far 62
USB Drive Spec Sheet 53
USB Drive Specs : 81
USB Drive Speeds Up 42
USB Drive Starter Kit 330
USB Drive Stick 256 46

To using to binary search method, I have to jump to 5,000,000 line and campare that the query and 4gram are matched or not. So to do that first, read the 1st line of 4gram, jumpt to 5,000,000 and read 5,000,000th line and compare with query and 1st and 5,000,000th line. And so on.

Anyway thank you so much.
Fianlly I can speed up my program.

If you have another suggestion or question don't hesitate.

Thanks again.

I meant the way the function worked, I gave example of going to tenth line and reading eleventh after in my post directly using islice.

I finally read your first posts' code and tried to understand what was going on there. What I came up was with the data in your last post:

import os

gpath = r"d:\##DB\Google-4gram"
str_1st_letters ='U'
open(os.path.join(gpath, (r"4gm-00%s" % ((ord(str_1st_letters.upper())-ord('A'))*2 + 34)) ), 'w').write("""USB Drive SanDisk Cruzer 1072
USB Drive Shuttle , 61
USB Drive Shuttle / 43
USB Drive Shuttle </S> 46
USB Drive So Far 62
USB Drive Spec Sheet 53
USB Drive Specs : 81
USB Drive Speeds Up 42
USB Drive Starter Kit 330
USB Drive Stick 256 46

strSearch = 'Shuttle'

#read a 5gram file match with first letter of input
with  open(os.path.join(gpath, (r"4gm-00%s" % ((ord(str_1st_letters.upper())-ord('A'))*2 + 34)) )) as tx:

    for nRange, str4gram in enumerate(str4gram.rstrip() for str4gram in tx if strSearch in str4gram):
        print "%r found in %r at %ith line" % (strSearch, str4gram, nRange+1)
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.