Hi all,

I have a large txt file (3 Million lines). Like to use python , to parse the file , so it can be managed by excel.
I am very new with Programming and python, like to learn it.

Thanks for your support...

input file format:

..
...
...
2014 Jul 23 08:15:16.675 ...s.s.lsllslls...slls
...
..

Name = x
Lastname = y
.
.

Age = 5
height = 1
..

..
...
...
2014 Jul 24 08:15:16.675 jkkl ...s.s.lsllslls...slls
...
..

Name = HHH
Lastname = BBSB
.
.

Age = 10
height = 2
..

2014 Jul 25 08:15:16.675...... ...s.s.lsllslls...slls
...
..

Name = SKSK
Lastname = SKSK
.
.

Age = 9
height = 3
..

..
...
...
2014 Jul 26 08:15:16.675.......lllll
...
..

Name = x
Lastname = y
.
.

Age = 8
height = 1.5

=============================================================

Wanted output format :

Date| Name |Last name|Age|Height <--- header
2014 Jul23 ,x,Y,5,1
.
.
.
.

Normally, you will probably want to use a database for such a large file, in my opinion that would be more effective, especailly for a data as large as that. But of course there are several other factors that can work in our favour otherwise.

That being said, I think you should be able to get this done.
You can open the file using the function open, then using a for loop with the filehandle that has file object you just opened. Loop through one line at a time.

I don't know if your data file contain lines with several dots, but using the data set you gave here. Then you can go to the next line of your file, if the current line starts with a '.'. Then if a line without one is found, split it, then check if the the list is not empty. If not, then you should be able to get various data that is needed.
Hope this helps.

Thank you for help.

I mananged to get the Date/time as well as the line number :-)

#!/usr/local/bin/python2.7
f=open('input.txt','r');
line_num = 0
search_phrase = "Jul"
for line in f.readlines():
    line_num += 1
    if line.find(search_phrase) >= 0:
      print line_num,
      print line [2:26]

output:
7 14 Jul 24 23:23:07.109
37 14 Jul 24 23:23:07.119
67 14 Jul 24 23:23:07.129

once it finds the the refernce line ( eg line 7 in the above example), how can i get it to print the content of line "x" where x is relative to line_num .

Can i use something like this

Print line (line_num+2,line_num+5,line_num+8)

your help is appreciated

Using readlines() is a bad idea as it reads the whole file in memory. You could try something like the following code, which loads one line at a time

#!/usr/local/bin/python2.7
from __future__ import print_function

search_phrase = "Jul"
with open('input.txt','r') as f:
    data = enumerate(f, 1)
    try:
        while True:
            line_num, line = next(data)
            if search_phrase in line:
                print(line_num, line[2:26])
                nums = set([line_num + 2, line_num + 5, line_num + 8])
                while nums:
                    line_num, line = next(data)
                    if line_num in nums:
                        print(line_num, line)
                        nums.discard(line_num)
    except StopIteration:
        pass

Edited 2 Years Ago by Gribouillis

Thanks for your help.

It works fine, except for one thing...

It prints the output per line , is there way to combine the four line into one single line , seperated by TAB or ","

The output format that i am trying to get is:

2014 Jul 23 08:15:16.109 ,Name = x, Lastname = y,Age = 5, height = 1
2014 Jul 23 08:15:16.119 ,Name = x, Lastname = y,Age = 5, height = 1
2014 Jul 23 08:15:16.129 ,Name = x, Lastname = y,Age = 5, height = 1
.
.

If your large data file has data that appears as consistently as your example would suggest, then you might use this simple approach:

# processlargeFile.py #

FNAME = 'largeFile.txt'

try:
    with open( FNAME ) as f:
        count = 0
        data = [] # get an empty list
        line = f.readline()
        while line:
            line = line.rstrip()
            if line and line[0] != '.': # if not empty and ...
                count += 1
                #print( line )
                if count == 1: # get date/time data
                    data.append( line[0:24] ) #get the first 24 char's
                elif count == 2: #get name
                    data.append( line[7:] ) #get all the chars beginning at index 7
                elif count == 3: #get last name
                    data.append( line[11:] )
                elif count == 4: #get age
                    data.append( line[6:] )
                elif count == 5: #get height, so done this record, so ...
                    data.append( line[9:] )
                    count = 0
                    #print( data )
                    print( ','.join( data ) )
                    #print()
                    data = []

            line = f.readline()
except:
    print( 'There was a problem opening/reading file', FNAME )

If your large data file lacks the regularly repeating data pattern needed above, perhaps you could first then pre-process the data so that it complies.

Edited 2 Years Ago by David W

This question has already been answered. Start a new discussion instead.