Parsing large text file in Python

Question

Needhelp2 0 Newbie Poster

10 Years Ago

Hi all,

I have a large txt file (3 Million lines). Like to use python , to parse the file , so it can be managed by excel.
I am very new with Programming and python, like to learn it.

Thanks for your support...

input file format:

..
...
...
2014 Jul 23 08:15:16.675 ...s.s.lsllslls...slls
...
..

Name = x
Lastname = y
.
.

Age = 5
height = 1
..

..
...
...
2014 Jul 24 08:15:16.675 jkkl ...s.s.lsllslls...slls
...
..

Name = HHH
Lastname = BBSB
.
.

Age = 10
height = 2
..

2014 Jul 25 08:15:16.675...... ...s.s.lsllslls...slls
...
..

Name = SKSK
Lastname = SKSK
.
.

Age = 9
height = 3
..

..
...
...
2014 Jul 26 08:15:16.675.......lllll
...
..

Name = x
Lastname = y
.
.

Age = 8
height = 1.5

=============================================================

Wanted output format :

Date| Name |Last name|Age|Height <--- header
2014 Jul23 ,x,Y,5,1
.
.
.
.

python

4 Contributors
8 Replies
3K Views
3 Days Discussion Span
Latest Post 10 Years Ago Latest Post by Needhelp2

2teez 43 Posting Whiz

10 Years Ago

Normally, you will probably want to use a database for such a large file, in my opinion that would be more effective, especailly for a data as large as that. But of course there are several other factors that can work in our favour otherwise.

That being said, I think you should be able to get this done.
You can open the file using the function open, then using a for loop with the filehandle that has file object you just opened. Loop through one line at a time.

I don't know if your data file contain lines with several dots, but using the data set you gave here. Then you can go to the next line of your file, if the current line starts with a '.'. Then if a line without one is found, split it, then check if the the list is not empty. If not, then you should be able to get various data that is needed.
Hope this helps.

Gribouillis 1,391 Programming Explorer

10 Years Ago

Using readlines() is a bad idea as it reads the whole file in memory. You could try something like the following code, which loads one line at a time

#!/usr/local/bin/python2.7
from __future__ import print_function

search_phrase = "Jul"
with open('input.txt','r') as f:
    data = enumerate(f, 1)
    try:
        while True:
            line_num, line = next(data)
            if search_phrase in line:
                print(line_num, line[2:26])
                nums = set([line_num + 2, line_num + 5, line_num + 8])
                while nums:
                    line_num, line = next(data)
                    if line_num in nums:
                        print(line_num, line)
                        nums.discard(line_num)
    except StopIteration:
        pass

Edited 10 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Needhelp2 0 Newbie Poster · Answer 1 · 2014-07-28T08:04:44+00:00

Thank you for help.

I mananged to get the Date/time as well as the line number :-)

#!/usr/local/bin/python2.7
f=open('input.txt','r');
line_num = 0
search_phrase = "Jul"
for line in f.readlines():
    line_num += 1
    if line.find(search_phrase) >= 0:
      print line_num,
      print line [2:26]

output:
7 14 Jul 24 23:23:07.109
37 14 Jul 24 23:23:07.119
67 14 Jul 24 23:23:07.129

once it finds the the refernce line ( eg line 7 in the above example), how can i get it to print the content of line "x" where x is relative to line_num .

Can i use something like this

Print line (line_num+2,line_num+5,line_num+8)

your help is appreciated

Needhelp2 0 Newbie Poster · Answer 2 · 2014-07-29T01:41:17+00:00

Thanks for your help.

It works fine, except for one thing...

It prints the output per line , is there way to combine the four line into one single line , seperated by TAB or ","

The output format that i am trying to get is:

2014 Jul 23 08:15:16.109 ,Name = x, Lastname = y,Age = 5, height = 1
2014 Jul 23 08:15:16.119 ,Name = x, Lastname = y,Age = 5, height = 1
2014 Jul 23 08:15:16.129 ,Name = x, Lastname = y,Age = 5, height = 1
.
.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2014-07-29T05:17:28+00:00

Gribouillis 1,391 Programming Explorer

10 Years Ago

You can use string operations to create the lines of output.

Needhelp2 0 Newbie Poster · Answer 4 · 2014-07-29T05:46:49+00:00

can u provde more detail .,, i have no experience in python..

David W 131 Practically a Posting Shark · Answer 5 · 2014-07-30T04:35:02+00:00

If your large data file has data that appears as consistently as your example would suggest, then you might use this simple approach:

# processlargeFile.py #

FNAME = 'largeFile.txt'

try:
    with open( FNAME ) as f:
        count = 0
        data = [] # get an empty list
        line = f.readline()
        while line:
            line = line.rstrip()
            if line and line[0] != '.': # if not empty and ...
                count += 1
                #print( line )
                if count == 1: # get date/time data
                    data.append( line[0:24] ) #get the first 24 char's
                elif count == 2: #get name
                    data.append( line[7:] ) #get all the chars beginning at index 7
                elif count == 3: #get last name
                    data.append( line[11:] )
                elif count == 4: #get age
                    data.append( line[6:] )
                elif count == 5: #get height, so done this record, so ...
                    data.append( line[9:] )
                    count = 0
                    #print( data )
                    print( ','.join( data ) )
                    #print()
                    data = []

            line = f.readline()
except:
    print( 'There was a problem opening/reading file', FNAME )

If your large data file lacks the regularly repeating data pattern needed above, perhaps you could first then pre-process the data so that it complies.

Needhelp2 0 Newbie Poster · Answer 6 · 2014-07-31T00:41:52+00:00

Thanks alot guys.

Problem solved with some learning :-)