how to parse a file with columns we want in each line??

Question

joe82 0 Light Poster

15 Years Ago

hello everyone,

my file has many long lines witth 12 ccolumns like:

gnl|dbS|13484118 gi|62750812 100 16 .......around 12 columns
gnl|dbS|13484888 gi|62750812 95 20 .......
gnl|dbS|22484118 gi|62750812 92 20 ..........

I want to grab the lines where values of column 3 lies between 90 - 99.9 in the file and store the result in new file.

Please help me if you can...!!

Thanks in advance

python

3 Contributors
20 Replies
131 Views
5 Days Discussion Span
Latest Post 15 Years Ago Latest Post by joe82

All 20 Replies

jlm699 320 Veteran Poster

15 Years Ago

Here's how to open a file for reading and iterate over it:

f = open(my_file)
for line in file:
    # Do something to each line
f.close()

Here's how to split a string :

>>> my_text = "Here'ssomeTExt WithSomeMoreOver Here and then some more"
>>> my_text.split()
["Here'ssomeTExt", 'WithSomeMoreOver', 'Here', 'and', 'then', 'some', 'more']
>>>

As you can see, split returns a list so you can easily use slicing or indexing to access the third column (which would be index 2, btw)

Here's how to compare a string to a number by converting to float (you can also convert to int for integers, natch):

>>> my_number_text = '95.09'
>>> if float(my_number_text) < 100:
...     print 'That number was less than 100!'
...     
That number was less than 100!
>>>

And finally here's how you open a file for writing

f = open(my_file, 'w')
f.write('Some text here\n')
f.close()

There you go. Everything that you asked for. All in one place, now isn't that nice?

jlm699 320 Veteran Poster

15 Years Ago

I am not able to do do it in this away

Why not? Show me the code you've created and describe the error that you get and I will gladly help you resolve it.

jlm699 320 Veteran Poster

15 Years Ago

In the future please use code tags when pasting code in this forum as it makes your code more readable and thus more people will be willing to read your post and answer your question. Here's how to use code tags:

[code=python] # Put your code inside here

[/code]

Now here's what your code would look like had you used code tags:

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
    f = fil.readlines
    for line in f:
        line.split()
        for text in line[2]:
            text = '90.00'
            if float(text) <100:
                result = line

Now with the indentation and syntax highlighting I can immediately see that you forgot to actually call readlines . So change the line f = fil.readlines so that it reads f = fil.readlines() .

Next, you forgot to assign the split line to a variable. You can assign it to itself so that your split line looks like this line = line.split() . So now line will contain the split list instead of the actual line.

The rest of your code is confusing to me... I suggest using print statements liberally to see what type of data you have and its contents at each point. That should help you figure out what's wrong with your last bit of code.

Run this to give you an example of what's going on (it's your code with the fixes I suggested above plus some debugging output):

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        print ' |LINE| ', line
        print ' |LINE[2]| ', line[2], ' <-- Is this the value I want?'
        for text in line[2]:
            print ' |TEXT| ', text, ' <-- Was this for loop a good idea?'
            text = '90.00'
            print ' |TEXT| ', text, ' <-- Why did I just assign 90?'
            if float(text) <100:
                result = line
                print ' |RESULT| ', result, ' <-- Is this what I want?'

jlm699 320 Veteran Poster

15 Years Ago

Yes, this is the column i want to access for range between (90, 100)
and print the lines having these values.
So after this please suggest something
I want my outputfile to be like:

gnl|dbS|13484888 gi|62750812 95 20 .......
gnl|dbS|22484118 gi|62750812 92 20 .........
lines with column 3 values between 90 -100 ( clored red)

Alright, so in that case you'll need to remove the for loop that iterates over line[2] and instead compare float(line[2]) to the desired value.

I hope you notice that I'm intentionally not hand feeding you this code because challenging yourself to learn how to put it all together is a stepping stone to becoming a programmer. So please don't take offense, as I'm only trying to help you help yourself.

jlm699 320 Veteran Poster

15 Years Ago

from __future__ import with_statement

with open ('C:\\Documents and Settings\\jguleria\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        for line[2] in line:
            if i in range(90, 100):
                print line

I want to do something like above
but iam getting an error message i is not defined..!!

Please suggest...Thanks

You never defined i . I believe you meant for i to contain the value of line[2] ; however you'll want to convert it to a float() before assigning it to i .

Also, remove the for line[2] in line: as that doesn't make sense syntactically or logically

EDIT: Also keep in mind your endpoints of range. The result of a range(a, b) has the following range: [a, b) meaning you'll get values 90 - 99.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

joe82 0 Light Poster · Answer 1 · 2009-07-02T23:57:59+00:00

hi jlm699,

I am not able to do do it in this away can you please give me the right code??
many thanks..!!

joe82 0 Light Poster · Answer 2 · 2009-07-03T01:53:02+00:00

Why not? Show me the code you've created and describe the error that you get and I will gladly help you resolve it.

from __future__ import with_statement

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
f = fil.readlines
for line in f:
line.split()
for text in line[2]:
text = '90.00'
if float(text) <100:
result = line

with open('C:\\Documents and Settings\\jDesktop\\j3.txt','w')as resultfile:
resultfile.write(result)

I am not getting how to readlines for column 3rd and include condition of number between 90 -100 and get those lines....:(

joe82 0 Light Poster · Answer 3 · 2009-07-03T02:38:42+00:00

Thank you very much for helping me.

#
with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
#
f = fil.readlines()
#
for line in f:
#
line = line.split()
#
print ' |LINE| ', line
#
print ' |LINE[2]| ', line[2], ' <-- Is this the value I want?'

Yes, this is the column i want to access for range between (90, 100)
and print the lines having these values.

So after this please suggest something

I want my outputfile to be like:

gnl|dbS|13484888 gi|62750812 95 20 .......
gnl|dbS|22484118 gi|62750812 92 20 .........
lines with column 3 values between 90 -100 ( clored red)

Many Thanks

joe82 0 Light Poster · Answer 4 · 2009-07-03T02:47:43+00:00

from __future__ import with_statement

with open ('C:\\Documents and Settings\\jguleria\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        for line[2] in line:
            if i in range(90, 100):
                print line
                #result = line

I want to do something like above
but iam getting an error message i is not defined..!!

Please suggest...Thanks

joe82 0 Light Poster · Answer 5 · 2009-07-03T03:19:17+00:00

Thank you very much for help, i am getting that you do not want to feed me but helping me to learn as well..:)

My current code is:

from __future__ import with_statement

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        i = float(line[2])
        if i in range(90, 99.99):
            print line

It gave me an error message:
IndexError: list index out of range

please suggest on this....Thanks

jlm699 320 Veteran Poster · Answer 6 · 2009-07-03T03:42:18+00:00

from __future__ import with_statement

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        i = float(line[2])
        if i in range(90, 99.99):
            print line

It gave me an error message:
IndexError: list index out of range

please suggest on this....Thanks

Notice what happens in my interpreter when I use the range that you provided:

>>> range(90,99.99)
C:\Python25\Lib\site-packages\wx-2.8-msw-unicode\wx\py\PyCrust.py:1:
DeprecationWarning: integer argument expected, got float
  """PyCrust is a python shell and namespace browser application."""
[90, 91, 92, 93, 94, 95, 96, 97, 98]
>>>

range returns a list of integers, and is not the best solution for test the bounds of a floating point number.

You're best bet is probably to use the comparison statements. Luckily for you, Python provides an easy way to check if a value is within parameters.

Consider the following example:

>>> my_numbers = [1,5,15,6,21,10]
>>> for each_number in my_numbers:
...     if 1 <= each_number < 10:
...         print each_number, 'is in the range (1,10]'
...     
1 is in the range (1,10]
5 is in the range (1,10]
6 is in the range (1,10]
>>>

Note that 10 was not included in the output. I used the 'less than' ( < ) operator instead of the 'less than or equal to' ( <= ) operator.

The_Kernel 33 Light Poster · Answer 7 · 2009-07-03T05:13:41+00:00

If you're getting an IndexError it means the location you're referencing in the list doesn't exist. You assume each line has at least three columns, however I'd bet there's an empty line in there that's screwing you up. It's easy to fix in any case. Just check that the length of the list is at least three. Here's you're code, plus the fix:

from __future__ import with_statement

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        if len(line) >= 3:
            i = float(line[2])
            if i in range(90, 99.99):
                print line

joe82 0 Light Poster · Answer 8 · 2009-07-06T21:34:24+00:00

Thank you very much for helping me but there is one problem in result..!!

I am just getting lines with integers not decimals !!!

i was expecting my result to print 5 lines including ( 95, 95, 91.62, 91.165 and 92)

i just got 3 lines with 95, 95 and 92..

What do you suggest on it??

here is the code:

from __future__ import with_statement

with open ('C:\\Documents and Settings\\jguleria\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        if len(line) >= 3:
            i = float(line[2])
            if i in range(90, 99.99):
                print line

and here is result:

Thanks a lot..!!

joe82 0 Light Poster · Answer 9 · 2009-07-06T21:41:09+00:00

I also tried:

from __future__ import with_statement

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        if len(line) >= 3:
            i = line[2]
            for each_number in i:
                if 90 <= each_number < 100:
                    print line

Not printing anything...:(

joe82 0 Light Poster · Answer 10 · 2009-07-06T21:47:30+00:00

Thank you everyone...

I fixed my problem...

Thanks a lot...:)

joe82 0 Light Poster · Answer 11 · 2009-07-06T22:01:17+00:00

Hello there,

Still got another problem...:(

When I am printing the lines, i am getting all the correct ones but when iam trying to save the result in file, I am getting just one line saved in my result file..

Can you please suggest something on this??

here is the code

from __future__ import with_statement

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    for line in f:
        line = line.split()
        if len(line) >= 3:
            i = float(line[2])
            if 90 <= i < 100:
                line = ' '.join(line)
                result = line
            
               
          


with open('C:\\Documents and Settings\\Desktop\\j3.txt','w')as resultfile:
    resultfile.write(result)

The_Kernel 33 Light Poster · Answer 12 · 2009-07-06T22:57:33+00:00

The problem is that you're just assigning "line" to "result" each time, hence "result" will only contain the last line you want to print. What you want to do is make "result" a list, then append the matching line to it. Then when you want to write it all to a file you can just do outfile.write('\n'.join(result))

joe82 0 Light Poster · Answer 13 · 2009-07-06T23:16:26+00:00

Its not working this way...i got a long column as in my result file instead of lines....:(

The_Kernel 33 Light Poster · Answer 14 · 2009-07-06T23:42:43+00:00

Well here's how I'd change your code. Let me know if this is what you have or doesn't work:

from __future__ import with_statement

with open ('C:\\Documents and Settings\\Desktop\\file2.txt') as fil:
    f = fil.readlines()
    result = []
    for line in f:
        line = line.split()
        if len(line) >= 3:
            i = float(line[2])
            if 90 <= i < 100:
                line = ' '.join(line)
                result.append(line)
            
with open('C:\\Documents and Settings\\Desktop\\j3.txt','w') as resultfile:
    resultfile.write('\n'.join(result))

joe82 0 Light Poster · Answer 15 · 2009-07-06T23:57:16+00:00

Waoo..!!
It worked..Thanks a lot..:)

Thanks to everyone who helped me in writing my code..!!

Now i want to modify my result if possible....please help..it looks little challenging to me...

from the above code my result is:( modified to look simpler)

NC_005111.2|NC_005111 95 20 1 0 68 87 31017559 31017578 4.4 32.3
NC_005111.2|NC_005111 91.67 24 2 0 63 86 35247737 35247714 4.4 32.3
NC_005111.2|NC_005111 91.67 24 2 0 64 87 40549054 40549031 4.4 32.3
NC_005111.2|NC_005111 92 24 2 0 63 86 42462636 42462659 4.4 32.3

Here the numbers colored Red are the one i just parsed...
the problem is: because of decimals and whole numbers the next column shifted...like in first row "95" is followed by "20" but there should be some space so that all the column should be visible or you can say tab delimited...

IS IT POSSIBLE TO DO ???

Many Thanks...!!!

how to parse a file with columns we want in each line??

Recommended Answers Collapse Answers

All 20 Replies

Recommended Answers