How to delete lines from a file we dont want ??

Question

joe82 0 Light Poster

15 Years Ago

Hello everyone,

my file1.txt have sequences as given below:

>1|62798264|rs8174605|T/C||dbSNP|T/C
AAGAGGAGAAAGCAAAGTTGCAAAAGGTGAAAGAGAAAGAAGAGCTAGAGAAGGGCAGGA
AGGAGCAGAGTAAGCAGAGGGAGCCTCAGAAGAGACCGGA_GAGGAGGTGTTGGTGCTCA
>1|100159271|ENSRNOSNP145|T/A||ENSEMBL:celera|T/A
TCTTATAATTAGTCATTGTGATAACTGCTACAAACAAAGTCACAGGATCTTGTGAGAGAA
>1|19033646|rs8173848|C/T||dbSNP|C/T
TTGCAAAAAAAAAAAAAAAAAAAAAAAGCCAGAATCCAGCATAAGTCAAGGAAATCCACT
>1|149643853|rs8173465|G/T||dbSNP|G/T
AACAGAGACAGCTGTGATGTACCCCATGAGCTGGAAAGAGCAGCCCAGCGGTGTCCCAGC
>1|101456015|ENSRNOSNP1318|G/C||ENSEMBL:celera|G/C
AACTCTTAGAAGTTAGAACCTGGGGTGGAGAGATGGCTTGGTGGTTGAGAGCATTGACTG

I want result file which do not have sequences with "rs"number e.g rs8177678, these are colored red in each sequence.

so my output file should have 2 sequences:

>1|100159271|ENSRNOSNP145|T/A||ENSEMBL:celera|T/A
TCTTATAATTAGTCATTGTGATAACTGCTACAAACAAAGTCACAGGATCTTGTGAGAGAA
>1|101456015|ENSRNOSNP1318|G/C||ENSEMBL:celera|G/C
AACTCTTAGAAGTTAGAACCTGGGGTGGAGAGATGGCTTGGTGGTTGAGAGCATTGACTG

Please help me.

Thanks in advance..

python

5 Contributors
10 Replies
178 Views
4 Days Discussion Span
Latest Post 15 Years Ago Latest Post by joe82

shadwickman 159 Posting Pro in Training

15 Years Ago

And make sure that your file isn't some huge size that you shouldn't be loading into memory all at once. You can just go through the lines with the xreadlines() iterator instead.

# this requires the output file to already exist and it should be
# blank, as opening it with mode "a" will just append lines to it.

fhi = open("input_file", "r")
fho = open("output_file", "a")
for line in fhi.xreadlines():
    if "rs" not in line:
        fho.write(line)
fhi.close()
fho.close()

But if you are loading it all into memory as a list like paulthom suggested, make sure to cycle the file in a way that deleting indices as you go won't mess up the loop, like paulthom's idea would. This below example shows the odd effects of deleting the indices during a for loop:

L = ['a', 'b', 'c', 'd']
for i, item in enumerate(L):
    del L[i]

"""resulting L:
['b', 'd']
"""

You could do it like this though:

fh = open("input_file", "r")
data = fh.readlines()
fh.close()

i = 0
while i < len(data):
    del data[i]
    i += 1

shadwickman 159 Posting Pro in Training

15 Years Ago

So have a variable that, when set, it deletes each line it encounters until it finds one starting with a >
Then just have this set to True or something once you find a line you want to remove, and it'll remove the lines after that don't start with > (which would the lines with the sequence). Then just set it to False once it encounters a line starting with > again.
Sorry if that wasn't too clear :P This is what I meant:

from __future__ import with_statement

with open ('dna.txt') as fil:
    f = fil.readlines()
    delete_seq = False
    for line in f:
        if line[0] == ">":
            delete_seq = False
            
        if "rs" in line:
            delete_seq = True
        elif not delete_seq:
            print line,

It will set delete_seq to True if it finds an "rs" in the line, and while delete_seq is True, it'll ignore any following lines until one of them starts with ">", which will set it back to False. If you need me to clarify, just ask. Here's my output:

>1|100159271|ENSRNOSNP145|T/A||ENSEMBL:celera|T/A
TCTTATAATTAGTCATTGTGATAACTGCTACAAACAAAGTCACAGGATCTTGTGAGAGAA
>1|101456015|ENSRNOSNP1318|G/C||ENSEMBL:celera|G/C
AACTCTTAGAAGTTAGAACCTGGGGTGGAGAGATGGCTTGGTGGTTGAGAGCATTGACTG

shadwickman 159 Posting Pro in Training

15 Years Ago

Like I showed in my previous post, use a boolean to track it. It's initially False, but when it encounters a line you want to remove, it becomes True. While it's True, it will disregard the lines it encounters until it finds one starting with ">". It will be set to False when it finds that again, but as it continues, any lines that have that "rs" will set it to True again, etc.

It basically starts skipping lines after a line it finds with "rs" until it finds a line starting with ">", and then it check that one to see if it's good. If it is, it stops ignoring lines, but if it's a bad line, it'll keep skipping, etc.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

wildgoose 420 Practically a Posting Shark · Answer 1 · 2009-07-03T02:02:38+00:00

A file is not elastic so two ways.

1) copy file A to file B but skip over the parts of file A you don't want copied!

2) invalidate the data! Write a 'bad' character code over fields in the data you don't like. The file size will remain the same, but the data will be invalidated thus your reader code when taught to ignore the 'bad' data characters effectively strips the data.

Write blank space 0x20, or other white space character (tab), etc. or even 0xff or 0x7f. Some 8-bit character you wouldn't normally find in your data!

lllllIllIlllI 178 Veteran Poster · Answer 2 · 2009-07-03T03:04:00+00:00

what i would do is read the whole file into one list, then you can do something like:

#open it, it is automatically in read mode
f = open("dnalookingfile.txt")

#get a list with all the lines
lines = f.readlines()

#iterate through the lines
for line in lines:
    if "rs" in line:
        #delete this one from the lines list and the next one as well
    else:
        #its all good, the line does not have rs numbers in it

Then you could just rewrite the file after you are done and it would be fixed :)

Hopefully :P

wildgoose 420 Practically a Posting Shark · Answer 3 · 2009-07-03T03:08:21+00:00

That will work as well. But close file then reopen it with the create upon open or you will merely be overwriting the existing file and have debris from the previous file at the end!

jlm699 320 Veteran Poster · Answer 4 · 2009-07-03T03:47:23+00:00

That will work as well. But close file then reopen it with the create upon open or you will merely be overwriting the existing file and have debris from the previous file at the end!

What wildgoose is trying to say here is make sure you open your new file in 'write mode' instead of trying to write with your 'read mode' file handler or another 'append mode' handler.

See, when you open a file using f = open(my_file, 'w') it will either create a new file, or clear the contents of an existing file with the same name (as defined by my_file ).

joe82 0 Light Poster · Answer 5 · 2009-07-06T23:08:54+00:00

Thank you everyone...

My code:

from __future__ import with_statement

with open ('C:\\Documents and Settings\\jDesktop\\file_E.txt') as fil:
    f = fil.readlines()
    for line in f:
        if "rs" not in line:
            print line

is removing the line with "rs" but i also need to remove the sequence attached with this line

my input is:

>1|62798264|rs8174605|T/C||dbSNP|T/C
AAGAGGAGAAAGCAAAGTTGCAAAAGGTGAAAGAGAAAGAAGAGCTAGAGAAGGGCAGGA
AGGAGCAGAGTAAGCAGAGGGAGCCTCAGAAGAGACCGGA_GAGGAGGTGTTGGTGCTCA
>1|100159271|ENSRNOSNP145|T/A||ENSEMBL:celera|T/A
TCTTATAATTAGTCATTGTGATAACTGCTACAAACAAAGTCACAGGATCTTGTGAGAGAA
>1|19033646|rs8173848|C/T||dbSNP|C/T
TTGCAAAAAAAAAAAAAAAAAAAAAAAGCCAGAATCCAGCATAAGTCAAGGAAATCCACT
>1|149643853|rs8173465|G/T||dbSNP|G/T
AACAGAGACAGCTGTGATGTACCCCATGAGCTGGAAAGAGCAGCCCAGCGGTGTCCCAGC
>1|101456015|ENSRNOSNP1318|G/C||ENSEMBL:celera|G/C
AACTCTTAGAAGTTAGAACCTGGGGTGGAGAGATGGCTTGGTGGTTGAGAGCATTGACTG

I want my OUTPUT to be:

>1|100159271|ENSRNOSNP145|T/A||ENSEMBL:celera|T/A
TCTTATAATTAGTCATTGTGATAACTGCTACAAACAAAGTCACAGGATCTTGTGAGAGAA
>1|101456015|ENSRNOSNP1318|G/C||ENSEMBL:celera|G/C
AACTCTTAGAAGTTAGAACCTGGGGTGGAGAGATGGCTTGGTGGTTGAGAGCATTGACTG

and with above code i am getting my result as:

AAGAGGAGAAAGCAAAGTTGCAAAAGGTGAAAGAGAAAGAAGAGCTAGAGAAGGGCAGGA
AGGAGCAGAGTAAGCAGAGGGAGCCTCAGAAGAGACCGGA_GAGGAGGTGTTGGTGCTCA

>1|101456015|ENSRNOSNP1318|G/C||ENSEMBL:celera|G/C
AACTCTTAGAAGTTAGAACCTGGGGTGGAGAGATGGCTTGGTGGTTGAGAGCATTGACTG
CTCTTCCAGAGGTCCTGAGTTCAAATCCCAGCAACCGCAT_GTGGCTCACAACCATCTGT

>1|100159271|ENSRNOSNP145|T/A||ENSEMBL:celera|T/A
TCTTATAATTAGTCATTGTGATAACTGCTACAAACAAAGTCACAGGATCTTGTGAGAGAA
TAATAGTGTTTAATTTAGACT

TTGCAAAAAAAAAAAAAAAAAAAAAAAGCCAGAATCCAGCATAAGTCAAGGAAATCCACT
CCAACACCATACTGACAAAGT

AACAGAGACAGCTGTGATGTACCCCATGAGCTGGAAAGAGCAGCCCAGCGGTGTCCCAGC
AGTCACCTGAAGGGGTAAAGC

please suggest something

Many Thanks..!!!

joe82 0 Light Poster · Answer 6 · 2009-07-07T02:12:51+00:00

Hi,

Thanks for replying to my post.
Can you please explain by giving an example??

from __future__ import with_statement

with open ('C:\\Documents and Settings\\Desktop\\file_E.txt') as fil:
    f = fil.readlines()
    for line in f:
        if line[0] =='>':
            if "rs" not in line:
                print line

This way I am just getting:
>1|101456015|ENSRNOSNP1318|G/C||ENSEMBL:celera|G/C

>1|100159271|ENSRNOSNP145|T/A||ENSEMBL:celera|T/A

no sequence...Please help..!!

Thanks

joe82 0 Light Poster · Answer 7 · 2009-07-07T02:18:30+00:00

Ohh..SORRY...I didn't see your Code...

Thanks you very much,..:)

That was great help..!!