Hi,
I have written a program that searches through a text file and in the end I want it to give me some probabilities. In my program I'm searching (for each line) for certain characters and if that character exists then extract the probability. The thing is that sometimes this character is not found and the line just gives me DontExist and then I just want a probability to be zero. How can I accomplish this? Where do I say this inside the for loop? My prgram looks like this:

f= open('filename', 'r')
data = f.readlines()
f.close()
f2= open('filename', 'r')
dat = f2.readlines()
f2.close()

for lines in data:
    if ('##' not in data):
        start=lines.index("Bgc")        
        end=lines.index(',',"Bgc")  
        prob=[start:end]  
        start2=lines.index("Bgc")        
        end2=lines.index(',',"Bgc")  
        prob2=[start:end] 
        lines = lines.split('\t')
        r1 = line[1]
        for line in dat:  
            line = line.split('\t')
            r2 = line[2]
            if r2 == r1:
               return prob/prob2

Where would I put the if loop of Bgc not existing in the file then give back that probability is 0 otherwise divide prob/prob2 and give back this calculated value instead.

Recommended Answers

All 9 Replies

Your code does not make sense to me.

Same with me. Post a sample of the data file so we know how your data is structured then try to explain exactly what you do more clearly and we can suggest a better code.

Sorry, maybe I was unclear. I'm opening 2 files (f and f2) I go into f and search for character Bgc and if Bgc exists I extract the probability (in the file I have lines containing among other thing this Bgc=0.975 and so on) I also forgot one line in the code, after prob I do this

Bgc = Bgc.replace("Bgc=" , '')

, so that I only end up with the probabilities.

Then I go into this other file, because I only want the prob's if the r1-ID matches the r2-ID in the 2nd file f2. If they do then divide prob/prob2.

Did this make more sense?

The 1st file looks like this (with tab between each column)

#file  R       plp   trace  class   info
20    QRT43   1413     29   FIL   NS=3,DP=14,Bgc=0.5,DB,H2,GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5

The first row is a header. 2nd row data starts and it contains 6 columns and in the last column I want to extract Bgc.

2nd file f2 looks like this (also tab file)

13  ABG1324  QRT43  23455

f2 contains many rows and 4 columns. Here r2=QRT43

its r1 and r2 I want to compare in the if loop.

>>> prob=[start:end]
SyntaxError: invalid syntax
>>> 'asdfa, asfda , Bgc asfda'.index(',',"Bgc")

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    'asdfa, asfda , Bgc asfda'.index(',',"Bgc")
TypeError: slice indices must be integers or None or have an __index__ method
>>>

prob=[start:end]

This is meaningless, but you can write

>>> prob = slice(3, 7)
>>> L = list(i**i for i in range(10))
>>> L[prob]
[27, 256, 3125, 46656]

I doubt this use of slicing was the OP's intention.

@sofia85 The problem (and this may be why your code is so unclear) is that crucial information is missing in your problem: do R values appear more than once in each file ? Do some of them appear in one file and not the other ? Before writing code, you should be able to write an algorithm which describes precisely the successive steps of your program: write pseudo-code. We still don't know what your program is supposed to do.

Sorry, I'm a beginner at both programming and python. I will try to write a pesudo code instead.

open file f1
open file f2

for lines in f1:

Sorry, I'm a beginner at both programming and python. I will try to write a pseudo code instead.

open f1
open f2

for each line in f1:
    search for Bgc & Bgd     #Bgc, Bgd is equal to a probability
    if Bgc & Bgd is found extract probability1 & probability2
    do a split and then extract the column containing all r's #they are ID nr's
    for each line in f2:
        extract column containing all r-IDs
        if (r-ID in f1) == (r-ID in f2):
            return prob1/prob2

My problem is that sometimes instead of f1 containing all these lines with information (#file R plp trace class info) it sometimes just contains dontExist. This is where my program crash because I'm only looking at finding Bgc and Bgd on each line, but the code doesn't take care of if it doesn't exist.

If the line only has dontExist, then I just want the prgram to return prob's to be 0 otherwise prob1/prob2.

I would first parse the lines with adapted datatypes like this

from collections import namedtuple
from itertools import islice
import re
import sys

file1 = iter("""#file  R       plp   trace  class   info
20\tQRT43\t1413\t29\tFIL\tNS=3,DP=14,Bgc=0.5,DB,H2,GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5
30\tQZEE43\t1413\t29\tFIL\tNS=3,DP=14,Bgc=0.2,DB,H2,GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5
15\tZZT43\t1413\t29\tFIL\tNS=3,DP=14,Bgd=0.7,DB,H2,GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5
""".strip().splitlines()) # replace this with your opened file 1

file2 = iter("""foo\tbar\tR\tbaz
13\tABG1324\tQRT43\t23455
13\tABG1324\tQRR43\t23887
13\tABG1324\tORT47\t20993
13\tABG1324\tQR453\t23455
13\tABG1324\tQROR3\t27466
""".strip().splitlines()) # replace this with your opened file 2

# Define a custom datatype to read file1

class LineA(namedtuple("LineA", "file R plp trace klass info")):    
    @property
    def bgc(self):
        return self._find_float("Bgc")
    
    @property
    def bgd(self):
        return self._find_float("Bgd")
    
    def _find_float(self, name):
        match = re.search(r"%s=([^,]*)" % name, self.info)
        if match:
            try:
                return float(match.group(1))
            except ValueError:
                return 0.0
        else:
            return 0.0

F1 = []
for line in islice(file1, 1, None):
    line = LineA(*line.split("\t", 5))
    F1.append(line)

# F1 is now a list of LineA objects, which have
# attributes .file, .R, .plp, .klass, .info, .bgc, .bgd
# the 2 last attributes are floats, with value 0.0 if nothing was found

for line in F1:
    print line
    
for line in F1:
    print line.bgc, line.bgd
    
    
LineB = namedtuple("LineB", "foo bar R baz")

F2 = []
for line in islice(file2, 1, None):
    F2.append(LineB(*line.split("\t", 3)))

# F2 is now a list of LineB objects, which have
# attributes .foo, .bar, .R, .baz

for line in F2:
    print line

# Now try to extract the information you want from these two lists F1 and F2

"""My output -->
LineA(file='20', R='QRT43', plp='1413', trace='29', klass='FIL', info='NS=3,DP=14,Bgc=0.5,DB,H2,GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5')
LineA(file='30', R='QZEE43', plp='1413', trace='29', klass='FIL', info='NS=3,DP=14,Bgc=0.2,DB,H2,GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5')
LineA(file='15', R='ZZT43', plp='1413', trace='29', klass='FIL', info='NS=3,DP=14,Bgd=0.7,DB,H2,GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5')
0.5 0.0
0.2 0.0
0.0 0.7
LineB(foo='13', bar='ABG1324', R='QRT43', baz='23455')
LineB(foo='13', bar='ABG1324', R='QRR43', baz='23887')
LineB(foo='13', bar='ABG1324', R='ORT47', baz='20993')
LineB(foo='13', bar='ABG1324', R='QR453', baz='23455')
LineB(foo='13', bar='ABG1324', R='QROR3', baz='27466')
"""

My suggestion is to take 2 files with reasonable numbers of lines to start with, let's say 20 or 30 lines, create 2 lists the way I wrote it above, and try to produce your desired information with these 2 lists. If this works, it should be easy to generalize to larger files.

commented: inherit from namedtuple as factory, interesting +13
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.