compare 4 tab delimeted files

Question

biomed 0 Newbie Poster

15 Years Ago

Hi,
I am new to python but I want to use it over perl because I think it is a better overall approach to programming. This is a bioinformatics problem.

I have four identically formatted tab delimeted files that have genomic variation data. Each line looks like this.
chr1 11828655 152 uc001ati.1 * R ND ND ND -30 ND ND ND NPPA

and there are four files that have lines that match at a different sub set of attributes. I want to compare all four files and create a master file that has the matching attribute as the new key and lists all the four files with information on if they have this or not. So lets say file 2,3 and 4 have the same [1] item I want to be able to report this.
Does anyone have an idea about how to best approach this problem. I tried line and text manipulation but these seem to be somewhat limited to what I need to do. I want to do this without using a database but it may come to that as well. So please let me know if you think that is the way to go and leave python for this.

Thanks

perl python

3 Contributors
6 Replies
97 Views
1 Day Discussion Span
Latest Post 15 Years Ago Latest Post by Gribouillis

All 6 Replies

Gribouillis 1,391 Programming Explorer

15 Years Ago

I agree with you that line and text manipulations are too limited to handle the problem. You can easily write python functions to transform your lines into python tuples which are much easier to handle, compare, sort, etc. For example here is a function to read the file as a sequence of tuples

def gen_items(fileobj):
    fileobj.seek(0)
    for line in fileobj:
        yield tuple(line.strip().split()

Then you could use it like this

infile = open("myfile.txt", "r")
for item in gen_items(infile):
    print(item)

And this should print a sequence of tuples like

('chr1', '11828655',  '152',  'uc001ati.1', '*',  'R',  'ND',  'ND',  'ND',  '-30',  'ND',  'ND', 'ND',  'NPPA')

So that item[7] would be the string 'ND' for example.
Conversely, you can write a function to convert a tuple to a line

def to_line(item):
    return "\t".join(item)

which you can use like this to pruduce an output file

outfile = open("output.txt")
for item in mysequence:
    outfile.write(to_line(item))
    outfile.write("\n")

The algorithm to compare the items in the different files should not be too complex. A question is the size of your files. If your files are not too big, you can load entire files as lists of items

mylist = list(gen_items(open("my_file.txt")))

and compare the lists of tuples for example.

Edited 15 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

15 Years Ago

You should replace line 18-23 by

files = [openfile(p) for p in paths]
deneme = files[0]. readline()
print deneme

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

tbone2sk 14 Junior Poster in Training · Answer 1 · 2010-04-02T05:46:30+00:00

Still a lot that you can add, but this should get you started.

from string import split

class Line:
    
    def __init__(self,one,two,three,four):
        self.one = one
        self.two = two
        self.three = three
        self.four = four

class FindMatch:

    def __init__(self):
        self.dict = {}

    def addFromFile(self,filename):
        infile = open(filename, 'r')

        for line in infile:
            splt = line.split('\t')
            lineObj = Line(splt[0],splt[1],splt[2],splt[3])
            self.dict[splt[0]] = lineObj

    def printMatching(self,matchKey):
                      
        for key in self.dict:
            if matchKey == key:
                obj = self.dict[key]
                print "%s\t%s\t%s\t%s" %(obj.one,obj.two,obj.three,obj.four)
def main():

    find = FindMatch()

    """Add lines to dictionary from file"""
    find.addFromFile('FILENAME.txt')
    
    find.printMatching('KEY USED TO TEST FOR MATCH')
    
if __name__ == '__main__':
    main()

biomed 0 Newbie Poster · Answer 2 · 2010-04-02T22:28:06+00:00

thanks for your help. It got me started and I am trying to write simpler steps to help me understand and move forward with the solution.
With this code I am trying to create two functions. One will open N number of files with given paths and the other should iterate all the files and read the first line of each file and print it.

#!/usr/bin/python2.6

path1 = "some path"
path2 = "some path"
path3 = "some path"
path4 = "some path"

paths=[path1, path2, path3, path4]

def openfile(path):
	infile=open(path,"r")

#I want this function to get the opened file as a parameter and read whatever lines I want it to read and then do something like print it)
def addLineFromFile(openfile):
	line= openfile.readline()
	print line

for item in paths:
	openfile(item)
	print "File at "+item+" is opened."	

deneme = openfile(paths[0]).readline(0)
print deneme

So how can I make this work? Thanks .

biomed 0 Newbie Poster · Answer 3 · 2010-04-02T23:18:00+00:00

Thanks and I updated the code to

#!/usr/bin/python2.6

path1 = "some path"
path2 = "some path"
path3 = "some path"
path4 = "some path"

paths=[path1, path2, path3, path4]

def openfile(path):
	infile=open(path,"r")

files = [openfile(p) for p in paths]
deneme = files[0].readline()
print deneme

but this is the error I got.
deneme = files[0].readline()
Attribute Error: 'NoneType' object has no attribute 'readline'

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 4 · 2010-04-03T00:31:19+00:00

You must rewrite openfile as

def openfile(path):
    return open(path,"r")

compare 4 tab delimeted files

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers