Count number of reads in each genomic coordinate

Question

Stackheuw 0 Newbie Poster

13 Years Ago

I am trying to count the number of hits a value in one file(column) falls between an interval from another file (two columns).

I am completely stuck on how to map it.

I tried something like this:

for line in file1:
if line[0]=line2[0] and line2[1]<line[1]<line2[2]:
print line

I'm not sure if this is correct.

file 1:
elem1 39887
elem1 72111

file 2:
elem1 1 57898
elem1 57899 69887
elem2 69888 82111

In file1 elem1 is an element in my project. the value 39887 is the start coordinate.

In file2 elem1 is still an element in my project, but the values are start and end coordinates. File2 is only a reference file.

For every line in file2, I want to see if the "elem#"=="elem#" in file 1. If the elem# in file1 is equal to elem# in file2, then I want to continue in this loop and see if the corresponding value in file1 is between the start and end positions in file2.

For instance, in the first line of file1, elem1==elem1 in the first line of file2. Since they are equal, is 39887 between 1 and 57898? Yes it is, therefore count it. I need to do this for every line in file2.

In the end, I want to see how many elements are within each group of coordinates from file2.

python

3 Contributors
7 Replies
276 Views
3 Days Discussion Span
Latest Post 13 Years Ago Latest Post by Stackheuw

All 7 Replies

woooee 814 Nearly a Posting Maven

13 Years Ago

I want to see how many elements are within each group of coordinates from file2

Does this mean you want to count them or print/copy them?

You only have to store the first file in a container, and check each record from the second file against it. The following uses a dictionary and the test data submitted. To keep track of the number of records found, you can either change the dictionary to point to a list that also contains a counter, or if you think it is easier to understand, use a second dictionary using the same key pointing to a counter. Either way, post your code for more assistance.

file_1 = ["elem1 39887", "elem2 72111"]
file_1_dict = {}
for rec in file_1:
    rec_split = rec.split()
    key = rec_split[0]
    if key not in file_1_dict:  ## allow for possible duplicte entries
        ## compare integers as strings sort from left to right
        file_1_dict[key]=int(rec_split[1])

file_2 = ["elem1 1 57898", "elem1 57899 69887", "elem2 69888 82111"]
for rec in file_2:
    rec_split = rec.split()
    key = rec_split[0]
    if key in file_1_dict:
        low=int(rec_split[1])
        high=int(rec_split[2])
        print "testing key", key, low, high,
        if low < file_1_dict[key] < high:
            print "Found"
        else:
            print "Not Found"

Edited 13 Years Ago by woooee because: n/a

woooee 814 Nearly a Posting Maven

13 Years Ago

See this similar thread for a more complete solution (probably some one in the same class). It is pretty close to what was posted here. Also read this thread.

Edited 13 Years Ago by woooee because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 1 · 2011-09-07T00:24:43+00:00

Looks like you are putting data to bins: http://www.daniweb.com/software-development/python/code/373120, just integer data and self-defined bins. Numpy should handle it with ease.

Stackheuw 0 Newbie Poster · Answer 2 · 2011-09-07T00:44:20+00:00

Thanks pyTony. I will definitely give this a shot. I'm going crazy trying to figure this out.

Stackheuw 0 Newbie Poster · Answer 3 · 2011-09-09T19:44:30+00:00

Sorry for the late reply, they all went to my spam folder. I will post what I have, which is very similar to what you have. Yes, I want to track the number of records found in each region, if and only if it belongs to that element. Some could have same region, but different elements.

Does this mean you want to count them or print/copy them?
You only have to store the first file in a container, and check each record from the second file against it. The following uses a dictionary and the test data submitted. To keep track of the number of records found, you can either change the dictionary to point to a list that also contains a counter, or if you think it is easier to understand, use a second dictionary using the same key pointing to a counter. Either way, post your code for more assistance.
file_1 = ["elem1 39887", "elem2 72111"]
file_1_dict = {}
for rec in file_1:
    rec_split = rec.split()
    key = rec_split[0]
    if key not in file_1_dict:  ## allow for possible duplicte entries
        ## compare integers as strings sort from left to right
        file_1_dict[key]=int(rec_split[1])

file_2 = ["elem1 1 57898", "elem1 57899 69887", "elem2 69888 82111"]
for rec in file_2:
    rec_split = rec.split()
    key = rec_split[0]
    if key in file_1_dict:
        low=int(rec_split[1])
        high=int(rec_split[2])
        print "testing key", key, low, high,
        if low < file_1_dict[key] < high:
            print "Found"
        else:
            print "Not Found"

Stackheuw 0 Newbie Poster · Answer 4 · 2011-09-09T19:46:54+00:00

Stackheuw 0 Newbie Poster

13 Years Ago

Yes count them and print to file

Stackheuw 0 Newbie Poster · Answer 5 · 2011-09-09T23:35:37+00:00

Lol, that's my post. I'm trying to get a new set of eyes on my problem. Its not for a class. I'm doing a personal mining task. Haven't been in school in. LOL, I'm hoping I can figure it out soon. :)

Count number of reads in each genomic coordinate

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers