Dear All

I am working with a tab-delimeted files - first two columns are pairs of identifiers and third column is floating point number (0-1) denoting interaction strength between colmn1 and 2. sample file is also attached.


eg-

column1 column2 column3

john steve 0.67588
john matt 1.00
red blue 0.90
yellow steve 0.02

and so on...

from this file I need to know for each unique identifier in first column how many connections are there in second column (example john have two connections and red and yellow have one) at different threshold (column3) from 0 to 1. and writing an output to a file.

the final output look something like this -

for column3 >= 0.6

john 2
red 1
yellow 0

Thanks..

Attachments
column1	column2	strength
208991str	823str	0.325422
208991str	39402str	0.492476
208991str	38707_rstr	0.693147
208991str	38691_sstr	0.63908
208991str	38241str	0.63908
208991str	37892str	0.405465
208991str	37152str	0.549084
208991str	36711str	0.448025
208991str	33323_rstr	0.325422
208991str	32069str	0.448025
208991str	31845str	0.492476
208986str	243981str	0.750306
208986str	243140str	0.587787
208986str	242979str	0.325422
208986str	242172str	0.587787
208986str	241994str	0.325422
208986str	241716str	0.325422
208986str	241017str	0.693147
208986str	240437str	0.538997
208986str	240052str	0.587787
208986str	239629str	0.693147
208986str	239272str	0.325422
208986str	239132str	0.63908
208986str	238996_xstr	0.63908
208986str	238725str	0.587787
208986str	238520str	0.587787
208986str	238481str	0.364643
208986str	238029_sstr	0.511901
208986str	236561str	0.693147
208986str	236313str	0.63908
208986str	236207str	0.587787
208986str	235670str	0.587787
208986str	235591str	0.538997
208986str	235296str	0.664472
208986str	235044str	0.492476
208986str	232874str	0.405465
208986str	232231str	0.750306
208986str	231775str	0.63908
208986str	231768str	0.693147
208986str	231577_sstr	0.63908
208986str	231513str	0.538997
208986str	231240str	0.405465
208986str	231235str	0.538997
208986str	230337str	0.693147
208986str	230218str	0.538997
208986str	229659_sstr	0.538997
208986str	228964str	0.693147
208986str	228956str	0.492476
208986str	228915str	0.492476
208986str	228754str	0.747512
208986str	228707str	0.492476
208986str	228554str	0.325422
208986str	228498str	0.63908
208986str	228469str	0.448025
208986str	228121str	0.325422
208986str	227997str	0.563063
208986str	227961str	0.63908
208986str	227697str	0.492476
208986str	227677str	0.587787
208986str	227314str	0.587787
208998str	215446_sstr	0.492476
208998str	215049_xstr	0.799656
208998str	214974_xstr	0.419768
208998str	214879_xstr	0.63908
208998str	214830str	0.613837
208998str	214805str	0.405465
208998str	214770str	0.587787
208998str	214736_sstr	0.63908
208998str	214732str	0.364643
208998str	214710_sstr	0.448025
208998str	214617str	0.639812
208998str	214596str	0.693147
208998str	214476str	0.492476
208998str	214456_xstr	0.364643
208998str	214452str	0.750306
208998str	214440str	0.750306
208998str	214430str	0.63908
208998str	214428_xstr	0.492476
208998str	214329_xstr	0.613837
208998str	214321str	0.325422
208998str	214199str	0.63908
208998str	214146_sstr	0.325422
208998str	214091_sstr	0.405465
208998str	214079str	0.492476
208998str	214058str	0.63908
208998str	213931str	0.766415
208998str	213906str	0.651846
208998str	213905_xstr	0.492476
208998str	213844str	0.587787
208998str	213746_sstr	0.492476
208998str	213689_xstr	0.587787
208998str	213620_sstr	0.788855
208998str	213603_sstr	0.676952
208998str	213541_sstr	0.475373
208998str	213518str	0.587787
208998str	213503_xstr	0.63908
208998str	213479str	0.364643
208998str	213428_sstr	0.492476
208998str	213416str	0.652071
208998str	213373_sstr	0.875469
208998str	213326str	0.587787
208998str	213293_sstr	0.538997
208998str	213214_xstr	0.492476
208998str	213168str	0.688941
208998str	213139str	0.689298
208998str	213101_sstr	0.880754
208998str	213093str	0.492476
208998str	213067str	0.325422
208998str	212918str	0.688941
208998str	212816_sstr	0.693147
208998str	212670str	0.492476
208998str	212657_sstr	0.538997
208998str	212647str	0.587787
208998str	212592str	0.538997
208998str	212588str	0.794919
208998str	212549str	0.526388
208998str	212533str	0.538997
208998str	212501str	0.538997
208998str	212486_sstr	0.738888
208998str	212464_sstr	0.492476
208998str	212361_sstr	0.693147
208998str	212334str	0.83181
208998str	212298str	0.387374
208998str	212224str	0.448025
208998str	212190str	0.693147
208998str	212187_xstr	0.475373
208998str	211981str	0.492476
208998str	211965str	0.405465
208998str	211964str	0.538997
208998str	211959str	0.492476
208998str	211833_sstr	0.693147
208998str	211817_sstr	0.750306
208998str	211800_sstr	0.700947
208998str	211784_sstr	0.538997
208998str	211734_sstr	0.492476
208998str	211725_sstr	0.766412
208998str	211676_sstr	0.73612
209101str	206991_sstr	0.364643
209101str	206978str	0.364643
209101str	206929_sstr	0.364643
209101str	206888_sstr	0.405465
209101str	206844str	0.405465
209101str	206802str	0.448025
209101str	206789_sstr	0.405465
209101str	206726str	0.325422
209101str	206715str	0.287682
209101str	206698str	0.325422
209101str	206631str	0.448025
209101str	206628str	0.287682
209101str	206598str	0.287682
209101str	206561_sstr	0.405465
209101str	206545str	0.448025
209101str	206502_sstr	0.448025
209101str	206422str	0.364643
209101str	206410str	0.405465
209101str	206404str	0.448025
209101str	206398_sstr	0.216223
209101str	206353str	0.405465
209101str	206350str	0.538997
209101str	206337str	0.448025
209101str	206335str	0.448025
209101str	206291str	0.364643
209101str	206254str	0.448025
209101str	206239_sstr	0.325422
209101str	206211str	0.405465
209101str	206207str	0.364643
208991str	201328str	0.492476
208991str	201313str	0.448025
208991str	201292str	0.405465
208991str	201279_sstr	0.587787
208991str	201266str	0.587787
208991str	201251str	0.460119
208991str	201250_sstr	0.325422
208991str	201236_sstr	0.364643
208991str	201201str	0.538997
208991str	201151_sstr	0.63908
208991str	201147_sstr	0.526388
208991str	201137_sstr	0.538997
208991str	201130_sstr	0.492476
208991str	201110_sstr	0.613837
208991str	201069str	0.492476
208991str	201061_sstr	0.750306
208991str	201044_xstr	0.538997
208991str	201037str	0.492476
208991str	201015_sstr	0.492476
208991str	201005str	0.492476
208991str	201004str	0.587787
208991str	200989str	0.613837
208991str	200970_sstr	0.587787
208991str	200918_sstr	0.405465
208991str	200904str	0.613837
208991str	200832_sstr	0.448025
208991str	200800_sstr	0.405465
208991str	200796_sstr	0.538997
208991str	200771str	0.664478
208991str	200704str	0.511899
208991str	200697str	0.538997
208991str	200650_sstr	0.405465
208991str	200635_sstr	0.364643
208991str	200634str	0.448025
208991str	200632_sstr	0.364643
208991str	200629str	0.613837
208991str	200600str	0.750306
208986str	1569225_astr	0.727139
208986str	1568574_xstr	0.587787
208986str	1565868str	0.587787
208986str	1565717_sstr	0.538997
208986str	1565703str	0.364643
208986str	1562031str	0.538997
208986str	1560698_astr	0.587787
208986str	1559776str	0.407488
208986str	1557227_sstr	0.693147
208986str	1556821_xstr	0.587787
208986str	1555960str	0.63908
208986str	1555938_xstr	0.287682
208986str	1555935_sstr	0.63908
208986str	1555745_astr	0.448025
208986str	1555434_astr	0.492476
208986str	1555236_astr	0.492476
208986str	1555229_astr	0.693147
208986str	1554600_sstr	0.587787
208986str	1554390_sstr	0.63908
208986str	1554240_astr	0.63908
208986str	1553856_sstr	0.587787
208986str	1553678_astr	0.860565
208986str	1552519str	0.492476
208986str	1552263str	0.587787
208991str	1405_istr	0.448025

From your description, column2 seems to be unnecessary. If that is the case, you
can read and store your data in a dictionary in the structure below:

connections = {'john':[0.67588, 1.00],
               'red':[0.9],
              }

If you want to print for column3 > 0.06:

for id in connections:
    print id, len([x for x in connections[id] if x > 0.06])

There are several different ways to do this. One is a dictionary pointing to a list that contains the number of records found for the key, and the thresholds found to test if greater than (if I am reading the question correctly). You could also use two dictionaries, one as the counter, and one to hold the thresholds if that is easier to understand. An SQL file would be in order if this is a large data set. You could also create a class instance for each unique name, but that is probably more trouble and more confusing than the other solutions. A simple example:

test_list = [
"john steve 0.67588",
"john matt 1.00",
"red blue 0.90",
"yellow steve 0.02" ]

test_dict = {}
for rec in test_list:
    substrs = rec.split()
    key = substrs[0]
    if key not in test_dict:
        ## add the new key, counter=1, and a list containing the threshold
        test_dict[key] = [1, [float(substrs[2])]]
    else:
        test_dict[key][0] += 1     ## add one to counter (zero location in list
        test_dict[key][1].append(float(substrs[2]))   ## interaction strength

for key in test_dict:
    ## are any > 0.5
    values = test_dict[key][1]
    for v in values:
        if v > 0.5:
            print key, v, test_dict[key][0]

Edited 6 Years Ago by woooee: n/a

This would produce your desired output:

import itertools as it

data="""john steve 0.67588
john matt 1.00
red blue 0.90
yellow steve 0.02"""

data_pairs = sorted((first, value)
                    for first,second,value in (d.split()
                                               for d in data.splitlines()))
limit = 0.6
for name, group in it.groupby(data_pairs, lambda x: x[0]):
    print name, len([ value for _,value in group if float(value) >= limit])
This question has already been answered. Start a new discussion instead.