Frequency distribution calculation

Question

pythonbegin 0 Light Poster

14 Years Ago

Hi All

Hope you all are ok.

I have a query on calculating frequency distribution. I have seven different files like the one I have attached which contains items as below in example. I want to know total how many times each pair occurs combining all these files.

ex
208015_at 207042_at
208015_at 213168_at
208015_at 204790_at
208015_at 204653_at
208015_at 205312_at
208015_at 1565703_at

output would be like this
208015_at 207042_at 5 times
208015_at 213168_at 3 times
208015_at 204790_at 2 times
208015_at 204653_at 5 times

Thanks

python

sample.txt (4.85 KB)

204236_at	224833_at	0.965609
201328_at	221875_x_at	0.944502
204314_s_at	201151_s_at	0.931058
201331_s_at	201266_at	0.929209
221558_s_at	203761_at	0.926578
38707_r_at	206978_at	0.926236
201331_s_at	217728_at	0.923373
213168_at	203132_at	0.921607
201710_at	210139_s_at	0.920628
200989_at	202464_s_at	0.920448
213541_s_at	204677_at	0.91896
201328_at	221485_at	0.9116
224833_at	238520_at	0.909135
211660_at	211800_s_at	0.906171
201328_at	215716_s_at	0.905691
224833_at	227677_at	0.903975
209969_s_at	204279_at	0.898333
231768_at	211660_at	0.893304
204236_at	204249_s_at	0.893134
208991_at	202446_s_at	0.890309
206802_at	203132_at	0.887232
37152_at	232231_at	0.88
213541_s_at	208982_at	0.879998
209967_s_at	202499_s_at	0.871389
211603_s_at	200904_at	0.867723
201694_s_at	201693_s_at	0.86676
214879_x_at	204583_x_at	0.86317
212501_at	226621_at	0.861995
231768_at	217254_s_at	0.861281
222103_at	204314_s_at	0.860718
207042_at	211800_s_at	0.860522
201746_at	206353_at	0.859957
201331_s_at	207826_s_at	0.857549
213541_s_at	203865_s_at	0.845874
200989_at	203553_s_at	0.840466
203749_s_at	223062_s_at	0.838962
204947_at	203553_s_at	0.838909
204790_at	221485_at	0.82869
206036_s_at	208944_at	0.828622
206067_s_at	215446_s_at	0.813193
201783_s_at	221485_at	0.809997
206929_s_at	215716_s_at	0.801835
1565703_at	213373_s_at	0.800939
203617_x_at	239629_at	0.794239
204798_at	208383_s_at	0.79132
216928_at	217254_s_at	0.78751
213541_s_at	203868_s_at	0.780786
209189_at	201531_at	0.78074
206067_s_at	205932_s_at	0.778787
208015_at	243140_at	0.777783
205312_at	209324_s_at	0.773702
204531_s_at	207165_at	0.768513
203010_at	219076_s_at	0.754842
203752_s_at	36711_at	0.7488
223438_s_at	228469_at	0.740277
203973_s_at	208944_at	0.739469
209967_s_at	206157_at	0.728062
208510_s_at	205633_s_at	0.72747
212549_at	201147_s_at	0.718415
209239_at	226621_at	0.706846
206789_s_at	204249_s_at	0.693295
204367_at	209974_s_at	0.685251
202431_s_at	203973_s_at	0.67857
209189_at	202672_s_at	0.677848
228554_at	227209_at	0.64641
204188_s_at	208864_s_at	0.630677
212501_at	211965_at	0.614614
201328_at	1555745_a_at	0.603157
215551_at	203553_s_at	0.588931
202431_s_at	202388_at	0.587513
204755_x_at	227209_at	0.581532
211621_at	212670_at	0.550604
214732_at	204404_at	0.543566
210828_s_at	223358_s_at	0.531163
211117_x_at	208965_s_at	0.519689
205397_x_at	212657_s_at	0.515443
201464_x_at	202672_s_at	0.515212
204653_at	225575_at	0.452502
203973_s_at	224215_s_at	0.45127
213168_at	214329_x_at	0.399432
209969_s_at	201853_s_at	0.385119
213168_at	204314_s_at	0.372605
205446_s_at	204440_at	0.361425
210265_x_at	211555_s_at	0.355639
204039_at	200635_s_at	0.347495
208510_s_at	210519_s_at	0.3255
222103_at	218395_at	0.320329
208991_at	206888_s_at	0.319534
211603_s_at	211800_s_at	0.319081
208530_s_at	205157_s_at	0.318909
213168_at	203302_at	0.318019
206067_s_at	227997_at	0.310028
213541_s_at	213620_s_at	0.309536
201328_at	205051_s_at	0.309354
37152_at	210401_at	0.30898
201328_at	204845_s_at	0.30779
221558_s_at	204440_at	0.306145
203973_s_at	202388_at	0.30301
206036_s_at	203887_s_at	0.301444
224833_at	211676_s_at	0.300808
203617_x_at	209555_s_at	0.297261
222103_at	218761_at	0.294649
204947_at	200771_at	0.293211
231768_at	210381_s_at	0.29307
231768_at	214329_x_at	0.28539
204790_at	224694_at	0.283657
201464_x_at	209189_at	0.283308
212549_at	204379_s_at	0.282856
221558_s_at	202718_at	0.282318
213541_s_at	212486_s_at	0.281758
208510_s_at	205910_s_at	0.281698
202431_s_at	204622_x_at	0.280544
209189_at	218127_at	0.280222
214879_x_at	201422_at	0.27989
200989_at	208991_at	0.278644
213168_at	212592_at	0.27732
1565703_at	1555960_at	0.276906
208991_at	202948_at	0.275568
205312_at	203535_at	0.272649
38707_r_at	204070_at	0.271845
202431_s_at	201236_s_at	0.271596
223438_s_at	204875_s_at	0.270155
203010_at	204933_s_at	0.270021
201328_at	201037_at	0.269237
206067_s_at	209504_s_at	0.268398
212501_at	201920_at	0.266814
205397_x_at	204284_at	0.266281
204790_at	236313_at	0.266141
223438_s_at	205999_x_at	0.265608
202672_s_at	212501_at	0.264666
203010_at	217127_at	0.264302
201746_at	206991_s_at	0.264257
204798_at	204105_s_at	0.263781
201783_s_at	209576_at	0.262773
38707_r_at	203186_s_at	0.261139
209239_at	206036_s_at	0.261035
204755_x_at	221453_at	0.260754
216928_at	205016_at	0.259221
204790_at	1554600_s_at	0.259007
213168_at	201847_at	0.258209
206802_at	208986_at	0.257095
208510_s_at	211122_s_at	0.256465
204790_at	212657_s_at	0.255273
204314_s_at	203132_at	0.254184
203010_at	209301_at	0.253862
202672_s_at	207978_s_at	0.253459
207042_at	200704_at	0.253439
206789_s_at	204863_s_at	0.252327
209969_s_at	203881_s_at	0.251646
209239_at	202826_at	0.251491
209969_s_at	203693_s_at	0.251092
228554_at	209242_at	0.249406
212549_at	212187_x_at	0.249094
221558_s_at	202437_s_at	0.246997
206067_s_at	205572_at	0.244302
231768_at	204314_s_at	0.242952

2 Contributors
5 Replies
195 Views
1 Day Discussion Span
Latest Post 14 Years Ago Latest Post by TrustyTony

All 5 Replies

TrustyTony 888 ex-Moderator

14 Years Ago

And your problem&code?

TrustyTony 888 ex-Moderator

14 Years Ago

Looks better if you push (CODE) first.

counts = dict((v, 0) for v in set(s))
for element in s:
    counts[element] += 1
print counts

This look quite neat piece of code, I think I have coded similar thing before, maybe I could (re)post it.

You basically can wrap this code in another for loop. Of course if you check the 'generating lower´case words' code snippet of mine you can see maybe cleaner way of counting the words (defaútdict(int) instead of your dict generator.
Also you can do for example

counts[element] = 1 if element not in counts else counts[element]+1

Edited 12 Years Ago by mike_2000_17 because: Fixed formatting

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

pythonbegin 0 Light Poster · Answer 1 · 2010-11-03T19:31:34+00:00

Hi tonijv

first i tried to use the code below, i can do simply for one file by storing pairs in list /set and calling a function below. also i have a perl code to this...which works but i want to implement in python. In perl, first opened all the files using regular expressions from file name and on then counting a pair and saving them in a hash at the end printing counts for all the pairs next to it and how many unique values are shared among these files.

counts = dict((v, 0) for v in set(s))
for element in s:
counts[element] += 1
print counts

pythonbegin 0 Light Poster · Answer 2 · 2010-11-04T08:59:28+00:00

Hi ToniJV, yes i got it from google only. I m just wondering how I could scan multiple files at a time and count the number of occurances. Do you have any idea on this?

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 3 · 2010-11-04T16:21:47+00:00

Show effort! Here is code for iterating files in current directory written 'newbie' style:

import os

myext = '.py'
count = 0
for file in os.listdir(os.curdir):
    basename, ext = os.path.splitext(file)
    if ext == myext:
        count+=1
        print count,':',file

Frequency distribution calculation

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers