I am a new python user and I am trying to code an implementation to calculate and empirical cdf. So far, I have some code (attached below) that returns a list of tuples [(datapoint, P(X>=x)),...]. The problem I am trying to resolve is how to take care of replicated data e.g [1,1,4,6,7..]. In my implementation, I can't handle repeated numbers.Any ideas to improve my implementation would be welcome, thanks.

``````class EmpiricalCDF:

def __init__(self,datalist):

'''
class that holds a list of data and returns cdf

defined as p(X>=x)

'''
self.datalist = datalist
self.n = len(datalist)

def cdf_data(self):
data = self.datalist
plotdata =[]

for i in range(len(data)):

n = float(self.n)
length = len(data)
plotdata.append((data,length/n))
data.pop(0)

return plotdata``````

I am not sure, I understand correctly.

Empirical distribution function is a function with two arguments. The dataset, and a real number.

What is your cdf_data is meant to return?
BTW you are losing all your data (in self.datalist) by calling cdf_data function.

My implementaion would be:

``````def cdf_data(self,t):
return sum(d in self.datalist if d<t)/float(self.n)``````

I was just taking a look at compacting your code slightly, and I noticed something.

self.n = len(datalist)
data = datalist
length = len(datalist)
...
n = float(self.n)

Therefore

n = length

And then: ...(length/n)
This would be 1.

You can also change the function 'cdf_data' (including the above thing for the moment), to:

``````def cdf_data(self):
data = self.datalist
plotdata =[]

for i in range(len(data)):
#got rid of: "n = float(self.n)"
length = len(data)
plotdata.append((data,length/float(self.n)))            #this line
data.pop(0)

return plotdata``````
Be a part of the DaniWeb community

We're a friendly, industry-focused community of 1.19 million developers, IT pros, digital marketers, and technology enthusiasts learning and sharing knowledge.