Problem with duplicates

Question

bgk111 0 Newbie Poster

15 Years Ago

Hello all,

This will hopefully be a quick and easy answer. I am having some difficulties with multiple lists that have certain characteristics, so what do I have:

4 lists, all the same length that signify: [row] [coumn] [VAR1] [VAR2]

So basically I have 2 variables at each location (row,col) within a larger dataset. What I need to do is find everywhere where row and column are the same, and where they are the same, add up all of VAR1 and VAR2. [row and column are actually lat lon locations, and where I have multiple data points at the same location, I want to sum all of Var1 and all of VAR2]

I am not posting any code because I am hoping that there is a simple python function that I simply dont know about, plus the code is hundreds of lines long and it would be difficult to condense into something easily digestible. I have written some basic trys, but because they lists are so long, it slows to a snails crawl and is basically useless. I have tried various loops and permutations, but they are all too slow. Any Ideas? Thanks!

-bk

dataset python

5 Contributors
7 Replies
95 Views
6 Days Discussion Span
Latest Post 15 Years Ago Latest Post by bgk111

All 7 Replies

woooee 814 Nearly a Posting Maven

15 Years Ago

This easiest to understand, IMHO, is to create a dictionary with the row & col values as a tuple, for the key pointing to a list of the element number(s). Then, if the list has a length greater than one you have duplicates and can add the corresponding Var1, or Var2 elements, or whatever. For the example given, the dictionary would look like
{ (52, 513) : [0, 3],
(34, 22) : [1],
(423, 421) : [2],
etc.
So a test for length would say that the (52, 513) pair has more than one value and you would add Var1[0] + Var1[3]. Then, it is safest to create a new list with the new, combined values or whatever you would like the output to be. And if I understand this correctly, if the dictionary had an entry like
(52, 513) : [0, 3, 6]
and you wanted to add [0]+[3], and then [3]+[6], you would have to keep track of pairs using two for() loops or some other method.

zachabesh 5 Junior Poster

15 Years Ago

I would suggest that instead of storing the data in separate lists, store all the data in a class.

example:

class Data:
   def __init__(self,row,value,var1,var2):
      self.row = row
      self.value = value
      self.var1 = var1
      self.var2 = var2

Call this for each row of data, put all of the classes in a list,and then you can iter over them and do whatever you want.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

djidjadji 28 Light Poster · Answer 1 · 2009-08-28T05:19:14+00:00

It would have been better if you gave an example of your input and required output. The summing of VAR1 and VAR2 is not clear.

from itertools import izip
#row=[]
#col=[]
#var1=[]
#var2=[]
def getSum(data):
    if not isinstance(data,list):
        data=[data]
    return sum(data,0)

result = [(r,c,getSum(v1),getSum(v2)) for r,c,v1,v2 in izip(row,col,var1,var2) if r==c]

bgk111 0 Newbie Poster · Answer 2 · 2009-08-28T07:42:45+00:00

Im at home now and dont have access to the data, but here is an example of the data

col = [52 , 34 ,423, 52, 235, 34, 235]
row = [513, 22, 421, 513, 432, 21, 432]
Var1 = [24, 13, 534, 12, 532, 12, 322]
Var2 = [10, 12, 163, 10, 53, 10, 343]

So then the output would then be

col = [52, 34, 423, 235, 34, 235]
row = [513, 22, 421, 432, 21, 432]
Var1 = [ 24+12, 13, 534, 532, 12, 322]
Var2 = [10 ,12, 163, 53, 10, 343]

The only col/row combo that had a duplicate was the 52,513, and in all cases, VAR2 will be the same if the col and row are the same (but there are duplicates within Var2 as in the above. I hope this helps to answer the question...

djidjadji 28 Light Poster · Answer 3 · 2009-08-28T22:40:35+00:00

Like woooee pointed the easiest thing is using a dictionary with a tuple as the key. But instead of storing the indices why not make a list of the VAR1 and VAR2 values that are found on the row-col position. After this search process the lists of VAR1 and VAR2 values accordingly. Use exceptions to find out if the dict has the key, it should be faster then using dict.get(key,default) or dict.set_default().

mergeDict = {}
from itertools import izip
for r,c,v1,v2 in izip(row,col,Var1,Var2):
    key = (r,c)
    try:
        v1v2list = mergeDict[key]
    except KeyError:
        v1v2list = [[],[]]
        mergeDict[key] = v1v2list
    v1v2list[0].append(v1)
    v1v2list[1].append(v2)
# now you can loop over the dict and process the result lists
for k,v in mergeDict.iteritems():
    r,c = k
    v1List = v[0]
    v2List = v[1]
    # process and output results
    #v1Sum = sum(v1List,0)
    #.......

Jackson William 0 Newbie Poster · Answer 4 · 2009-09-03T12:05:59+00:00

mergeDict = {}
from itertools import izip
for r,c,v1,v2 in izip(row,col,Var1,Var2):
    key = (r,c)
    try:
        v1v2list = mergeDict[key]
    except KeyError:
        v1v2list = [[],[]]
        mergeDict[key] = v1v2list
    v1v2list[0].append(v1)
    v1v2list[1].append(v2)
# now you can loop over the dict and process the result lists
for k,v in mergeDict.iteritems():
    r,c = k
    v1List = v[0]
    v2List = v[1]
    # process and output results
    #v1Sum = sum(v1List,0)
    #.......

djidjadji's solution is perfect, but you might as well merge Var1s at the same time:

mergeDict = {}
from itertools import izip
for r,c,v1,v2 in izip(row, col, Var1, Var2):
    key = (r,c)
    try:
        # if this (row, col) location already has data, merge var1's
        mergeDict[key][0] += v1
    except KeyError:
        mergeDict[key] = [v1, v2]

bgk111 0 Newbie Poster · Answer 5 · 2009-09-03T21:51:59+00:00

I ended up using dj's solution, and will try to add in your fix in the next few days. Im still unsure if I want to just port it to FORTRAN. I wrote a very inelegant way to do it, and it is already comparable in speed. The main problem I am learning that I have in python is that it dosn't deal well with large data vectors (in the 50,000 - 150,000 range). But thanks all for the tips and tricks, I will be able to use the above in plenty of programs to come!

Problem with duplicates

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers