find duplicates (or 3 instances) in dataset

Question

giancan 0 Light Poster

11 Years Ago

Dear friends,
I have a set of values as follows (filename is string; values are floating; notimportant... is self-explaining):

filenameA, value1_1, value2_1, value3, value4, notimportant
filenameA, value1_2, value2_2, value3, value4, notimportant
filenameA, value1_3, value2_3, value3, value4, notimportant
filenameA, value1_5, value2_5, value3, value4, notimportant
filenameA, value1_7, value2_7, value3, value4, notimportant
...
filenameB, value1_1, value2_1, value3, value4, notimportant
filenameB, value1_5, value2_5, value3, value4, notimportant
filenameB, value1_7, value2_7, value3, value4, notimportant
...
filenameC, value1_1, value2_1, value3, value4, notimportant
filenameC, value1_7, value2_7, value3, value4, notimportant
filenameC, value1_9, value2_9, value3, value4, notimportant

From this huge list (I will appreciate also if you could suggest me how to temporary store those information) I need to find any "value1 and value2" that is repeated at least 3 times in the list and I need to get then the full row.
So, in the example above, my ideal output would be:

filenameA, value1_1, value2_1, value3, value4, notimportant
filenameB, value1_1, value2_1, value3, value4, notimportant
filenameC, value1_1, value2_1, value3, value4, notimportant
filenameA, value1_7, value2_7, value3, value4, notimportant
filenameB, value1_7, value2_7, value3, value4, notimportant
filenameC, value1_7, value2_7, value3, value4, notimportant

(not
filenameA, value1_5, value2_5, value3, value4, notimportant
and
filenameB, value1_5, value2_5, value3, value4, notimportant

because "value1_5, value2_5" is not repeated AT LEAST 3 times.)

How would you suggest that I proceed?

Thanks a lot,
Gianluca

python

2 Contributors
5 Replies
227 Views
2 Days Discussion Span
Latest Post 11 Years Ago Latest Post by Gribouillis

Gribouillis 1,391 Programming Explorer

11 Years Ago

I suggest that you sort the list on the pair of values and group the records. For example, assuming that the rows are tuples like

("filenameA", 3.14, 5.3, -2.0, 4.0, "notimportant")

You can work with

import itergadgets

@itergadgets.sorter_grouper
def by2values(item):
    return (item[1][1], item[1][2])

def extract_rows(sequence):
    """extract the desired rows from the
    initial sequence of rows"""
    L = (list(group) for group in by2values(enumerate(sequence)))
    L = (g for g in L if len(g) >= 3)
    L = (x for g in L for x in sorted(g))
    return [x[1] for x in L]

The module itergadgets is here. Of course, this method means that the whole list is stored in memory at the same time. Adding sorted at line 12 produces the selected rows in their initial order.

Edited 11 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

11 Years Ago

Here is a complete working example

#!/usr/bin/env python
#-*-coding: utf8-*-
from __future__ import unicode_literals, print_function, division

import itergadgets
import pprint

@itergadgets.sorter_grouper
def by2values(item):
    return (item[1][1], item[1][2])

def extract_rows(sequence, preserve_order = False):
    """extract the desired rows from the
    initial sequence of rows"""
    L = (list(group) for group in by2values(enumerate(sequence)))
    L = (g for g in L if len(g) >= 3)
    L = (x for g in L for x in sorted(g))
    if preserve_order:
        L = sorted(L)
    return [x[1] for x in L]

if __name__ == "__main__":
    mylist = [
    ("filenameA", "value1_1", "value2_1", "value3", "value4", "noti"),
    ("filenameA", "value1_2", "value2_2", "value3", "value4", "noti"),
    ("filenameA", "value1_3", "value2_3", "value3", "value4", "noti"),
    ("filenameA", "value1_5", "value2_5", "value3", "value4", "noti"),
    ("filenameA", "value1_7", "value2_7", "value3", "value4", "noti"),
    ("filenameB", "value1_1", "value2_1", "value3", "value4", "noti"),
    ("filenameB", "value1_5", "value2_5", "value3", "value4", "noti"),
    ("filenameB", "value1_7", "value2_7", "value3", "value4", "noti"),
    ("filenameC", "value1_1", "value2_1", "value3", "value4", "noti"),
    ("filenameC", "value1_7", "value2_7", "value3", "value4", "noti"),
    ("filenameC", "value1_9", "value2_9", "value3", "value4", "noti"),
    ]
    res = extract_rows(mylist)
    pprint.pprint(res)


""" my output -->
[(u'filenameA', u'value1_1', u'value2_1', u'value3', u'value4', u'noti'),
 (u'filenameB', u'value1_1', u'value2_1', u'value3', u'value4', u'noti'),
 (u'filenameC', u'value1_1', u'value2_1', u'value3', u'value4', u'noti'),
 (u'filenameA', u'value1_7', u'value2_7', u'value3', u'value4', u'noti'),
 (u'filenameB', u'value1_7', u'value2_7', u'value3', u'value4', u'noti'),
 (u'filenameC', u'value1_7', u'value2_7', u'value3', u'value4', u'noti')]
"""

Edited 11 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

giancan 0 Light Poster · Answer 1 · 2013-09-01T13:12:22+00:00

Dear Gribouillis,
your code and module look nice, but I am not sure I can get the desired output with my actual knowledge.
May I ask you some more help?
So, I create my tuple as follow

mylist=[("filenameA", 3.14, 5.3, -2.0, 4.0, "notimportant"),("filenameB", 3.14, 5.3, -2.0, 4.0, "notimportant"),("filenameC", 3.14, 5.3, -2.0, 4.0, "notimportant"),("filenameD", 3.24, 5.3, -2.0, 4.0, "notimportant")]

right?
And how do I join this with your code? if I run extract_rows(mylist) I don't get any output.
Thanks a lot,
Gianluca

giancan 0 Light Poster · Answer 2 · 2013-09-01T14:07:45+00:00

Perfect! Thanks a lot.
I will check it step by step and let you know if I need some more explainations... but everything looks crystal clear now.
Thanks again,
Gianluca

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2013-09-01T14:20:40+00:00

Gribouillis 1,391 Programming Explorer

11 Years Ago

Thanks!