Dear friends,
I have a set of values as follows (filename is string; values are floating; notimportant... is self-explaining):

filenameA, value1_1, value2_1, value3, value4, notimportant
filenameA, value1_2, value2_2, value3, value4, notimportant
filenameA, value1_3, value2_3, value3, value4, notimportant
filenameA, value1_5, value2_5, value3, value4, notimportant
filenameA, value1_7, value2_7, value3, value4, notimportant
...
filenameB, value1_1, value2_1, value3, value4, notimportant
filenameB, value1_5, value2_5, value3, value4, notimportant
filenameB, value1_7, value2_7, value3, value4, notimportant
...
filenameC, value1_1, value2_1, value3, value4, notimportant
filenameC, value1_7, value2_7, value3, value4, notimportant
filenameC, value1_9, value2_9, value3, value4, notimportant

From this huge list (I will appreciate also if you could suggest me how to temporary store those information) I need to find any "value1 and value2" that is repeated at least 3 times in the list and I need to get then the full row.
So, in the example above, my ideal output would be:

filenameA, value1_1, value2_1, value3, value4, notimportant
filenameB, value1_1, value2_1, value3, value4, notimportant
filenameC, value1_1, value2_1, value3, value4, notimportant
filenameA, value1_7, value2_7, value3, value4, notimportant
filenameB, value1_7, value2_7, value3, value4, notimportant
filenameC, value1_7, value2_7, value3, value4, notimportant

(not
filenameA, value1_5, value2_5, value3, value4, notimportant
and
filenameB, value1_5, value2_5, value3, value4, notimportant

because "value1_5, value2_5" is not repeated AT LEAST 3 times.)

How would you suggest that I proceed?

Thanks a lot,
Gianluca

Recommended Answers

All 5 Replies

I suggest that you sort the list on the pair of values and group the records. For example, assuming that the rows are tuples like

("filenameA", 3.14, 5.3, -2.0, 4.0, "notimportant")

You can work with

import itergadgets

@itergadgets.sorter_grouper
def by2values(item):
    return (item[1][1], item[1][2])

def extract_rows(sequence):
    """extract the desired rows from the
    initial sequence of rows"""
    L = (list(group) for group in by2values(enumerate(sequence)))
    L = (g for g in L if len(g) >= 3)
    L = (x for g in L for x in sorted(g))
    return [x[1] for x in L]

The module itergadgets is here. Of course, this method means that the whole list is stored in memory at the same time. Adding sorted at line 12 produces the selected rows in their initial order.

Dear Gribouillis,
your code and module look nice, but I am not sure I can get the desired output with my actual knowledge.
May I ask you some more help?
So, I create my tuple as follow

mylist=[("filenameA", 3.14, 5.3, -2.0, 4.0, "notimportant"),("filenameB", 3.14, 5.3, -2.0, 4.0, "notimportant"),("filenameC", 3.14, 5.3, -2.0, 4.0, "notimportant"),("filenameD", 3.24, 5.3, -2.0, 4.0, "notimportant")]

right?
And how do I join this with your code? if I run extract_rows(mylist) I don't get any output.
Thanks a lot,
Gianluca

Here is a complete working example

#!/usr/bin/env python
#-*-coding: utf8-*-
from __future__ import unicode_literals, print_function, division

import itergadgets
import pprint

@itergadgets.sorter_grouper
def by2values(item):
    return (item[1][1], item[1][2])

def extract_rows(sequence, preserve_order = False):
    """extract the desired rows from the
    initial sequence of rows"""
    L = (list(group) for group in by2values(enumerate(sequence)))
    L = (g for g in L if len(g) >= 3)
    L = (x for g in L for x in sorted(g))
    if preserve_order:
        L = sorted(L)
    return [x[1] for x in L]

if __name__ == "__main__":
    mylist = [
    ("filenameA", "value1_1", "value2_1", "value3", "value4", "noti"),
    ("filenameA", "value1_2", "value2_2", "value3", "value4", "noti"),
    ("filenameA", "value1_3", "value2_3", "value3", "value4", "noti"),
    ("filenameA", "value1_5", "value2_5", "value3", "value4", "noti"),
    ("filenameA", "value1_7", "value2_7", "value3", "value4", "noti"),
    ("filenameB", "value1_1", "value2_1", "value3", "value4", "noti"),
    ("filenameB", "value1_5", "value2_5", "value3", "value4", "noti"),
    ("filenameB", "value1_7", "value2_7", "value3", "value4", "noti"),
    ("filenameC", "value1_1", "value2_1", "value3", "value4", "noti"),
    ("filenameC", "value1_7", "value2_7", "value3", "value4", "noti"),
    ("filenameC", "value1_9", "value2_9", "value3", "value4", "noti"),
    ]
    res = extract_rows(mylist)
    pprint.pprint(res)


""" my output -->
[(u'filenameA', u'value1_1', u'value2_1', u'value3', u'value4', u'noti'),
 (u'filenameB', u'value1_1', u'value2_1', u'value3', u'value4', u'noti'),
 (u'filenameC', u'value1_1', u'value2_1', u'value3', u'value4', u'noti'),
 (u'filenameA', u'value1_7', u'value2_7', u'value3', u'value4', u'noti'),
 (u'filenameB', u'value1_7', u'value2_7', u'value3', u'value4', u'noti'),
 (u'filenameC', u'value1_7', u'value2_7', u'value3', u'value4', u'noti')]
"""

Perfect! Thanks a lot.
I will check it step by step and let you know if I need some more explainations... but everything looks crystal clear now.
Thanks again,
Gianluca

Thanks!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.