Hi all!

I've been working on some school/hobby project for some time now. It's a kind of a simple statistical tool for analysis of data from psychological experiments. Not being professional programmer I encountered a couple of problems concerning data storage. Let me explain the how the data is structured now and why I think it's wrong ;-)

- I have my main data organized in dictionaries (OrderedDict). In these dictionaries I store data for each experiment part (with keys like "Exp1", "Exp2").In each dictionary entry there's a list (of "subjects") of lists (subjects results). it's called main_data

- I also have extra data for each subject (like sex, age-band, type of experimental treatment and many more). these data is stored in another OrderedDict, with keys being variable name (like "sex"), and data being a list with data for all subjects).it's called extra_data

All data are sorted by subject id number, that way when I check, for example, the fifth entry from "Exp1" list, I can also check fifth entry from "sex" list from extra data dictionary and know that my fifth subject is a man/woman.

One of the most important features I need in my program is the ability to perform calulations only for a part of subjects (i.e. only men). Now I do it like this: filtering function input is a dictionary like this {"sex": [1,2], "age-band":[1,2,3,4]} and so on with all variables. All keys in this input dictionary are the same as keys in extra_data (lets call this dict input_dict). When filtering, I iterate over input_dict keys and values. For each subject I check if value for this key in extra_data is in values in input_dict. If it is (i.e. I have 1 in "sex" in input_dict and 1 in "sex" in extra_data for particular subject), I copy main_data for each experiment part (exp1, exp2 and so on) for this subject to a new dictionary.

The problem is that when my data sets get quite big (about 20 experiment parts, and about 20 extra_data variables, about 300 subjects) this approach is very slow, because it involves a lot of coping of data.

So my question is, how do you think should I organize and filter data to make it work faster? I'd be grateful for any ideas.
Sorry for such a long post.

Best regards
Yemu

I think you should work with a Subject class, each Subject instance containing the results to all the experiments and the extra data, like this

from random import randint

class Subject(object):

    def __init__(self, sid):
        self.sid = sid
        self.results = dict()
        self.sex = 1
        self.age_band = 1

class Experiment(object):
    def __init__(self, eid):
        self.eid = eid

def main():
    # create a list of 20 experiments
    experiments = [Experiment(i) for i in range(20)]
    # create a list of 300 subjects
    subjects = [Subject(i) for i in range(300)]

    # create a random result list for each experiment and each subject
    for exp in experiments:
        for sub in subjects:
            sub.results[exp] = [randint(0,20) for i in range(5)]

    # create random extra data for each subject
    for sub in subjects:
        sub.sex = randint(1, 2)
        sub.age_band = randint(1, 9)

    # function to select a subset of a subjects sequence
    def select(subjects, **kwd):
        for s in subjects:
            b = True
            for key, value in kwd.iteritems():
                if not getattr(s, key) in value:
                    b = False
                    break
            if b:
                yield s

    # get a list of selected subjects
    selection = list(select(subjects, sex = [1], age_band = [1, 2, 3, 4]))
    print len(selection)

    # The results of the selected subjects are available through the instances
    # print the results of the selected subjects for the 3rd experiment:
    print [ sub.results[experiments[2]] for sub in selection ]

if __name__ == "__main__":
    main()

You can also add extra data to Experiment instances.

All I can propose, that you use a database.
The ordered dictionary is only usable, if you know the queries beforehand or you have hardly any data.

I have made, a diagram about, how I understand your data structure, but Dia crashed and I have lost it.
If you are still interested in solving this problem, let me know.

I would use an sqlite database, with the following tables:

  • experiment, list of experiments
  • subject, list of possible subject
  • subject_data, list of possible values of a subject, if it is not numerical
  • experiment_data, in which experience which subject has got which subject_data or a numerical value.

If this model is right, then a filter of men is roughly:

select something
from 
experiment_data as e
,subject as s
,subject_data as sd
where e.subject_id=s.id
and sd.subject_id=s.id
and e.subject_data_id=sd.id
and s.name="sex"
and sd.name="man"

A second for the use of SQLite, then you can select where sex=='Male', etc. If you want to keep your original structure, a dictionary of lists might work better, with each position in the list holding a specitic data value.
exp_dict[Exp#] = [sex, age_band, type_of treatment]

Otherwise, you would want a separate dicitonary wth "Male" and "Female" as keys, pointing to a list of experiment numbers, or whatever is the key of the main dictionary, so you don't have to iterate through all of the main dictionary's keys, to the sub-dictionary of gender.

thank you very much for your solutions!
I have to take a close look at them, and decide what would suit me best. And to do that I have to understand them well first ;-)
best regards
y

Post back if you want some help with SQLite, or the dictionaries.

Python3 offers the named tuple, it sounds like something you may be interested in. Here is an example ...

# named tuple instances require no more memory than regular tuples
# tested with Python 3.1.1

import collections as co

EmpRec = co.namedtuple('EmpRec', 'name, department, salary')

bob = EmpRec('Bob Zimmer', 'finance', 77123)
tim = EmpRec('Tim Bauer', 'shipping', 34231)

fred_list = ['Fred Flint', 'purchasing', 42350]
# create an instance from a list
fred = EmpRec._make(fred_list)

# create and instance from an existing instance
john = fred._replace(name='John Ward', salary=49200)

# create a default instance for hourly manufacturing workers
default = EmpRec('addname', 'manufacturing', 26000)
mike = default._replace(name='Mike Holz')
gary = default._replace(name='Gary Wood')
carl = default._replace(name='Carl Boor')

# access by named index
print(bob.name, bob.salary)  # Bob Zimmer 77123
# or access by numeric index
print(tim[0], tim[2])  # Tim Bauer 34231

print('-'*40)

# access from a list of instances
emp_list = [bob, fred, tim, john, mike, gary, carl]
for emp in emp_list:
    print( "%-15s works in %s" % (emp.name, emp.department) )

print('-'*40)

# convert an instance to a dictionary via OrderedDict
print( dict(bob._asdict()) )
"""
{'department': 'finance', 'salary': 77123, 'name': 'Bob Zimmer'}
"""

# list the fieldnames of an instance
print(bob._fields)  # ('name', 'department', 'salary')

Note: Python 2.6 includes the named tuple already.