Start New Discussion within our Software Development Community

This post is a followup to an older dicsussion on recordmanaging. I am writing this to share with colleagues, so it may be a bit fluffy and not straight to the point. I apologize to all the experts.

Good programs always start by managing data in a flexible and robust manner. Python has great builtin container datatypes (lists, tuples, dictionaries), but often, we need to go beyond these and create flexible data handlers which are custom-fitted to our analysis. To this end, the programmer really does become an architect, and a myriad of possible approaches are applicable. This is a double-edged sword, though, as inexperienced coders (like me) will tend to go down the wrong avenues, and implement poor solutions. Emerging from such a trajectory, I really want to share my experience and introduce what I feel is a very easy-to-use and applicable data container.

PyTony in the aforementioned link really turned me on to the underused Python data container,namedtuple. A namedtuple is an immutable array container just like a normal tuple, except namedtuples have field designations. Therefore, elements can be accessed by attribute lookup and itemlookup (ie x.a or x[0]); whereas, tuples have no concept of attribute lookup. Namedtuples are a really great option for storing custom datatypes for these reasons:

  • They are lightweight (take up very little memory)
  • They allow for manual creation, but are also seemlessly interfaced to file or sqldatabase input (see reference)
  • They have many basic utilities builtin, such as the ability to instantiate directly from lists and dictionaries, or simple means for subclassing and prototyping.

Therefore, named tuples really are ideal for managing data which may come from various files, databases and from manual construction; they are not limited to a specific import domain. They also take up less memory than subclasses of Python's objectGreat example right here on DaniWeb, and they have many builtin methods which object subclassing would require the programmer to write herself. One should note that if data mutability (eg changing the data attributes in the program directly) is paramount to the analysis, object subclassing is probably the way to go. Given all the advantages of namedtuples, I realized that they do have some shortcomings. My biggest gripes with namedtuples are:

  • Namedtuples don't inherently understand default field values.
  • Namedtuples don't typcheck field values.

Let me demonstrate this with an example. I'm going to define a namedtuple class called "Person" which has three fields, name, age and height. I will then make a person object from it.

In [1]: from collections import namedtuple

In [2]: Person=namedtuple('Person', 'name, age, height')

In [3]: bret=Person(name='bret', age=15, height=50)

In [4]: bret
Out[4]: Person(name='bret', age=15, height=50)

This is a nice record. I can access values by attribute lookup and I can use builtin methods to do nice things like return a dictionary without building any extra code.

In [5]: bret.age, bret.name, bret.height
Out[5]: (15, 'bret', 50)

In [6]: bret._asdict()
Out[6]: OrderedDict([('name', 'bret'), ('age', 15), ('height', 50)])

Ok, so this works nicely, but what if we want to read in records with no height column. This is the first place that namedtuple will fail you.

In [59]: ted=Person(name='ted', age=50)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-59-23acd5446d52> in <module>()
----> 1 ted=Person(name='ted', age=50)

TypeError: __new__() takes exactly 4 arguments (3 given)

There are many instances when it is desirable to have this behavior; however, there are also instances when it is not desirable. For example, if we were storing data input from a survey and certain fields were left blank, do we really want this to crash the program? The alternative is to populate this with null or default data manually, so wouldn't it be great if namedtuples understood this implicitly? One can think of many other instances where defaulting is important, and it is especially helpful when fields have very obscure or misleading dataypes, which may confuse anyone else using your codebase

The second thing namedtuples don't do is enforce field types. Consider again our Person class. The attribute "name" implies that a string should be entered, but there's nothing to enforce this. The same is true for height; that is, certain information is presumed on the user's part.

In [10]: kevin=Person(name=32, age='string input', height=['a', 'list', 'has been entered'])

In [11]: kevin
Out[11]: Person(name=32, age='string input', height=['a', 'list', 'has been entered'])

Because a named tuple is a very basic container, it really doesn't care why types of objects you pass into the fields. Without getting into a philosophical argument on duck typing, I think we can all agree that there are times when this behavior is undesirable. Imagine you were going to share your codebase with someone else unfamiliar with the subject. Fieldnames might not be so obvious. Additionally, if you built your analysis assuming the height attribute had a very particular format, eg (6 foot 9 inches), everyone's life would be easier if the namedtuple new about it.

At the end of the day, I think all of these considerations fall under the umbrella of record keeping in Python. It is an interesting topic and certainly warrants further discussion.

Now let me get into my solution and why I think it's elegant. First, I should mention one cannot directly subclass a namedtupl; namedtuple is a function which builds classes, not a class in and of itself. To modify what is returned by namedtuple requires altering the source code directly, which is rather messy. The previous discussion actually was in regard to this. My solution then was pretty simple:

  • Write a class that natively understands default values and types.
  • Make sure the class can typecheck and fill in missing data in a light and syntactically nice way.
  • Pass the adjusted data into a namedtuple, and hide most of this under the hood.

This way, the newclass does all of its field typechecking before initalizing namedtuples. Default data is passed at instantiaton, and the default values and types are stored from then on. I will demonstrate by example- let's create a Person named tuple with the new class. We will define our stringent fields and pass them right in to the class instantiation.

In [12]: from recordmanager import RecordManager

In [16]: personfields=[('name', 'unnamed',) , ('age', int()), ('height', float() )]

In [17]: personmanager=RecordManager('Person', personfields)

By passing in personfields, we implicitly are telling RecordManager the default value and type of each field. The personmanager can now make named tuples from lists or dictionaries in a similar manner as an ordinary namedtuple.

In [26]: bill=personmanager._make('Billy', 32, 10000.00)

In [27]: bill
Out[27]: Person(name='Billy', age=32, height=10000.0)

At first glance, this looks like no different from the standard namedtuple _make() method; however, this _make method() is being called on the RecordManager class; therefore, it will typecheck fields. We can make the typcechecking verbose with a keyword, "warning".

In [29]: jill=peronmanagerjill=personmanager._make('Jill', 40.0, 50, warning=True)
Recasting 40.0 to <type 'int'> as 40
Recasting 50 to <type 'float'> as 50.0

In [30]: jill
Out[30]: Person(name='Jill', age=40.0, height=50)

Of course, certain types can't be recast, therefore, an error will come up showing exactly why.

In [31]: adam=personmanager._make('Adam', 'teststring', 40.0)
Out[31]:TypeError: Argument: teststring to <type 'int'>

All of the returns are still namedtuples, so all standard methods natively work.

In [33]: bill._asdict()
Out[33]: OrderedDict([('name', 'Billy'), ('age', 32), ('height', 10000.0)])

I've also incoporated an optional way to create from incomplete lists (ie missing fields). A namedtuple will be returned with field defaults for non-entered fields; however, this ASSUMES one enters fields in their correct order from left to right.

In [35]: joe=personmanager._make(name='Joe', extend_defaults=True)

In [36]: joe
Out[36]: Person(name='unnamed', age=0, height=0.0)

The namedtuple is stored in an attribute, so it can still be accessed directly. The Person attribue in the personmanager class can be accessed directly to bypass any typechecking, and will function like an ordinary namedtuple. I can pass bad input in and it won't care.

In [38]: jenny=personmanager.Person(name='jenny', age='string input', height=30)

In [39]: jenny
Out[39]: Person(name='jenny', age='string input', height=30)

Therefore, one is by no means obligated to use these special class methods. Strictly typechecked namedtuples can be used generated alongside default ones with no extra hassle.

Namedtuples have a really nice feature of instantiating from a dictionary. Let me demonstrate this first by accessing the standard namedtuple directly:

In [43]: d={'name':'Larry', 'age':50, 'height':90}

In [44]: Larry=personmanager.Person(**d)

In [45]: Larry
Out[45]: Person(name='Larry', age=50, height=90)

Again, this has no concept of defaults and/or typed fields. To incorporate this, personamanger has a method called dict_make, whichlet's users pass incomplete fields with type recasting.

In [55]: d={'name':'Fred', 'age':30.0}

In [56]: Fred=personmanager.dict_make(warning=True, **d)

Recasting 30.0 to <type 'int'> as 30

In [57]: Fred
Out[157]: Person(name='Fred', age=30, height=0.0)

Notice that default height was used as well as recasting.

Eventually I will add a method to supbclass within this framework, and then I think this will completely mimic namedtuple functionality. I hope that you found this useful and looking forward to feedback.

### Adam Hughes 8/7/12
### Record class which returns named tuples with same fields, similar syntax
### and the added options of defaults and a typechecking
from collections import namedtuple

class RecordManager(object):

    def __init__(self, typename, strict_fields, verbose=False):
        ''' Store all of the field and type data as class methods so they aren't regenerated
            everytime a new named tuple is required'''
        self.typename=typename 
    
        ### Store field type and default information in varous formats for easy access by methods ###
        self.strict_fields=strict_fields
        self._strict_names=[v[0] for v in strict_fields]
        self._strict_types=[ type(v[1]) for v in strict_fields ]
        self.strict_defaults=[ v[1] for v in strict_fields]  
        
        vars(self)[typename]=namedtuple(typename, self._strict_names, verbose=verbose)  #Creates a namedtuple class from factory function

    def _typecheck(self, arg, fieldtype, warning=False):
        ''' Takes in an argument and a field type and trys to recast if necessary, then returns recast argument'''
        if not isinstance(arg, fieldtype):   
            try:
                oldarg=arg            #Keep for error printout
                arg=fieldtype(arg)    #Attempt recast
            except (ValueError, TypeError):  #Recast failed
                raise TypeError("Argument: %s to %s" % (arg, fieldtype))
            else:
                if warning:
                    print ("Recasting %s to %s as %s" % (oldarg, fieldtype, arg) )        
        return arg
        
    def _make(self, *args, **kwargs):        
        '''Typechecks arguments and populates with defaults for non-entered fields.  Returns namedtuple. 
           The special keyword "warning" will make the _typecheck method alert the user of recasting.
           warning: If true and if recast is true, prints warning each time an input field is successfully type recasted.
   
           Another keyword "extend_defaults" can be used if the user wants to enter data of only a few fields.  For example,
           if the user passes in field 0, this will autofill field 1, field 2 etc.. with defaults.  This may not be a useful
           method since the dict_make method implements this robustly via keywords.
           '''        
        warning=kwargs.pop('warning', False)
        extend_defaults=kwargs.pop('extend_defaults', False)
        
        if len(args) > len(self.strict_defaults):
            raise ValueError('Too many arguments')
        
        ### If not enough args entered, fill in with strict defaults ###
        elif len(args) < len(self.strict_defaults) and extend_defaults==True: 
            args=list(args) 
            args.extend(self.strict_defaults[len(args):len(self.strict_defaults)] )       
            
        ### Typecheck arguments ###
        for i in range(len(args)):
                arg=args[i] ; fieldtype=self._strict_types[i]
                arg=self._typecheck(arg, fieldtype, warning)  #Will overwrite arguments as it goes
        return vars(self)[self.typename](*args)


    def dict_make(self, **kwargs):
        ''' User can pass a dictionary of attributes in and they will be typechecked/recast.  Similiar to passing
        dictionary directly to namedtuple using **d notation'''
        warning=kwargs.pop('warning', False)        

        for name, default in self.strict_fields:
            try:
                value=kwargs[name]
            except KeyError:
                kwargs[name]=default #Throw the default value in if missing
            else:
                value=self._typecheck(value, type(default), warning) #Typecheck if found
                kwargs[name]=value 
                
        return vars(self)[self.typename](**kwargs)