I am trying to sort a file by a quick sort comparison. The problem lies in parsing the string and comparing fields. I use the following to break up a line:

records = [line.split(spl) for line in file(filename)]

and the following to sort:

records.sort( lambda a,b: cmp(a[fieldNo],b[fieldNo]) )

This works fine on files delimited with "x", "y", "z". If I have X, Y, Z or X\tY\tZ, it blows up and cannot parse the line. I have tried defining 'spl' as comma (','), tab (\t) and empty (' ') and none of them work. It only runs properly if it's quote/comma delimited. Is there a library (besides CSV) that will work with a variety of delimiters? or a workaround? CSV does not have the split method.

It's not that difficult to roll your own. Check each character one at a time in each line and split and append to a list of lists as appropriate.

for ch in record:
    if ch.lower() in ["x", "y", "z"]:
        ## etc

That would work, but the records are approximatly 5,000+ bytes long, and the record count is about 3.65 mil. Speed is important here. When the proggy runs now, I get the 3.65mm sorted in about 3.5 minutes, as opposed to nearly 20 with the tool they use currently. I think bit checking would slow it down significantly.

And how else would you do it? Every character has to be checked in the underlying code no matter what method is used. The time difference would be because of code efficiency or speed of using C versus Python, but would not change the logic. Finally, a program that is fast but doesn't work properly is of no value. The old adage, first you make it work, then you make it fast, then you make it pretty, holds in pretty much every case.

I think bit checking would slow it down significantly.

It would be best to check this first before making the statement.

This is true.. I suppose I should have qualified my statement by saying my own skills are not nearly as sharp and accruate as a library that's already been defined and honed. Programming the parsing myself would be a fun excersise, but I think I would make it much less efficient.

Give exact format of input you should use. Looks like it is only matter of simple string.split() function from open(file) to parse it.

Could you post also your quick sort function and expected output of all.

If you also include the input data fro Advanced Editor -> Manage Attachments, it would be nice (zip if necessary).

Do use the [code) button to tag code or tabular input/output.

Edited 6 Years Ago by pyTony: n/a

This article has been dead for over six months. Start a new discussion instead.