I'm using Python 2.5.2 on a Ubuntu box for a research project based on data from the Fatality Analysis Reporting System (FARS) database (1975-), available at
http://www-fars.nhtsa.dot.gov/Main/index.aspx.

I found 115 characters of the form "\xzz" -- possibly hex? -- in 4+ card-image records on about 300k incidents from 1975-1981 so far. Each record is supposed to have up to 88 alpha-numeric characters. The record/card/field layout was constant for those years.

This is (obviously) not a huge problem, since 115 odd-balls is a tiny fraction of 80 (say) * 1.2m characters, but I'd like to figure out if those characters have any other interpretation before replacing them with spaces, question marks, or something else.

My processing loop begins with:

for line in fileinput.input(path_file):

All other returned characters are in string.printable.

Each record ends with "\r", so I'm guessing they were created on a Macintosh.

93 of the unexpected characters are in the field for VIN (vehicle identification number) values. The VIN code is clear and well known; see
http://www.autoinsurancetips.com/decoding-your-vin,
for example. These 93 instances involve:
1 time each: \x01, \x08, \x10, \x12, \x9b, and \xf9
3 times: \x19
5 times: \xf2 and
79 times: \x1b

The other 12 are in two other records (#1144677 (1979) and #1452856 (1980) of 2.whatever million read) which return \x00 (10 times each) and \x01 (once each) in the same fields. The fields report vehicle body type, truck characteristics (fuel, weight, series) and motorcycle engine displacement, not all of which apply to the same vehicle :).

Any thoughts on interpretting these strange character codes, other places to look, or should I conclude that they are random garbage?

Thanks very much!

HatGuy

Recommended Answers

All 4 Replies

The file may be encoded with a different codec than the one you're using. If this is the case, it's a simple matter of using

codecs.open(filename, "rb", "codec")

If you don't know the codec, you may have to try a few until you get lucky. Start with the most common ones: "utf-8", "latin-1" and work your way down.

Also, bear in mind that this may not just be that the file is being read with the wrong codec, but might have been corrupted from the start (read and then written with the wrong codec).

Good luck.

Just looking in the standard ASCII table you have a mix of control characters, for instance \x1b is escape, others are backspace, linefeed and some spanish characters.

I hope the file is not bad, but dealing with that has been my last resort from the beginning. I recognized the control characters but not the Spanish characters; thanks!

First, though, I'll get a list of codecs and see if they help.

Thanks, both of you!

HatGuy

Solved!

There are 114 files in .../Lib/Encodings, but import Encodings runs __init__.py and gets aliases.py and a few (platform dependent?) of the 112 encodings available under Ubuntu and Python 2.5.2. The default is None , which is available anyway.

Otherwise, I guess the assumption is that you know which encoding you're looking for. In that case, substitute the name you want for None in: open(file, mode='rb', encoding=None, ...) and run your read loop.

Unfortunately, I didn't know whether my problem was in the file (randomly dirty characters) or the default encoding didn't work with these files.

Each data file has about 300,000 card image records. Each record has from 50-88 characters, perhaps 2.1 million characters per file. There are 30 files in all, of which I've tested 7.

Test process:
1. Run all the encodings against one data file, to find those that would work.

2. Run all the encodings against all 7 data files, to find the non-printing characters each one returned.

Of the 113 encodings tested (112 from the directory + None ):

44 encodings raised one of several errors when the first record of a file was read. I haven't tried to figure out why.

62 encodings opened the data files and returned various small (0, 1, 2, 4, or 5) sets of non-printable characters that occurred not more than 30 times per file. Seven other encodings processed the files, but returned such a huge number of non-printing characters (1.5 million, say), that I dropped them.

Within any one file, the sets of non-printable characters were the same size but the characters in the sets varied by groups of encodings.

Eight of the successful 62 encodings seemed familiar / popular to me. They shared a set of non-printing characters with other, less familiar encodings, but one of them, cp1252 , matched the others for only six of the seven years.

That left charmap, iso8859_1, latin_1, None, string_escape, raw_unicode_escape , and unicode_escape .

With these encodings (including cp1252 ), the maximum number of bad characters in a file is 30, a tiny fraction of the ~2.1 millinon characters they hold. I'll replace them with question marks ('?'), which is not used otherwise.

Since None worked as well as any of the other seven, I'll use it; K.I.S.S!

Hope this helps someone else, and thanks for the help!

HatGuy

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.