Ok, I am fairly new in using Python, but have a pretty good understanding of it. But I'm up against something that I just can't figure out.

I have an archived file that I am trying to take apart into individual files. It isn't a zip file, but an ascii file of individual files concatenated together with NULLs (hex 00) and CRs between them.

I wrote a function that rips through the file line by line and replaces the NULLs with CRs, and then I go through it again and remove the lines that start with a CR. This works great, and I rip through a 1 MB file in 2 seconds. But, some of the files have a random hex1A, which is an EOF. I tried opening the file with "rb" instead of "r", but then I lose my carriage returns in the resulting files (either Python or Windows must be doing something to the files when I open it in binary). When I open it with "r", it simply will not go past the first "x1A".

Any ideas?

Recommended Answers

All 8 Replies

Different Operating Systems use different CR, Linux uses '\n' and Windows uses '\r\n' and the Mac uses '\r'. This might cause your problem. Take a hexeditor and look at your file. If you have a two byte CR you will have to replace accordingly.

Just as an idea, I knew that UNIX'sCR=0D 0A, and Windows 0A 0A...I have SPE Editor which makes 0A 0A Carriage Returns in python files for example, and PyScripter which makes 0D 0A....But x1A is quite special..

It's the Ctrl+Z ...

Thanks for the replies. Let me descripe this in a little more detail:

The archive is mostly text. But, there are NULLs embedded within the lines. For instance, let's say the orginal text is this:

This is a test.'\n'
This is the next test.'\n'
This is the third test.'\n'

After another third party software archives the text, I might end up with something like this (I have no control over this; long story):

This is '\x00' '\x00' '\x00'a t'\x00' '\x00' est.'\n'
'\x00' '\x00' This is the next '\x00' '\x00' test.'\n'
This '\x00' '\x00' is the third test. '\n'

So, I wrote some script that rips through the archive line by line and replaces each '\x00' with a '\n'. I then rip through it again and delete each line that has only a '\n'. There are probably better ways to do it, but it works very fast for the files I have (around 1M, about 25000 lines).

But, occasionally I get a file that has a random '\x1A' in it instead of '\x00':

This is '\x00' '\x00' '\x00'a t'\x00' '\x00' est.'\n'
'\x00' '\x1A' This is the next '\x00' '\x00' test.'\n'
This '\x00' '\x00' is the third test. '\n'

And when I start reading the file line by line, it stops reading at the first instance of '\x1A' even though the file continues for thousands of more lines. I can get around this by reading it in binary ('rb''), but then I lose all of the carriage returns at the end of each line. I also tried iterating through each character in the file before reading it by line, but that took a long time and I abandoned that approach.

I am probably missing something simple here. I read something about translating the file against a map; would that make sense? Any other suggestions?

I think that as long as you have the x1A in the file, your parser will stop, it's normal because it is an EOF for it, it's the one sign it knows as the end. The exitence of the EOF is probably due to two file concatenated for God knows what reason in the archive.

Are u using Python modules for the archive (like zipfile?)...or parsing the file with file.read directly?

I run across a post elsewhere with a module and an exmaple of a zipped file containg x1A, but have no time for it, I leave it to you to skim through...maybe it helps:

http://mail.python.org/pipermail/python-checkins/2007-February/058579.html

Sorry I can't do more

That link you pointed me to made me think of something: I'm opening the file in "rb" mode, but was writing in "w" mode. By changing the mode to "wb", I was able to strip out the random \x1A. I just tested several of the problematic files, and they came out ok. I can't believe I didn't think of that sooner.

commented: Research a lot and learn even more...This solution needs remembering for small tweaks needed in file parsing +1

Yei...:D
We ourselves always seem to be our best teachers most of the time...
You should mark it as solved then to let others trust your solution further;)

Yes, just bouncing around ideas here was worth hours of previous google searches. This forum is great. Trust me, you'll be seeing more of me here. :)

And just for further edification for newbies like myself, what was happening in my original effort was that the carriage returns were being lost when opening the file in rb mode and writing in w mode. But by writing back in wb mode, each byte was copied exactly as in the original (except for what I was changing intentionally), and when I opened it back up in r mode again for further processing, everything worked again as it should.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.