Hi. I'm having some difficulty with a project that involves working with very large binary files. These are PCL files, where the decimal character "12" represents a Form Feed, but only if it's not embedded within a string of binary data.

In other words, I'm looking for decimal "12", then looking at the next few bytes to make sure it's really a form feed. I want to record the byte position of each "real" Form Feed, for futher processing down the line.

Here's what I have, it's working but extremely slow. The interesting part is in the "while" loop:

static void Main(string[] args)
{
	// need to initialize header and position of first page.
	filename = @"C:\Statements-05-03-05.pcl";
	infile = new FileStream(filename, FileMode.Open, FileAccess.Read);

	test = new byte[1024];
	infile.Read(test, 0 , test.Length);
	asciiChars = new char[ascii.GetCharCount(test, 0, test.Length)];
	ascii.GetChars(test, 0, test.Length, asciiChars, 0);
	asciiString = new string(asciiChars);

	header = asciiString.Substring(0,asciiString.IndexOf("*b0M") + 4);
	page_positions.Add(header.Length);

	counter = 1024;

	while (counter <= infile.Length )
	{   
		pcl_char = infile.ReadByte();
		counter++;
		if (pcl_char == 12)
		{
			test = new byte[14];
			curr_pos = infile.Position;

			infile.Read(test, 0, test.Length);
			counter = counter + 14;
			asciiChars = new char[ascii.GetCharCount(test, 0, test.Length)];
			ascii.GetChars(test, 0, test.Length, asciiChars, 0);
			asciiString = new string(asciiChars);

			if (asciiString == bgn_of_page)
			{
				page_positions.Add(curr_pos);

			} // if (new string(test) == bgn_of_page)
						
		} // if (pcl_char == 12)
	} // while (sr.Peek >= 0)
	infile.Close();
}

This is slow because I'm reading a single byte at a time. I check to see if the byte is "12", if so, I read the next 14 bytes, convert them to a string, compare it to a target string, and if I get a match, record the byte position of the "12" into an ArrayList.

I would really like to speed this up. For example, through buffering. If I use a StreamReader, and its "Read()" method, and use char[] instead of byte[], I can loop through a file in SECONDS, rather than MINUTES.

However, the .Position property of the BaseStream (infile) is wildly off. This is because it's reporting the position in the source stream, which is no longer reading a byte at a time. It's buffering 1024 at a time. So the .Position property is useless to me.

That's why the "counter" is in the above code. In that code, "counter" and "infile.Position" are synchronized and I could use them interchangably.

I had hoped that, when I switched to StreamReader, that "counter" would still be accurate. It isn't!

Can anyone shed light on this? I need a way to combine the speed of StreamReader/buffering while keeping track of absolute byte positions in the file.

Thank you SO MUCH for reading this far!

No one answered, and I found no answer on the web, either.

What I did was implement my own buffer. Instead of reading a byte at a time, I read in 8k at a time, using the FileStream. Then I did my search for the "12" in the bytes read, using a "for" loop.

It's a little tricky if the 12 is near the end of the buffer, since I need to check the next 14 bytes. So I use a working 14-byte array to store the last 14 bytes of the buffer each loop, and do some logic to handle that situation. It wasn't too tough.

What's frustrating is that this has to be done at all. The StreamReader obviously maintains an internal point to its own buffer. If that was exposed, then it would be easy to calculate the needed byte position.

In Java there is a random seek stream which you can use to read and write at arbitary points in a file. I am not sure if similar is available in C#, but most of the stuff in the Java API is available in C# somewhere.

However, what the h3ll are you wanting to do this for?? Are you tring to reverse compile some application or extract its string tokens or something? Surely there is a better way of doing this?

But there is probably a better way of getting these strings out as well.. depending on how the file is formatted... one idea would be to load the file as a string and then use regex pattern matching, to extract all the strings that start with the \x12 character and are 14 characters or less...

This is probably not the fastest mechanism.. but its possibly makes up for it if you can drop the file into a string or byte array with one read... as opposed of reading every byte.

C# also provides a Seek() method. That does no good unless you know where to seek, which is what the posted code loop does.

Load the file as a string? A 2.3GB string? I don't think so!

The files in question are PCL files. Large "print" files. The files contain PCL Codes, as well as large amounts of raster data. The raster data may contain binary characters that could be of any value. So just searching for decimal "12" doesn't work. Each page ends with a FormFeed (decimal 12), and begins with a set of PCL codes. To determine if the "12" is a real FormFeed, I read the next few bytes to see if they are the "setup page" codes.

Why? Because the file has to be: split into "statements", which can have an arbitrary number of pages, barcodes must be added, sequence numbers, paper tray pull codes, etc.

The first step is to determine where, in the original file, each page begins and ends. That forms the basis for all other operations. Since those positions can be anywhere in the file, I have to read through the entire file to find them.

A buffered read, which is what StreamReader provides, is ideal. However, the StreamReader object provides no property or mechanism to relate it's position back to the actual position of the underlying stream.

None that I could find, anyway.

The larger "why" is, why not work with the original data, and produce my own "finished" print stream? That's what I usually do, and what I prefer. However, some clients shop their print files around. It's nice to be able to say "no problem, I can handle that".

Loading one 2.3Gb file may not be an option.. what i typically do (in other file parsing problems) is load a byte array to a reasonably large size, that can still be handled quickly but that handles most cases or atleast doesn't take too many reads to handle mac cases.

65,000 bytes for example... which wouldn't take too long to handle 2.3Gb. The problem with what your thinking is that there is no guarantee if you have all the data you need in the window you are using.. but that would be easily solved.

But thats what your doing any way...

Exactly, the program is working and is fast. If the target byte (12) is within the last 14 elements of the char[] array, then I store everything after it into a "holding pen" char[]. set a switch, and let the loop fall through to get the next buffer. Then check the switch, finish out the "holding pen" array, and perform the test with it. All very neat and fast.

But the original question still remains. I'd like to examine the StreamReader object in more detail. Perhaps one could derive a class from it that exposed it's pointer to its internal buffer.

One could calculate the "virtual" position of the current character by taking the .Position of the underyling stream, subtracting the size of the buffer, and adding back in the value of the pointer.

Hello,
Looks like you had this discussion back in 2005, but I have a similar situation where I need to read a PCL file and extract some information from the file into a text file for my processing and reporting. I have used the Stream reader, but I am not able to see the actual contents of the PCL file when I read it from my app.. Any help to point me in the right direction would be greatly appreciated...

Thank You

Exactly, the program is working and is fast. If the target byte (12) is within the last 14 elements of the char[] array, then I store everything after it into a "holding pen" char[]. set a switch, and let the loop fall through to get the next buffer. Then check the switch, finish out the "holding pen" array, and perform the test with it. All very neat and fast.

But the original question still remains. I'd like to examine the StreamReader object in more detail. Perhaps one could derive a class from it that exposed it's pointer to its internal buffer.

One could calculate the "virtual" position of the current character by taking the .Position of the underyling stream, subtracting the size of the buffer, and adding back in the value of the pointer.

This article has been dead for over six months. Start a new discussion instead.