Hello.

I am wondering how EOFs are indicated to either the cstdio functions or the fstream functions. Mainly I am wondering how it differentiates it from data. For example, if EOF is a 2-byte code wouldn't this break the system by inserting a 'false' eof flag:

for (int i=0; i<256; ++i)
    for (int ii=0; ii<256; ++ii)
        fout<<(unsigned char)i<<(unsigned char)ii;

And the same argument would obviously extend to any length of the eof flag. I am wondering if there actually is a way to (accidentally or not) create a false EOF flag. I mainly wonder because I am trying to define a file format for some of my classes to save their data to a file and I am noticing that most 'official' files start off by listing their size in some way or another, and I am wondering if that is to help define where the EOF is.

Recommended Answers

All 4 Replies

Let the operating system be busy to find out what an EOF is.
Just concentrate on your file format.
The reason I could think of why some file formats start with size info is to calculate in some way how many "records" are in the file. But the OS file system provides that info also.

What makes you think that there is supposed to be a sequence of characters at the end of a file that indicate that the file has reached the end?

The EOF flag is artificial, it doesn't appear in the files. If you try to read / write to a file and the OS determines that you cannot because you have reached the end of it, you get the EOF flag as a result of the operator (either by setting the eofbit in the stream or returning EOF character from the char-read operation, or both).

The other place where you might use the EOF character is in an ASCII file that you want to artificially split into different files. But this only works for text files which will be read and interpreted. If you read a text file (non-binary), the characters that are read are interpreted, and if a EOF character is found, the stream will go into a failed mode, indicating that you have reached the end. If you expect more data to exist (i.e., another "file") then you can clear the error bits and try to read again. This can be useful, for instance, to pipe multiple files between programs in a BASH shell.

The character(s) is ignored in binary files, and it isn't a meaningful character that should appear for any other reason in a text (ASCII) file. When you do binary reading / writing, the operations are "unformatted", meaning the characters are not interpreted, and thus, any bytes that could appear and form an EOF character(s) are not interpreted as an EOF, it is just ignored and passed along (read or write).

As far as knowing when you have reached the end of the file, that's something that is specific to the operating system, and it has nothing to do with the EOF character or any other "marker" in the file. This kind of information is usually stored in a file descriptor, along with things like modification dates, ownership, etc.. The file read/write operations simply use that wildcard character to alert to the condition that you've reached the end, not to actually indicate the end-of-file within the file.

And the moral of the story here is that if you want to read binary data, read binary data, don't "trick" the stream by reading/writing binary data as formatted characters.

The reason I could think of why some file formats start with size info is to calculate in some way how many "records" are in the file. But the OS file system provides that info also.

Generally, the file-size provided by the OS is not reliable enough (sometimes includes meta-data, padding, etc.). You can use the seekg( end ) / tellg() / seekg( beg ) method to find the exact size, but this is not guaranteed to be fast, it could result in reading the entire file content.

Another reason to have the size in the header is for integrity verification. If you get to the end of the file before you expect to, or if, for some other reason, there is a mismatch in the size, you must assume that the file is corrupt.

And finally, many file formats are actually the result of serialization code, and generally, when writing serialization code, you want to save some meta-data for each chunk of data (object) that you write, usually including a version number, a unique ID, and the size of the chunk of data. The reason for the size is so that the object can be ignored if it is not supported by the application that is reading the file, this is basic form of forward compatibility (making older software compatible with more recent versions of the file-format / software by ignoring unsupported features). Also, if you just want to ignore some objects, you just skip ahead. So, if individual objects are usually saved this way, it is not uncommon that the entire file contains one big object, with a similar "size" information at the start, which, of course, makes less sense since you wouldn't skip the entire file. But then again, there could be a collection of big objects in the file, so maybe that "size" that you see at the start is not just to size to skip to get to the next big object.

In other words, there is a bit of a tradition to always put a "size" value somewhere in the header of the file or of a chunk of the file, because it is a useful piece of information for both forward and backward compatibility.

Exactly what I was looking for. Thank you guys :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.