This is currently something in design phase, but I'd like to have the idea checked over.

The situation is roughly as follows:
I've got a display that can show some amount of structs with varying byte counts per struct.
I've got a big binary file, over a gigabyte in size, filled with data that is parsed into structs.


My task?
Make the entire file viewable with (seemingly) minimal process time.

My current solution is the following:
To both sides of the currently viewed structs (ie, previously viewed structs and the next structs to be viewed), buffer the structs in memory and only performing operations on the buffered structs.

As the view moves, there are defined 'buffer thresholds' where, if the view reaches the threshold, the program deallocates memory for some number of buffered structs that the view is 'moving away from' and then shifts the remaining structs into the 'cleaned space'. The buffer reads in structs from the file, and then waits for a threshold to be passed.

So far, so good. I'm thinking a vector would be a good choice for the buffer. Advancing through the file (scrolling down) is not a problem.

But scrolling up is.

Initially, I thought that it would be all right to rewind() and come back down from the start of the file.
However, as the files get longer, running from the start to the desired location is frankly insane and wastes a lot of time. There needs to be a way to move the buffer back through memory.

So, fseek.
Fseek looks good, but it requires a long (number of bytes between the location and the origin) to go back. No problem?
Problem: The number of bytes in each struct is not constant. Going on bytecount alone without struct reference means that the whole process gets mubar (gotta think of the kiddies).

That makes me think about a dynamically-allocated set of longs that are set with byte values at the end of particular numbers of structs, allowing for the file to return to different positions based on SEEK_SET.
But that limits file size to 2 gigabytes, and I need to be able to scroll through more information than that. I need to be able to open files of, let's say, 20 gigabytes and be able to transverse the whole thing (if it were possible) as easily going up as going down, at least, from the user's perspective.

That's my idea. Is it any good as a starting point?

If not, anyone have suggestions on how to accomplish this? For example, pointers to positions in the file after an arbitrary number of structs that conveniently fit buffer sizes?

Thank you for your time.

Recommended Answers

All 12 Replies

Just found something that made me happy.
fgetpos and fsetpos.

Thanks, K&R!

So that solved your problem?

This is still in design phase, mind you. But being able to just get pointers to positions in a file stream and storing them in another vector is a much better approach than trying to juggle numbers and who knows what else as a workaround for the maximum value of a signed long.

And, a very nice means of handling jump to lines falls out of it, to boot...

So, I think it'll do. I'm pretty sure there's no other way for a novice like me to do this.

If you are on a windows OS and dont need portability consider memory mapped files.

(Reads up on memory maps)
Interesting thought, but I can't be sure of the memory setup for each computer this will be running on. And I'm not willing to play around with that while I'm drawing a paycheck.

Thanks, though. I'll think about that if things get too strained in this setup... but the fgetpos and fsetpos should do for now.
There's been an unofficial saying that 4 gigs will be about as large as it needs to go, so that's a nice piece of info to have.

Soon after discovering those functions, the guy with the Schlidt book insinuates that it won't work, because the pointers to the position in the file will be overwritten or something.

How does a file open work (in c, as I'm using C functions)?
fopen returns type FILE *, then you do your gets and puts... but are you moving the whole file into active memory, or disk reading it, which causes no extra space to be used?

Hm. Now that I've thought about it aloud, it seems rather absurd that large binary files would be moved to RAM in a sliding buffer.
I'm going to have to check with him today, find out what the heck he meant to try to scare me about.

>How does a file open work (in c, as I'm using C functions)?
fopen returns a pointer to FILE (usually a structure), which is basically the bookkeeper for the low level information required to read and write the file from disk and maintain a buffer to avoid excessive device requests. A very simple example would be:

struct _iobuf {
  int fd;
  int offset;
  char *buff;
  char *curr;
  int bufsize;
  int left;
};

typedef struct _iobuf FILE;

The fd field is the file identifier, the offset field is the number of characters into the file we are, the buff field is the buffer, the curr field is the next character in the buffer, bufsize is the size of the buffer, and left is how many characters are left in the buffer. The left field is needed because the buffer might not be completely filled, such as if a read error occurs or end-of-file is reached.

Say you open a file with the contents "This is a test" and the call fopen ( "somefile", "r" ). If we use a buffer size of 6, the contents of the FILE object would be (leaving out fd and bufsize because they don't change):

offset = 0
buff = "This i"
curr = "This i"
left = 6

If you read a single character from the file using fgetc, the resulting object would be:

offset = 1
buff = "This i"
curr = "his i"
left = 5

This continues until the end of the buffer is reached by counting left down to 0:

offset = 6
buff = "This i"
curr = ""
left = 0

At this point the next request for input will require a device read starting at offset and reading bufsize characters into buff. Then curr is set to buff, and left is set to the number of chracters read:

offset = 6
buff = "s a te"
curr = "s a te"
left = 6

Once again, the buffer is completely filled and can be completely read, as above. But the next buffer fill is different because there are only two characters left. Supposedly, in this hypothetical implementation, the system call used to read from the file will signify end-of-file.

offset = 12
buff = "st"
curr = "st"
left = 2

Two characters are read, left goes to 0, and the next request for input fails due to end-of-file. Okay, this is all well and good in theory, but how does it really work? Here is a quick and dirty C implementation: :)

#include <stdio.h>
#include <stdlib.h>

struct jsw_iobuf {
  FILE *fd;
  char *buff;
  char *curr;
  int bufsize;
  int left;
};

typedef struct jsw_iobuf JSW_FILE;

JSW_FILE *jsw_fopen ( const char *path )
{
  JSW_FILE *ret = malloc ( sizeof *ret );

  if ( ret == NULL )
    goto error;

  ret->fd = fopen ( path, "r" );

  if ( ret->fd == NULL )
    goto error;

  ret->bufsize = 6;
  ret->buff = malloc ( 6 );

  if ( ret->buff == NULL )
    goto error;

  ret->left = fread ( ret->buff, 1, 6, ret->fd );
  ret->curr = ret->buff;

  return ret;

error:
  free ( ret );
  return NULL;
}

void jsw_fclose ( JSW_FILE *fp )
{
  fclose ( fp->fd );
  free ( fp->buff );
  free ( fp );
}

int jsw_fgetc ( JSW_FILE *in )
{
  if ( in->left == 0 ) {
    /* Refill */
    in->left = fread ( in->buff, 1, in->bufsize, in->fd );
    in->curr = in->buff;
  }

  if ( --in->left < 0 )
    return EOF;

  return *in->curr++;
}

int main ( void )
{
  JSW_FILE *in = jsw_fopen ( "test.txt" );
  char ch;

  while ( ( ch = jsw_fgetc ( in ) ) != EOF )
    printf ( "%c\n", ch );

  jsw_fclose ( in );
}

Not very magical, is it? Sure, I used a portable approach, so this implementation is very contrived, and somewhat silly, but (barring the really irritating details) that's the gist of what happens when you open and read from a file stream.

>but are you moving the whole file into active memory, or disk
>reading it, which causes no extra space to be used?
A combination of the two. Portions of the file are buffered in memory, so extra space is used, but not enough for the entire file if you're using a large file. Typically though, small files of less than a few kilobytes will be small enough to completely fit in the buffer.

>because the pointers to the position in the file will be overwritten or something
It depends on how dynamic the file is. If you can expect simulaneous writes to the file while using the sliding window (such as with a multi-threaded environment), then it's very possible that without locking the file, you risk having your offsets invalidated when the file changes. If the file isn't going to change, there isn't a problem, but the entire concept of a framed file viewer is tricky.

commented: Excellent answer to my question; very helpful to my understanding of C file I/O +1

Thanks for the info; that's really helpful.

I talked with my direct supervisor, and got him to explain the trouble again.The program reads structs one at a time, so it's reasonable that for large files, a pointer to a location in memory could very well be overwritten.

Further consulting with the overlords reveals that they don't expect even 2 gig files (more on the order of 100MBs), so the fseek offset approach will be acceptable, creating a struct index-byte offset pair map.

But I'd still like to know, just for the purpose of education:
Does fgetpos modify the pointer input to point to a location in the file on disk, or to the location in memory where the file is being buffered?

By the way: thanks for all the help thus far. It's really great to be able to understand how the language works.

>Does fgetpos modify the pointer input to point to a location
>in the file on disk, or to the location in memory where the file
>is being buffered?
fgetpos just saves the current location indicator (which would most likely be some form of integral offset). It doesn't modify anything.

Bad choice of words. I was trying to ask if that meant the fpos_t pointer is set to the location in the file itself, not in the buffer.

That way, reading one structure at a time does not cause a number of fpos_t pointers to unintentionally point to the same place.

>the fpos_t pointer is set to the location in the file itself, not in the buffer.
Correct. The buffer is simply a transparent optimization to avoid device accesses. It doesn't even have to exist for the system to work, so fpos_t is treated as if it were a direct index into the file. It may not be, but it still behaves that way. :)

That's what I thought... because it didn't make sense to have fgetpos and fsetpos operate on the buffer alone. Just nutty, really.

Thanks again!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.