Hey,

This a bit of a lengthy problem to explain, please bear with me.

My goal:
Read in a data file (written in ASCII) and convert it to a smaller format (such as binary)

The Problem:
I start off with a proprietary format called Heim RawData. I currently don't have access to how they pack their data, but I don't think I'm allowed to share even if I do get it. Each raw file is approximately a gig in size; their proprietary software can take a sample of that (I usually take between 1 and 10 seconds worth) and convert it into an ASCII data format. We're trying to do data analysis so the more data we can look at once the better, but by converting it into ASCII each number/letter is represented with 8 bits therefore increasing the use of memory drastically. I'm trying to automate the system so we don't have to use excel, limited to only 60000 some values or 1000000 if using 2007 for a part of our analysis.

Typically when writing to a file, C generally prints in ASCII correct? Is it possible to write in a binary mode with C? Is there a better language to do this kind of data packaging? I'm only using C because the analysis team will most likely be using C as well. What kind of formating would be best?

Even if I write as in HEX notation, doesn't it still write to the file in ASCII. I don't think that will substantially reduce my file size and I'll have to convert back to decimal for analysis.

If someone can provide me with some insight or recommendations that would be greatly appreciated.

Best Regards,
Andrew C.

Recommended Answers

All 8 Replies

I don't think, it does make any big difference in writing the values into the file in either ASCII or binary is going to give you any difference in terms of file size. It all the same as far i can see.

Yes, there are function and routines which you could use to write data on to the file in binary mode. First of all you need to open the file for writing in binary mode. And then you will use fwrite and fread function to write and read respectively onto the file.

Perhaps you might also use fseek and ftell function position the file pointer to the right location in the file.

ssharish

Let's see if I can try to explain myself better with this small test code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(void)
{
	int test_num = 216;
	int test_num2;
	char test_string[10];
	FILE *fp_asciiHeim, *fp_converted;

	//create a read/write dummy file
	if((fp_converted = fopen( "data01.txt", "w+" ))==NULL) {
		printf("cannot open file");
        exit(1);
    }

	//fscanf(fp_asciiHeim, "%d", test_num);
	//fscanf(fp_asciiHeim, "%s", test_string);

	// write the decimal value to the file
	fprintf(fp_converted, "%d\n", test_num);
	//fprintf(fp_converted, "%s\n", test_string);

	//reset the cursor to the beginning
	fseek (fp_converted, 0, SEEK_SET);

	//try reading the number either with a string or a decimal
	fscanf(fp_converted, "%d", test_num2);
	//fscanf(fp_converted, "%s", test_string);
	printf("%d\n", test_num2);
	//printf("%s", test_string);
	_fcloseall();

	return 0;
}

So in the code above, I basically write a decimal to a file, and try reading it back. When you use fprintf, it prints the value in ascii; when I try reading it back with fscanf using a %d specifier, I get some random value. However, using %s and a character array I can print out the correct output.

I guess I don't really understand really how fscanf works in terms of scanning for a decimal or a float value. Ideally, I would just like to be able to use %d and get the 216 to store in an int. That would mean that i'm not using ascii anymore to write to a data file. The gist is I don't want to be storing representing the data values in ascii especially when i'm dealing with millions of data points.

which functions/libraries/specifiers do I use to write in binary? maybe I shouldn't even be writing to a .txt?

Let's see if I can try to explain myself better with this small test code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(void)
{
	int test_num = 216;
	int test_num2;
	char test_string[10];
	FILE *fp_asciiHeim, *fp_converted;

	//create a read/write dummy file
	if((fp_converted = fopen( "data01.txt", "w+" ))==NULL) {
		printf("cannot open file");
        exit(1);
    }

	//fscanf(fp_asciiHeim, "%d", test_num);
	//fscanf(fp_asciiHeim, "%s", test_string);

	// write the decimal value to the file
	fprintf(fp_converted, "%d\n", test_num);
	//fprintf(fp_converted, "%s\n", test_string);

	//reset the cursor to the beginning
	fseek (fp_converted, 0, SEEK_SET);

	//try reading the number either with a string or a decimal
	fscanf(fp_converted, "%d", test_num2);
	//fscanf(fp_converted, "%s", test_string);
	printf("%d\n", test_num2);
	//printf("%s", test_string);
	_fcloseall();

	return 0;
}

So in the code above, I basically write a decimal to a file, and try reading it back. When you use fprintf, it prints the value in ascii; when I try reading it back with fscanf using a %d specifier, I get some random value. However, using %s and a character array I can print out the correct output.

I guess I don't really understand really how fscanf works in terms of scanning for a decimal or a float value. Ideally, I would just like to be able to use %d and get the 216 to store in an int. That would mean that i'm not using ascii anymore to write to a data file. The gist is I don't want to be storing representing the data values in ascii especially when i'm dealing with millions of data points.

which functions/libraries/specifiers do I use to write in binary? maybe I shouldn't even be writing to a .txt?

OK, some basics:
1) whether you write out the values as binary or decimal, you'll be using ASCII if your OS uses ASCII (and yours almost certainly does). Binary files will be smaller than text files, because they have no added \r\n in them, (carriage return, line feed), added in.

Binary is just a mode for output, like text mode. They're BOTH ASCII if you're on an ASCII OS.

You're not getting the right value back from the file, because you're not using fscanf() properly. You need to give it the & (address of) the variable, not just the name of the variable, like in fprintf(). AHA!

You could compress the data, but it still won't help the display of the data, because it will be unreadable to the human eye, until it's decompressed again. To really see a lot of data, I'd look into setting up two or three monitors and put a portion of the data on each of the monitors (I believe it's called a "Panoramic View", and software can do this for you, you don't have to write your own code.).

As I understand it, you have a Gig or so of data, and you want to be able to analyze a larger part of it than Excel will handle.

What I don't understand is why you need to view it all. Who can make sense of that much data at a time?

I'd look for ways to leave the raw data file strictly as it is. Write your program to do the analysis you want on that data, and display whatever part of the data or analysis you want, on request.

Well, We're sampling at a 1MHz so within one second we get a million samples. The data goes for about half an hour or more each, hence the gig in data size (using their RAW format). If I expanded that entirely to ascii, that would be over 5 gigs worth of data. Of course, we can't make any sense of it, but my job is reduce the noise which (I'm converting an existing excel macro) cuts it down by 75%.

Anything meaningful is more likely going to happen over 5-15 second block, which while this is small compared to the overall amount of data, its still a lot to work with.

In the meantime, I'll mess around with the C code and I'll post back when I make some progress or run into more problems.

The other reason for cutting down file size, we may eventually want to try and feed the data into a system to recreate the signal and the buffer on the equipment just isn't that large; also, I doubt it can read the proprietary format.

accho, as I said before, writing data onto a file either in binary or ascii is not gonna make a difference apart from few kila bytes. If your are really considered reducing the file size you need to think of compress the file.

If you still wanting to handle data with the bianry, then look back my previous post or if you are deling with the ascii then the function which u have been using are right, but should go though the manuals of those function to find out for a proper usage.

ssharish

So I managed to get my hands on a license for the proprietary program to convert the data files into a binary format, meaning a .bin file.

If I use a hex editor to look at the data, the number 21376
in ASCII Format: (hex) 32 31 35 36 36 (5 bytes)
in Binary Format: (hex) 53 80 (2 bytes)

Hopefully, this clarifies what I've been trying to describe. I actually see ~50% file reduction size with a binary format in comparison to ASCII.

So now that I successfully got it into a Binary format, but I still have the problem of not knowing how to manipulate the data in binary. I know you mentioned just look up the library functions, but if you could help point me in the right direction that would be great.

In the binary file (hex editor), I have a 16 bit word followed by 0d 0a which are ascii for carriage return and newline. I want to remove the carriage return and newline so my binary file will just continuously have the 16 bit words one after the other. So basically I need to delete every other 16 bits in the binary file. I've tried reading in one byte at a time with fread and using fwrite to write to a new .bin file every other 2 bytes since there is a possibility the data is 0d 0a.

Are there buffer issues I should be aware of when using fread() because I'll cycle through the opened binary file with a while(!feof(fp_heimData)) , but when it finishes I only see 1 kb worth of data when it should be closer to 3000 kb. The original file size is 6000 kb.

and here's the c code I've written:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(void)
{
	//initialize variables
	int count = 0;
	char * hex_test;
	FILE *fp_readBin, *fp_writeBin;
	
	//open binary file to read and binary file to write
	if((fp_readBin = fopen( "Ftrans.Bin", "r" ))==NULL) {
		printf("cannot open file");
        exit(1);
	}
	if((fp_writeBin = fopen( "Ftrans2.Bin", "w" ))==NULL) {
		printf("cannot open file");
        exit(1);
	}
	
	//remove all character codes
	hex_test = (char*) malloc (sizeof(char));
	/*while(!feof(fp_readBin)){
		
		fread(hex_test,1,1,fp_readBin);
		if(count == 0 || count == 1){
			fwrite(hex_test,1,1,fp_writeBin);
		}

		if (count == 3)
			count = 0;
		count++;
	}*/

	//testing to see if it will cycle through the entirety of the file
	while(!feof(fp_readBin)){
		fread(hex_test,1,1,fp_readBin);
		if(*hex_test != 0x0d || *hex_test != 0x0a){
			fwrite(hex_test,1,sizeof(hex_test),fp_writeBin);
		}
	}
	printf("finished\n");
	fclose(fp_readBin);
	fclose(fp_writeBin);
	free(hex_test);
	return 0;
}

I've been told PERL works great for file manipulation, but I would prefer to stay in C to make it easier to work with other programmers once I finish up my internship.

Thanks again! Much appreciated.

hex_test = (char*) malloc (sizeof(char)); Don't need to cast malloc in C, in fact no need to allocate memory for just one byte character. If all that you need is a char declare a char. while(!feof(fp_readBin)) feof should never be used as a control statement. It doesn't work correctly.

while(!feof(fp_readBin)){
		fread(hex_test,1,1,fp_readBin);
		if(*hex_test != 0x0d || *hex_test != 0x0a){
			fwrite(hex_test,1,sizeof(hex_test),fp_writeBin);
		}
	}
char ch;
while( fread( &ch, 1, 1, fp_readBin ) ) {
    if (ch != 0x0d && ch != 0x0a)
        fwrite( &ch, 1, 1, fp_writeBin );
}

<<obsolete already>>

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.