Hi guys, its me again..

I was wondering if anyone could tell me how to test a character if it is within the ASCII range, like say in the following pseudocode:

ifstream in;
	in.open("file.xxx", ios::binary); 
	
	if(!in)
	{
		cerr << "file.xxx could not be opened. \n";
	}
		
	while (!in.eof())
	{
		char c = in.get();
		if (c > 127) // not working, not detecting if file is ascii or binary
		{
			cout<< "file.xxx is non-ASCII binary.\n";
			break; 
		}
		else
		{
			cout << "file.xxx is ASCII binary (text). \n";
			break; // how do i do it so that this break condition is only executed when end of file has been reached without any characters outside the ASCII range?
		}
	}

	inFile.close();

Recommended Answers

All 12 Replies

#include <ctype.h>
// or #include <cctype> in C++
if (isascii(c))
{...

It's the same as (c&~0x7F)==0...

commented: Thanks :) +1

Cool thanks I'll try it out tomorrow :)

So anyway I tried the using the function isascii, with the following code:

.....
        ifstream inFile;
	inFile.open(fullpath.c_str());
	
	if(!inFile)
	{
		cerr << fullpath << " could not be opened. \n";
	}
		
	inFile.seekg(0,ios::end);
	long length = inFile.tellg();
	inFile.seekg(0,ios::beg);

        char* buffer = 0;
	buffer = new char[length];

	inFile.get(buffer,length);
	bool type = 1;
	int i = 0;

	do
	{
		type = isascii(buffer[i]);
		i++;
	} while ((i < length) && (type == 1));

	if (type == 1)
	{
		cout << fullpath << " is ASCII binary (text). \n";
	}
	else
	{
		cout << fullpath << " is non-ASCII binary. \n";
	}
....

which should work, right?

But no, it doesnt, because everytime it encounters a new line character(0xCD), and decides that it is non-ASCII.. =/

Here's code that compiles:

#include <iostream>
#include <fstream>
#include <cctype>

using namespace std;

int main()
{
	ifstream inFile;
	ofstream outFile;
	outFile.open("list.txt");

	outFile << "Line 1" << endl
			<< "Line 2.." << endl
			<< "This is" << endl
			<< "a simple text"<< endl
			<< "file.."<< endl;

	outFile.close();

	inFile.open("list.txt"); 
	
	if(!inFile)
	{
		cerr << "File could not be opened. \n";
	}
		
	inFile.seekg(0,ios::end);
	long length = inFile.tellg();
	inFile.seekg(0,ios::beg);

	char* buffer = 0;
	buffer = new char[length];

	inFile.get(buffer,length);
	bool type = 1;
	int i = 0;

	do
	{
		type = isascii(buffer[i]);
		cout << type << endl;
		i++;
	} while ((i < length) && (type == 1));

	if (type == 1)
	{
		cout << "File is ASCII binary (text). \n";
	}
	else
	{
		cout << "File is non-ASCII binary. \n";
	}
	inFile.close();
	delete buffer;
}

If you run a debugger on it, you'll see that when it gets to the character after "1" in "Line 1", the value pointed to by buffer is "0xcd", which is a weird character that I can't type in.. Then it sets type to "0", and exits the do while loop. THis in turn says that the file is non-ASCII, because of the 0xcd character..

Update:

It was solved by a friend by inserting memset(buffer, 0x00, length); after the buffer declaration..

Solved?

inFile.get(buffer,length);
length is the length of the file, but get only reads a single line ?

0xCD is what the Microsoft run-time fills up allocated memory with as you allocate it, so that you can easily spot use of uninitialised data.
Just filling it with memset seems like sweeping the problem under the carpet, not a solution.

Solved?

inFile.get(buffer,length);
length is the length of the file, but get only reads a single line ?

0xCD is what the Microsoft run-time fills up allocated memory with as you allocate it, so that you can easily spot use of uninitialised data.
Just filling it with memset seems like sweeping the problem under the carpet, not a solution.

Yes, I checked it out, it turns out that inFile.get only gets the characters until newline is found, so it's checking the rest of the file after the first line...

So, its not solved yet.. I'm trying to use inFile.read() for the moment to get the characters in conjunction with opening the file in binary mode, but now I have a new problem..

Allocating a dynamic array of [length] size seems wasteful, especially if i'm opening huge files, just to check if they have non-ascii characters, so i was thinking of something along the lines of

{
	ifstream inFile;
	inFile.open("list.txt", ios::binary);
	
	if(inFile.good())
	{
		bool type = 1;
		const int streamsize = 100;
		char* buffer = 0;
		buffer = new char[streamsize];
		
		while (!inFile.eof())
		{
			memset(buffer,0x00,streamsize);
			inFile.read(buffer,streamsize);
			
			int i = 0;

			do
			{
				type = isascii(buffer[i]);
				i++;
			} while ((i < streamsize) && (type == 1) );

			if ((type == 1) && !inFile.eof())
			{
				buffer = buffer + streamsize;
			}
			
			else if ( type == 0 )
			{
				cout << "list.txt is non-ASCII binary file. \n";
				return;
			}
		}
		if ((type == 1))
		{
			cout << "list.txt is ASCII binary file (text).\n";
			return;
		}
	}
	else
	{
		cerr <<  "list.txt could not be opened. \n";
		return;
	}
	
}

How's this?

1. If you want to get/check every byte in the file, open it in binary mode then use read() member function:

inFile.open("list.txt",ios::binary);
...
if (!infile.read(buffer,length))
{
    cerr << "Can\'t read a file" << endl;
    return 2;
}

2. Right condition after loop:

if (i >= lentgh) // not if (type == 1)

3. It's not so good text file identification method. For example, a file with all zero bytes is not a text file (obviously) but your program classifies it as a text file. Think about more reliable method...

1. If you want to get/check every byte in the file, open it in binary mode then use read() member function:

inFile.open("list.txt",ios::binary);
...
if (!infile.read(buffer,length))
{
    cerr << "Can\'t read a file" << endl;
    return 2;
}

2. Right condition after loop:

if (i >= lentgh) // not if (type == 1)

3. It's not so good text file identification method. For example, a file with all zero bytes is not a text file (obviously) but your program classifies it as a text file. Think about more reliable method...

1. Yes, I'm trying to use the read() function now and opening it in binary mode..

2. The (type == 1) condition is because I'm trying to check if a particular character is within the ascii range..

3. I think you may be right..


What I really need is a program that:

1.) Takes an input filename;
2.) Tries to open the file;
3.) Reads a block of memory (say, 100 bytes) and checks if there any non-ASCII characters in it:
3.a.), if there are, stop reading immediately and mark the file (by pushing the bool value into a structure with the filename).
3.b.) If there are none, get the next 100 bytes of the file and go back to step 3 (read a block of memory. If end-of-file is reached without encountering non-ASCII characters, mark the file as ASCII, by pushing the bool value into a structure with the filename).

Anyone know how to implement this in a better way? maybe I'm going about it all wrong..

Okay I think I somewhat got it now..

{
	ifstream inFile;
	inFile.open("List.txt", ios::binary);
	
	bool type = 1;
	const int streamsize = 20;
	char* buffer = 0;
	buffer = new char[streamsize];
	memset(buffer,0x00,streamsize);

	if(inFile.good())
	{
		while (!inFile.eof())
		{
			inFile.read(buffer,streamsize);
			
			int i = 0;

			do
			{
				type = isascii(buffer[i]);
				i++;
			} while ((i < streamsize) && (type == 1) );
			
			if ( type == 0 )
			{
				cout << "List txt is non-ASCII binary file. \n";
				return;
			}
		}

		if ((type == 1))
		{
			cout << "List.txt is ASCII binary file (text).\n";
			return;
		}
	}
	else
	{
		cerr << "List txt could not be opened. \n";
		return;
	}
	delete buffer;
	inFile.close();
}

Some corrections and improvements.

1. Never read file data in tea-spoon by tea-spoon manner. Reasonable minimal portion is (obviously) disk sector size (not less than 512 bytes). Don't use streamsize as a name of your variables. It's not an error but streamsize is a name of STL stream size type. As usually, an optimal strategy for data reading is "the more at once the better".

2. Don't pick with cumbersome and unnecessary buffer filling. Look at the code sceleton:

bool isAscii = true;
    ifstream inFile("readme.txt",ios::binary);
    // That's right play of streamsize:
    const streamsize gPortion = 512;
    streamsize gCount;
    char* buffer = new char[gPortion];
    
    while (!inFile.eof())
    {
        inFile.read(buffer,gPortion);
        // Get the last op byte counter:
        gCount = inFile.gcount();
        if (gCount <= 0)
            break;
         // process gCount bytes
        int i;
        for (i = 0; i < gCount && isascii(buffer[i]); ++i)
           ;
        if (i < gCount)
        {
            isAscii = false;
            break;
        }
    }
    // Report result in isAscii var here...
    delete [] buffer;

3. Get the filename from cmd parameter of ask it from the console:

int main(int argc, char* argv[])
{
    const char* fname;
    string sname;
    ...
    if (argc > 1)
       fname = argv[1];
    else
    {
        cout << "Enter filename...";
        getline(cin,sname);
        fname = sname.c_str();
    }
    ifstream infile(fname,ios::binary);
    ...

Some corrections and improvements.

1. Never read file data in tea-spoon by tea-spoon manner. Reasonable minimal portion is (obviously) disk sector size (not less than 512 bytes). Don't use streamsize as a name of your variables. It's not an error but streamsize is a name of STL stream size type. As usually, an optimal strategy for data reading is "the more at once the better".

2. Don't pick with cumbersome and unnecessary buffer filling. Look at the code sceleton:

bool isAscii = true;
    ifstream inFile("readme.txt",ios::binary);
    // That's right play of streamsize:
    const streamsize gPortion = 512;
    streamsize gCount;
    char* buffer = new char[gPortion];
    
    while (!inFile.eof())
    {
        inFile.read(buffer,gPortion);
        // Get the last op byte counter:
        gCount = inFile.gcount();
        if (gCount <= 0)
            break;
         // process gCount bytes
        int i;
        for (i = 0; i < gCount && isascii(buffer[i]); ++i)
           ;
        if (i < gCount)
        {
            isAscii = false;
            break;
        }
    }
    // Report result in isAscii var here...
    delete [] buffer;

3. Get the filename from cmd parameter of ask it from the console:

int main(int argc, char* argv[])
{
    const char* fname;
    string sname;
    ...
    if (argc > 1)
       fname = argv[1];
    else
    {
        cout << "Enter filename...";
        getline(cin,sname);
        fname = sname.c_str();
    }
    ifstream infile(fname,ios::binary);
    ...

Thanks ArkM! :) You've been really helpful! I think I got it now! :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.