Hello everyone,

I am having a hard time reading and writing a UTF-8 file in visual c++ 2010.

void ReadUTF8File()
{
	ifstream UTF8File("C:\\DaniWeb\\Desktop\\UTF8File.txt");
	/*
	UTF8File.txt:
	☺☻♥♦♣♠•◘○
	*/
	string UTF8FileStr;

	if(UTF8File.is_open())
	{
		while(!UTF8File.eof())
		{
			UTF8File >> UTF8FileStr;
			cout << UTF8FileStr << endl;
			/*
			cout:
			☺☻♥♦♣♠•◘○
			*/
		}
	}
	UTF8File.close();
}

The output was not similar to the file's text. Please help, thank you for your time and consideration.

Here is an explanation of utf8 file format. Note that the first two bytes may be a binary integer -- see the chart at the end of that link.

Edited 4 Years Ago by Ancient Dragon: n/a

Comments
Thanks for the link.

Okay I've finished reading this. But I still don't quite understand how to read and write a UTF-8 file.

Okay I've finished reading this. But I still don't quite understand how to read and write a UTF-8 file.

Well, from a C perspective,you could open the file and use the standard fgets function to read a line. Once a line is read in, you would convert the UTF8 to Windows Unicode UTF16 using the MultiBytetoWideChar function with a codepage setting of CP_UTF8. At this point you must use the appropriate Windows Unicode defined functions to display the text. If you wanted to take it a step further and convert the Windows UTF16 to plain ANSI in order to use printf etc., you would use the WideCharToMultiByte function for the conversion.

That will not work with some languages that must be represented in two bytes.

If the file is in UTF-8 format then the first 4 bytes will represent a binary number between 0 and 0xffff. The table in that link tells how to interpret those bytes. If the first byte does not have a maximum value of 0x0f then the file is not in UTF-8 format. You can assume standard ascii file format, or possibly UTF-16 or UTF-32 format.

Edited 4 Years Ago by Ancient Dragon: n/a

Hmmm.. okay, I'll re-edit this post once I got the code working! Thanks everyone.

That will not work with some languages that must be represented in two bytes.

If the file is in UTF-8 format then the first 4 bytes will represent a binary number between 0 and 0xffff. The table in that link tells how to interpret those bytes. If the first byte does not have a maximum value of 0x0f then the file is not in UTF-8 format. You can assume standard ascii file format, or possibly UTF-16 or UTF-32 format.

I'm not sure I quite understand your explanation. The fgets function is only used to get a string from the file whether it contains gibberish or whatever into the Unicode function that will translate it to a windows readable format. The MultibyteToWide function can handle UTF8, UTF16 and UTF32. Bottomline, Unicode is designed to handle every language on the planet and the MultiByteToWide function is primarily designed to handle Unicode and can work with languages that use up to 32 bits.

Giovanni Dicanio, a MS MVP wrote an excellent in-depth article with POC code on this subject which can explain it a lot better than me. Unfortunately, I just can't find a link to that article. Will post that link as soon as I find it

Since you're using Visual Studio 2010, can you just use C++/CLI to read and write the file?

#include "stdafx.h"
using namespace System;
using namespace System::IO;
using namespace System::Text;

int main(array<System::String ^> ^args)
{
   String^ strFileName = "c:\\science\\test_utf8.txt";
   File::WriteAllText(strFileName, "The quick brown fox jumps over a lazy dog", Encoding::UTF8);
   String^ strData = File::ReadAllText(strFileName, Encoding::UTF8);
   Console::WriteLine(strData);
   return 0;
}

Edited 4 Years Ago by thines01: formatting

Comments
THANK YOU!
This question has already been answered. Start a new discussion instead.