Reading UTF-8 File In C++?

Question

Tygawr -12 Newbie Poster

13 Years Ago

Hello everyone,

I am having a hard time reading and writing a UTF-8 file in visual c++ 2010.

void ReadUTF8File()
{
	ifstream UTF8File("C:\\DaniWeb\\Desktop\\UTF8File.txt");
	/*
	UTF8File.txt:
	☺☻♥♦♣♠•◘○
	*/
	string UTF8FileStr;

	if(UTF8File.is_open())
	{
		while(!UTF8File.eof())
		{
			UTF8File >> UTF8FileStr;
			cout << UTF8FileStr << endl;
			/*
			cout:
			∩╗┐Γÿ║Γÿ╗ΓÖÑΓÖªΓÖúΓÖáΓÇóΓùÿΓùï
			*/
		}
	}
	UTF8File.close();
}

The output was not similar to the file's text. Please help, thank you for your time and consideration.

c++

4 Contributors
7 Replies
7K Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by thines01

All 7 Replies

Ancient Dragon 5,243 Achieved Level 70

13 Years Ago

Here is an explanation of utf8 file format. Note that the first two bytes may be a binary integer -- see the chart at the end of that link.

Edited 13 Years Ago by Ancient Dragon because: n/a

Tygawr commented: Thanks for the link. +0

Ancient Dragon 5,243 Achieved Level 70

13 Years Ago

That will not work with some languages that must be represented in two bytes.

If the file is in UTF-8 format then the first 4 bytes will represent a binary number between 0 and 0xffff. The table in that link tells how to interpret those bytes. If the first byte does not have a maximum value of 0x0f then the file is not in UTF-8 format. You can assume standard ascii file format, or possibly UTF-16 or UTF-32 format.

Edited 13 Years Ago by Ancient Dragon because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Tygawr -12 Newbie Poster · Answer 1 · 2012-03-14T06:04:01+00:00

Okay I've finished reading this. But I still don't quite understand how to read and write a UTF-8 file.

BobS0327 24 Junior Poster in Training · Answer 2 · 2012-03-14T08:44:12+00:00

Okay I've finished reading this. But I still don't quite understand how to read and write a UTF-8 file.

Well, from a C perspective,you could open the file and use the standard fgets function to read a line. Once a line is read in, you would convert the UTF8 to Windows Unicode UTF16 using the MultiBytetoWideChar function with a codepage setting of CP_UTF8. At this point you must use the appropriate Windows Unicode defined functions to display the text. If you wanted to take it a step further and convert the Windows UTF16 to plain ANSI in order to use printf etc., you would use the WideCharToMultiByte function for the conversion.

Tygawr -12 Newbie Poster · Answer 3 · 2012-03-14T09:08:54+00:00

Hmmm.. okay, I'll re-edit this post once I got the code working! Thanks everyone.

BobS0327 24 Junior Poster in Training · Answer 4 · 2012-03-14T09:47:40+00:00

That will not work with some languages that must be represented in two bytes.
If the file is in UTF-8 format then the first 4 bytes will represent a binary number between 0 and 0xffff. The table in that link tells how to interpret those bytes. If the first byte does not have a maximum value of 0x0f then the file is not in UTF-8 format. You can assume standard ascii file format, or possibly UTF-16 or UTF-32 format.

I'm not sure I quite understand your explanation. The fgets function is only used to get a string from the file whether it contains gibberish or whatever into the Unicode function that will translate it to a windows readable format. The MultibyteToWide function can handle UTF8, UTF16 and UTF32. Bottomline, Unicode is designed to handle every language on the planet and the MultiByteToWide function is primarily designed to handle Unicode and can work with languages that use up to 32 bits.

Giovanni Dicanio, a MS MVP wrote an excellent in-depth article with POC code on this subject which can explain it a lot better than me. Unfortunately, I just can't find a link to that article. Will post that link as soon as I find it

thines01 401 Postaholic Team Colleague Featured Poster · Answer 5 · 2012-03-14T10:43:22+00:00

Since you're using Visual Studio 2010, can you just use C++/CLI to read and write the file?

#include "stdafx.h"
using namespace System;
using namespace System::IO;
using namespace System::Text;

int main(array<System::String ^> ^args)
{
   String^ strFileName = "c:\\science\\test_utf8.txt";
   File::WriteAllText(strFileName, "The quick brown fox jumps over a lazy dog", Encoding::UTF8);
   String^ strData = File::ReadAllText(strFileName, Encoding::UTF8);
   Console::WriteLine(strData);
   return 0;
}

Reading UTF-8 File In C++?

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers