I am trying to calculate the number of code points that are present in a file, but it always showing me the number of characters being used.

I am using string buffer to read the file, so I come to this part f d code

while ((sCurrentLine = br.readLine()) != null) {
	System.out.println(sCurrentLine);
        numChar=numChar+sCurrentLine.length();//Calculating number of characters on each line
        numCdpoints=numCdpoints+sCurrentLine.codePointCount(0,sCurrentLine.length());//Calculating the number of code points
}

If there is anything anyone can suggest, it would be really helpful

Recommended Answers

All 5 Replies

What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. Read this, try to understand it and get back in case of more queries.

What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. Read this, try to understand it and get back in case of more queries.

In the text, there are some characters that have other font.., and thus when i print the string i get somethig lyk this,

���5ܲ hile the ���5ܲ ecimal representation

So, there are some characters which are surrogate pairs(codepoints)

OK, now that we have confirmed that you have some characters with surrogate pairs, the next point would be to understand what is the Charset used when opening the file stream for reading. Make sure that you don't rely on the default OS charset (windows-1252, latin etc.) and explicitly pass UTF8. Something like (not tested):

new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));

Since UTF-8 is backwards compatible with ASCII (or ASCII is acceptable UTF-8), you'll be able to read regular characters along with characters having surrogate pairs.

If it still doesn't work, post/attach a small fragment of your text file.

OK, now that we have confirmed that you have some characters with surrogate pairs, the next point would be to understand what is the Charset used when opening the file stream for reading. Make sure that you don't rely on the default OS charset (windows-1252, latin etc.) and explicitly pass UTF8. Something like (not tested):

new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));

Since UTF-8 is backwards compatible with ASCII (or ASCII is acceptable UTF-8), you'll be able to read regular characters along with characters having surrogate pairs.

If it still doesn't work, post/attach a small fragment of your text file.

Actually i am using
br = new BufferedReader(new FileReader("C:\\piblurb.txt"));
to read the file, nd i have do UTF16 encoding only

I have attached the text file

Assuming the text is from the wikipedia description of PI, there are a few problems. First, the text is completely garbled; how did you generate the text file? Second, why UTF-16? Did you specifically encode the file as UTF-16? If yes, then is it without BOM or with BOM?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.