I am trying to calculate the number of code points that are present in a file, but it always showing me the number of characters being used.

I am using string buffer to read the file, so I come to this part f d code

while ((sCurrentLine = br.readLine()) != null) {
	System.out.println(sCurrentLine);
        numChar=numChar+sCurrentLine.length();//Calculating number of characters on each line
        numCdpoints=numCdpoints+sCurrentLine.codePointCount(0,sCurrentLine.length());//Calculating the number of code points
}

If there is anything anyone can suggest, it would be really helpful

Edited 4 Years Ago by ~s.o.s~: Added code tags, learn to use them.

What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. Read this, try to understand it and get back in case of more queries.

What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. Read this, try to understand it and get back in case of more queries.

In the text, there are some characters that have other font.., and thus when i print the string i get somethig lyk this,

���5ܲ hile the ���5ܲ ecimal representation

So, there are some characters which are surrogate pairs(codepoints)

Edited 4 Years Ago by manoj_93: n/a

OK, now that we have confirmed that you have some characters with surrogate pairs, the next point would be to understand what is the Charset used when opening the file stream for reading. Make sure that you don't rely on the default OS charset (windows-1252, latin etc.) and explicitly pass UTF8. Something like (not tested):

new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));

Since UTF-8 is backwards compatible with ASCII (or ASCII is acceptable UTF-8), you'll be able to read regular characters along with characters having surrogate pairs.

If it still doesn't work, post/attach a small fragment of your text file.

Edited 4 Years Ago by ~s.o.s~: n/a

OK, now that we have confirmed that you have some characters with surrogate pairs, the next point would be to understand what is the Charset used when opening the file stream for reading. Make sure that you don't rely on the default OS charset (windows-1252, latin etc.) and explicitly pass UTF8. Something like (not tested):

new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));

Since UTF-8 is backwards compatible with ASCII (or ASCII is acceptable UTF-8), you'll be able to read regular characters along with characters having surrogate pairs.

If it still doesn't work, post/attach a small fragment of your text file.

Actually i am using
br = new BufferedReader(new FileReader("C:\\piblurb.txt"));
to read the file, nd i have do UTF16 encoding only

I have attached the text file

Attachments
5hile the 5ecimal representation of 5 has been computed ... 5igits of the 5ecimal representation of 5 are available on many 5eb pages, and there is software for calculating the 5ecimal representation of 5 to billions of digits on any computer. 5x5{

Assuming the text is from the wikipedia description of PI, there are a few problems. First, the text is completely garbled; how did you generate the text file? Second, why UTF-16? Did you specifically encode the file as UTF-16? If yes, then is it without BOM or with BOM?

This article has been dead for over six months. Start a new discussion instead.