0

I am trying to calculate the number of code points that are present in a file, but it always showing me the number of characters being used.

I am using string buffer to read the file, so I come to this part f d code

while ((sCurrentLine = br.readLine()) != null) {
	System.out.println(sCurrentLine);
        numChar=numChar+sCurrentLine.length();//Calculating number of characters on each line
        numCdpoints=numCdpoints+sCurrentLine.codePointCount(0,sCurrentLine.length());//Calculating the number of code points
}

If there is anything anyone can suggest, it would be really helpful

Edited by ~s.o.s~: Added code tags, learn to use them.

2
Contributors
5
Replies
6
Views
5 Years
Discussion Span
Last Post by ~s.o.s~
Featured Replies
  • 2
    ~s.o.s~ 2,560   5 Years Ago

    What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. [URL="http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html"]Read this[/URL], try to understand it and get back in case of more queries. Read More

2

What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. Read this, try to understand it and get back in case of more queries.

0

What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. Read this, try to understand it and get back in case of more queries.

In the text, there are some characters that have other font.., and thus when i print the string i get somethig lyk this,

���5ܲ hile the ���5ܲ ecimal representation

So, there are some characters which are surrogate pairs(codepoints)

Edited by manoj_93: n/a

0

OK, now that we have confirmed that you have some characters with surrogate pairs, the next point would be to understand what is the Charset used when opening the file stream for reading. Make sure that you don't rely on the default OS charset (windows-1252, latin etc.) and explicitly pass UTF8. Something like (not tested):

new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));

Since UTF-8 is backwards compatible with ASCII (or ASCII is acceptable UTF-8), you'll be able to read regular characters along with characters having surrogate pairs.

If it still doesn't work, post/attach a small fragment of your text file.

Edited by ~s.o.s~: n/a

0

OK, now that we have confirmed that you have some characters with surrogate pairs, the next point would be to understand what is the Charset used when opening the file stream for reading. Make sure that you don't rely on the default OS charset (windows-1252, latin etc.) and explicitly pass UTF8. Something like (not tested):

new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));

Since UTF-8 is backwards compatible with ASCII (or ASCII is acceptable UTF-8), you'll be able to read regular characters along with characters having surrogate pairs.

If it still doesn't work, post/attach a small fragment of your text file.

Actually i am using
br = new BufferedReader(new FileReader("C:\\piblurb.txt"));
to read the file, nd i have do UTF16 encoding only

I have attached the text file

Attachments
5hile the 5ecimal representation of 5 has been computed ... 5igits of the 5ecimal representation of 5 are available on many 5eb pages, and there is software for calculating the 5ecimal representation of 5 to billions of digits on any computer. 5x5{
0

Assuming the text is from the wikipedia description of PI, there are a few problems. First, the text is completely garbled; how did you generate the text file? Second, why UTF-16? Did you specifically encode the file as UTF-16? If yes, then is it without BOM or with BOM?

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.