I have file that contains Unicode characters from Latin-1 and Latin Extended A sets (link to Unicode charts). How do I read it so that these unicodes are converted to proper characters?
I tried

public void readFile(File file) {
        try {
            BufferedReader in = new BufferedReader(new FileReader(file));
            String noneUTF;
            while ((noneUTF = in.readLine()) != null) {
                byte[] utfBytes = noneUTF.getBytes("UTF8");
                String str = new String(utfBytes, "UTF8");
                jta.append(str + "\n");
            }
            in.close();
        }
        catch (IOException ioe) {
            ioe.printStackTrace();
        }
    }

and

public void readFile(File file) {
        
        StringBuffer buffer = new StringBuffer();
            try {
            FileInputStream fis = new FileInputStream(file);
            InputStreamReader isr = new InputStreamReader(fis,
                                      "UTF8");
            Reader in = new BufferedReader(isr);
            int ch;
            while ((ch = in.read()) > -1) {
                buffer.append((char)ch);
            }
            in.close();
            System.out.println( buffer.toString());
            } catch (IOException e) {
            e.printStackTrace();
            }
    }

but I haven't got to much success :(

Recommended Answers

All 13 Replies

Success avoiding me :'(

On last attempt I used

public void readFile(File file) {
        Charset charset = Charset.forName("ISO-8859-1");
        CharsetDecoder decoder = charset.newDecoder();
        try {
            BufferedReader in = new BufferedReader(new FileReader(file));
            String noneUTF;
            while ((noneUTF = in.readLine()) != null) {
                byte[] utfBytes = noneUTF.getBytes();
                ByteBuffer buf = ByteBuffer.wrap(utfBytes);
                CharBuffer cbuf = decoder.decode(buf);
                String str = cbuf.toString();
                jta.append(str + "\n");
            }
            in.close();
        }
        catch (IOException ioe) {
            ioe.printStackTrace();
        }
    }

but it did not work. Bellow I attached snipped of text that I'm trying to read. Final words should have extra punctuation over some of the letters

# common
common.today=Dnes
common.search=Vyhlada\u0165
common.login=Prihlasi\u0165
common.logout=Odhlasi\u0165
common.poweredBy=Powered by {0}
common.copyright=Content © {0}
common.home=Domov
common.yes=\u00c1no
common.no=Nie
common.backTo=Spa\u0165 k
common.summaryOfBlogs=S\u00fahrn blogov
common.commaSeparated=(odelen\u00e9 \u010diarkou)
common.previous=<< Predch\u00e1dzaj\u00faci
common.next=\u010eal\u0161\u00ed >>
common.readMore=\u010cita\u0165 \u010falej...
common.about=O mne
common.aboutAuthor=O autorovi

Played around with native2ascii tool, but results are mixed so far. I used following command

C:\>native2ascii -encoding windows-1252 -reverse slovak.txt slovak2.txt

and above snipped come like

# common
common.today=Dnes
common.search=Vyhlada\u0165
common.login=Prihlasi\u0165
common.logout=Odhlasi\u0165
common.poweredBy=Powered by {0}
common.copyright=Content © {0}
common.home=Domov
common.yes=Áno
common.no=Nie
common.backTo=Spa\u0165 k
common.summaryOfBlogs=Súhrn blogov
common.commaSeparated=(odelené \u010diarkou)
common.previous=<< Predchádzajúci
common.next=\u010ealší >>
common.readMore=\u010cita\u0165 \u010falej...
common.about=O mne
common.aboutAuthor=O autorovi

You can see that some of the characters been converted, but other escaped. I will continue play around with it and see what I can get out of it

If you observe more closely all the characters that have managed to get convered are from the set Latin-1 (0080-00FF) such as Á - \u00c1, ú - \u00fa, á - \u00e1, í - \u00ed. The only character that is not from the Latin-1 set but still gets converted is š - \u0161 which belongs to the Latin Extended - A set. But thats because the encoding windows 1252 consists of characters from ISO 8859-1 as well as ISO-8859-15 where the | is replaced by š for 0xA6 from ISO-8859-1

So which argument/parameter should I use for correct conversion?

Since you have code from two different sets I guess you will have to use two different charsets here and check the character code while reading from the file, if the character code is in the range 0800-00FF i.e. Latin-1 use the charset ISO-8859-1 if it is from the set 0100-017F i.e. Latin Extended - A use the aprropriate charset.

PS : I do not know the charset to be used for Latin Extended - A, like you use ISO-8859-1 for Latin-1

http://65.110.21.43/jgloss/utf.html#WRITEUTF
Use first lines -->public final class UTF8 --> of method --> private static void testWriteUTF(). File.createTempFile( used as bridge.
/**
* Discover how Java's use of UTF-8 conforms with Unicode standards.
* <p/>
* composed with IntelliJ IDEA
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0, 2006-02-25
*/

@quuba - that did not fly
I used simple string odelen\u00e9 \u010diarkou on the code you linked. This string has characters from both Latin sets and you can see result in attached file

private static final String TEST = //"common.search=Vyhlada\u0165";

            "odelen\u00e9 \u010diarkou" +
            "common.today=Dnes \n" +
            "common.search=Vyhlada\u0165 \n" +
            "common.login=Prihlasi\u0165 \n" +
            "common.logout=Odhlasi\u0165 \n" +
            "common.poweredBy=Powered by {0} \n" +
            "common.copyright=Content &copy; {0} \n" +
            "common.home=Domov \n" +
            "common.yes=\u00c1no \n" +
            "common.no=Nie \n" +
            "common.backTo=Spa\u0165 k \n" +
            "common.summaryOfBlogs=S\u00fahrn blogov \n" +
            "common.commaSeparated=(odelen\u00e9 \u010diarkou) \n" +
            "common.previous=&lt;&lt; Predch\u00e1dzaj\u00faci \n" +
            "common.next=\u010eal\u0161\u00ed &gt;&gt; \n" +
            "common.readMore=\u010cita\u0165 \u010falej... \n" +
            "common.about=O mne \n" +
            "common.aboutAuthor=O autorovi";
    //
    private static JTextArea jta = new JTextArea();

    private static void testWriteUTF() throws IOException {//http://65.110.21.43/jgloss/utf.html#WRITEUTF
        File tempFile = File.createTempFile("temp", "tmp");
        // O P E N  for write
        FileOutputStream fos =
                new FileOutputStream(tempFile, false/* append */);
        DataOutputStream dos = new DataOutputStream(fos);
        // W R I T E
        dos.writeUTF(TEST);
        jta.append(TEST + "\n");
        // C L O S E
        dos.close();
        // O P E N  for read
        FileInputStream fis = new FileInputStream(tempFile);
        DataInputStream dis = new DataInputStream(fis);
        // W R I T E
        String reconstitutedTest = dis.readUTF();
        jta.append(reconstitutedTest + "\n");
        if (!reconstitutedTest.equals(TEST)) {
            System.out.println("oops: readUTF differs from original");
        }
        // C L O S E
        dis.close();
    }

result

Yes, that peace of code work with provided string. Lets see how it will perform with reading from file.

Still, no luck. This line String reconstitutedTest = dis.readUTF(); gives me EOFException
At the present file reading method look like this

public void readFile(File file) {
        String str = "";
        DataInputStream dis = null;
        try {
            dis = new DataInputStream(new
                    BufferedInputStream(new FileInputStream(file)));
            // W R I T E

            try {
                while (true) {
                    str = dis.readUTF();
                }
            }
            finally{
                if(dis != null) dis.close();
            }
        }
        catch (FileNotFoundException fnfe) {
            System.out.println("FileNotFoundException ocured");
            fnfe.printStackTrace();
        }
        catch (IOException ioe) {
            System.out.println("IOException ocured");
            ioe.printStackTrace();
        }
        jta.append(str);
    }

OK I know I should not be re-opening old thread, but once again I was looking on similar problem and this time I figured out solution.

To get text written with Unicode to translate to Slovak language I had to use

native2ascii -encoding UTF-16 -reverse resources_sk.properties utf8.properties

However if I had file already written in Slovak and need to transfer it to UTF-8 encoding associated with "properties" file type used in many Java applications I need to use

native2ascii -encoding UTF-16 utf8.properties toUtf8.properties

Shame I cannot self-award my self some rep points :twisted:

Now problem is solved :)

To finally finish this topic I enclose a short program, which was created as a side effect of working on this theme:

package peter_budo;

import java.util.ArrayList;
import java.util.List;

/**
 *
 * @author j3c 17 marca 2009
 * Created on 2010-09-09, 12:01:11
 */
public class DetectUnicodeBlocks {

    //static final int LAST_UNICODE = 0x10FFFF;//1114112; j++) {//
    private List<ItemBlock> blockList = new ArrayList<ItemBlock>();
    private boolean DEBUG;

    public DetectUnicodeBlocks() {
        detectUnicodeBlocks();
        if (DEBUG) {
            for (ItemBlock inner : blockList) {
                System.out.println(inner);
            }
        }
    }

    public String[] getModel() {
        String[] model = new String[blockList.size()];
        int i = 0;
        for (ItemBlock inner : blockList) {
            model[i++] = inner.name;
        }
        return model;
    }

    public String[] getModelDescription() {
        String[] model = new String[blockList.size()];
        int i = 0;
        for (ItemBlock inner : blockList) {
            model[i++] = inner.toString();
        }
        return model;
    }

    int getFrom(String in) {
        for (ItemBlock inner : blockList) {
            if (in.equals(inner.name)) {
                return inner.from;
            }
        }
        return -1;
    }

    int getTo(String in) {
        for (ItemBlock inner : blockList) {
            if (in.equals(inner.name)) {
                return inner.to;
            }
        }
        return -1;
    }

    String getName(String in) {
        for (ItemBlock inner : blockList) {
            if (in.equals(inner.name)) {
                return inner.toString();
            }
        }
        return null;
    }

    class ItemBlock {

        String name;
        int from;
        int to;
        //Character ch;

        public ItemBlock(String name, int from, int to) {//, Character ch
            this.name = name;
            this.from = from;
            this.to = to;
            //this.ch = ch;
            if (DEBUG) {
                System.out.println("*" + toString());
            }
        }

        @Override
        public String toString() {
            return (name + "[" + Integer.toHexString(from) + "-" + Integer.toHexString(to) + "]" + (1 + to - from));// + "\t" + ch
        }
    }

    //18iii09j3c
    //transitions of Character.UnicodeBlock previous-current
    //five base transitions (! NO four)
    //A-A
    //A-null
    //A-B 5th.
    //null-A
    //null-null
    private void detectUnicodeBlocks() {
        Character.UnicodeBlock previous = null;//Character.UnicodeBlock.of(1114111);//start od ost dopuszczalny
        Character.UnicodeBlock current = null;
        String name = "";
        int from = 0;
        int to = -1;
        //Character chCounter = '\u0000';
        //Character ch = null;
        //
        for (int j = Character.MIN_CODE_POINT; j < Character.MAX_CODE_POINT; j++) {//last 0x10FFFF
            current = Character.UnicodeBlock.of(j);
            if (previous == null) {
                if (current == null) {
                    // System.out.println("NN " + Integer.toHexString(j));
                } else {
                    if (to != -1) { // eliminacja pierwszego błędnego
                        blockList.add(new ItemBlock(name, from, to));//1   //, ch
                    }
                    if (DEBUG) {
                        System.out.println("NA " + Integer.toHexString(j) + "\t" + current);
                    }
                    name = current.toString();
                    from = j;
                    //ch = chCounter;
                }
            } else {
                if (current == null) {
                    //System.out.println("AN " + Integer.toHexString(j));
                } else {
                    if (current.equals(previous)) {
                        //System.out.println("AA " + Integer.toHexString(j));
                        to = j;//odświeżany
                    } else {
                        blockList.add(new ItemBlock(name, from, to));//2  //, ch
                        if (DEBUG) {
                            System.out.println("AB " + Integer.toHexString(j) + "\t" + current);
                        }
                        name = current.toString();
                        from = j;
                        //ch = chCounter;
                    }
                }
                //
                // System.out.println("AA"+j);
            }
            previous = current;
            //chCounter++;//
        }
        blockList.add(new ItemBlock(name, from, to));//3 //, ch
    }

    public static void main(String[] args) {
        DetectUnicodeBlocks detectUnicodeBlocks = new DetectUnicodeBlocks();
        String[] model = detectUnicodeBlocks.getModel();
        int k = 0;
        for (String inner : model) {
            System.out.println(k++ + "." + inner);
        }
//
        String[] modelDescr = detectUnicodeBlocks.getModelDescription();
        int j = 0;
        for (String inner : modelDescr) {
            System.out.println(j++ + "." + inner);
        }
    }
}
/* R E S U L T
run:
0.BASIC_LATIN[0-7f]128
1.LATIN_1_SUPPLEMENT[80-ff]128
2.LATIN_EXTENDED_A[100-17f]128
3.LATIN_EXTENDED_B[180-24f]208
4.IPA_EXTENSIONS[250-2af]96
5.SPACING_MODIFIER_LETTERS[2b0-2ff]80
6.COMBINING_DIACRITICAL_MARKS[300-36f]112
7.GREEK[370-3ff]144
8.CYRILLIC[400-4ff]256
9.CYRILLIC_SUPPLEMENTARY[500-52f]48
10.ARMENIAN[530-58f]96
11.HEBREW[590-5ff]112
12.ARABIC[600-6ff]256
13.SYRIAC[700-74f]80
14.ARABIC_SUPPLEMENT[750-77f]48
15.THAANA[780-7bf]64
16.NKO[7c0-7ff]64
17.DEVANAGARI[900-97f]128
18.BENGALI[980-9ff]128
19.GURMUKHI[a00-a7f]128
20.GUJARATI[a80-aff]128
21.ORIYA[b00-b7f]128
22.TAMIL[b80-bff]128
23.TELUGU[c00-c7f]128
24.KANNADA[c80-cff]128
25.MALAYALAM[d00-d7f]128
26.SINHALA[d80-dff]128
27.THAI[e00-e7f]128
28.LAO[e80-eff]128
29.TIBETAN[f00-fff]256
30.MYANMAR[1000-109f]160
31.GEORGIAN[10a0-10ff]96
32.HANGUL_JAMO[1100-11ff]256
33.ETHIOPIC[1200-137f]384
34.ETHIOPIC_SUPPLEMENT[1380-139f]32
35.CHEROKEE[13a0-13ff]96
36.UNIFIED_CANADIAN_ABORIGINAL_SYLLABICS[1400-167f]640
37.OGHAM[1680-169f]32
38.RUNIC[16a0-16ff]96
39.TAGALOG[1700-171f]32
40.HANUNOO[1720-173f]32
41.BUHID[1740-175f]32
42.TAGBANWA[1760-177f]32
43.KHMER[1780-17ff]128
44.MONGOLIAN[1800-18af]176
45.LIMBU[1900-194f]80
46.TAI_LE[1950-197f]48
47.NEW_TAI_LUE[1980-19df]96
48.KHMER_SYMBOLS[19e0-19ff]32
49.BUGINESE[1a00-1a1f]32
50.BALINESE[1b00-1b7f]128
51.SUNDANESE[1b80-1bbf]64
52.LEPCHA[1c00-1c4f]80
53.OL_CHIKI[1c50-1c7f]48
54.PHONETIC_EXTENSIONS[1d00-1d7f]128
55.PHONETIC_EXTENSIONS_SUPPLEMENT[1d80-1dbf]64
56.COMBINING_DIACRITICAL_MARKS_SUPPLEMENT[1dc0-1dff]64
57.LATIN_EXTENDED_ADDITIONAL[1e00-1eff]256
58.GREEK_EXTENDED[1f00-1fff]256
59.GENERAL_PUNCTUATION[2000-206f]112
60.SUPERSCRIPTS_AND_SUBSCRIPTS[2070-209f]48
61.CURRENCY_SYMBOLS[20a0-20cf]48
62.COMBINING_MARKS_FOR_SYMBOLS[20d0-20ff]48
63.LETTERLIKE_SYMBOLS[2100-214f]80
64.NUMBER_FORMS[2150-218f]64
65.ARROWS[2190-21ff]112
66.MATHEMATICAL_OPERATORS[2200-22ff]256
67.MISCELLANEOUS_TECHNICAL[2300-23ff]256
68.CONTROL_PICTURES[2400-243f]64
69.OPTICAL_CHARACTER_RECOGNITION[2440-245f]32
70.ENCLOSED_ALPHANUMERICS[2460-24ff]160
71.BOX_DRAWING[2500-257f]128
72.BLOCK_ELEMENTS[2580-259f]32
73.GEOMETRIC_SHAPES[25a0-25ff]96
74.MISCELLANEOUS_SYMBOLS[2600-26ff]256
75.DINGBATS[2700-27bf]192
76.MISCELLANEOUS_MATHEMATICAL_SYMBOLS_A[27c0-27ef]48
77.SUPPLEMENTAL_ARROWS_A[27f0-27ff]16
78.BRAILLE_PATTERNS[2800-28ff]256
79.SUPPLEMENTAL_ARROWS_B[2900-297f]128
80.MISCELLANEOUS_MATHEMATICAL_SYMBOLS_B[2980-29ff]128
81.SUPPLEMENTAL_MATHEMATICAL_OPERATORS[2a00-2aff]256
82.MISCELLANEOUS_SYMBOLS_AND_ARROWS[2b00-2bff]256
83.GLAGOLITIC[2c00-2c5f]96
84.LATIN_EXTENDED_C[2c60-2c7f]32
85.COPTIC[2c80-2cff]128
86.GEORGIAN_SUPPLEMENT[2d00-2d2f]48
87.TIFINAGH[2d30-2d7f]80
88.ETHIOPIC_EXTENDED[2d80-2ddf]96
89.CYRILLIC_EXTENDED_A[2de0-2dff]32
90.SUPPLEMENTAL_PUNCTUATION[2e00-2e7f]128
91.CJK_RADICALS_SUPPLEMENT[2e80-2eff]128
92.KANGXI_RADICALS[2f00-2fdf]224
93.IDEOGRAPHIC_DESCRIPTION_CHARACTERS[2ff0-2fff]16
94.CJK_SYMBOLS_AND_PUNCTUATION[3000-303f]64
95.HIRAGANA[3040-309f]96
96.KATAKANA[30a0-30ff]96
97.BOPOMOFO[3100-312f]48
98.HANGUL_COMPATIBILITY_JAMO[3130-318f]96
99.KANBUN[3190-319f]16
100.BOPOMOFO_EXTENDED[31a0-31bf]32
101.CJK_STROKES[31c0-31ef]48
102.KATAKANA_PHONETIC_EXTENSIONS[31f0-31ff]16
103.ENCLOSED_CJK_LETTERS_AND_MONTHS[3200-32ff]256
104.CJK_COMPATIBILITY[3300-33ff]256
105.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A[3400-4dbf]6592
106.YIJING_HEXAGRAM_SYMBOLS[4dc0-4dff]64
107.CJK_UNIFIED_IDEOGRAPHS[4e00-9fff]20992
108.YI_SYLLABLES[a000-a48f]1168
109.YI_RADICALS[a490-a4cf]64
110.VAI[a500-a63f]320
111.CYRILLIC_EXTENDED_B[a640-a69f]96
112.MODIFIER_TONE_LETTERS[a700-a71f]32
113.LATIN_EXTENDED_D[a720-a7ff]224
114.SYLOTI_NAGRI[a800-a82f]48
115.PHAGS_PA[a840-a87f]64
116.SAURASHTRA[a880-a8df]96
117.KAYAH_LI[a900-a92f]48
118.REJANG[a930-a95f]48
119.CHAM[aa00-aa5f]96
120.HANGUL_SYLLABLES[ac00-d7af]11184
121.HIGH_SURROGATES[d800-db7f]896
122.HIGH_PRIVATE_USE_SURROGATES[db80-dbff]128
123.LOW_SURROGATES[dc00-dfff]1024
124.PRIVATE_USE_AREA[e000-f8ff]6400
125.CJK_COMPATIBILITY_IDEOGRAPHS[f900-faff]512
126.ALPHABETIC_PRESENTATION_FORMS[fb00-fb4f]80
127.ARABIC_PRESENTATION_FORMS_A[fb50-fdff]688
128.VARIATION_SELECTORS[fe00-fe0f]16
129.VERTICAL_FORMS[fe10-fe1f]16
130.COMBINING_HALF_MARKS[fe20-fe2f]16
131.CJK_COMPATIBILITY_FORMS[fe30-fe4f]32
132.SMALL_FORM_VARIANTS[fe50-fe6f]32
133.ARABIC_PRESENTATION_FORMS_B[fe70-feff]144
134.HALFWIDTH_AND_FULLWIDTH_FORMS[ff00-ffef]240
135.SPECIALS[fff0-ffff]16
136.LINEAR_B_SYLLABARY[10000-1007f]128
137.LINEAR_B_IDEOGRAMS[10080-100ff]128
138.AEGEAN_NUMBERS[10100-1013f]64
139.ANCIENT_GREEK_NUMBERS[10140-1018f]80
140.ANCIENT_SYMBOLS[10190-101cf]64
141.PHAISTOS_DISC[101d0-101ff]48
142.LYCIAN[10280-1029f]32
143.CARIAN[102a0-102df]64
144.OLD_ITALIC[10300-1032f]48
145.GOTHIC[10330-1034f]32
146.UGARITIC[10380-1039f]32
147.OLD_PERSIAN[103a0-103df]64
148.DESERET[10400-1044f]80
149.SHAVIAN[10450-1047f]48
150.OSMANYA[10480-104af]48
151.CYPRIOT_SYLLABARY[10800-1083f]64
152.PHOENICIAN[10900-1091f]32
153.LYDIAN[10920-1093f]32
154.KHAROSHTHI[10a00-10a5f]96
155.CUNEIFORM[12000-123ff]1024
156.CUNEIFORM_NUMBERS_AND_PUNCTUATION[12400-1247f]128
157.BYZANTINE_MUSICAL_SYMBOLS[1d000-1d0ff]256
158.MUSICAL_SYMBOLS[1d100-1d1ff]256
159.ANCIENT_GREEK_MUSICAL_NOTATION[1d200-1d24f]80
160.TAI_XUAN_JING_SYMBOLS[1d300-1d35f]96
161.COUNTING_ROD_NUMERALS[1d360-1d37f]32
162.MATHEMATICAL_ALPHANUMERIC_SYMBOLS[1d400-1d7ff]1024
163.MAHJONG_TILES[1f000-1f02f]48
164.DOMINO_TILES[1f030-1f09f]112
165.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B[20000-2a6df]42720
166.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT[2f800-2fa1f]544
167.TAGS[e0000-e007f]128
168.VARIATION_SELECTORS_SUPPLEMENT[e0100-e01ef]240
169.SUPPLEMENTARY_PRIVATE_USE_AREA_A[f0000-fffff]65536
170.SUPPLEMENTARY_PRIVATE_USE_AREA_B[100000-10fffe]65535
BUILD SUCCESSFUL (total time: 0 seconds)

 */

usefull to fill Combo Box.

commented: Hehe nice +15
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.