Getting Hebrew page source code

Question

yosi501r 0 Newbie Poster

14 Years Ago

Hi everybody,

I want to get the HTML source code of a page like:
http://morfix.mako.co.il/default.aspx?q=connection&source=milon
(the page is in hebrew)

But, after entering a hebrew page the characters I get are like
� ׳�׳ ׳’׳�׳™ ׳¢׳‘׳¨׳™ ׳¢׳™׳‘׳¨׳™, ׳�׳™׳�׳•׳� ׳�׳ ׳’׳�׳™

I want to see the Hebrew as it is.

The current code:

public static void main(String[] args) throws Exception {
        Scanner input = new Scanner(System.in);
        String a= input.next();
        URL yahoo = new URL(a);
        URLConnection yc = yahoo.openConnection();

        BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));


        String inputLine;
        
        

    	JTextArea tt= new JTextArea();
    	JFrame f = new JFrame();
    	f.add(tt);
    	f.setVisible(true);
    	JScrollPane bar = new JScrollPane();
    	tt.setAutoscrolls(true);
    	tt.add(bar);
    	
        while ((inputLine = in.readLine()) != null) {
        	

        	tt.append("\n"+inputLine);

        }
  
    	a=input.next();
    	System.out.print(a);
        in.close();
    }
}

How to solve the issue.

Any help will be appreciated!

java

2 Contributors
5 Replies
184 Views
3 Days Discussion Span
Latest Post 14 Years Ago Latest Post by mKorbel

All 5 Replies

mKorbel 274 Veteran Poster

14 Years Ago

if you have WinOS, then you have a lots of problems with localizations and CharEncode, you needed only add (maybe always) corrext EncodePage (Charset) for File and Streams

String fileEncoding = System.getProperty("file.encoding");
System.out.println("File Encoding: " + fileEncoding);
System.out.println("Char Encoding: " + charEncoding);
System.out.println("Char Encoding: " + Charset.availableCharsets());

InputStream in = null;
in = conn.getInputStream();
int len;
byte[] buf = new byte[1024];
while ((len = in.read(buf)) > 0) {
  bos.write(buf, 0, len);
}
String charEncoding = Charset.defaultCharset().name();
charEncoding = "cp1250"; //Slovak EncodePage
ret = new String(bos.toByteArray(), charEncoding);
ret1 = bos.toString(charEncoding);

Edited 14 Years Ago by mKorbel because: n/a

mKorbel 274 Veteran Poster

14 Years Ago

I think that every (maybe my mistake) File and Stream definitions allows 2nd. parameters for CharEncode, that's for suck/put data from web, get/put htmlPage, load/save File contents ...., for/from GUI

http://download.oracle.com/javase/tutorial/essential/io/index.html

http://www.java2s.com/Code/Java/File-Input-Output/CatalogFile-Input-Output.htm

try google for CharsetDecoder too

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

yosi501r 0 Newbie Poster · Answer 1 · 2011-04-11T03:06:09+00:00

if you have WinOS, then you have a lots of problems with localizations and CharEncode, you needed only add (maybe always) corrext EncodePage (Charset) for File and Streams

String fileEncoding = System.getProperty("file.encoding");
System.out.println("File Encoding: " + fileEncoding);
System.out.println("Char Encoding: " + charEncoding);
System.out.println("Char Encoding: " + Charset.availableCharsets());

InputStream in = null;
in = conn.getInputStream();
int len;
byte[] buf = new byte[1024];
while ((len = in.read(buf)) > 0) {
  bos.write(buf, 0, len);
}
String charEncoding = Charset.defaultCharset().name();
charEncoding = "cp1250"; //Slovak EncodePage
ret = new String(bos.toByteArray(), charEncoding);
ret1 = bos.toString(charEncoding);

Thank you, but I don't know how this gets combined with my code. I tried it but with errors.

yosi501r 0 Newbie Poster · Answer 2 · 2011-04-13T23:48:57+00:00

I think that every (maybe my mistake) File and Stream definitions allows 2nd. parameters for CharEncode, that's for suck/put data from web, get/put htmlPage, load/save File contents ...., for/from GUI
http://download.oracle.com/javase/tutorial/essential/io/index.html
http://www.java2s.com/Code/Java/File-Input-Output/CatalogFile-Input-Output.htm
http://www.java2s.com/Code/Java/File-Input-Output/CatalogFile-Input-Output.htm
try google for CharsetDecoder too

Still the same, for hebrew sites I get gibberish.

mKorbel 274 Veteran Poster · Answer 3 · 2011-04-14T01:20:47+00:00

don't (me) silly, be sure that I can found

System.out.println("Char Encoding: " + Charset.availableCharsets());

(google + h) and read some chars rightToLeft, works both ISO-xxxx-x and windows-xxxx

I'm outta from this thread

Getting Hebrew page source code

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers