lostpenan 0 Newbie Poster

I would like to read a html file that has asian encoding (gb2312), then extract the specific strings and then copy them into a new html file. But when i try to read the html file, i get gibberish. Im using Python 2.4.3 and i know it supports gb2312 but couldnt get it to work.

Here's a snippet of the html file...

<!---->
<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<title>冰鱼BT发布页面</title>
<style type="text/css">
<!--
body {
margin-left: 0px;
margin-top: 0px;
margin-right: 0px;
margin-bottom: 0px;
font-family: Tahoma, "宋体", Verdana;
font-size: 12px;
}
.
.
.
.
<!---->
<tr style="background-color:#ECECEC;" onMouseOver="this.style.backgroundColor='#ffcc00';return true;" onMouseOut="this.style.backgroundColor='#ECECEC';">
<td align="center" height="22" class="lista">综艺</td>
<!---->
<td class="lista"><a href="/icefish/15028.htm" target="_blank"><font color=red>【冰鱼出品】周日八点党 2006-08-13</font></a>
<!---->
<a href="http://bbs2.icefish.org/read.php?tid=127207" target="_blank"><font color=green>相关讨论</font></a>&nbsp;<font color='#FF0033'><a href="http://fby.mdbchina.com/getSearchName.asp?sch=周日八点党" target="_blank"><font color='#FF0033'>相关资料</font></a></font>
<!---->
</td>
<td align="center" class="lista">816.12 MB</td>
<td align="center" width="55" class="lista"><img src="/digiimg/15028s.gif" width='52' height='22' border='0'></td>
<td align="center" width="55" class="lista"><img src="/digiimg/15028l.gif" width='52' height='22' border='0'></td>
<td align="center" class="lista"><a href=/download.php?url=torrents/38d788b6bd2f8be101c8663bc68e99e5c8549816.btf&f=%A1%BE%B1%F9%D3%E3%B3%F6%C6%B7%A1%BF%D6%DC%C8%D5%B0%CB%B5%E3%B5%B3%A1%A12006-08-13.torrent&fid=15028 onclick="sAD()">下载</a></td>
<!---->
.
.
.

Thanks in advance for any suggestions and walkthroughs.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.