Hi,
I am trying to read a html page and convert into xml and copy the content into a txt file in the local drive. The code below is to read the html page:
def cleaner = new HtmlCleaner()
def node = cleaner.clean(address)
// Convert from HTML to XML
def props = cleaner.getProperties()
def serializer = new SimpleXmlSerializer(props)
def xml = serializer.getXmlAsString(node)
// Parse the XML into a document we can work with
return new XmlSlurper(false,false).parseText(xml)
and the below code is to write it to a local txt file:
static writeXml(page, fname) {
def d1= new File(base + '/' + fname).parentFile
d1.mkdirs()
def fw = new FileWriter(base + '/' + fname)
groovy.xml.XmlUtil.serialize(page, fw)
fw.close()
}
the problem im haveing right now is, when it reads the &nbps tags in the html pages are converted to XML as ?. What I want to do is to replace the '&nbps' in to a empty string '' . How do i get access to the html tags and replace it.
Appreciate a reply
thanks in advance