Hi Everyone, So i have been trying to figure this out for ages. Im just not able to to get the algorithm straight.

Say I am given a string containing markup tags ex. <html> </html> in a string like this:

<html>
<head> <title> Example </title> </head>
<body> quiet <CAPSLOCK> LOUD </CAPSLOCK> quiet </body>
</html>

I am trying to get python to return a list of nested lists embedding each tag and the contents inbetween it in a new nested list. so for the example above it would be

[['<html>', ['<head>', ['<title>', 'Example', '</title>'], '</head>'], ['<body>', 'quiet', ['<CAPSLOCK>', 'LOUD', '</CAPSLOCK>'], 'quiet', '</body>'], '</html>']] .

I know i have to use string.split to seperate my string by whitespaces. but have no idea where to go from there. I am a python newbie. Any help would be greatly appreciated!

Thanks in advanced

In an ideal world your html code could look like this ...

<html>
<head>
<title>My Title</title>
</head>
<body>
<b>Hello there</b>
</body>
</html>

However you may as well expect this ...

<html>
<Head>
	<title>My Title</title>
</head>

<body bgcolor="Yellow" text="#6600FF">

<b>Hello there</b>

</BODY>
</Html>

Putting this in a list of lists would be a major project.

There is module BeautifulSoup available that could help you a lot, take a look.

Edited 7 Years Ago by vegaseat: n/a

Looks like you're gonna have to program a HTML parser, not the funest but I've done something similar. You're gonna have to loop through and take every thing that starts with < and copy that and everything between the closing including the > into a object in a list, everything that is not surrounded by <> then put it in the list as its own object, when i get home i can write a small sample if you'd like.

ps if you split by whitespace and you have this "<P>hey, whats up</P>", then your going to get a list like this

This article has been dead for over six months. Start a new discussion instead.