BeautifulSoup to extract multiple TD tags within TR

Question

sys73r 13 Newbie Poster

13 Years Ago

import urllib2
from BeautifulSoup import BeautifulSoup

data = urllib2.urlopen('http://www.NotAvalidURL.com').read()
soup = BeautifulSoup(data)

table = soup("tr", {'class' : 'index_table_in' })

print table[0]

the result is:

<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>

</tr>

*the goal is to get only the strings and the index_table_12345 ID, in separate variables to I can save work them out.

so far I haven't been able to do so, the class documentation is pretty tight...

... any suggestions?

thank you!

python

2 Contributors
7 Replies
24K Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by sys73r

snippsat 661 Master Poster

13 Years Ago

Something like this,if you want sting 1,string 2.... just iterate over the content.

from BeautifulSoup import BeautifulSoup

html = '''\
<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td--></tr>'''

soup = BeautifulSoup(html)
tag = soup.findAll('td') #all "td" tag in a list
tag_a = tag[0].find('a')

print tag_a.text #string 1
print tag_a['href'] #/info/12345

Edited 13 Years Ago by snippsat because: n/a

snippsat 661 Master Poster

13 Years Ago

from BeautifulSoup import BeautifulSoup

html = '''\
<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td--></tr>'''

soup = BeautifulSoup(html)
tag = soup.findAll('a') #all "a" tag in a list

print [i.text for i in tag] #[u'string 1', u'string 2', u'string 3', u'string 4']
print [i['href'] for i in tag] #[u'/info/12345', u'/info/12345', u'/info/12345', u'/info/12345']
#Code over is list comprehension

#As a ordinary loop,it look like this without append to a list
for i in tag:
    print i.text
'''Out-->
string 1
string 2
string 3
string 4
'''

Edited 13 Years Ago by snippsat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

sys73r 13 Newbie Poster · Answer 1 · 2012-01-10T07:18:56+00:00

thanks man, Im trying your suggestion plus some more data to test

from BeautifulSoup import BeautifulSoup

html = '''\
<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td--></tr>
<tr id="index_table_12346" class="index_table_in">
<td><a href="/info/12346">string 5</a></td>
<td><a href="/info/12346">string 6</a></td>
<td><a href="/info/12346">string 7</a></td>
<td><a href="/info/12346">string 8</a></td>
<!--td></td--></tr>'''

soup = BeautifulSoup(html)
tag = soup.findAll('td') #all "td" tag in a list
#print tag
for id, tg in enumerate(tag):# i want to go through each piece of TR and print out the values
    tag_a = tg[id].find('a')
    for st in tag_a: #to get string 1, string 2, etc
        print st.text[0] #string 1
        print st.text[1] #string 2
        print st.text[2] #string 3
        print st['href'] #/info/12345

i get:

Traceback (most recent call last):
File "/Users/johnb/Documents/testing/testing2.py", line 21, in <module>
tag_a = tg[id].find('a')
File "build/bdist.macosx-10.7-intel/egg/BeautifulSoup.py", line 601, in __getitem__
KeyError: 0

some tests

>>> soup = BeautifulSoup(html)
>>> tag = soup.findAll('td') #all "td" tag in a list
>>> for id, tg in enumerate(tag):
...     print id, tg
... 
0 <td><a href="/info/12345">string 1</a></td>
1 <td><a href="/info/12345">string 2</a></td>
2 <td><a href="/info/12345">string 3</a></td>
3 <td><a href="/info/12345">string 4</a></td>
4 <td><a href="/info/12346">string 5</a></td>
5 <td><a href="/info/12346">string 6</a></td>
6 <td><a href="/info/12346">string 7</a></td>
7 <td><a href="/info/12346">string 8</a></td>
>>> for id, tg in enumerate(tag):
...     tag_a = tg[id].find['a']
...     print tag_a.text
...     print tag_a['href']
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "build/bdist.macosx-10.7-intel/egg/BeautifulSoup.py", line 601, in __getitem__
KeyError: 0
>>> type(tag)
<type 'list'>
>>> tag
[<td><a href="/info/12345">string 1</a></td>, <td><a href="/info/12345">string 2</a></td>, <td><a href="/info/12345">string 3</a></td>, <td><a href="/info/12345">string 4</a></td>, <td><a href="/info/12346">string 5</a></td>, <td><a href="/info/12346">string 6</a></td>,

not sure what it isn't getting the data correctly as it should. any pointers?

thanks!

sys73r 13 Newbie Poster · Answer 2 · 2012-01-10T20:57:37+00:00

thank you both for your suggestions, I managed to get it working on my dev box, windows xp sp2, however when i put the same script into a debian box with python 2.5.2 It doesnt seem to work, the code is :

html = '''\
<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td--></tr>
<tr id="index_table_12346" class="index_table_in">
<td><a href="/info/12346">string 5</a></td>
<td><a href="/info/12346">string 6</a></td>
<td><a href="/info/12346">string 7</a></td>
<td><a href="/info/12346">string 8</a></td>
<!--td></td--></tr>'''

soup = BeautifulSoup(html)
tag = soup.findAll('a') #all "a" tag in a list

#filename = "/root/dsi/dsi_secureless.txt"
#FILE = open(filename,"w")
count = 0
passx = 0
#As a ordinary loop,it look like this without append to a list, 9 7017646
for i in tag:
	if count > 3:
		print "-------------------------------"
		#FILE.write("-------------------------------" + "\n")
		count = 0
		passx = 0
	if passx == 0:
		print i['href']
		#FILE.write(i['href'] + "\n")
		passx = 1
	print i.text
	#FILE.write(i.text + "\n")
	count = count + 1

#FILE.close()

debian linux output:

debian:~/dsi# python cra3.py
/info/12345
None
None
None
None
-------------------------------
/info/12346
None
None
None
None
debian:~/testing# python
Python 2.5.2 (r252:60911)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

all the None are suppose to be print i.text.

thought python was platform independent but it seems I need to modify it a little bit to make it work on linux?

snippsat 661 Master Poster · Answer 3 · 2012-01-10T22:38:22+00:00

What are you trying to do?,explain better what you want to count.
Your code now dosent make much sense.

sys73r 13 Newbie Poster · Answer 4 · 2012-01-10T23:15:07+00:00

well, basically there is the webpage structured like table > tr > td which is where the data I want to extract.

so with the code above I get rid of the first 4 items which give not useful info. once I've done that I want to take every item, which is grouped by 4 items (1 tr x 4 tds = 1 record) and write it to a file.

like this:

C:\>python cra3.py
/info/12345
string 1
string 2
string 3
string 4
-------------------------------
/info/12346
string 5
string 6
string 7
string 8

the code above is actually working but on XP only, I'd say there is something I need to change to make it work on python 2.5. on the XP i got 2.7

thanks again

sys73r 13 Newbie Poster · Answer 5 · 2012-01-11T01:35:54+00:00

so I got it working:

import urllib2
from BeautifulSoup import BeautifulSoup

data = urllib2.urlopen('http://').read()
soup = BeautifulSoup(data)
tag = soup.findAll('a') #all "a" tag in a list

filename = "est.txt"
FILE = open(filename,"w")

count = 0
linked = 0

for i in tag:
	if count > 3:
		FILE.write("-------------------------------\n")# item separator
		count = 0
		linked = 0
	if "vulnerability" in i['href']:
		if linked == 0:#making sure the link gets printed only once.
			FILE.write("http://test" + i['href'] + "\n")
			linked = 1
		a = str(i).strip().split(">")[1]
		b = str(a).strip().split("<")[0]
		FILE.write(b +"\n")
	count += 1		

FILE.close()