0
import urllib2
from BeautifulSoup import BeautifulSoup

data = urllib2.urlopen('http://www.NotAvalidURL.com').read()
soup = BeautifulSoup(data)

table = soup("tr", {'class' : 'index_table_in' })

print table[0]

the result is:

<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td-->
</tr>

*the goal is to get only the strings and the index_table_12345 ID, in separate variables to I can save work them out.

so far I haven't been able to do so, the class documentation is pretty tight...

... any suggestions?


thank you!

2
Contributors
7
Replies
15
Views
5 Years
Discussion Span
Last Post by sys73r
0

Something like this,if you want sting 1,string 2.... just iterate over the content.

from BeautifulSoup import BeautifulSoup

html = '''\
<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td--></tr>'''

soup = BeautifulSoup(html)
tag = soup.findAll('td') #all "td" tag in a list
tag_a = tag[0].find('a')

print tag_a.text #string 1
print tag_a['href'] #/info/12345

Edited by snippsat: n/a

0

thanks man, Im trying your suggestion plus some more data to test

from BeautifulSoup import BeautifulSoup

html = '''\
<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td--></tr>
<tr id="index_table_12346" class="index_table_in">
<td><a href="/info/12346">string 5</a></td>
<td><a href="/info/12346">string 6</a></td>
<td><a href="/info/12346">string 7</a></td>
<td><a href="/info/12346">string 8</a></td>
<!--td></td--></tr>'''

soup = BeautifulSoup(html)
tag = soup.findAll('td') #all "td" tag in a list
#print tag
for id, tg in enumerate(tag):# i want to go through each piece of TR and print out the values
    tag_a = tg[id].find('a')
    for st in tag_a: #to get string 1, string 2, etc
        print st.text[0] #string 1
        print st.text[1] #string 2
        print st.text[2] #string 3
        print st['href'] #/info/12345

i get:

Traceback (most recent call last):
File "/Users/johnb/Documents/testing/testing2.py", line 21, in <module>
tag_a = tg[id].find('a')
File "build/bdist.macosx-10.7-intel/egg/BeautifulSoup.py", line 601, in __getitem__
KeyError: 0


some tests

>>> soup = BeautifulSoup(html)
>>> tag = soup.findAll('td') #all "td" tag in a list
>>> for id, tg in enumerate(tag):
...     print id, tg
... 
0 <td><a href="/info/12345">string 1</a></td>
1 <td><a href="/info/12345">string 2</a></td>
2 <td><a href="/info/12345">string 3</a></td>
3 <td><a href="/info/12345">string 4</a></td>
4 <td><a href="/info/12346">string 5</a></td>
5 <td><a href="/info/12346">string 6</a></td>
6 <td><a href="/info/12346">string 7</a></td>
7 <td><a href="/info/12346">string 8</a></td>
>>> for id, tg in enumerate(tag):
...     tag_a = tg[id].find['a']
...     print tag_a.text
...     print tag_a['href']
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "build/bdist.macosx-10.7-intel/egg/BeautifulSoup.py", line 601, in __getitem__
KeyError: 0
>>> type(tag)
<type 'list'>
>>> tag
[<td><a href="/info/12345">string 1</a></td>, <td><a href="/info/12345">string 2</a></td>, <td><a href="/info/12345">string 3</a></td>, <td><a href="/info/12345">string 4</a></td>, <td><a href="/info/12346">string 5</a></td>, <td><a href="/info/12346">string 6</a></td>,

not sure what it isn't getting the data correctly as it should. any pointers?

thanks!

0
from BeautifulSoup import BeautifulSoup

html = '''\
<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td--></tr>'''

soup = BeautifulSoup(html)
tag = soup.findAll('a') #all "a" tag in a list

print [i.text for i in tag] #[u'string 1', u'string 2', u'string 3', u'string 4']
print [i['href'] for i in tag] #[u'/info/12345', u'/info/12345', u'/info/12345', u'/info/12345']
#Code over is list comprehension

#As a ordinary loop,it look like this without append to a list
for i in tag:
    print i.text
'''Out-->
string 1
string 2
string 3
string 4
'''

Edited by snippsat: n/a

0

thank you both for your suggestions, I managed to get it working on my dev box, windows xp sp2, however when i put the same script into a debian box with python 2.5.2 It doesnt seem to work, the code is :

html = '''\
<tr id="index_table_12345" class="index_table_in">
<td><a href="/info/12345">string 1</a></td>
<td><a href="/info/12345">string 2</a></td>
<td><a href="/info/12345">string 3</a></td>
<td><a href="/info/12345">string 4</a></td>
<!--td></td--></tr>
<tr id="index_table_12346" class="index_table_in">
<td><a href="/info/12346">string 5</a></td>
<td><a href="/info/12346">string 6</a></td>
<td><a href="/info/12346">string 7</a></td>
<td><a href="/info/12346">string 8</a></td>
<!--td></td--></tr>'''

soup = BeautifulSoup(html)
tag = soup.findAll('a') #all "a" tag in a list

#filename = "/root/dsi/dsi_secureless.txt"
#FILE = open(filename,"w")
count = 0
passx = 0
#As a ordinary loop,it look like this without append to a list, 9 7017646
for i in tag:
	if count > 3:
		print "-------------------------------"
		#FILE.write("-------------------------------" + "\n")
		count = 0
		passx = 0
	if passx == 0:
		print i['href']
		#FILE.write(i['href'] + "\n")
		passx = 1
	print i.text
	#FILE.write(i.text + "\n")
	count = count + 1

#FILE.close()

debian linux output:

debian:~/dsi# python cra3.py
/info/12345
None
None
None
None
-------------------------------
/info/12346
None
None
None
None
debian:~/testing# python
Python 2.5.2 (r252:60911)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.


all the None are suppose to be print i.text.

thought python was platform independent but it seems I need to modify it a little bit to make it work on linux?

0

What are you trying to do?,explain better what you want to count.
Your code now dosent make much sense.

0

well, basically there is the webpage structured like table > tr > td which is where the data I want to extract.

so with the code above I get rid of the first 4 items which give not useful info. once I've done that I want to take every item, which is grouped by 4 items (1 tr x 4 tds = 1 record) and write it to a file.

like this:

C:\>python cra3.py
/info/12345
string 1
string 2
string 3
string 4
-------------------------------
/info/12346
string 5
string 6
string 7
string 8


the code above is actually working but on XP only, I'd say there is something I need to change to make it work on python 2.5. on the XP i got 2.7


thanks again

0

so I got it working:

import urllib2
from BeautifulSoup import BeautifulSoup

data = urllib2.urlopen('http://').read()
soup = BeautifulSoup(data)
tag = soup.findAll('a') #all "a" tag in a list

filename = "est.txt"
FILE = open(filename,"w")

count = 0
linked = 0

for i in tag:
	if count > 3:
		FILE.write("-------------------------------\n")# item separator
		count = 0
		linked = 0
	if "vulnerability" in i['href']:
		if linked == 0:#making sure the link gets printed only once.
			FILE.write("http://test" + i['href'] + "\n")
			linked = 1
		a = str(i).strip().split(">")[1]
		b = str(a).strip().split("<")[0]
		FILE.write(b +"\n")
	count += 1		

FILE.close()
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.