Hello!
How can ask my scipt to print every word in a url page that starts with the letter "A" in this case?
This is my code:

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.thefamouspeople.com/singers.php'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
for word in soup.text:
    if soup.text.startswith('A'):
        print soup.text

But it doesn't print anything for output.

Well i changed it into this:

for word in soup.text:
    if word.startswith('A'):
        print word

But now the output is this: (many A letter)

A
A
A
A
A
.
.

What is the type of soup.text ? try

print(type(soup.text))

if it is a string (type str), you could try

import re
for word in re.findall(r'\b\w+\b', soup.text):
    ...

I want the script to find all the names of singers that starts with the latter "A" in a url page. First i asked my script to check the content of all <il> and <td> tags but now i want to change it.
I want the script to check the url page i gave to it and then find and save all the names of singers on that page that starts with the letter "A". It's my homework and i don't know how to do that. Friends suggest me using dictionary to check words with it to find if they are names of human or not. But the homework asks me not to do that.
So now i want my script to find every word that start with "A" in that url page and print it for me. Then i should find a way to ask my crawle just save those words starting with "A" that are singers names.
Very difficult!!!

Edited 1 Year Ago by Niloofar24

Go to google and check around for google python class babynames, they have a solution presented as well. The idea is to extract people names from a table from a webpage. Seems kind of the same thing, so check it out

Edited 1 Year Ago by Slavi

So now i want my script to find every word that start with "A" in that url page >and print it for me. Then i should find a way to ask my crawle just save those >words starting with "A" that are singers names.
Very difficult!!!

That would be nightmare,and you would have to clean up a lot of rubbish text.
This is just one word that start with A Ajax.Request(,
that i saw when i quickly looked at source you get from that page.

You have to find a tag that give you info you want.
so tag <img> with <title>,will give you a fine list.

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.thefamouspeople.com/singers.php'
html = urllib2.urlopen(url) #Do not use read()
soup = BeautifulSoup(html)
link_a = soup.find_all('img')
for link in link_a:
    try:
        print link['title']
    except KeyError:
        pass

"""Ouptput--> here just 3 names befor it changes to B
Ashlee Simpson
Avril Ramona Lavigne
Axl Rose
Barbra Streisand
Barry Manilow
Barry White
"""

Edited 1 Year Ago by snippsat

Thank you @snippsat. I have 2 questions:

print link['title'] what is title? The alt of the <img> tag?

except KeyError: what does keyError mean here? What the keyword keyError is for?

Edited 1 Year Ago by Niloofar24

print link['title'] what is title? The alt of the <img> tag?

Have you looked at Firebug or Chrome DevTools?
Links i gave you in post.
Then is easy to so see what <title> of the <img> is.

except KeyError: what does keyError mean here? What the keyword keyError is for?

Not all <img> on this page has a title tag,so it trow a keyError.
This is source code that gets called in BeautifulSoup.

def __getitem__(self, key):
    """tag[key] returns the value of the 'key' attribute for the tag,
       and throws an exception if it's not there."""
    return self.attrs[key]

With except KeyError: pass,just ignore those images.
Can be more specific with class="col-md-6,
so it only search for names on images we need.
Then it will not trow an error.

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.thefamouspeople.com/singers.php'
html = urllib2.urlopen(url) #Don not use read()
soup = BeautifulSoup(html)
tag_row = soup.find_all('div', {'class':'col-md-6'})
for item in tag_row:
    print item.find('img')['title']

Edited 1 Year Ago by snippsat

This question has already been answered. Start a new discussion instead.