need help for finding special NAMES in a url page (homework)

Question

Niloofar24 15 Posting Whiz

10 Years Ago

Hello.
I have a homework. I have asked to create a web crawler that be able to enter into a music website and then for the first step, collect the name of singers that their names starts with the letter "A".

Now i need a little help for this step. How my crawler should understand wich words in that page are the singers names?! The crawler should find their names in a special tag, correct?! But what kind of tag?! Their names could be in any tag like <h4></h4> for example or in a single <p></p> tag or in a <b></b> or in <ul></ul> or any other tag!
So i just need a hint to find the way, any idea?!

python seo

Edited 10 Years Ago by Niloofar24

5 Contributors
12 Replies
380 Views
2 Days Discussion Span
Latest Post 10 Years Ago Latest Post by iJunkie22

All 12 Replies

David W 131 Practically a Posting Shark

10 Years Ago

For starters, you might test each word to see if it begins with 'A' ?

If it does ...
then might look it up in a dictionary of just names to see if it is a name?

vegaseat 1,735 DaniWeb's Hypocrite

10 Years Ago

A dictionary in this case would be a list of special names you expect to find. Simply create a text file with each name on its own line that you can read into Python code and form a list or a set (removes any duplicates). You could assume that each name starts with a capital letter, be it the name of the composer, performer, group, title etc.

Start with a small list for testing and add more names as you find them.
You might be able to find these special names on the internet.

Edited 10 Years Ago by vegaseat

snippsat 661 Master Poster

10 Years Ago

David W it's about time to learn string formatting :)
Splitting up with , and + something + is not so nice.

print('"{}".istitle() is {}'.format(item, item.istitle()))

Python String Format Cookbook

string formatting specifications(Gribouillis)

Edited 10 Years Ago by snippsat

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Niloofar24 15 Posting Whiz · Answer 1 · 2015-03-02T09:59:03+00:00

Well, @David W, i can check every word in that page to see if the words startwords that starts with latter "A", are the names of singers or any other word that starts with "A"??!!
How my crawler should recognize human names fom any other word starts with letter "A"?!

What do you mean by look it up in a dictionary of just names? In wich dictionary you mean?

David W 131 Practically a Posting Shark · Answer 2 · 2015-03-02T10:02:23+00:00

You may have to hunt the web for a big file of names ...

When you find suitable ones process and merge them into one big set (of unique names)

(Actually ...
just keep names that begin with 'A' in your set of processed names.)

Now just look up each word that beings with 'A' to see if it is in that set.

(Set look up times are very short.)

Niloofar24 15 Posting Whiz · Answer 3 · 2015-03-02T10:14:04+00:00

Well, ok i will try this way, but i think there should be an other more simple way that we have not find it yet.

iJunkie22 0 Newbie Poster · Answer 4 · 2015-03-04T01:04:39+00:00

You can also use compound regex and counters to filter the html. There is a nice feature in regex that allows you to match arbitrary strings.

self.smart_pattern = re.compile(r'(((?P<size>[0-9]+)x(?P=size))/(?P<context>(actions|animations|apps)))')

matches 24x24/apps, but does not match 24x23/apps, and it yields 24x24 in group 1 and apps in group 2. You can work similar magic with HTML blocks.

iJunkie22 0 Newbie Poster · Answer 5 · 2015-03-04T01:15:07+00:00

str.istitle() can also be used to verify that the string has each seperate word capitalized (how a name would be).

Niloofar24 15 Posting Whiz · Answer 6 · 2015-03-04T03:30:55+00:00

@iJunkie22, can you explain your last post please? about str.istitle(), and would be good if give me a little example.

David W 131 Practically a Posting Shark · Answer 7 · 2015-03-04T10:48:26+00:00

Take a look:

# str.istitle.py #

myStrs = [ "Sam S. Smith", "Ann a. anderson", "Anne Anderson" ]

mySet = set(myStrs) # pretend this is your big dictionary of names

for item in myStrs:
    print( "'" + item +"'.istitle() is", item.istitle() )

print()

for item in myStrs:
    print( "'" + item +"'[0] == A is", item[0] == 'A' )

print()    

for item in myStrs:
    if item[0] == 'A':
        if item in mySet:
            print( "'" + item +"'[0] == A is", item[0] == 'A' )
            print( item, "is in", mySet )

David W 131 Practically a Posting Shark · Answer 8 · 2015-03-04T14:55:42+00:00

No problem ... :)

There are many ways ...

Sometimes ... simple is easier to see for students.

But yes, I prefer using a format string in many contexts.

iJunkie22 0 Newbie Poster · Answer 9 · 2015-03-04T18:21:31+00:00

Also, the potential scale of the material to crawl makes this a great example of why order of comparisons is important. It all comes down to what is called "Big O".

Given that you want to check:

string is in your list of names
string starts with "A"
string is titlecase

Think of it like Big O refers to growth rate of cost...

The cost of running test1 = # of input strings * # of artist names

The cost of running test2 = # of input strings * 1 (it compares the first character and stops)

The cost of running test3 = # of input strings * # of words in input string

In theory, this makes test2 the least costly, with test1 and test3 tied for 2nd. In practice, however, names are rarely more than 4 parts, and the list of famous people is much larger than 4, so test1 is the most expensive. This means that you can make your crawler far more efficient by simply reordering the tests into this order:

string starts with "A"
string is titlecase
string is in your list of names

For a better explanation of "Big O", read this

need help for finding special NAMES in a url page (homework)

Recommended Answers Collapse Answers

All 12 Replies

Recommended Answers