Help with Python Threading Library with BeautifulSoup.

Question

John A.

11 Years Ago

Hey guys, I'm trying to get all links on a website using BeautifulSoup, Queue, Threading, and urllib2. I am specifically looking for links that lead to other pages of the same site. It runs for a few seconds, going through about 3 URLs before giving me the error:

Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/home/john/Desktop/Python Projects/QtProjects/ThreadedDataMine.py", line 51, in run
    if url[0:4] != "http" and url[0] != "/" and "#" not in url:
TypeError: 'NoneType' object has no attribute '__getitem__'

The program run perfectly when it only goes over the main URL, but starts giving this error whenever I tell it to start adding the URLs it finds into the first Threads Queue if it hasn't before.

Here's my code:

import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup

hosts = ["http://waoanime.tv"]

queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()

            #grabs urls of hosts and then grabs chunk of webpage
            req = urllib2.Request(host, headers={'User-Agent':"Anime Browser"})
            html = urllib2.urlopen(req)
            chunk = html.read()

            #place chunk into out queue
            self.out_queue.put(chunk)

            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue, queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue
        self.queue = queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()

            soup = BeautifulSoup(chunk)
            #parse the chunk
            for line in soup.findAll('a'):
                url = (line.get('href'))
                if url[0:4] != "http" and "#" not in url:
                new_url = ""
                if url[0] == "/":
                    new_url = ("http://waoanime.tv%s" % url)
                else:
                    new_url = ("http://waoanime.tv/%s" % url)
                if new_url not in hosts:
                    hosts.append(new_url)
                    #self.queue.put(new_url)
                    print new_url #debug
                elif url[0:13] == "http://forums" and url not in hosts and "#" not in url:
                    hosts.append(url)
                    #put url in url queue
                    self.queue.put(url)
                    print url #debug
                else:
                    pass

            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue, queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

Because this is my first time using Threading, I just modified the code found here.

I would really appreciate any help fixing this bug, or if you know of a better way I can do this.

python queue

Edited 11 Years Ago by John A. because: changed Threaded to Threading

3 Contributors
8 Replies
732 Views
2 Days Discussion Span
Latest Post 11 Years Ago Latest Post by Sky Diploma

All 8 Replies

Sky Diploma 571 Practically a Posting Shark

11 Years Ago

To add to what snippsat said,

Just consider, What would happen if your code comes through an anchor like the following <a class='sd-link-color'></a>

Since it doesn't hold an 'href' attribute, I am assuming that

url = (line.get('href')) would result to None.

That's causing your error. Since the get method defaults to a None.

Edited 11 Years Ago by Sky Diploma

Sky Diploma 571 Practically a Posting Shark

11 Years Ago

Hi John, Glad that we could help,

I would firstly like to put-forth to you that this implementation would endup making several requests to waoanime.tv persistently. This might be potentially harmful to the site. I hope you have prior permission from its webmaster for your experiment.

Secondly, I dont see how the programs behavior is wrong. Could you give more explanation regarding this.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

snippsat 661 Master Poster · Answer 1 · 2014-04-02T05:04:40+00:00

url is set to None,then you get this error message.

>>> url = 'http://www.google.com'
>>> url[0:4]
'http'
>>> url = None
>>> url[0:4]
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: 'NoneType' object has no attribute '__getitem__'

A tips is to just use print on url to see what you get.
So if there are no more url it will return None.
You can catch error and pass it out,then it will try to go futher.

>>> try:
...     url = None
... except TypeError:    
...     pass

John A. · Answer 2 · 2014-04-02T16:27:56+00:00

Thank you guys so much! I added href=True to findAll, but now there's a new problem, and I'm pretty sure it's because of the way I'm using the threads. I put a print statement in the first thread to see what hosts it was going through, and it prints 5-6 before giving me an elapsed time, making the program stop, then it starts back up again and prints more before stopping again.
Sample Output:

>>> 
http://waoanime.tv
http://forums.waoanime.tv/arcade.phphttp://forums.waoanime.tv/http://forums.watchanimeon.com/register.phphttp://forums.waoanime.tv/f8/waoplayer-buffering-infinitely-79494/http://forums.waoanime.tv/f37/hello-everyone-79493/Elapsed Time: 1.78832912445





>>> http://forums.waoanime.tv/f38/google-japan-magic-hand-version-79492/
http://forums.waoanime.tv/f45/feedback-79491/
http://forums.waoanime.tv/f40/mini-tv-android-79490/
http://forums.waoanime.tv/f8/profile-picture-not-appearing-79489/
http://forums.waoanime.tv/f8/half-videos-arent-working-79488/
http://forums.waoanime.tv/f33/2k-thread-79487/
http://forums.waoanime.tv/f37/hello-everyone-79486/http://forums.waoanime.tv/f33/because-79485/

http://forums.waoanime.tv/register.php
http://forums.waoanime.tv/showgroups.php

I tried adding a few more ThreadUrl threads(now has 10), and it did print more, but still ended up giving the same kind of output:

>>> 
http://waoanime.tv
http://forums.waoanime.tv/arcade.php
 http://forums.watchanimeon.com/register.phphttp://forums.waoanime.tv/

 http://forums.waoanime.tv/f37/hello-everyone-79493/http://forums.waoanime.tv/f38/google-japan-magic-hand-version-79492/http://forums.waoanime.tv/f45/feedback-79491/http://forums.waoanime.tv/f40/mini-tv-android-79490/http://forums.waoanime.tv/f8/profile-picture-not-appearing-79489/http://forums.waoanime.tv/f8/half-videos-arent-working-79488/Elapsed Time: 2.28090500832http://forums.waoanime.tv/f8/waoplayer-buffering-infinitely-79494/

>>> http://forums.waoanime.tv/f33/2k-thread-79487/http://forums.waoanime.tv/f37/hello-everyone-79486/

http://forums.waoanime.tv/f33/because-79485/
http://forums.waoanime.tv/register.phphttp://forums.waoanime.tv/showgroups.php

So I think the problem is in def main(), I just don't know what I should do to fix it.

John A. · Answer 3 · 2014-04-03T17:09:40+00:00

Hey Sky, I've already cleared it with the site admin, and he said it was fine.

Well it's probably not semantically wrong, but it's not doing what I want. I'm trying to figure out a way for it to comb through the website, starting with the main page, and gather every url within the site's domain. After the main page, it'll move to the next page in the queue, adding any new urls it finds to the list and queue to be checked. At the end it would have gone through every page getting urls. This is just a precursur to what I want to do later on.

I said "I think the problem is in def main() because I think it's not doing what I want due to the way the threads are being created. I just don't knew enough about the threading library to know a way to do it differently.

I hope this clarifies things, and I'm sorry for the confusion.

EDIT: If you know a better way for me to do this, that would also be great.

Sky Diploma 571 Practically a Posting Shark · Answer 4 · 2014-04-04T02:39:02+00:00

I think I've understood what you intend to say.

Let me suggest you a solution to this. I'm not sure if this would entirely work.

# lets define another function that is seperate from main. 
def process_queues():
    queue.join()
    out_queue.join()
    print "Elapsed Time: %s" % (time.time() - start)

def main():
    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue, queue)
        dt.setDaemon(True)
        dt.start()

    process_queues()

Whats essentially happening is that the .join() method makes the system wait till the queue is empty.

This would force the queues to be exhausted before printing the Elapsed time message.

Do let me know if this works for you.

Thanks.

John A. · Answer 5 · 2014-04-04T02:59:18+00:00

Thank you so much! It works perfectly. Though if you don't mind me asking, how is this different from me having the .join() method in main()? What's the semantic difference?

Sky Diploma 571 Practically a Posting Shark · Answer 6 · 2014-04-04T04:30:43+00:00

Its not really 'no' major difference. This is exactly the same as

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue, queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()
    print "Elapsed Time: %s" % (time.time() - start)

main()

Note the only difference being that the print statement moved inside Main.

It was just a convention that I was used to, to put those into a seperate function.

The only thing we needed to do is to print the elapsed time right after we complete processing both queues. :)

Just one suggestion, If we run this on a site with a large number of pages, python may run out of space within its list and fail. So think of an alternative for that.

Oh yeah, and do mark the thread as "solved" if you have no further questions.

Help with Python Threading Library with BeautifulSoup.

Recommended Answers Collapse Answers

All 8 Replies

Recommended Answers