why the output doesn't contain all that URL links?

Question

Niloofar24 15 Posting Whiz

10 Years Ago

Hello my friends.
Look at this please:

>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> url = urllib2.urlopen('https://duckduckgo.com/?q=3D&t=canonical&ia=meanings')
>>> soup = BeautifulSoup(url)
>>> links = soup('a')
>>> print links
[<a class="header__logo-wrap" href="/?t=canonical" tabindex="-1"><span class="header__logo">DuckDuckGo</span></a>, <a class="search__dropdown" href="javascript:;" id="search_dropdown" tabindex="4"></a>, <a href="https://duckduckgo.com/html/?q=3D">here</a>]
>>>

I used this https://duckduckgo.com/?q=3D&t=canonical&ia=meanings as the url, i thought the code above shoud do like this:

Find all the links in that page of the internet, but you can see the result! As there are many links to different websites on that url page, so why it didn't print the url of each website into output?!

python seo

3 Contributors
2 Replies
596 Views
1 Week Discussion Span
Latest Post 10 Years Ago Latest Post by vegaseat

snippsat 661 Master Poster

10 Years Ago

so why it didn't print the url of each website into output?!

Because this is a dynamic site using JavaScript,jQuery....
The problem is that JavaScript get evaluatet by DOM in browser.
To get links we need something that automates browsers like Selenium.

Can show you one way,here i also use PhantomJS,for not loading a browser.

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='C:/phantom/phantomjs')
driver.set_window_size(1120, 550)
driver.get('https://duckduckgo.com/?q=3D&t=canonical&ia=meanings')
page_source = driver.page_source
soup = BeautifulSoup(page_source)
link_a = soup.find_all('a')
for link in set(link_a):
    if 'http' in repr(link):
         try:
            print link['href']
         except KeyError:
            pass

Output: here first 6 links of out 100 links.

http://3d.si.edu/
https://en.wikipedia.org/wiki/3-D_film
http://3d.about.com/
http://www.boxofficemojo.com/genres/chart/?id=3d.htm
http://www.urbanfonts.com/fonts/3d-fonts.htm
http://www.3dcontentcentral.com/

This is more advanced web-scraping,
and you need to study(understand) site and know what tools to use.

Edited 10 Years Ago by snippsat

Gribouillis commented: very good help +14

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

vegaseat 1,735 DaniWeb's Hypocrite Team Colleague · Answer 1 · 2015-03-14T14:41:15+00:00

vegaseat 1,735 DaniWeb's Hypocrite

10 Years Ago

@snippsat good explanation!