Hello my friends.
Look at this please:

>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> url = urllib2.urlopen('https://duckduckgo.com/?q=3D&t=canonical&ia=meanings')
>>> soup = BeautifulSoup(url)
>>> links = soup('a')
>>> print links
[<a class="header__logo-wrap" href="/?t=canonical" tabindex="-1"><span class="header__logo">DuckDuckGo</span></a>, <a class="search__dropdown" href="javascript:;" id="search_dropdown" tabindex="4"></a>, <a href="https://duckduckgo.com/html/?q=3D">here</a>]

I used this https://duckduckgo.com/?q=3D&t=canonical&ia=meanings as the url, i thought the code above shoud do like this:

Find all the links in that page of the internet, but you can see the result! As there are many links to different websites on that url page, so why it didn't print the url of each website into output?!

Recommended Answers

All 2 Replies

so why it didn't print the url of each website into output?!

Because this is a dynamic site using JavaScript,jQuery....
The problem is that JavaScript get evaluatet by DOM in browser.
To get links we need something that automates browsers like Selenium.

Can show you one way,here i also use PhantomJS,for not loading a browser.

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='C:/phantom/phantomjs')
driver.set_window_size(1120, 550)
page_source = driver.page_source
soup = BeautifulSoup(page_source)
link_a = soup.find_all('a')
for link in set(link_a):
    if 'http' in repr(link):
            print link['href']
         except KeyError:

Output: here first 6 links of out 100 links.


This is more advanced web-scraping,
and you need to study(understand) site and know what tools to use.

commented: very good help +14

@snippsat good explanation!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, learning, and sharing knowledge.