How to print only the content of all <li> tags from a url page?

Question

Niloofar24 15 Posting Whiz

9 Years Ago

Hi.
How i can ask my crawler to print only the text of all <li></li> tags in a url page?
I want to save the text of all <li></li> tags in a text file (without<li></li> words.)

python

3 Contributors
8 Replies
3K Views
3 Hours Discussion Span
Latest Post 9 Years Ago Latest Post by Niloofar24

Slavi 94 Master Poster

9 Years Ago

The problem is that your html variable is just a string containing this value
https://www.daniweb.com/software-development/python/threads/492669/how-to-print-only-the-content-of-all-tags-from-a-url-page
and not the actual HTML code ... the library that you have imported urllib2 .. use it to get the code from that page
Read urllib2
and also the example from there ..

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

also.. should this >>> re.findall(r'<p>(.+),/p>', html)
be >>> re.findall(r'<p>(.+)</p>', html)?
and I am not sure if you read the link I gave you earlier about regular expression but the . matches any character including space. The + stands for that get all character that match the pattern stated which in our case was the . representing any character between <li></li> as in get all characters that match the pattern, as if it was only . without + it will simply return a single character that matches the pattern

Edited 9 Years Ago by Slavi

snippsat 661 Master Poster

9 Years Ago

Use regular expressions Click Here

No no no just to make it clear :)
Have to post this link again.
Use a parser Beautifulsoup or lxml.

from bs4 import BeautifulSoup

html = '''\
<head>
  <title>Page Title</title>
</head>
<body>
  <li>Text in li 1</li>
  <li>Text in li 2</li>
</body>
</html>'''

soup = BeautifulSoup(html)
tag_li = soup.find_all('li')
print tag_li
for tag in tag_li:
    print tag.text

"""Output-->
[<li>Text in li 1</li>, <li>Text in li 2</li>]
Text in li 1
Text in li 2
"""

Edited 9 Years Ago by snippsat

Slavi commented: Great read, but this 'Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins.. he's gone too far=D +6

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Great read, but this 'Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins.. he's gone too far=D

Slavi 94 Master Poster Featured Poster · Answer 1 · 2015-03-03T11:57:49+00:00

Use regular expressions Click Here

Here's quick example ..

>>> import re
>>> html = 'randomstuff<li>I am some text 12345</li>randomstuff'
>>> re.findall(r'<li>(.+)</li>',html)
['I am some text 12345']

Niloofar24 15 Posting Whiz · Answer 2 · 2015-03-03T12:29:01+00:00

I tried this for testing:

>>> import urllib2
>>> import re
>>> html = 'https://www.daniweb.com/software-development/python/threads/492669/how-to-print-only-the-content-of-all-tags-from-a-url-page'
>>> re.findall(r'<p>(.+),/p>', html)

But the output was:

[]

I tried other tags too but all outputs was [], what's the problem?

Niloofar24 15 Posting Whiz · Answer 3 · 2015-03-03T13:46:49+00:00

Thank you @snippsat. Your example was exactly what i was looking for.

And thank you @Slavi for your answer and explanation.

snippsat 661 Master Poster · Answer 4 · 2015-03-03T13:49:11+00:00

he's gone too far=D

Yes of course,to make it great humoristic read.
Regex can be ok to use some times,like you only need a singel text/value.

Both BeautifulSoup and lxml has build in support for regex.
Sometime it ok to use regex as helper to parser,when parsing dynamic web-sites
you can get a at lot of rubbish text.

Niloofar24 15 Posting Whiz · Answer 5 · 2015-03-03T14:40:41+00:00

I wanted to post this question into a new Discussion but as it was related to this discussion so i will ask here. My code:

from bs4 import BeautifulSoup
import urllib2

mylist = []

url = 'http://www.niloofar3d.ir/try.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
tag_li = soup.find_all('li')
for tag in tag_li:
    if tag.text.startswith('A'):
        mylist.append(tag.text)
if 'A' in mylist[0]:
    if 'A' in mylist[1]:
        if 'A' in mylist[2]:
            print mylist
else:
    'sorry!'

The output must be the else message but it print this output:

[u'Apple', u'Age', u'Am']

What is the problem? I want the script to check if the first 3 words (indexes) of mylist start with the letter 'A', print the list, but if not, print 'sorry!'. But as you can see here, it has printed even the index[4]!

And one more question, how i can remove those u letters that has printed into output?

Niloofar24 15 Posting Whiz · Answer 6 · 2015-03-03T14:50:45+00:00

Well, it seems my question is basically wrong.
Forget that question, sorry!