Hi.
How i can ask my crawler to print only the text of all <li></li> tags in a url page?
I want to save the text of all <li></li> tags in a text file (without<li></li> words.)

Use regular expressions Click Here

Here's quick example ..

>>> import re
>>> html = 'randomstuff<li>I am some text 12345</li>randomstuff'
>>> re.findall(r'<li>(.+)</li>',html)
['I am some text 12345']

The problem is that your html variable is just a string containing this value
https://www.daniweb.com/software-development/python/threads/492669/how-to-print-only-the-content-of-all-tags-from-a-url-page
and not the actual HTML code ... the library that you have imported urllib2 .. use it to get the code from that page
Read urllib2
and also the example from there ..

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

also.. should this >>> re.findall(r'<p>(.+),/p>', html)
be >>> re.findall(r'<p>(.+)</p>', html)?
and I am not sure if you read the link I gave you earlier about regular expression but the . matches any character including space. The + stands for that get all character that match the pattern stated which in our case was the . representing any character between <li></li> as in get all characters that match the pattern, as if it was only . without + it will simply return a single character that matches the pattern

Edited 1 Year Ago by Slavi

Use regular expressions Click Here

No no no just to make it clear :)
Have to post this link again.
Use a parser Beautifulsoup or lxml.

from bs4 import BeautifulSoup

html = '''\
<head>
  <title>Page Title</title>
</head>
<body>
  <li>Text in li 1</li>
  <li>Text in li 2</li>
</body>
</html>'''

soup = BeautifulSoup(html)
tag_li = soup.find_all('li')
print tag_li
for tag in tag_li:
    print tag.text

"""Output-->
[<li>Text in li 1</li>, <li>Text in li 2</li>]
Text in li 1
Text in li 2
"""

Edited 1 Year Ago by snippsat

Comments
Great read, but this 'Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins.. he's gone too far=D

Thank you @snippsat. Your example was exactly what i was looking for.

And thank you @Slavi for your answer and explanation.

he's gone too far=D

Yes of course,to make it great humoristic read.
Regex can be ok to use some times,like you only need a singel text/value.

Both BeautifulSoup and lxml has build in support for regex.
Sometime it ok to use regex as helper to parser,when parsing dynamic web-sites
you can get a at lot of rubbish text.

Edited 1 Year Ago by snippsat

I wanted to post this question into a new Discussion but as it was related to this discussion so i will ask here. My code:

from bs4 import BeautifulSoup
import urllib2

mylist = []

url = 'http://www.niloofar3d.ir/try.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
tag_li = soup.find_all('li')
for tag in tag_li:
    if tag.text.startswith('A'):
        mylist.append(tag.text)
if 'A' in mylist[0]:
    if 'A' in mylist[1]:
        if 'A' in mylist[2]:
            print mylist
else:
    'sorry!' 

The output must be the else message but it print this output:

[u'Apple', u'Age', u'Am']

What is the problem? I want the script to check if the first 3 words (indexes) of mylist start with the letter 'A', print the list, but if not, print 'sorry!'. But as you can see here, it has printed even the index[4]!

And one more question, how i can remove those u letters that has printed into output?

Edited 1 Year Ago by Niloofar24

This question has already been answered. Start a new discussion instead.