I am trying to extract blogs related to economy using the RSS feeds in python. I have no idea how to get a specific number of blogs and how to get those blogs in a particular domain (like economy).
My project requires analysing these blogs using NLP techniques, but I'm stuck in the first step and I don't know how to start.

Recommended Answers

All 7 Replies

the RSS feed is an XML data, so you have to know how to parse XML. You can either parse using elementTree or using minidom

Thank you krystosan.
Umm, I don't know how to get the RSS feed for the blogs of a particular field(like economics, sports, etc.)

Give an example or link og what you try to extract/parse.
As mention by krystosan it's XML data,and there are good tool for this in Python.
And library that is only for parsing RSS like Universal Feed Parser
I like both Beautifulsoup and lxml.
A quick demo with Beautifulsoup.

from bs4 import BeautifulSoup

rss = '''\
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title>Python</title>
<link>http://www.reddit.com/r/Python/</link>
<description>
news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
</description>'''

soup = BeautifulSoup(rss)
title_tag = soup.find('title')
description_tag = soup.find('description')
print title_tag.text
print description_tag.text

"""Output-->
Python

news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
"""

Thanks snippsat!
It really helped.
Although, when I try to find the content of the blog using
content_tag = soup.find('content:encoded')
it just gives the first paragraph of the website (the first instance when "content:encoded" occurs)

find_all()

The find_all() method scans the entire document looking for results,

I have done this:
The code to get text only and remove the html tags (get_text()) works generally. But this code doesn't work somehow. What am I doing wrong?

import urllib2
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
content_tag = souprss.find_all('content:encoded')
for row in content_tag:
    print(row.get_text())
for node in row.findAll('p'):
    print''.join(node.findAll(text=True))

the above code and defining invalid tags also do not work (as the tags are nested and there are too many invalid tags)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.