I am trying to extract blogs related to economy using the RSS feeds in python. I have no idea how to get a specific number of blogs and how to get those blogs in a particular domain (like economy).
My project requires analysing these blogs using NLP techniques, but I'm stuck in the first step and I don't know how to start.

3 Years
Discussion Span
Last Post by Remy the cook

the RSS feed is an XML data, so you have to know how to parse XML. You can either parse using elementTree or using minidom


Thank you krystosan.
Umm, I don't know how to get the RSS feed for the blogs of a particular field(like economics, sports, etc.)


Give an example or link og what you try to extract/parse.
As mention by krystosan it's XML data,and there are good tool for this in Python.
And library that is only for parsing RSS like Universal Feed Parser
I like both Beautifulsoup and lxml.
A quick demo with Beautifulsoup.

from bs4 import BeautifulSoup

rss = '''\
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python

soup = BeautifulSoup(rss)
title_tag = soup.find('title')
description_tag = soup.find('description')
print title_tag.text
print description_tag.text


news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python

Thanks snippsat!
It really helped.
Although, when I try to find the content of the blog using
content_tag = soup.find('content:encoded')
it just gives the first paragraph of the website (the first instance when "content:encoded" occurs)


I have done this:
The code to get text only and remove the html tags (get_text()) works generally. But this code doesn't work somehow. What am I doing wrong?

import urllib2
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
content_tag = souprss.find_all('content:encoded')
for row in content_tag:
for node in row.findAll('p'):

the above code and defining invalid tags also do not work (as the tags are nested and there are too many invalid tags)

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.