0

I am trying to extract blogs related to economy using the RSS feeds in python. I have no idea how to get a specific number of blogs and how to get those blogs in a particular domain (like economy).
My project requires analysing these blogs using NLP techniques, but I'm stuck in the first step and I don't know how to start.

3
Contributors
7
Replies
73
Views
4 Years
Discussion Span
Last Post by Remy the cook
1

the RSS feed is an XML data, so you have to know how to parse XML. You can either parse using elementTree or using minidom

0

Thank you krystosan.
Umm, I don't know how to get the RSS feed for the blogs of a particular field(like economics, sports, etc.)

1

Give an example or link og what you try to extract/parse.
As mention by krystosan it's XML data,and there are good tool for this in Python.
And library that is only for parsing RSS like Universal Feed Parser
I like both Beautifulsoup and lxml.
A quick demo with Beautifulsoup.

from bs4 import BeautifulSoup

rss = '''\
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title>Python</title>
<link>http://www.reddit.com/r/Python/</link>
<description>
news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
</description>'''

soup = BeautifulSoup(rss)
title_tag = soup.find('title')
description_tag = soup.find('description')
print title_tag.text
print description_tag.text

"""Output-->
Python

news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
"""
0

Thanks snippsat!
It really helped.
Although, when I try to find the content of the blog using
content_tag = soup.find('content:encoded')
it just gives the first paragraph of the website (the first instance when "content:encoded" occurs)

0

I have done this:
The code to get text only and remove the html tags (get_text()) works generally. But this code doesn't work somehow. What am I doing wrong?

import urllib2
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
content_tag = souprss.find_all('content:encoded')
for row in content_tag:
    print(row.get_text())
0
for node in row.findAll('p'):
    print''.join(node.findAll(text=True))

the above code and defining invalid tags also do not work (as the tags are nested and there are too many invalid tags)

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.