Read NewsHeadlines Using Python

Question

abhi_cvx 0 Newbie Poster

14 Years Ago

Hi All,

I am really new to the programming world and I am trying to solve a simple problem for my python course-

I have to write a script that retrieves Columbia University's webpage and prints
only the titles of the news stories on the main page. I have to use regular expression and string operations

I wrote this to start with but I am really confused in taking any further steps-

import urllib
sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()
print htmlSource

Any quick help would be highly appreciated. It would be great if somebody can send me a sample code and I can look at it to read and understand.

I am really short of time so a quick response would be apreciated !

Thanks,
Abhi

python

2 Contributors
4 Replies
282 Views
17 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by Beat_Slayer

All 4 Replies

Beat_Slayer 17 Posting Pro in Training

14 Years Ago

Similar task

Try something and we'll help you.

Cheers and Happy coding

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Beat_Slayer 17 Posting Pro in Training · Answer 1 · 2010-08-31T17:12:34+00:00

Some help.

import urllib

sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()

before_html, tag_before, rest_html = str(htmlSource).partition('<!-- BEGIN COLUMBIA NEWS -->')
news, tag_after, after_html = rest_html.partition('<!-- END COLUMBIA NEWS -->')

for line in news.split('\r'):
    print line.strip()

Cheers and Happy coding

Beat_Slayer 17 Posting Pro in Training · Answer 2 · 2010-08-31T17:54:23+00:00

Sorry for all the consecutive posts.

I was bored so...

import urllib

sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()

before_html, tag_before, rest_html = str(htmlSource).partition('<!-- BEGIN COLUMBIA NEWS -->')
news, tag_after, after_html = rest_html.partition('<!-- END COLUMBIA NEWS -->')

news = news.split('\r')

stripwhite = lambda x: x.strip()

news = [stripwhite(item.strip('\n')) for item in news if stripwhite(item.strip('\n')) and stripwhite(item.strip('\n')) != '<br /><br />']

for line in news:
    news_title, tag_before, rest_html = str(line).partition('<a href=')
    link, tag_after, after_html = rest_html.partition('>')
    print 'News:', news_title
    print 'Link:', link

Cheers and Happy coding

Beat_Slayer 17 Posting Pro in Training · Answer 3 · 2010-08-31T22:53:19+00:00

Look here.

Text slice and split made easy.

Cheers and Happy coding

Read NewsHeadlines Using Python

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers