Hi All,

I am really new to the programming world and I am trying to solve a simple problem for my python course-

I have to write a script that retrieves Columbia University's webpage and prints
only the titles of the news stories on the main page. I have to use regular expression and string operations

I wrote this to start with but I am really confused in taking any further steps-

import urllib
sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()
print htmlSource

Any quick help would be highly appreciated. It would be great if somebody can send me a sample code and I can look at it to read and understand.

I am really short of time so a quick response would be apreciated !

Thanks,
Abhi

Some help.

import urllib

sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()

before_html, tag_before, rest_html = str(htmlSource).partition('<!-- BEGIN COLUMBIA NEWS -->')
news, tag_after, after_html = rest_html.partition('<!-- END COLUMBIA NEWS -->')

for line in news.split('\r'):
    print line.strip()

Cheers and Happy coding

Sorry for all the consecutive posts.

I was bored so...

import urllib

sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()

before_html, tag_before, rest_html = str(htmlSource).partition('<!-- BEGIN COLUMBIA NEWS -->')
news, tag_after, after_html = rest_html.partition('<!-- END COLUMBIA NEWS -->')

news = news.split('\r')

stripwhite = lambda x: x.strip()

news = [stripwhite(item.strip('\n')) for item in news if stripwhite(item.strip('\n')) and stripwhite(item.strip('\n')) != '<br /><br />']

for line in news:
    news_title, tag_before, rest_html = str(line).partition('<a href=')
    link, tag_after, after_html = rest_html.partition('>')
    print 'News:', news_title
    print 'Link:', link

Cheers and Happy coding

Edited 6 Years Ago by Beat_Slayer: n/a

This article has been dead for over six months. Start a new discussion instead.