1

Two litle functions to help on text slice and spliting.

The code comments say it all.

# Slicer takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns the string
# between the two given strings
Slicer = lambda((b, a, t)): t.partition(b)[2].partition(a)[0]
# Spliter takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns a tuple
# containing the string before, the string between and the string
# after the given strings
Spliter = lambda((b, a, t)): ((t.partition(b)[0]),) + t.partition(b)[2].partition(a)[0::2]

Cheers and Happy coding

import urllib

sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()

# Slicer takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns the string
# between the two given strings
Slicer = lambda((b, a, t)): t.partition(b)[2].partition(a)[0]

tag_before = '<!-- BEGIN COLUMBIA NEWS -->'
tag_after = '<!-- END COLUMBIA NEWS -->'

newsSource = Slicer((tag_before, tag_after, htmlSource)).split('\r')

stripwhite = lambda x: x.strip()

newsSource = [stripwhite(item.strip('\n')) for item in newsSource if stripwhite(item.strip('\n')) and stripwhite(item.strip('\n')) != '<br /><br />']

# Spliter takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns a tuple
# containing the string before, the string between and the string
# after the given strings
Spliter = lambda((b, a, t)): ((t.partition(b)[0]),) + t.partition(b)[2].partition(a)[0::2]

tag_before = '<a href='
tag_after = '>'

for line in newsSource:
    news_title, link, rest = Spliter((tag_before, tag_after, line))
    print 'News:', news_title
    print 'Link:', link
2
Contributors
2
Replies
6
Views
7 Years
Discussion Span
Last Post by Beat_Slayer
0

I did little clean up for your code, looks like you have compressed my between function. The lambdas looked little out of place and I changed them to normal defs to be more understandable for people without Lisp or similar experience, hope I did not break anything:

import urllib

sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()

# Slicer takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns the string
# between the two given strings
def slicer((before, after, text)):
    return (text.partition(before)[2].partition(after)[0])

# Spliter takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns a tuple
# containing the string before, the string between and the string
# after the given strings
def spliter((b, a, t)):
    return ((t.partition(b)[0]),) + t.partition(b)[2].partition(a)[0::2]

def stripwhite(x):
    return x.strip()

tag_before = '<!-- BEGIN COLUMBIA NEWS -->'
tag_after = '<!-- END COLUMBIA NEWS -->'

newsSource = slicer((tag_before, tag_after, htmlSource)).split('\r')
newsSource = [stripwhite(item)
              for item in newsSource
              if '' != stripwhite(item) != '<br /><br />' ]

tag_before = '<a href='
tag_after = '>'

for line in newsSource:
    news_title, link, rest = spliter((tag_before, tag_after, line))
    print 'News:', news_title
    print 'Link:', link

Edited by pyTony: n/a

0

Yes, it follows the same principle of multi partition calls to slice and split an string.

I always use similar methods, and decided to make one liners of my current versions for easying the things out.

Cheers and Happy coding

Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.