Text slice and split made easy.

13 Years Ago Beat_Slayer 1 563 Views

Two litle functions to help on text slice and spliting.

The code comments say it all.

# Slicer takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns the string
# between the two given strings
Slicer = lambda((b, a, t)): t.partition(b)[2].partition(a)[0]

# Spliter takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns a tuple
# containing the string before, the string between and the string
# after the given strings
Spliter = lambda((b, a, t)): ((t.partition(b)[0]),) + t.partition(b)[2].partition(a)[0::2]

Cheers and Happy coding

import urllib

sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()

# Slicer takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns the string
# between the two given strings
Slicer = lambda((b, a, t)): t.partition(b)[2].partition(a)[0]

tag_before = '<!-- BEGIN COLUMBIA NEWS -->'
tag_after = '<!-- END COLUMBIA NEWS -->'

newsSource = Slicer((tag_before, tag_after, htmlSource)).split('\r')

stripwhite = lambda x: x.strip()

newsSource = [stripwhite(item.strip('\n')) for item in newsSource if stripwhite(item.strip('\n')) and stripwhite(item.strip('\n')) != '<br /><br />']

# Spliter takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns a tuple
# containing the string before, the string between and the string
# after the given strings
Spliter = lambda((b, a, t)): ((t.partition(b)[0]),) + t.partition(b)[2].partition(a)[0::2]

tag_before = '<a href='
tag_after = '>'

for line in newsSource:
    news_title, link, rest = Spliter((tag_before, tag_after, line))
    print 'News:', news_title
    print 'Link:', link

TrustyTony 888 pyMod

13 Years Ago

I did little clean up for your code, looks like you have compressed my between function. The lambdas looked little out of place and I changed them to normal defs to be more understandable for people without Lisp or similar experience, hope I did not break anything:

import urllib

sock = urllib.urlopen('http://www.columbia.edu/')
htmlSource = sock.read()
sock.close()

# Slicer takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns the string
# between the two given strings
def slicer((before, after, text)):
    return (text.partition(before)[2].partition(after)[0])

# Spliter takes as arguments a tuple containing the string before,
# the string after and the string to truncate. It returns a tuple
# containing the string before, the string between and the string
# after the given strings
def spliter((b, a, t)):
    return ((t.partition(b)[0]),) + t.partition(b)[2].partition(a)[0::2]

def stripwhite(x):
    return x.strip()

tag_before = '<!-- BEGIN COLUMBIA NEWS -->'
tag_after = '<!-- END COLUMBIA NEWS -->'

newsSource = slicer((tag_before, tag_after, htmlSource)).split('\r')
newsSource = [stripwhite(item)
              for item in newsSource
              if '' != stripwhite(item) != '<br /><br />' ]

tag_before = '<a href='
tag_after = '>'

for line in newsSource:
    news_title, link, rest = spliter((tag_before, tag_after, line))
    print 'News:', news_title
    print 'Link:', link

Edited 13 Years Ago by TrustyTony because: n/a

Beat_Slayer 17 Posting Pro in Training

13 Years Ago

Yes, it follows the same principle of multi partition calls to slice and split an string.

I always use similar methods, and decided to make one liners of my current versions for easying the things out.

Cheers and Happy coding

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.