0

Hello,

I've been downloading my news for offline use with this script:

wget -r --no-parent -Q4096m -U Mozilla A.stm -erobots=off http://news.bbc.co.uk/2/hi/business/default.stm'

But it dumps it into one folder, and the titles are numbered. Is there any regular expression or other command I can use to seperate them into folders by date?

2
Contributors
1
Reply
2
Views
7 Years
Discussion Span
Last Post by slate
0

I would look up the wget manual.
If that does not help, you can peek into the file, extract the date from it, and put the file wherever you want.

import os
from os.path import join
import re

s=re.compile('.*?<meta name=\"OriginalPublicationDate\" content=\"(.*?)" />.*?',re.M|re.S)

_dir="news.bbc.co.uk/2/hi/business/"
for f in os.listdir(_dir):
        #print open(_dir+"/"+f).read()
        if f.endswith("stm"):
                st=open(join(_dir,f)).read()
                ob=s.match(st)
                if ob:
                        print f,ob.group(1)

This prints:

8348437.stm 2009/11/07 16:34:40
8317828.stm 2009/10/21 08:09:31
8253047.stm 2009/09/13 08:55:16
7879565.stm 2009/02/09 16:31:28
8375969.stm 2009/11/24 22:15:41
8388133.stm 2009/12/01 10:42:11
8370035.stm 2009/11/20 13:26:04
4372794.stm 2006/01/31 21:07:28
8063149.stm 2009/05/22 09:35:28
8365018.stm 2009/11/23 23:44:45

Edited by slate: n/a

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.