import urllib
from urllib2 import urlopen
from gzip import GzipFile
from cStringIO import StringIO
import re
import urllib2

def download(url):
    s = urlopen(url).read()
    if s[:2] == '\x1f\x8b': # assume it's gzipped data
        with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
            s = ifh.read()
    return s

s = download('http://www.locationary.com/place/en/US/Virginia/Richmond-page20/?ACTION_TOKEN=NumericAction')

findLoc = re.compile('http://www\.locationary\.com/place/en/US/Virginia/Richmond/.{1,100}\.jsp')

findLocL = re.findall(findLoc,s)

for i in range(0,25):

    def download(url):
        s = urlopen(url).read()
        if s[:2] == '\x1f\x8b': # assume it's gzipped data
            with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
                s = ifh.read()
        return s

    b = download(findLocL[i])
  
    findYP = re.compile('http://www\.yellowpages\.com/.{1,100}\d{1,100}')

    findYPL = re.findall(findYP,b)
    
    for c in range(1):
        
        print findYPL[c]

Recommended Answers

All 10 Replies

For some reason, the program won't return "print findYPL[c]"

until I've actually gone into my browser and opened the link

to that specific page, which is "b" or "findLocL" .

Hey woooee...if you read this, could you please help me? I would really appreciate it! Thanks!

Or anybody else for that matter...

Can someone please help?

Hi Daniweb! Can someone help me out? I would really appreciate it!

Your code is just doing so stupid things like redefining 25 times in loop (and two unnecessary imports still), and you have got your basic code from Gribouillis, so we would really appreciate some honest efforts.

I wrote almost all of that code myself. you could just be nice. I've been trying to figure this stuff out soi figured I would ask on here. But obviously there are a couple jerks everywhere..

Your re seems to only bring same result from multiple pages, also the re is not good way to deal with html data, you should consider BeautifulSoup or other html/xml module.

Title re returns more interesting info, as it basically contains the address on page

from urllib2 import urlopen
from gzip import GzipFile
import re
from cStringIO import StringIO

#import webbrowser
def download(url):
    page = urlopen(url).read()
    if page[:2] == '\x1f\x8b': # assume it's gzipped data
        with GzipFile(mode='rb', fileobj=StringIO(page)) as ifh:
            return ifh.read()
    return page

page = download('http://www.locationary.com/place/en/US/Virginia/Richmond-page20/?ACTION_TOKEN=NumericAction')
#target = re.compile('<title>(.*)</title>')
target = re.compile('http://www\.yellowpages\.com/.{1,100}\d{1,100}')

for ind, location in enumerate(re.findall('http://www\.locationary\.com/place/en/US/Virginia/Richmond/.{1,100}\.jsp', ''.join(page))):
    for info in download(location).splitlines():
        res = '\n'.join(re.findall(target, info))
        if res:
            # only one per location
            print res
            break
    
"""Output:
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
"""

You should also incrementally test your code and not just post a load of crap and demand that we fix it for you, or otherwise we are jerks. I get that your web page is zipped in some odd form. Note that urlopen().read() returns a string or bytes, not a file object so you would use zlib to decompress it. The only way I was able to do this was to save the zipped string and use gzip with the force option to upzip it. Gzip is standard on all Linux and Mac systems, but if you are using MS Windows, you will have to experiment with what is available.

import zlib
import urllib2
import subprocess
 
def download(url):
    response=urllib2.urlopen(url)
    print 'RESPONSE:', response
    print 'URL     :', response.geturl()

    headers = response.info()
    print 'DATE    :', headers['date']
    print 'HEADERS :'
    print '---------'
    print headers

    data = response.read()
    print 'LENGTH  :', len(data)
    print 'DATA    :'
    print '---------'
    print type(data)

    name_write="./test_1.gz"
    fp=open(name_write, "wb")
    fp.write(data)
    fp.close()

    subprocess.call("gzip -df %s" % (name_write), shell=True)

download('http://www.locationary.com/place/en/US/Virginia/Richmond-page20/?ACTION_TOKEN=NumericAction')

"""----------  The first few lines of what I got  ----------
		<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 
		

<html>
			<head>	  				
				<meta http-equiv="content-type" content="text/html; charset=UTF-8">
				
				
				<title>Richmond (Virginia, United States)</title>	
				<link href='http://www.locationary.com/css/style.css' rel="stylesheet" type="text/css"> 
"""

@ Tony
Is it me or is this year's group of questioners worse than in year's past. Two incidents in one day where someone who only takes and does not contribute demands that we solve their problem for them. Both of them are now on my "do not respond" list.

I hate myself. I'm going to get a dagger and kill myself.

Well, that's off-topic and more of a personal problem.

Perhaps a reply acknowledging the suggestions pyTony and woooee made above would be more appropriate.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.