Something is wrong. It won't let me do anything until I open the URL with my browser!

Question

jacob501 0 Light Poster

13 Years Ago

import urllib
from urllib2 import urlopen
from gzip import GzipFile
from cStringIO import StringIO
import re
import urllib2

def download(url):
    s = urlopen(url).read()
    if s[:2] == '\x1f\x8b': # assume it's gzipped data
        with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
            s = ifh.read()
    return s

s = download('http://www.locationary.com/place/en/US/Virginia/Richmond-page20/?ACTION_TOKEN=NumericAction')

findLoc = re.compile('http://www\.locationary\.com/place/en/US/Virginia/Richmond/.{1,100}\.jsp')

findLocL = re.findall(findLoc,s)

for i in range(0,25):

    def download(url):
        s = urlopen(url).read()
        if s[:2] == '\x1f\x8b': # assume it's gzipped data
            with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
                s = ifh.read()
        return s

    b = download(findLocL[i])
  
    findYP = re.compile('http://www\.yellowpages\.com/.{1,100}\d{1,100}')

    findYPL = re.findall(findYP,b)
    
    for c in range(1):
        
        print findYPL[c]

python

4 Contributors
10 Replies
313 Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by Ezzaral

All 10 Replies

TrustyTony 888 ex-Moderator

13 Years Ago

Your code is just doing so stupid things like redefining 25 times in loop (and two unnecessary imports still), and you have got your basic code from Gribouillis, so we would really appreciate some honest efforts.

woooee 814 Nearly a Posting Maven

13 Years Ago

You should also incrementally test your code and not just post a load of crap and demand that we fix it for you, or otherwise we are jerks. I get that your web page is zipped in some odd form. Note that urlopen().read() returns a string or bytes, not a file object so you would use zlib to decompress it. The only way I was able to do this was to save the zipped string and use gzip with the force option to upzip it. Gzip is standard on all Linux and Mac systems, but if you are using MS Windows, you will have to experiment with what is available.

import zlib
import urllib2
import subprocess
 
def download(url):
    response=urllib2.urlopen(url)
    print 'RESPONSE:', response
    print 'URL     :', response.geturl()

    headers = response.info()
    print 'DATE    :', headers['date']
    print 'HEADERS :'
    print '---------'
    print headers

    data = response.read()
    print 'LENGTH  :', len(data)
    print 'DATA    :'
    print '---------'
    print type(data)

    name_write="./test_1.gz"
    fp=open(name_write, "wb")
    fp.write(data)
    fp.close()

    subprocess.call("gzip -df %s" % (name_write), shell=True)

download('http://www.locationary.com/place/en/US/Virginia/Richmond-page20/?ACTION_TOKEN=NumericAction')

"""----------  The first few lines of what I got  ----------
		<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 
		

<html>
			<head>	  				
				<meta http-equiv="content-type" content="text/html; charset=UTF-8">
				
				
				<title>Richmond (Virginia, United States)</title>	
				<link href='http://www.locationary.com/css/style.css' rel="stylesheet" type="text/css"> 
"""

@ Tony
Is it me or is this year's group of questioners worse than in year's past. Two incidents in one day where someone who only takes and does not contribute demands that we solve their problem for them. Both of them are now on my "do not respond" list.

Edited 13 Years Ago by woooee because: n/a

Ezzaral 2,714 Posting Sage

13 Years Ago

Well, that's off-topic and more of a personal problem.

Perhaps a reply acknowledging the suggestions pyTony and woooee made above would be more appropriate.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jacob501 0 Light Poster · Answer 1 · 2011-12-14T06:40:17+00:00

For some reason, the program won't return "print findYPL[c]"

until I've actually gone into my browser and opened the link

to that specific page, which is "b" or "findLocL" .

jacob501 0 Light Poster · Answer 2 · 2011-12-14T09:07:25+00:00

Hey woooee...if you read this, could you please help me? I would really appreciate it! Thanks!

Or anybody else for that matter...

jacob501 0 Light Poster · Answer 3 · 2011-12-14T09:31:08+00:00

jacob501 0 Light Poster

13 Years Ago

Can someone please help?

jacob501 0 Light Poster · Answer 4 · 2011-12-14T23:16:26+00:00

Hi Daniweb! Can someone help me out? I would really appreciate it!

jacob501 0 Light Poster · Answer 5 · 2011-12-15T01:48:28+00:00

I wrote almost all of that code myself. you could just be nice. I've been trying to figure this stuff out soi figured I would ask on here. But obviously there are a couple jerks everywhere..

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 6 · 2011-12-15T04:26:19+00:00

Your re seems to only bring same result from multiple pages, also the re is not good way to deal with html data, you should consider BeautifulSoup or other html/xml module.

Title re returns more interesting info, as it basically contains the address on page

from urllib2 import urlopen
from gzip import GzipFile
import re
from cStringIO import StringIO

#import webbrowser
def download(url):
    page = urlopen(url).read()
    if page[:2] == '\x1f\x8b': # assume it's gzipped data
        with GzipFile(mode='rb', fileobj=StringIO(page)) as ifh:
            return ifh.read()
    return page

page = download('http://www.locationary.com/place/en/US/Virginia/Richmond-page20/?ACTION_TOKEN=NumericAction')
#target = re.compile('<title>(.*)</title>')
target = re.compile('http://www\.yellowpages\.com/.{1,100}\d{1,100}')

for ind, location in enumerate(re.findall('http://www\.locationary\.com/place/en/US/Virginia/Richmond/.{1,100}\.jsp', ''.join(page))):
    for info in download(location).splitlines():
        res = '\n'.join(re.findall(target, info))
        if res:
            # only one per location
            print res
            break
    
"""Output:
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
http://www.yellowpages.com/berkeley-ca/mip/claremont-hotel-club-spa-10695838
"""

jacob501 0 Light Poster · Answer 7 · 2011-12-15T05:09:47+00:00

I hate myself. I'm going to get a dagger and kill myself.

Something is wrong. It won't let me do anything until I open the URL with my browser!

Recommended Answers Collapse Answers

All 10 Replies

Recommended Answers