Hello everyone,

i have a question for You.

i have a script with grabs URL and etc.

here is a example of it:

Sorry for very messy code... I'm just testing

while (a < 10) :
    if a == 2 :
        f = urllib.urlopen("****" % params1).read()
        linkai = re.compile('</a> -     <a href="(.*?)"')
        surasti = re.findall(linkai, f)
        for link in surasti:
            u = urllib.urlopen(link).read()
            urlas = re.compile('ne"><a href="(.*?)"')
            miau = re.compile('(?s)<pre.*?>http://(.+?)</pre>')
            surasti2 = re.findall(urlas , u)
            miau2 = re.findall(miau, u)

            time.sleep(3)
            for i in surasti2:
                print i
                l.write(i + '\n')
            
            for b in miau2:
                print (b + '\n\n')
                l2.write(b + '\n\n')
                g +=1
            dic[b]=i
            
        a = a + 1        
     
    else:
        f = urllib.urlopen("*****" % params2).read()
        linkai = re.compile('</a> -     <a href="(.*?)"')
        surasti = re.findall(linkai, f)
        
        
        for link in surasti:
            u = urllib.urlopen(link).read()
            urlas = re.compile('ne"><a href="(.*?)"')
            miau = re.compile('(?s)<pre.*?>http://(.+?)</pre>')
            surasti2 = re.findall(urlas , u)
            miau2 = re.findall(miau, u)
            
            time.sleep(3)

            for i in surasti2:
                print i
                l.write(i + '\n')
                
            for b in miau2:
                print (b + '\n\n')
                l2.write(b + '\n\n')
                g +=1

            dic[b]=i
        
        a = a + 1

Every time i run this lets say 10 pages (in every page are another 10 URL) so total should be 100 URL so dictionary should be 100 entries length.

Here is example of few dictionaries

{ 'http://sdfsdfasdfs.com/sdfsdf' : 'http:sdfasdfasdf.com/asdfasdfas' , 'http://sdfsdfasdfs.com/sdfsdf' : 'http:sdfasdfasdf.com/asdfasdfas' and so on}

But max entries i get 20! i could set it on 100 pages (should be 1000 entries(pairs) in dictionary ) but I'm getting 20 entries no matter what!

i did test and changed variable "b" into "g"(witch is g += 1) and it works just fine (run trough 10 pages (100 entries) and I'm getting 100 entries(pairs) in dictionary )

Please help me with this :)

Please do not handle HTML with regular expressions, use proper tools like Beautiful Soup
http://www.crummy.com/software/BeautifulSoup/

Dictionary has only limit amount of memory, I think, here proof:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> big = dict((a,a+1) for a in range(100000))
>>> len(big)
100000
>>>

Edited 5 Years Ago by pyTony: n/a

Thank for Your reply, yes i should use beautifulsoup instead os regex :) i need to learn it first.

Yes i know the limit is much more then 20 entries, but WHY everytime i'm getting limit of 20 then i'm collecting url : url? Because with int : url everything works just fine..

tried to append one urls address to one list and other url address to another list. Printed them out, they had 80 item each. tried to write every item in those lists to dictionary, and got 20 pairs max.... :) so strange :)

Dictionary keys have to be unique, so if you try to add a key that is already in the dictionary, it will not add a new key but write over the existing key, so add some code to test if the key already exists in the dictionary and print a message if it does. Finally, this code does not add to the dictionary on each pass through the loop but only adds the final pass, and I can't tell from the code if that is what you want to do or not so can't give any more help.

for b in miau2:
    print (b + '\n\n')
    l2.write(b + '\n\n')
    g +=1
 
dic[b]=i

Solved!! Thank you guys ! :)

Edited 5 Years Ago by Creatinas: n/a

This question has already been answered. Start a new discussion instead.