1,105,380 Community Members

Getting/Merging Data Into Tsv File Help

Member Avatar
jellyworms
Newbie Poster
1 post since Mar 2013
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
2
 

Hi, I just started learning python a few days ago...and I'm already stuck on something easy TT.TT

I have a Tab-Separated-Values data.tsv file that contains 3 columns (country name, area, and population).
here's a snippet of my current tsv file

country area    population 
MACAU   28.2    578025 
MONACO  2   30510 
SINGAPORE   697 5353494 
HONG KONG   1104    7153519 
GAZA STRIP  360 1710257 
GIBRALTAR   6.5 29034 
HOLY SEE (VATICAN CITY) 0.44    836 
BAHRAIN 760 1248348 
MALDIVES    298 394451 
MALTA   316 409836 
BERMUDA 54  69080 
SINT MAARTEN    34  39088 
BANGLADESH  143998  161083804
..........

I would like to aggregate the data by geo regions such as North America, South America, etc. and since the region info is not in the file, I need to add it in from this site www.indexmundi.com/factbook/regions and merge the region names into the file so it pairs with the correct country to produce this output:
(what I want my final tsv file to look like)

country region  area    population
AFGHANISTAN Asia    652230  30419928
ALBANIA Europe  28748   3002859
ALGERIA Africa  2381741 37367226
AMERICAN SAMOA  Oceania 199 54947
ANDORRA Europe  468 85082
ANGOLA  Africa  1246700 18056072
ANGUILLA    Central America & the Caribbean 91  15423
ANTIGUA AND BARBUDA Central America & the Caribbean 442.6   89018
ARGENTINA   South America   2780400 42192494
ARMENIA Asia    29743   2970495
ARUBA   Central America & the Caribbean 180 107635
AUSTRALIA   Oceania 7741220 22015576
AUSTRIA Europe  83871   8219743
AZERBAIJAN  Asia    86600   9493600
.............

this is my code right now:

import urllib2, re
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
soup = BeautifulSoup(response)
row = soup.findAll('li')
for link in row:
    href = link.find('a')['href']
    url = "http://www.indexmundi.com"
    countryurl = url + href
    response = urllib2.urlopen(countryurl).read()
    soup = BeautifulSoup(response)
    data_table = soup.findAll('td')
    for data in data_table:
        region = data.find('a').text
        print region

This only prints out the region names like below:

Algeria
Angola 
Benin  
Botswana
Burkina
Faso   
Burundi Cameroon   
Cape Verde
Central African Republic   
Chad   
Comoros Congo, Democratic Republic of the
Congo, Republic of the 
Cote d'Ivoire  
Djibouti   
Egypt
etc....

The result I want can be done only using BeautifulSoup4 and urllib2 (which I have incorporated) so I don't need other complicated modules (again, newbie).

I don't think I need to keep reading into the links from where I'm at right? But then I'm not sure how to merge the regions into the file with the correct country though I think I would somehow need to save the country name first so that when I do write the region names to my current tsv file, it will merge with the correct country it's under.

Any help would be greatly appreciated

Member Avatar
slate
Posting Whiz
375 posts since Jun 2008
Reputation Points: 163 [?]
Q&As Helped to Solve: 107 [?]
Skill Endorsements: 10 [?]
 
1
 

Your countryurl variable does not contain a country, but a region. That is what misleads you.
You produce excelent quality code, btw.

response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
soup = BeautifulSoup(response)
row = soup.findAll('li')
for link in row:
    href = link.find('a')['href']
    url = "http://www.indexmundi.com"
    countryurl = url + href
    #-----
    regionname=link.find('a').text
    #-----    
    response = urllib2.urlopen(countryurl).read()
    soup = BeautifulSoup(response)
    data_table = soup.findAll('td')
    for data in data_table:
        region = data.find('a').text #this is the country in reality
        #-----
        print regionname,region
        #-----
Member Avatar
bumsfeld
Posting Virtuoso
1,537 posts since Jul 2005
Reputation Points: 399 [?]
Q&As Helped to Solve: 261 [?]
Skill Endorsements: 7 [?]
 
1
 

Added few more hints to slate's code:

import urllib2, re
from bs4 import BeautifulSoup

url = 'http://www.indexmundi.com/factbook/regions'
response = urllib2.urlopen(url).read()
soup = BeautifulSoup(response)
row = soup.findAll('li')

# create a dictionary of region:regionname pairs
region_dict = {}
for link in row:
    href = link.find('a')['href']
    url = "http://www.indexmundi.com"
    countryurl = url + href
    regionname = link.find('a').text
    response = urllib2.urlopen(countryurl).read()
    soup = BeautifulSoup(response)
    data_table = soup.findAll('td')
    for data in data_table:
        region = data.find('a').text
        #print(regionname, region)  # test
        # make region upper case
        region_dict[region.upper()] = regionname

# tab separated values (country\tarea\tpopulation)
data_line = "BERMUDA\t54\t69080"
data_list = data_line.split('\t')
print(data_list)  # ['BERMUDA', '54', '69080']

if data_list[0] in region_dict:
    # insert regionname at index 1
    data_list.insert(1, region_dict[data_list[0]])

print(data_list)  # ['BERMUDA', u'North America', '54', '69080'] 
You
This article has been dead for over three months: Start a new discussion instead
Post:
Start New Discussion
Tags Related to this Article