Hi, I just started learning python a few days ago...and I'm already stuck on something easy TT.TT
I have a Tab-Separated-Values data.tsv file that contains 3 columns (country name, area, and population).
here's a snippet of my current tsv file
country area population MACAU 28.2 578025 MONACO 2 30510 SINGAPORE 697 5353494 HONG KONG 1104 7153519 GAZA STRIP 360 1710257 GIBRALTAR 6.5 29034 HOLY SEE (VATICAN CITY) 0.44 836 BAHRAIN 760 1248348 MALDIVES 298 394451 MALTA 316 409836 BERMUDA 54 69080 SINT MAARTEN 34 39088 BANGLADESH 143998 161083804 ..........
I would like to aggregate the data by geo regions such as North America, South America, etc. and since the region info is not in the file, I need to add it in from this site www.indexmundi.com/factbook/regions and merge the region names into the file so it pairs with the correct country to produce this output:
(what I want my final tsv file to look like)
country region area population AFGHANISTAN Asia 652230 30419928 ALBANIA Europe 28748 3002859 ALGERIA Africa 2381741 37367226 AMERICAN SAMOA Oceania 199 54947 ANDORRA Europe 468 85082 ANGOLA Africa 1246700 18056072 ANGUILLA Central America & the Caribbean 91 15423 ANTIGUA AND BARBUDA Central America & the Caribbean 442.6 89018 ARGENTINA South America 2780400 42192494 ARMENIA Asia 29743 2970495 ARUBA Central America & the Caribbean 180 107635 AUSTRALIA Oceania 7741220 22015576 AUSTRIA Europe 83871 8219743 AZERBAIJAN Asia 86600 9493600 .............
this is my code right now:
import urllib2, re from bs4 import BeautifulSoup response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read() soup = BeautifulSoup(response) row = soup.findAll('li') for link in row: href = link.find('a')['href'] url = "http://www.indexmundi.com" countryurl = url + href response = urllib2.urlopen(countryurl).read() soup = BeautifulSoup(response) data_table = soup.findAll('td') for data in data_table: region = data.find('a').text print region
This only prints out the region names like below:
Algeria Angola Benin Botswana Burkina Faso Burundi Cameroon Cape Verde Central African Republic Chad Comoros Congo, Democratic Republic of the Congo, Republic of the Cote d'Ivoire Djibouti Egypt etc....
The result I want can be done only using BeautifulSoup4 and urllib2 (which I have incorporated) so I don't need other complicated modules (again, newbie).
I don't think I need to keep reading into the links from where I'm at right? But then I'm not sure how to merge the regions into the file with the correct country though I think I would somehow need to save the country name first so that when I do write the region names to my current tsv file, it will merge with the correct country it's under.
Any help would be greatly appreciated