» Stack Overflow Survey 2018: Respondents World Map
In this notebook I create a choropleth map that shows how many people responded to the Stack Overflow Developer Survey 2018 in relation to their countries' populations.
First load the result file into a pandas DataFrame and create a series of value counts from the Country
column.
%matplotlib inline import os import pandas as pd df_public = pd.read_csv(os.path.expanduser('~/data/kaggle.com/stackoverflow/stack-overflow-2018-developer-survey/survey_results_public.csv'), usecols=['Country'], dtype=str) df_countries = pd.DataFrame(df_public.Country.value_counts()) df_countries.head(5)
Country | |
---|---|
United States | 20309 |
India | 13721 |
Germany | 6459 |
United Kingdom | 6221 |
Canada | 3393 |
Next create a country index to map the country names used in the developer survey to the ISO 3166-1 country codes that identify countries in the geographic data used later for plotting the map. Then show the names that could not be mapped with the iso3166
package.
from iso3166 import countries country_index = {name: countries.get(name).alpha3 for name in df_countries.index if name in countries} set(df_countries.index) - set(country_index)
{'Bolivia', 'Cape Verde', 'Congo, Republic of the...', 'Czech Republic', "Democratic People's Republic of Korea", 'Democratic Republic of the Congo', 'Hong Kong (S.A.R.)', 'Iran, Islamic Republic of...', 'Libyan Arab Jamahiriya', 'Micronesia, Federated States of...', 'North Korea', 'Other Country (Not Listed Above)', 'Republic of Korea', 'Republic of Moldova', 'South Korea', 'The former Yugoslav Republic of Macedonia', 'United Kingdom', 'United Republic of Tanzania', 'Venezuela, Bolivarian Republic of...'}
Add the missing ISO codes manually to the index.
from collections import Counter country_index.update({ 'Bolivia': 'BOL', 'Cape Verde': 'CPV', 'Congo, Republic of the...': 'COG', 'Czech Republic': 'CZE', "Democratic People's Republic of Korea": 'PRK', 'Democratic Republic of the Congo': 'COD', 'Hong Kong (S.A.R.)': 'HKG', 'Iran, Islamic Republic of...': 'IRN', 'Libyan Arab Jamahiriya': 'LBY', 'Micronesia, Federated States of...': 'FSM', 'North Korea': 'PRK', 'Republic of Korea': 'KOR', 'Republic of Moldova': 'MDA', 'South Korea': 'KOR', 'The former Yugoslav Republic of Macedonia': 'MKD', 'United Kingdom': 'GBR', 'United Republic of Tanzania': 'TZA', 'Venezuela, Bolivarian Republic of...': 'VEN' }) pd.Series(country_index).value_counts().head()
PRK 2 KOR 2 ISL 1 HRV 1 ECU 1 dtype: int64
In the output above we see, that the two Koreas have two name to ISO code mappings in the index each. So next group by the iso
column, summing up the respondent counts and show the top entries.
df_countries['iso'] = df_countries.index.map(lambda x: country_index.get(x)) iso_index = df_countries.groupby('iso').sum() iso_index.sort_values('Country', ascending=False).head()
| |Country | |iso | | |:----:|:----:| |USA|20309| |IND|13721| |DEU|6459| |GBR|6221| |CAN|3393|
In the next cell we import GeoPandas, create a GeoDataFrame containing data from naturalearthdata.com and remove Antarctica so it doesn't take up unnecessary space. Then add columns containing the total number of respondents per country and the ratio of respondents to 1 million inhabitants.
import geopandas as gpd world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')).to_crs('+proj=robin') world = world[world.name != 'Antarctica'] world['respondents'] = world['iso_a3'].apply(lambda x: int(iso_index.loc[x]) if x in iso_index.index else None) world['respondent_ratio'] = world['respondents'] / world['pop_est'] * 1_000_000 world.sort_values('respondent_ratio', ascending=False).head(10)
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | respondents | respondent_ratio | |
---|---|---|---|---|---|---|---|---|
77 | 306694.0 | Europe | Iceland | ISL | 12710.0 | POLYGON ((-1025302.196888561 6952546.77166883,... | 45.0 | 146.726053 |
50 | 1299371.0 | Europe | Estonia | EST | 27410.0 | POLYGON ((1872088.407724589 6118324.758626858,... | 189.0 | 145.454993 |
78 | 7233701.0 | Asia | Israel | ISR | 201400.0 | POLYGON ((3209703.584131608 3498387.789316183,... | 1003.0 | 138.656547 |
28 | 7604467.0 | Europe | Switzerland | CHE | 316700.0 | POLYGON ((799805.1118756525 5069666.509551148,... | 1010.0 | 132.816672 |
120 | 4213418.0 | Oceania | New Zealand | NZL | 116700.0 | (POLYGON ((14993207.4601813 -4373933.456521292... | 557.0 | 132.196711 |
74 | 4203200.0 | Europe | Ireland | IRL | 188400.0 | POLYGON ((-493519.5353766397 5723612.199234703... | 554.0 | 131.804340 |
151 | 9059651.0 | Europe | Sweden | SWE | 344300.0 | POLYGON ((1580026.141474696 6884344.137279728,... | 1164.0 | 128.481770 |
118 | 4676305.0 | Europe | Norway | NOR | 276400.0 | (POLYGON ((1885031.797202721 7380292.033532551... | 565.0 | 120.821888 |
97 | 491775.0 | Europe | Luxembourg | LUX | 39370.0 | POLYGON ((495064.0699557118 5340139.039437959,... | 59.0 | 119.973565 |
43 | 5500510.0 | Europe | Denmark | DNK | 203600.0 | (POLYGON ((995956.0168443418 5899888.876027877... | 653.0 | 118.716264 |
Now plot the map. We treat countries with and without data separately, add annotations and a legend, so the graphic can be interpreted without additional context. See this notebook on creating choropleth maps with GeoPandas for more details.
known = world.dropna(subset=['respondent_ratio']) unknown = world[world['respondent_ratio'].isna()] ax = known.plot(column='respondent_ratio', cmap='viridis_r', figsize=(20, 12), scheme='fisher_jenks', k=7, legend=True, edgecolor='#aaaaaa') unknown.plot(ax=ax, color='#ffffff', hatch='//', edgecolor='#aaaaaa') ax.set_title('Stack Overflow Developer Survey 2018 Respondents per 1 Million People', fontdict={'fontsize': 20}, loc='left') descripton = ''' Survey data: kaggle.com/stackoverflow/stack-overflow-2018-developer-survey • Population estimates: naturalearthdata.com • Source code: kaggle.com/ramirogomez/stack-overflow-survey-2018-respondents-world-map • Author: Ramiro Gómez - ramiro.org'''.strip() ax.annotate(descripton, xy=(0.065, 0.12), size=12, xycoords='figure fraction') ax.set_axis_off() legend = ax.get_legend() legend.set_bbox_to_anchor((.11, .4)) legend.prop.set_size(12)
Conclusion
While the USA is by far the country with the most respondents, we see that Iceland, Estonia, Israel, Switzerland and New Zealand have the highest ratios of developer survey respondents in relation to population. A map just showing the total numbers would certainly look very different and tell a different story.
%load_ext signature %signature
Author: Ramiro Gómez • Last edited: July 03, 2018Linux 4.15.0-24-generic - CPython 3.6.5 - IPython 6.4.0 - matplotlib 2.2.2 - numpy 1.14.2 - pandas 0.23.1
Shirts for Python Programmers
Published:July 03, 2018 by Ramiro Gómez. If you want to be notified about new content, click here to subscribe to the newsletter or RSS feed.
Disclosure: External links on this page may contain affiliate IDs, which means that I earn a commission if you make a purchase via such a link. This allows me to operate this site and offer hopefully valuable content that is freely accessible. More information about affiliate programs.
© Ramiro Gómez. Berlin, Germany.
Be informed about new content
Share this page
Hit ESC or click X on top right to close this dialog.