What Software Engineers Earn Compared to the General Population
In this notebook we'll compare the median annual income of software engineers to the average annual income (GDP per Capita) in 50 countries. It's shown how to scrape the data from a web page using lxml, turn it into a Pandas dataframe, clean it up and create scatter and bar plots with matplotlib to visualize the general trend and see which countries are the best and worst for software engineers based on how much they earn compared to the average person.
The data comes from PayScale and the International Monetary Fund and was published by Bloomberg in May 2014. It includes figures for 50 countries for which data was most available to PayScale. The software engineer figures represent income data collected from May 1, 2013, to May 1, 2014, and use exchange rates from May 5, 2014. The median years of work experience for survey respondents from each country range from two to five years.
First we load the necessary libraries, set the plot style and some variables, including a geonamescache object for adding country codes used to link geographic data with income figures in this interactive map.
%load_ext signature %matplotlib inline import os import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import geonamescache from lxml import html mpl.style.use('ramiro') data_dir = os.path.expanduser('~/data') gc = geonamescache.GeonamesCache() chartinfo = '''Figures represent income data from May 2013 to May 2014 using exchange rates from May 2014. Average annual income figures are for 2014. Author: Ramiro Gómez - ramiro.org • Data: Bloomberg/PayScale - bloomberg.com/visual-data/best-and-worst/highest-paid-software-engineers-countries'''
Data retrieval and cleanup
To scrape the data from the web page, I looked at the source to determine a way to identify the data table. It is the only table on the page with a class of
hid so the xpath expression below can be used to extract the table from the HTML source after it was loaded.
url ='http://www.bloomberg.com/visual-data/best-and-worst/highest-paid-software-engineers-countries' xpath = '//table[@class="hid"]' tree = html.parse(url) table = tree.xpath(xpath) raw_html = html.tostring(table)
Pandas makes it easy to turn this raw HTML string into a dataframe. We instruct it to use the
Rank column as the index and the first row as the header. The read_html function returns a list, in our case of one dataframe object, so we just grab this and print the first few rows
df = pd.read_html(raw_html, header=0, index_col=0) df.head()
| |Country |Ratio of median software engineer pay to average income |Median annual pay for software engineers |Average annual income | |Rank | | | | | |:----:|:----:|:----:|:----:|:----:| |1|Pakistan t|5.56|$7,200|$1,296| |2|India t|3.91|$6,200|$1,584| |3|South Africa t|3.64|$24,000|$6,595| |4|Bulgaria t|3.28|$25,200|$7,682| |5|China t|3.15|$23,100|$7,333|
and the data types that were automatically determined by Pandas.
Country object Ratio of median software engineer pay to average income float64 Median annual pay for software engineers object Average annual income object dtype: object
The output above shows, that we need to do some cleanup before continuing with further exploration. The values in the
Country column all end in a space followed by a t, which we just remove. Also we need to turn the dollar amounts into a numeric type for use in calculations and plots.
df['Country'] = df['Country'].apply(lambda x: x.rstrip('t').strip()) for col in df.columns[2:]: df[col] = pd.to_numeric(df[col].apply(lambda x: x.lstrip('$').replace(',', ''))) df.dtypes
Country object Ratio of median software engineer pay to average income float64 Median annual pay for software engineers int64 Average annual income int64 dtype: object
This looks better now. As a sanity check we can test whether the ratio that is contained in the data agrees with the income numbers. To do so we calculate the ratio ourselves and compare it to the existing one.
ratio = round(df['Median annual pay for software engineers'] / df['Average annual income'], 2) all(ratio == df['Ratio of median software engineer pay to average income'])
Exploration and visualization
To get a holistic view we first create a scatter plot with all 50 records showing the median software engineer income on the X axis and the annual average of the whole population on the Y axis. We also draw a quadratic polynomial fitting curve to see if we can spot a trend.
x = 'Median annual pay for software engineers' y = 'Average annual income' title = 'Median annual income of software engineers vs. general average in 50 countries' fig = plt.figure(figsize=(11, 9)) ax = fig.add_subplot(111) ax.plot(x, y, '.', data=df, ms=10, alpha=.7) ax.set_title(title, fontsize=20, y=1.04) ax.set_xlabel(x) ax.set_ylim(bottom=0, top=101000) ax.set_ylabel(y) # Polynomial curve fitting # http://docs.scipy.org/doc/numpy/reference/routines.polynomials.classes.html polynomial = np.polynomial.Polynomial.fit(df[x], df[y], 2) xp = np.linspace(0, 120000, 100) yp = polynomial(xp) ax.plot(xp, yp, '-', lw=1, alpha=.5) fig.text(0, -.04 , chartinfo, fontsize=11) plt.show()
Before interpreting this and the following plots, I'll point out a few issues with the data. We do not know how many respondents took part in the PayScale survey and we do know that their work experiences range from two to five years. Whether this sample is a good representation for the income of software engineers in the respective countries is questionable.
Moreover, we compare median annual values for software engineers with mean annual values for the general population. Considering how large the income share of top earners is in several countries, annual median values for the general population might well show a different picture.
With this in mind, software engineer looks like a good career choice in the majority of the 50 countries. In some of the countries we see in the lower left, mid-level software engineers earn multiples of what the average person does. But there are also countries, where software engineers earn a lot less. To find out which are the best and worst countries for software engineers with respect to income, we'll plot rankings in form of bar plots next.
In the following bar plot countries are ranked by the ratio of median software engineer pay to average income. Since our dataframe is already ordered by the ratio from high to low, we can simply use the head and tail methods to get the slices we want to show.
col = 'Ratio of median software engineer pay to average income' title = 'Best and worst countries ranked by ratio of median software engineer pay to average income' limit = 10 best = df.head(limit)[::-1] worst = df.tail(limit) ticks = np.arange(limit)
Now we create a plot consisting of two bar charts showing the best countries for software engineers on the left and the worst on the right based on income ratio.
fig = plt.figure(figsize=(14, 5)) fig.suptitle(title, fontsize=20) ax1 = fig.add_subplot(1, 2, 1) ax1.barh(ticks, best[col], alpha=.5, color='#00ff00') ax1.set_yticks(ticks) ax1.set_yticklabels(best['Country'].values, fontsize=15, va='bottom') ax2 = fig.add_subplot(1, 2, 2) ax2.barh(ticks, worst[col], alpha=.5, color='#ff0000') ax2.set_yticks(ticks) ax2.set_yticklabels(worst['Country'].values, fontsize=15, va='bottom') fig.text(0, -.07, chartinfo, fontsize=12) plt.show()
This chart again shows that the income differences are huge in some countries and that software engineers are likely to earn more. In our dataset Pakistan and India are the two countries with the lowest average annual income and Qatar has the 2nd highest average income after Norway. I assume that the income distribution in countries with a high ratio is skewed towards lower incomes and in countries with low ratios towards higher incomes. Again, it'd be interesting to compare medians to medians and not medians to means, but we'd have to find another data source for such a comparison.
Adding country codes
df_map = df.copy() names = gc.get_countries_by_names() df_map['iso3'] = df_map['Country'].apply(lambda x: names[x]['iso3']) df.head(5)
| |Country |Ratio of median software engineer pay to average income |Median annual pay for software engineers |Average annual income | |Rank | | | | | |:----:|:----:|:----:|:----:|:----:| |1|Pakistan|5.56|7200|1296| |2|India|3.91|6200|1584| |3|South Africa|3.64|24000|6595| |4|Bulgaria|3.28|25200|7682| |5|China|3.15|23100|7333|
Here we use the geonamescache object initialized in the beginning to get a dictionary of countries keyed by names. The values are dictionaries as well, which, among other things, contain ISO 3 country codes. Finally, to save a few bytes the country column is removed and the dataframe saved as a CSV file.
del df_map['Country'] df_map.to_csv(data_dir + '/economy/income-software-engineers-countries.csv', encoding='utf-8', index=False)
In this notebook we looked at income data for software engineers in 50 countries and compared their earnings to the general population. In the process we scraped the dataset from the source web page, cleaned it up, visualized it and interpreted the results pointing out potential issues with the dataset and methodology.
While software engineer seems to be a good career choice in most of the countries, keep the caveats in mind before you start making emigration plans. Also, income should certainly not be your only criterion for choosing a profession or a place to live in.
Author: Ramiro Gómez • Last edited: April 08, 2016Linux 4.2.0-35-generic - CPython 3.5.1 - IPython 4.1.2 - matplotlib 1.5.1 - numpy 1.10.4 - pandas 0.18.0
Shirts for Python Programmers
Published:April 08, 2016 by Ramiro Gómez. If you want to be notified about new content, click here to subscribe to the newsletter or RSS feed.
Disclosure: External links on this page may contain affiliate IDs, which means that I earn a commission if you make a purchase via such a link. This allows me to operate this site and offer hopefully valuable content that is freely accessible. More information about affiliate programs.
© Ramiro Gómez. Berlin, Germany.
Be informed about new content
Share this page
Hit ESC or click X on top right to close this dialog.