In order to show some basic statistics with Pandas, we will be using a real life data-set in this article, as it is often easier to show the concepts when applied in context. Since financial data is so readily available from Yahoo Finance, let's use some stock data to demonstrate calculating statistics with Pandas. Where possible a brief explanation of the stats will be given for those unfamilar with the methods. The stocks we will be using for this article are known as the FAANG stocks, which consists of Facebook, Amazon, Apple , Netflix and Google.

Pandas functions covered:

mean
pct_change
standard deviation
mean absolute deviation
skew

Import the necessary libraries and download the FAANG data for 2015-2020

import pandas_datareader.data as web
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')


start = dt.datetime(2015,1,1)        
end = dt.datetime(2020, 1,1) 

##ticker symbols for FAANG stocks
symbols = ['FB', 'AMZN', 'AAPL', 
           'NFLX', 'GOOG']

source = 'yahoo'

## take only adjusted close prices
data = web.DataReader(symbols, source, start, end)['Adj Close']

data.head()

Out[226]: 
Symbols            FB        AMZN        AAPL       NFLX        GOOG
Date                                                                
2015-01-02  78.449997  308.519989  100.216454  49.848572  523.373108
2015-01-05  77.190002  302.190002   97.393181  47.311428  512.463013
2015-01-06  76.150002  295.290009   97.402374  46.501427  500.585632
2015-01-07  76.150002  298.420013   98.768150  46.742859  499.727997
2015-01-08  78.180000  300.459991  102.563072  47.779999  501.303680

That looks as expected.

I will use the population measures, which isn't strictly correct, however, the difference is likely negligible for the purposes of this article.

Mean

Let's calculate the arithmetic mean \(\mu\) for each stock:

\(\mu=\ \frac{1}{N}\sum\limits_{i=1}^{N }x_i\)

data.mean()

Out[229]: 
Symbols
FB       143.073386
AMZN    1115.047337
AAPL     149.094641
NFLX     201.427671
GOOG     913.613379
dtype: float64

For our specific problem, calculating the mean with the raw prices probably doesn't make much sense. Let's check out Pandas pct_change function, to normalize the data as follows:

\(pct\_change\ =\ \frac{Price\ Today\ -\ Price\ t\ periods\ ago}{Price\ Today}\)

The default t in the formula above is 1, however you can override this to any period you wish. Note, that you will always have t NaN values when using this function, therefore we add dropna() to the function to clean the data up.

data = data.pct_change(periods=1).dropna()

data.head()

Out[236]: 
Symbols           FB      AMZN      AAPL      NFLX      GOOG
Date                                                        
2015-01-05 -0.016061 -0.020517 -0.028172 -0.050897 -0.020846
2015-01-06 -0.013473 -0.022833  0.000094 -0.017121 -0.023177
2015-01-07  0.000000  0.010600  0.014022  0.005192 -0.001713
2015-01-08  0.026658  0.006836  0.038423  0.022188  0.003153
2015-01-09 -0.005628 -0.011749  0.001072 -0.015458 -0.012951

Let's make a bar chart to compare the stocks average daily change

data.mean().plot.bar()

average percentage change with pandas for FB AAPL NFLX GOOG

It looks like Netflix has had the highest mean percentage return historically.

Standard Deviation

The standard deviation is the most commonly used measure of dispersion around the mean. Consider the graph below constructed with mock data for illustrative purposes, in which all three distributions have exactly the same mean (zero). Clearly the red and green curves exhibit more dispersion around the mean, which is due to higher standard deviations.

\(\sigma =\sqrt{\frac{1}{N} \sum\limits_{i=1}^N(x_i-\mu)^2}\)

standard deviation with pandas visualized

It may be useful to mention that the standard deviation is only used to describe continuous data, if you had categorical or nominal data, it wouldn't make sense to use standard deviation to describe your data. For our particular problem, standard deviation is a widely used measure of the riskiness of a stock, see this Investopedia article for more information.

Let's calculate the daily standard deviation for each of our FAANG stocks and compare them in a bar chart.

data.std().plot.bar()

standard deviation for FAANG stocks

Looks like Netflix has the highest standard deviation of the stocks in our sample.

Mean Absolute Deviation (MAD)

The formula below is much more intuitive, and arguably a better measure for dispersion. See this interesting discussion on stackexchagne regarding the differences between standard deviation and MAD

\(\frac{1}{N} \sum\limits_{i=1}^N |x_i-\mu|\)

data.mad()

Out[303]: 
Symbols
FB      0.011882
AMZN    0.012194
AAPL    0.011016
NFLX    0.017820
GOOG    0.010189
dtype: float64

Skew

Skew, also known as the third moment of the distribution, is a measure of assymetry in the distribution. Notice, the cubed term in the formula below, which allows for both negative and positive values. The plot below with mock data contrasts distributions with positive and negative skew. Negatively skewed distributions have heavier tails, meaning they have more extreme negative values, with the opposite being true for positively skewed distributions.

\(Skew = {\frac{1}{N}}\ \sum\limits_{i=1}^N \frac{(x_i-\mu)^3}{\sigma^3}\)

skew visualised with pandas

Let's compare the skewness measure for the stocks in our sample:

data.skew().plot.barh()

skew for FAANG stocks with python

Full script for this article

import pandas_datareader.data as web
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

start = dt.datetime(2015,1,1)        
end = dt.datetime(2020, 1,1) 

symbols = ['FB', 'AMZN', 'AAPL', 
           'NFLX', 'GOOG']

source = 'yahoo'

data = web.DataReader(symbols, source, start, end)['Adj Close']

data.head()

data  = data.pct_change(periods=1).dropna()

data.mean().plot.bar()

data.std().plot.bar()

data.mad()


data.skew().plot.barh()

Basic Statistics with Pandas Part-1

Mean

Standard Deviation

Mean Absolute Deviation (MAD)

Skew

Join the discussion