Often when doing data analysis it becomes necessary to change the frequency of data. Pandas provides a relatively simple way to do this.
Take the following example of a business that has daily sales and expenses data for 20 years. Create the example dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = pd.date_range(start=pd.datetime(2000,1,1),
end=pd.datetime(2020,1,1),freq='D')
df.set_index('date', inplace =True)
df['sales'] = np.random.choice([100,50,200,300,600],len(df))
df['expenses'] = df.sales * np.random.choice([0.4,0.5,0.33,0.66],len(df))
print(df)
out:
sales expenses
date
2000-01-01 300 150.0
2000-01-02 100 66.0
2000-01-03 200 100.0
2000-01-04 100 33.0
2000-01-05 600 300.0
... ...
2019-12-28 200 100.0
2019-12-29 200 100.0
2019-12-30 300 99.0
2019-12-31 600 240.0
2020-01-01 200 66.0
[7306 rows x 2 columns]
The data is currently in daily increments, let's say we wanted to change it to weekly, monthly and annual frequencies. In the context of the example we are using, it seems like a good idea to take the sum of all sales and expenses during a week/month or year.
Note If the following error appears:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
This means your date column is not in the correct format. To fix this use the following command:
df['your date column name'] = pd.to_datetime(df['your date column name'])
df.set_index('your date column name', inplace=True)
Weekly resample by sum
df2 = df.resample('W').sum()
print(df2)
out:
sales expenses
date
2000-01-02 400 216.0
2000-01-09 1500 664.0
2000-01-16 1900 891.5
2000-01-23 1950 983.0
2000-01-30 1950 898.0
... ...
2019-12-08 1400 677.5
2019-12-15 2050 792.0
2019-12-22 1250 740.0
2019-12-29 2150 1281.0
2020-01-05 1100 405.0
[1045 rows x 2 columns]
Monthly resample by sum
#monthly resample
df3 = df.resample('M').sum()
print(df3)
Out:
sales expenses
date
2000-01-31 8300 3952.5
2000-02-29 7650 3586.5
2000-03-31 7850 3951.5
2000-04-30 7200 3334.5
2000-05-31 7500 3648.0
... ...
2019-09-30 6650 3254.5
2019-10-31 8650 4203.5
2019-11-30 7900 3879.5
2019-12-31 7850 3862.5
2020-01-31 200 66.0
[241 rows x 2 columns]
Annual resample by sum
#annual resample
df4 = df.resample('A').sum()
print(df4)
out:
sales expenses
date
2000-12-31 87150 40936.0
2001-12-31 93200 44804.0
2002-12-31 95150 46329.5
2003-12-31 92450 43830.5
2004-12-31 89050 41126.0
2005-12-31 91400 42414.5
2006-12-31 93900 44679.5
2007-12-31 94500 45281.0
2008-12-31 98650 47353.5
2009-12-31 88700 41289.5
2010-12-31 95600 43527.0
2011-12-31 99200 46249.5
2012-12-31 95250 44568.0
2013-12-31 90450 42426.0
2014-12-31 91200 42306.5
2015-12-31 93850 43962.0
2016-12-31 91000 42780.0
2017-12-31 93850 42945.5
2018-12-31 88300 40470.0
2019-12-31 93400 44987.5
2020-12-31 200 66.0
Notice here that since we only have one value for the year beginning on 2020, the value is much smaller. It is possible to remove this small value with the following command:
df4 = df4[:-1] #removes the last entry
Resampling data by different methods
In the previous example we only used summuation as the resampling method. However, often it is necessary to resample different columns by different methods.
Keeping the example from the previous section, but adding another column called expense ratio = \(\frac{expenses}{sales}\) to which could possibly be useful to determine how high expenses are over time relative to sales.
df = pd.DataFrame()
df['date'] = pd.date_range(start=pd.datetime(2000,1,1),
end=pd.datetime(2020,1,1),freq='D')
df.set_index('date', inplace =True)
df['sales'] = np.random.choice([100,50,200,300,600],len(df))
df['expenses'] = df.sales * np.random.choice([0.4,0.5,0.33,0.66],len(df))
df['expense_ratio'] = df.expenses/df.sales
print(df)
Out:
sales expenses expense_ratio
date
2000-01-01 50 20.0 0.40
2000-01-02 200 132.0 0.66
2000-01-03 100 50.0 0.50
2000-01-04 300 150.0 0.50
2000-01-05 600 240.0 0.40
... ... ...
2019-12-28 200 66.0 0.33
2019-12-29 600 240.0 0.40
2019-12-30 300 120.0 0.40
2019-12-31 50 25.0 0.50
2020-01-01 300 99.0 0.33
[7306 rows x 3 columns]
Let's say we wanted to resample on a weekly basis by taking the sum of both sales and expenses, but taking the average of the expense ratio. In order to do this we can pass in a dictionary to to Pandas .agg method
df2 = df.resample('W').agg({'sales':'sum', 'expenses':'sum', 'expense_ratio': 'mean'})
print(df2)
out:
sales expenses expense_ratio
date
2000-01-02 500 264.0 0.495000
2000-01-09 1350 449.0 0.340000
2000-01-16 1900 921.0 0.484286
2000-01-23 1250 522.0 0.421429
2000-01-30 950 426.0 0.455714
... ... ...
2019-12-08 2450 1179.0 0.505714
2019-12-15 1800 925.0 0.482857
2019-12-22 1600 835.5 0.552857
2019-12-29 1600 644.5 0.408571
2020-01-05 700 454.0 0.606667
[1045 rows x 3 columns]