Getting data into Pandas is the first step in beginning your analysis. This article will show how to load csv files and Excel spreadsheets into Python. I will also show some common errors in loading data into the Python, and provide a run-through of simple solutions that can get you up and running.
CSV Files
Csv stands for comma-separated values. You can create a csv file to follow along with this article if you don’t have one already.
Paste the following values into your notepad:
Date,Sales,Expenses,Location
01/01/2016,1052,673.28,South America
02/01/2016,599,383.36,North America
03/01/2016,1133,725.12,Asia
04/01/2016,810,518.4,Asia
05/01/2016,1133,725.12,North America
06/01/2016,1216,778.24,North America
07/01/2016,1238,792.32,Asia
08/01/2016,1088,696.32,Europe
09/01/2016,1474,943.36,South America
10/01/2016,1307,836.48,North America
Now press File > Save As > example.csv make sure to add the .csv extension to the file. Perhaps saving to your Desktop will be easiest if you are new to Python.
Now open your Python editor and write the following commands:
import pandas as pd
df = pd.read_csv('example.csv')
df.head()
You should see the following output:
|
Date |
Sales |
Expenses |
Location |
0 |
01/01/2016 |
1052 |
673.28 |
South America |
1 |
02/01/2016 |
599 |
383.36 |
North America |
2 |
03/01/2016 |
1133 |
725.12 |
Asia |
3 |
04/01/2016 |
810 |
518.4 |
Asia |
4 |
05/01/2016 |
1133 |
725.12 |
North America |
If you get the following error:
FileNotFoundError: [Errno 2] File example.csv does not exist: 'example.csv'
This means that your current working directory is different from where the file is located. In Jupyter a quick fix for this is just to click on the Jupyter logo and then navigate to your Desktop (or location the file is stored) and reopen the notebook.
Alternatively, you can use the following command:
df = pd.read_csv(r'Desktop\example.csv')
Another example is if the data is located in a folder named ‘data’ on your desktop:
df = pd.read_csv(r'Desktop\data\example.csv')
Sypder makes this particularly easy, all you need to do is click the folder icon on the top right of the screen > Desktop and press Select Folder and navigate to the folder the file is stored in.
What if your csv file is not separated by commas?
For example if you open your csv file with notepad and it looks something like this:
Date;Sales;Expenses;Location
01/01/2016;1052;673.28;South America
02/01/2016;599;383.36;North America
03/01/2016;1133;725.12;Asia
04/01/2016;810;518.4;Asia
05/01/2016;1133;725.12;North America
06/01/2016;1216;778.24;North America
07/01/2016;1238;792.32;Asia
08/01/2016;1088;696.32;Europe
09/01/2016;1474;943.36;South America
10/01/2016;1307;836.48;North America
If you try to load the data in the way detailed previously you will likely get an error that looks something like this:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 10, saw 2
This can be very frustrating!! As the above error can be confusing to understand. The issue here is the character which separates each data point known as the delimiter. Lucky the folks over at Pandas make fixing this error very straightforward, simply add the following to your script replacing 'semicolon_format' with the desired filename.
df = pd.read_csv('semicolon_format.csv',sep=';')
Let's take an extreme example to illustrate how the sep option works. Say, for example you have the misfortune of being given a data-set that looks something like this:
Date;Sales#Expenses;Location
01/01/2016:1052;673.28:South America
02/01/2016;599-383.36;North America
03/01/2016#1133;725.12:Asia
04/01/2016:810;518.4;Asia
05/01/2016#1133;725.12-North America
06/01/2016;1216;778.24;North America
07/01/2016#1238;792.32;Asia
08/01/2016;1088;696.32;Europe
09/01/2016;1474;943.36#South America
10/01/2016;1307;836.48;North America
In the example above there are multiple delimiters : ; - # which looks like it will be difficult to work with. Again Pandas makes this very easy to work around.
df = pd.read_csv('crazy_format.csv',
sep=";|:|#|-",
engine='python')
Notice here that alll you need to do is add each delimiter separated with a | and the data will load in as normal. The engine='python' line in the snippet above isn't strictly necessary, however it avoids a ParserWarning regarding C being unable to support regular expressions.
More options
Load certain columns
Continuing with the example.csv file from above. If you require only certain columns of the file to be loaded you can specify this in the read_csv method. For example if you want to load only columns 1, 2, & 4 you can use the following command:
df = pd.read_csv('example.csv',
usecols=[0,1,3],
engine='python')
df.head()
Returns the following dataframe:
Date | Sales | Location | |
0 | 01/01/2016 | 1052 | South America |
1 | 02/01/2016 | 599 | North America |
2 | 03/01/2016 | 1133 | Asia |
3 | 04/01/2016 | 810 | Asia |
4 | 05/01/2016 | 1133 | North America |
5 | 06/01/2016 | 1216 | North America |
It may be worth repeating here, that Python/Pandas uses 0 indexing which can often cause confusion. The first column is 0, the second 1, third 2 and so on.
If you had a very large dataset and you want to load in the first 50 columns you can pass the following into the usecols option:
cols = [col for col in range(50)]
df = pd.read_csv('your_filename.csv',
usecols=cols)
Set Index Column
Instead of getting the default range index when loading your dataframe into Python you can specify which column to use as the index:
df = pd.read_csv('example.csv',
index_col=0)
Returns the following dataframe:
Sales | Expenses | Location | |
Date | |||
01/01/2016 | 1052 | 673.28 | South America |
02/01/2016 | 599 | 383.36 | North America |
03/01/2016 | 1133 | 725.12 | Asia |
04/01/2016 | 810 | 518.4 | Asia |
05/01/2016 | 1133 | 725.12 | North America |
06/01/2016 | 1216 | 778.24 | North America |
Parse Dates
The index_col option above is often very useful in conjunction with the parse_dates option to automatically parse the date from the csv file and set it as the index as follows:
df = pd.read_csv('example.csv',
index_col=0,
parse_dates=['Date'])
You can change the 'Date' column to whatever column your datetime data is named in your data-set.
That looks nice but notice how Pandas parses the dates as 2016-01-01 which corresponds to year-month-day, what if the dates represented in our example 01/01/2016 corresponds to day/month/year as opposed to the way Pandas is currently interpreting it?
Well yet again Pandas makes this relatively straightforward to deal with. We can use a custom date_parser method as shown below:
def custom_dateparser(raw_dates):
return pd.to_datetime(raw_dates,format='%d/%m/%Y')
df = pd.read_csv('example.csv',
index_col=0,
parse_dates=['Date'],
date_parser=custom_dateparser)
Which returns the following dataframe.
Sales Expenses Location
Date
2016-01-01 1052 673.28 South America
2016-01-02 599 383.36 North America
2016-01-03 1133 725.12 Asia
2016-01-04 810 518.40 Asia
2016-01-05 1133 725.12 North America
For a full list of options for the read_csv method take a look at the documentation here