import pandas as pd
39 An Introduction to Pandas
Let’s talk about Pandas.
And by that, we of course mean the beautiful, glorious, intelligent… … open source Python library that provides powerful data structures and analysis tools.
(You can judge the level of geekiness of someone fairly easily by simply making the statement “I love Pandas” and seeing if they reply “Me too, and NumPy!”)
Pandas is very powerful for manipulating data in large arrays (unlike some spreadsheet software…), and allows for indexing of data. NumPy and Pandas are often used together, with NumPy used for mathematical functions applied to the data, and Pandas used to manipulate the data.
As with NumPy, there is a conventional alias under which Pandas should be imported :
39.1 The Pandas Dataframe
One of the most useful structures in Pandas is the Pandas DataFrame. A DataFrame is like a table with different columns (which can have names) for different data fields, and different rows for each entry in the data.
Imagine a table in Excel. Only much more powerful.
39.2 Creating a new dataframe
We have two choices when we want to create a new Pandas DataFrame :
- We can create a new DataFrame and then build the DataFrame manually
- We can read in existing data from a .csv file
Given that 99% of the time you’ll be doing the latter for real world applications, let’s just focus on that. And it’s also really easy!
Pandas has a fantastic function that allows us to read in data from a .csv file, and it will automatically stuff it into a new DataFrame for us!
In one line of code!!
Don’t believe me? Observe…
One fantastic feature of pandas is that we can point to either a local file (one stored on our machine) or a remote file (one stored somewhere else - like the web).
In the example below, we are accessing a file called input_data.csv
.
If this file was stored in the same folder as our python file, we would just do
= pd.read_csv("input_data.csv") df
We can then show the first 5 rows of the data with the .head()
method.
df.head()
Patient ID | Name | Flu Vaccine | Age | County | |
---|---|---|---|---|---|
0 | 65192 | Bob | Yes | 42 | Cornwall |
1 | 84568 | Nigel | No | 27 | Devon |
2 | 93765 | Florence | Yes | 84 | Somerset |
3 | 97865 | Martha | Yes | 57 | Somerset |
4 | 12451 | Simon | No | 35 | Somerset |
However, we could also use a web location. We’re going to use this method in the rest of the examples to allow us to interact with pandas dataframes in this book.
You can see that the outputs are identical!
39.3 The index
39.4 Specifying an existing column as the index
In our data, we already have a unique identifier in our data - “Patient ID”. When we read in the .csv, we can tell Pandas that we want to set this column to be the Index (rather than creating a new one).
IMPORTANT - Pandas will check that the column you specify does truly have unique values. If it doesn’t, it will raise an exception (error) and if you don’t catch it, the code will terminate.
Be careful when importing data where multiple records refer to the same patient - patient ID would not be a unique identifier in that case.