42 Dropping Data from Dataframes
To allow all the exercises in this section to work, please run this code cell first!
This will import pandas and load the dataframe we’ll be working with.
42.1 Dropping rows and columns
Pandas allows us to easily drop (remove) columns and rows of a DataFrame.
42.1.1 Dropping columns
Let’s first remind ourselves of what our dataframe looks like. We’ll just print the first five rows by using the head()
method.
To drop a single column (multiple columns is the same, just add them to the list of column names):
Note - in the above example, we store a new DataFrame that has the Name column dropped, leaving the original DataFrame intact. This is normally what you want to do.
If we want to actually remove it from the original DataFrame, we can either simply assign the above to df (the name of the original DataFrame), thereby overwriting it with a copy that doesn’t have the “Name” column.
OR we can use the parameter flag inplace=True to specify the changes should be made to the DataFrame we’re working on.
42.1.2 Dropping Rows
We can also drop a row or rows by specifying the index / indices of the row (or rows) we want to drop :
If you wanted to drop multiple rows that met a condition, we could use conditional indexing from before.
We use !=
(not equal to) to get rid of any rows where the value of county is “Cornwall”.
42.2 Removing Duplicates
Pandas provides us with some powerful tools to drop duplicate rows. First, we can easily drop any rows where all the values (other than the index) are identical.
Let’s set up a new DataFrame to show this in action. We’ll create one manually here, using a dictionary to store the column names and associated data for each record. We won’t use patient ID as the index here.
In this new DataFrame, we have three records (rows) that are duplicates, because they have the same info for the same patient.
Let’s drop the duplicate rows in one line of code :
By default, Pandas will keep the first duplicate (index 0 in the above). But we may want to keep the last duplicate instead (eg if our data related to individual encounters with a service for the same patient - keeping the most recent entry may make the most sense). To do this :
We may also want to change our definition of a “duplicate”. It may be that we don’t need all the values to be the same, just a certain one (or ones) to be considered a duplicate.
Imagine a reconfigured version of our prior table, in which the number of admissions has been updated for patient 105 in the final entry.
We may only want to keep records for a patient where there was a new admission. So we’d want to keep either 0 or 2, and also 3. This will do the trick :