41 Extending and Joining Dataframes
To allow all the exercises in this section to work, please run this code cell first!
This will import pandas and load the dataframe we’ll be working with.
41.1 Adding data to a dataframe
We can add additional rows (and columns) to a DataFrame. To add a row, we must specify the values for all of the columns for this new row. Similarly, to add a column, we must specify the value for this column for all of the rows.
41.1.1 Adding rows
To add a row, we specify the index value for the new row (that doesn’t yet exist in our DataFrame) and assign a list of values for the other columns :
Be careful - you can overwrite an existing row this way if you provide an index that already exists.
41.1.2 Adding columns
We can add a new column as follows.
However - note that you’ll need a value for every single row in your dataset.
The code below makes a new dataframe that is just the first 5 rows of the existing dataframe, then adds the new column
41.2 Concatenation
Another way of adding to our DataFrame is by joining two DataFrames together to create a new one.
We can do this using the Pandas concat() method.
axis = 0 joins by adding rows
axis = 1 joins by adding columns
First, let’s make two new dataframes.
This shows you another way you can create a dataframe - by passing a list of dictionaries, where every dictionary in the list represents one row.
The keys of the dictionary will be the column names, and the values will be the values in the row.
We can then join these with the concat method.
Note : in this example, we have entries with duplicate index values. We’d likely want to do something about this, but in practice, the DataFrames you’ll be concatenating will normally have their own unique index that will hold when both are joined.
If you do need to reset the index, you can use the .reset_index()
method.
Note that the old index becomes a new column! We can avoid this by passing drop=True
.
Let’s have a quick look at concatenating on the other axis.
You may notice that concatenating in this direction feels a bit risky!
You may prefer to use the merge method of pandas. The kind of joins available with that method will feel very familiar if you have used SQL.
Using merge, you can specify columns or an index that must match, ensuring that the correct rows are matched up.
Let’s join this to df_a
. Run the cell below to remind yourself of what was in that dataframe.
Note that because we haven’t got a value in dataframe a for Hugo, we don’t have a record of whether he is a trainer, so in the merged dataframe, we get NaN
. This stands for ‘not a number’ and effectively means the cell in the dataframe is blank.
You can read more about the types of joins available in merge
here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html