10 Exercise Solution: Regression with Trees (LOS Dataset)

If using colab, run this cell first. Otherwise, skip this cell.

!pip install ydata_profiling
!pip install catboost
!pip install --upgrade scikit-learn
!git clone https://github.com/hsma-programme/h6_4e_boosted_trees.git
%cd /content/h6_4e_boosted_trees/exercises_colab

10.1 Core

We’re going to work with a dataset to try to predict patient length of stay.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# import the relevant models from Sklearn, XGBoost, CatBoost and LightGBM
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from ydata_profiling import ProfileReport

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
# import any other libraries you need
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, \
                            r2_score, root_mean_squared_error

Open the data dictionary in the los_dataset folder and take a look at what data is available.

Next, load in the dataframe containing the LOS data.

los_df = pd.read_csv("../datasets/los_dataset/LengthOfStay.csv", index_col="eid")

View the dataframe.

los_df.head()

	vdate	rcount	gender	dialysisrenalendstage	asthma	irondef	pneum	substancedependence	psychologicaldisordermajor	depress	...	sodium	glucose	bloodureanitro	creatinine	bmi	pulse	respiration	secondarydiagnosisnonicd9	facid	lengthofstay
eid
1	8/29/2012	0	F	0	0	0	0	0	0	0	...	140.361132	192.476918	12.0	1.390722	30.432418	96	6.5	4	B	3
2	5/26/2012	5+	F	0	0	0	0	0	0	0	...	136.731692	94.078507	8.0	0.943164	28.460516	61	6.5	1	A	7
3	9/22/2012	1	F	0	0	0	0	0	0	0	...	133.058514	130.530524	12.0	1.065750	28.843812	64	6.5	2	B	3
4	8/9/2012	0	F	0	0	0	0	0	0	0	...	138.994023	163.377028	12.0	0.906862	27.959007	76	6.5	1	A	1
5	12/20/2012	0	F	0	0	0	1	0	1	0	...	138.634836	94.886654	11.5	1.242854	30.258927	67	5.6	2	E	4

5 rows × 26 columns

ProfileReport(los_df)

Consider what columns to remove.

HINT: Is there a column in the dataset that doesn’t really make much sense to predict from? If you’re not sure, use the full dataset for now and come back to this later.

NOTE: For now, we’re going to assume that all of the included measures will be available to us at the point we need to make a prediction - they’re not things that will be calculated later in the patient’s stay.

los_df = los_df.drop(columns="vdate")
los_df.head(1)

	rcount	gender	dialysisrenalendstage	asthma	irondef	pneum	substancedependence	psychologicaldisordermajor	depress	psychother	...	sodium	glucose	bloodureanitro	creatinine	bmi	pulse	respiration	secondarydiagnosisnonicd9	facid	lengthofstay
eid
1	0	F	0	0	0	0	0	0	0	0	...	140.361132	192.476918	12.0	1.390722	30.432418	96	6.5	4	B	3

1 rows × 25 columns

Convert categories with only two options into a boolean value (e.g. for a gender column in which gender has only been provided as M or F, you could encode M as 0 and F as 1).

los_df.gender.unique()

array(['F', 'M'], dtype=object)

los_df['gender'].replace('M', 0, inplace=True)
los_df['gender'].replace('F', 1, inplace=True)

los_df.gender.unique()

array([1, 0], dtype=int64)

Convert columns with multiple options per category into multiple columns using one-hot encoding.

los_df.facid.unique()

# Bonus - astype('int') will convert the true/false values to 0/1
# not necessary - it will work regardless
one_hot = pd.get_dummies(los_df['facid']).astype('int')
los_df = los_df.drop('facid', axis=1)
los_df = los_df.join(one_hot)

los_df.head()

	rcount	gender	dialysisrenalendstage	asthma	irondef	pneum	substancedependence	psychologicaldisordermajor	depress	psychother	...	bmi	pulse	respiration	secondarydiagnosisnonicd9	lengthofstay	A	B	C	D	E
eid
1	0	1	0	0	0	0	0	0	0	0	...	30.432418	96	6.5	4	3	0	1	0	0	0
2	5+	1	0	0	0	0	0	0	0	0	...	28.460516	61	6.5	1	7	1	0	0	0	0
3	1	1	0	0	0	0	0	0	0	0	...	28.843812	64	6.5	2	3	0	1	0	0	0
4	0	1	0	0	0	0	0	0	0	0	...	27.959007	76	6.5	1	1	1	0	0	0	0
5	0	1	0	0	0	1	0	1	0	0	...	30.258927	67	5.6	2	4	0	0	0	0	1

5 rows × 29 columns

los_df.rcount.value_counts()

# Bonus - astype('int') will convert the true/false values to 0/1
# not necessary - it will work regardless
one_hot = pd.get_dummies(los_df['rcount'], prefix="rcount").astype('int')
los_df = los_df.drop('rcount', axis=1)
los_df = los_df.join(one_hot)
los_df.head()

	gender	dialysisrenalendstage	asthma	irondef	pneum	substancedependence	psychologicaldisordermajor	depress	psychother	fibrosisandother	...	B	C	D	E	rcount_0	rcount_1	rcount_2	rcount_3	rcount_4	rcount_5+
eid
1	1	0	0	0	0	0	0	0	0	0	...	1	0	0	0	1	0	0	0	0	0
2	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
3	1	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	1	0	0	0	0
4	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
5	1	0	0	0	1	0	1	0	0	0	...	0	0	0	1	1	0	0	0	0	0

5 rows × 34 columns

Train a decision tree model to predict length of stay based on the variables in this dataset.

X = los_df.drop(columns='lengthofstay')
y = los_df['lengthofstay']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = 0.25,
    random_state=42
    )

regr_dt = DecisionTreeRegressor(random_state=42)

# Train the model using the training sets
regr_dt.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_train = regr_dt.predict(X_train)
y_pred_test = regr_dt.predict(X_test)

y_pred_test

array([3., 1., 3., ..., 8., 2., 5.])

Assess the performance of this model.

print("TRAINING DATA")
print(f"Mean absolute error: {mean_absolute_error(y_train, y_pred_train):.2f}")
print(f"Mean absolute percentage error: {mean_absolute_percentage_error(y_train, y_pred_train):.2%}" )
print("Root Mean squared error: %.2f" % root_mean_squared_error(y_train, y_pred_train))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_train, y_pred_train))

TRAINING DATA
Mean absolute error: 0.00
Mean absolute percentage error: 0.00%
Root Mean squared error: 0.00
Coefficient of determination: 1.00

print("TESTING DATA")
print(f"Mean absolute error: {mean_absolute_error(y_test, y_pred_test):.2f}")
print(f"Mean absolute percentage error: {mean_absolute_percentage_error(y_test, y_pred_test):.2%}" )
print("Root Mean squared error: %.2f" % root_mean_squared_error(y_test, y_pred_test))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred_test))

TRAINING DATA
Mean absolute error: 0.50
Mean absolute percentage error: 12.94%
Root Mean squared error: 0.93
Coefficient of determination: 0.84

def plot_residuals(actual, predicted):
    residuals = actual - predicted

    plt.figure(figsize=(10, 5))
    plt.hist(residuals, bins=20)
    plt.axvline(x = 0, color = 'r')
    plt.xlabel('Residual')
    plt.ylabel('Frequency')
    plt.title('Distribution of Residuals')
    plt.show()

plot_residuals(y_test, y_pred_test)

Note that here the alpha parameter has been set to 0.05 - this means that the points will be transparent. Higher numbers of points in a given space will appear more opaque.

def plot_actual_vs_predicted(actual, predicted):
    fig, ax = plt.subplots(figsize=(6, 6))

    ax.scatter(actual, predicted, color="black", alpha=0.05)
    ax.axline((1, 1), slope=1)
    plt.xlabel('True Values')
    plt.ylabel('Predicted Values')
    plt.title('True vs Predicted Values')
    plt.show()

plot_actual_vs_predicted(y_test, y_pred_test)

Train a boosting model to predict length of stay based on the variables in this dataset.

X = los_df.drop(columns='lengthofstay')
y = los_df['lengthofstay']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = 0.25,
    random_state=42
    )

regr_xgb = XGBRegressor(random_state=42)

# Train the model using the training sets
regr_xgb.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_train = regr_xgb.predict(X_train)
y_pred_test = regr_xgb.predict(X_test)

y_pred_test

array([3.6313417, 0.8304186, 2.4800673, ..., 5.5354204, 1.5881324,
       5.249632 ], dtype=float32)

Assess the performance of this model and compare it with your decision tree model.

print("TRAINING DATA")
print(f"Mean absolute error: {mean_absolute_error(y_train, y_pred_train):.2f}")
print(f"Mean absolute percentage error: {mean_absolute_percentage_error(y_train, y_pred_train):.2%}" )
print("Root Mean squared error: %.2f" % root_mean_squared_error(y_train, y_pred_train))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_train, y_pred_train))

TRAINING DATA
Mean absolute error: 0.29
Mean absolute percentage error: 10.66%
Root Mean squared error: 0.37
Coefficient of determination: 0.98

print("TESTING DATA")
print(f"Mean absolute error: {mean_absolute_error(y_test, y_pred_test):.2f}")
print(f"Mean absolute percentage error: {mean_absolute_percentage_error(y_test, y_pred_test):.2%}" )
print("Root Mean squared error: %.2f" % root_mean_squared_error(y_test, y_pred_test))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred_test))

TESTING DATA
Mean absolute error: 0.33
Mean absolute percentage error: 11.60%
Root Mean squared error: 0.44
Coefficient of determination: 0.96

plot_residuals(y_test, y_pred_test)

plot_actual_vs_predicted(y_test, y_pred_test)

10.2 Extension

Continue to refine your model.

You could - tune hyperparameters - try additional model types (e.g. different boosting trees, linear regression) - try dropping some features - calculate additional metrics - add some visualisations of the features in the dataset - store the results from your different model iterations and display this as a table of metrics - turn this table of metrics into a graph - try using k-fold cross validation

# NO SOLUTION GIVEN

11 Challenge

Explore additional aspects of regression modelling using Google, sklearn and other library documentation, Stack Overflow, AI tools, or other avenues.

Some suggestions to get you started: - Look up the assumptions of linear regression. Would our dataset violate any of these assumptions? - What about multiple linear regression? - Investigate the performance of your model on different subsets of the data. Are the errors lower for male and female patients? How about across the different facilities?

# NO SOLUTION GIVEN