5  Exercise Solution: Decision Trees (Stroke Thromobolysis Dataset)

The data loaded in this exercise is for seven acute stroke units, and whether a patient receives clost-busting treatment for stroke. There are lots of features, and a description of the features can be found in the file stroke_data_feature_descriptions.csv.

Train a decision tree model to try to predict whether or not a stroke patient receives clot-busting treatment. Use the prompts below to write each section of code.

5.1 Core - Fitting and Evaluating a Decision Tree

Run the code below to import the dataset and the libraries we need.

import pandas as pd
import numpy as np

# import preprocessing functions
from sklearn.model_selection import train_test_split

# Import machine learning model of interest
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Import package to investigate our loaded dataframe
from ydata_profiling import ProfileReport

# Import functions for evaluating model
from sklearn.metrics import recall_score, precision_score

# Imports relating to logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Imports relating to plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Download data
# (not required if running locally and have previously downloaded data)

download_required = True

if download_required:

    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '2004_titanic/master/jupyter_notebooks/data/hsma_stroke.csv'
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data to data subfolder
    data.to_csv(data_directory + 'hsma_stroke.csv', index=False)

# Load data
data = pd.read_csv('data/hsma_stroke.csv')
# Make all data 'float' type
data = data.astype(float)
# Show data
data.head()
Clotbuster given Hosp_1 Hosp_2 Hosp_3 Hosp_4 Hosp_5 Hosp_6 Hosp_7 Male Age ... S2NihssArrivalFacialPalsy S2NihssArrivalMotorArmLeft S2NihssArrivalMotorArmRight S2NihssArrivalMotorLegLeft S2NihssArrivalMotorLegRight S2NihssArrivalLimbAtaxia S2NihssArrivalSensory S2NihssArrivalBestLanguage S2NihssArrivalDysarthria S2NihssArrivalExtinctionInattention
0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 63.0 ... 3.0 4.0 0.0 4.0 0.0 0.0 0.0 0.0 1.0 1.0
1 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 85.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 91.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 90.0 ... 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0
4 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 69.0 ... 2.0 0.0 4.0 1.0 4.0 0.0 1.0 2.0 2.0 1.0

5 rows × 51 columns

Look at an overview of the data by running the code below.

We’re going to use a library we haven’t covered before to give a quick summary of the dataframe.

You used this data last week, so it should feel familiar to you.

Do you prefer this method or the code you used last week in the logistic regression exercise?

profile = ProfileReport(data)

profile.to_notebook_iframe()

Load in the ‘stroke_data_feature_descriptions’ dataframe and view that too - you can just view the whole dataframe with pandas rather than using the ProfileReport.

Hint: it’s in the same folder as the hsma_stroke.csv dataset we imported above.

stroke_data_feature_descriptions_df = pd.read_csv('../datasets/stroke_data_feature_descriptions.csv')

stroke_data_feature_descriptions_df
Feature Description
0 # Comorbidities Number of comorbidities
1 2+ comorbidotes If the patient had at least two comorbidities
2 80+ If the patient is aged 80 or over
3 Age Age of patient
4 Anticoag before stroke_0 Did not take anticoagulants before stroke
5 Anticoag before stroke_1 Did take anticoagulants before stroke
6 Anticoag before stroke_NK Not known if was taking anticoagulants before ...
7 Antiplatelet_0 Did not receive antiplatelet treatment
8 Antiplatelet_1 Did receive antiplatelet treatment
9 Antiplatelet_NK Not known if received antiplatelet treatment
10 Atrial Fib Patient has atrial fibrillation
11 Co-mordity If the patient has any comorbidities at all
12 Congestive HF Patient has congestive heart failure
13 Diabetes Patient has diabetes
14 Hosp_1 Taken to hospital 1
15 Hosp_2 Taken to hospital 2
16 Hosp_3 Taken to hospital 3
17 Hosp_4 Taken to hospital 4
18 Hosp_5 Taken to hospital 5
19 Hosp_6 Taken to hospital 6
20 Hosp_7 Taken to hospital 7
21 Hypertension Patient has hypertension
22 Male Patient is male
23 Onset Time Known Type_BE Onset time type is Best Estimate
24 Onset Time Known Type_NK Onset time type is Not Known
25 Onset Time Known Type_P Onset time type is Precise
26 S2NihssArrival Stroke severity (NIHSS score) on arrival : tot...
27 S2NihssArrivalBestGaze Stroke severity (NIHSS score) on arrival : eye...
28 S2NihssArrivalBestLanguage Stroke severity (NIHSS score) on arrival : com...
29 S2NihssArrivalDysarthria Stroke severity (NIHSS score) on arrival : slu...
30 S2NihssArrivalExtinctionInattention Stroke severity (NIHSS score) on arrival : abi...
31 S2NihssArrivalFacialPalsy Stroke severity (NIHSS score) on arrival : fac...
32 S2NihssArrivalLimbAtaxia Stroke severity (NIHSS score) on arrival : lim...
33 S2NihssArrivalLocCommands Stroke severity (NIHSS score) on arrival : lev...
34 S2NihssArrivalLocQuestions Stroke severity (NIHSS score) on arrival : lev...
35 S2NihssArrivalMotorArmLeft Stroke severity (NIHSS score) on arrival : mov...
36 S2NihssArrivalMotorArmRight Stroke severity (NIHSS score) on arrival : mov...
37 S2NihssArrivalMotorLegLeft Stroke severity (NIHSS score) on arrival : mov...
38 S2NihssArrivalMotorLegRight Stroke severity (NIHSS score) on arrival : mov...
39 S2NihssArrivalSensory Stroke severity (NIHSS score) on arrival : sen...
40 S2NihssArrivalVisual Stroke severity (NIHSS score) on arrival : bli...
41 S2RankinBeforeStroke Pre-stroke disability level (Modified Rankin S...
42 Stroke severity group_1. No stroke symtpoms Stroke severity 1 - no symptoms
43 Stroke severity group_2. Minor Stroke severity 2 - minor
44 Stroke severity group_3. Moderate Stroke severity 3 - moderate
45 Stroke severity group_4. Moderate to severe Stroke severity 4 - moderate to severe
46 Stroke severity group_5. Severe Stroke severity 5 - severe
47 Stroke Type_I Ischemic stroke
48 Stroke Type_PIH Pregnancy-induced Hypertension stroke
49 TIA Stroke was a transient ischaemic attack ("mini...

Divide the main stroke dataset into features and labels.

Remember - we’re trying to predict whether patients are given clotbusting treatment or not.

What column contains that information?

X = data.drop('Clotbuster given', axis=1)
y = data['Clotbuster given']

Split the data into training and testing sets.

Start with a train/test split of 80/20.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=127)

Fit a Decision Tree model.

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Use the trained model to predict labels in both training and test sets, and calculate and compare accuracy.

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)

# Note that the lecture slides show this displayed as a float using :3f
# We can instead use :.3% to format the number as a percentage to 3 decimal places.
print(f"Accuracy of predicting training data = {accuracy_train:.3%}")
print(f"Accuracy of predicting testing data = {accuracy_test:.3%}")
Accuracy of predicting training data = 100.000%
Accuracy of predicting testing data = 74.799%

Calculate the additional model metrics for the test data only.

  • precision
  • specificity
  • recall (sensitivity)

Return the ‘micro’ average in each case.

precision_score_test = precision_score(y_test, y_pred_test, average='micro')
recall_sensitivity_score_test = recall_score(y_test, y_pred_test, average='micro')
specificity_score_test = precision_score(y_test, y_pred_test, pos_label=0)

print(f"Precision score for testing data = {precision_score_test:.3%}")
print(f"Recall (sensitivity) score for testing data = {recall_sensitivity_score_test:.3%}")
print(f"Specificity score for testing data = {specificity_score_test:.3%}")
Precision score for testing data = 74.799%
Recall (sensitivity) score for testing data = 74.799%
Specificity score for testing data = 78.404%
# we learn this in a later session - but a nice way to compare and contrast
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_test))
              precision    recall  f1-score   support

         0.0       0.78      0.78      0.78       215
         1.0       0.70      0.71      0.70       158

    accuracy                           0.75       373
   macro avg       0.74      0.74      0.74       373
weighted avg       0.75      0.75      0.75       373

Plot the decision tree.

fig, ax = plt.subplots(figsize=(14,10))

fig = plot_tree(
    model,
    feature_names = X.columns.tolist(),
    class_names=["Not Given Clotbuster", "Given Clotbuster"],
    filled = True,
    ax=ax
)

plt.show()

5.2 Extension - Refining Your Decision Tree

Let’s experiment by changing a few parameters.

5.2.1 Maximum Depth

Try changing the value of the ‘max_depth’ parameter when setting up your DecisionTreeClassifier.

Output the - accuracy (train and test) - precision (test) - specificity (test) - and recall (sensitivity) (test)

of this new classifier.

We’ve switched to ‘macro’ average from here on in - take a look at the second half of session 4D for more detail on this.

model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
precision_score_test = precision_score(y_test, y_pred_test, average='macro')
recall_sensitivity_score_test = recall_score(y_test, y_pred_test, average='macro')
specificity_score_test = precision_score(y_test, y_pred_test, pos_label=0)

# Note that the lecture slides show this displayed as a float using :3f
# We can instead use :.3% to format the number as a percentage to 3 decimal places.
print(f"Accuracy of predicting training data = {accuracy_train:.3%}")
print(f"Accuracy of predicting testing data = {accuracy_test:.3%}")
print(f"Precision score for testing data = {precision_score_test:.3%}")
print(f"Recall (sensitivity) score for testing data = {recall_sensitivity_score_test:.3%}")
print(f"Specificity score for testing data = {specificity_score_test:.3%}")
Accuracy of predicting training data = 80.591%
Accuracy of predicting testing data = 80.965%
Precision score for testing data = 80.767%
Recall (sensitivity) score for testing data = 81.475%
Specificity score for testing data = 87.500%
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
precision_score_test = precision_score(y_test, y_pred_test, average='macro')
recall_sensitivity_score_test = recall_score(y_test, y_pred_test, average='macro')
specificity_score_test = precision_score(y_test, y_pred_test, pos_label=0)

# Note that the lecture slides show this displayed as a float using :3f
# We can instead use :.3% to format the number as a percentage to 3 decimal places.
print(f"Accuracy of predicting training data = {accuracy_train:.3%}")
print(f"Accuracy of predicting testing data = {accuracy_test:.3%}")
print(f"Precision score for testing data = {precision_score_test:.3%}")
print(f"Recall (sensitivity) score for testing data = {recall_sensitivity_score_test:.3%}")
print(f"Specificity score for testing data = {specificity_score_test:.3%}")
Accuracy of predicting training data = 78.308%
Accuracy of predicting testing data = 77.748%
Precision score for testing data = 77.338%
Recall (sensitivity) score for testing data = 77.845%
Specificity score for testing data = 83.000%

This is getting very tiresome - let’s write a function!

def fit_dt_model(model):
    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    accuracy_train = np.mean(y_pred_train == y_train)
    accuracy_test = np.mean(y_pred_test == y_test)
    precision_score_test = precision_score(y_test, y_pred_test, average='micro')
    recall_sensitivity_score_test = recall_score(y_test, y_pred_test, average='micro')
    specificity_score_test = precision_score(y_test, y_pred_test, pos_label=0)

    # Note that the lecture slides show this displayed as a float using :3f
    # We can instead use :.3% to format the number as a percentage to 3 decimal places.
    print(f"Accuracy of predicting training data = {accuracy_train:.3%}")
    print(f"Accuracy of predicting testing data = {accuracy_test:.3%}")
    print(f"Precision score for testing data = {precision_score_test:.3%}")
    print(f"Recall (sensitivity) score for testing data = {recall_sensitivity_score_test:.3%}")
    print(f"Specificity score for testing data = {specificity_score_test:.3%}")

fit_dt_model(model = DecisionTreeClassifier(max_depth=8, random_state=42))
Accuracy of predicting training data = 86.367%
Accuracy of predicting testing data = 78.820%
Precision score for testing data = 78.820%
Recall (sensitivity) score for testing data = 78.820%
Specificity score for testing data = 83.010%
fit_dt_model(model = DecisionTreeClassifier(max_depth=4, random_state=42))
Accuracy of predicting training data = 79.449%
Accuracy of predicting testing data = 80.161%
Precision score for testing data = 80.161%
Recall (sensitivity) score for testing data = 80.161%
Specificity score for testing data = 80.786%

5.2.2 Minimum Samples

Try changing the values of ‘min_samples_split’ (the default value is 2).

fit_dt_model(model =DecisionTreeClassifier(min_samples_split=6, random_state=42))
Accuracy of predicting training data = 95.097%
Accuracy of predicting testing data = 74.531%
Precision score for testing data = 74.531%
Recall (sensitivity) score for testing data = 74.531%
Specificity score for testing data = 76.549%
fit_dt_model(model = DecisionTreeClassifier(min_samples_split=4, random_state=42))
Accuracy of predicting training data = 97.381%
Accuracy of predicting testing data = 76.408%
Precision score for testing data = 76.408%
Recall (sensitivity) score for testing data = 76.408%
Specificity score for testing data = 79.535%

Now try adjusting ‘min_samples_leaf’ (the default is 1).

fit_dt_model(model = DecisionTreeClassifier(min_samples_leaf=6, random_state=42))
Accuracy of predicting training data = 87.710%
Accuracy of predicting testing data = 77.748%
Precision score for testing data = 77.748%
Recall (sensitivity) score for testing data = 77.748%
Specificity score for testing data = 77.049%
fit_dt_model(model = DecisionTreeClassifier(min_samples_split=3, random_state=42))
Accuracy of predicting training data = 99.127%
Accuracy of predicting testing data = 73.727%
Precision score for testing data = 73.727%
Recall (sensitivity) score for testing data = 73.727%
Specificity score for testing data = 76.233%

5.2.3 Split Criterion

Compare the performance when using

  • Gini Impurity
  • Entropy
  • Log Loss
print("**Gini**")
fit_dt_model(model = DecisionTreeClassifier(criterion="gini", random_state=42))
**Gini**
Accuracy of predicting training data = 100.000%
Accuracy of predicting testing data = 75.335%
Precision score for testing data = 75.335%
Recall (sensitivity) score for testing data = 75.335%
Specificity score for testing data = 78.605%
print("**Entropy**")
fit_dt_model(model = DecisionTreeClassifier(criterion="entropy", random_state=42))
**Entropy**
Accuracy of predicting training data = 100.000%
Accuracy of predicting testing data = 77.212%
Precision score for testing data = 77.212%
Recall (sensitivity) score for testing data = 77.212%
Specificity score for testing data = 79.545%
print("**Log Loss**")
fit_dt_model(model = DecisionTreeClassifier(criterion="log_loss", random_state=42))
**Log Loss**
Accuracy of predicting training data = 100.000%
Accuracy of predicting testing data = 77.212%
Precision score for testing data = 77.212%
Recall (sensitivity) score for testing data = 77.212%
Specificity score for testing data = 79.545%

5.3 Comparing Performance with a Logistic Regression Model

Copy your code in from last week’s logistic regression exercise (or write this in from scratch - there isn’t much that is different to the decision tree model!).

Remember - you will need to standardise the data for the logistic regression model!

Look at these additional metrics as well:

  • precision
  • specificity
  • recall (sensitivity)
scaler = StandardScaler()

X_train_stand = scaler.fit_transform(X_train)
X_test_stand = scaler.fit_transform(X_test)

model = LogisticRegression()
model.fit(X_train_stand, y_train)
y_pred_train = model.predict(X_train_stand)
y_pred_test = model.predict(X_test_stand)

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
precision_score_test = precision_score(y_test, y_pred_test, average='micro')
recall_sensitivity_score_test = recall_score(y_test, y_pred_test, average='micro')
specificity_score_test = precision_score(y_test, y_pred_test, pos_label=0)

# Note that the lecture slides show this displayed as a float using :3f
# We can instead use :.3% to format the number as a percentage to 3 decimal places.
print(f"Accuracy of predicting training data = {accuracy_train:.3%}")
print(f"Accuracy of predicting testing data = {accuracy_test:.3%}")
print(f"Precision score for testing data = {precision_score_test:.3%}")
print(f"Recall (sensitivity) score for testing data = {recall_sensitivity_score_test:.3%}")
print(f"Specificity score for testing data = {specificity_score_test:.3%}")
Accuracy of predicting training data = 81.531%
Accuracy of predicting testing data = 82.574%
Precision score for testing data = 82.574%
Recall (sensitivity) score for testing data = 82.574%
Specificity score for testing data = 83.784%

Use the cell below to write out an interpretation of the performance of the logistic regression model and the decision tree.

Think about the presence of false positives and false negatives.

Which might you be more interested in minimizing in this model?

Hint - giving thrombolysis to good candidates for it can lead to less disability after stroke and improved outcomes. However, there is a risk that giving thrombolysis to the wrong person could lead to additional bleeding on the brain and worse outcomes. What might you want to balance?

No answer given

5.4 Challenge Exercises

5.4.1 Bonus Exercise 1

Have a read of this article on feature importance in decision trees: Article Link

In particular, make sure you read the section “Pros and cons of using Gini importance” so you can understand some of the things you need to keep in mind when looking at feature importance in trees.

We can access the feature importance by running the following code:

# modify this code to point towards your decision tree model object (make sure that object
# was using the gini index as the criteria)
model_dt = DecisionTreeClassifier(criterion="gini", random_state=42)
model_dt = model_dt.fit(X_train, y_train)

feature_importances_dt = model_dt.feature_importances_

feature_importances_dt
array([0.01364131, 0.00456272, 0.00586152, 0.01439673, 0.00110657,
       0.01397118, 0.01249914, 0.00889   , 0.08702842, 0.00186827,
       0.05481699, 0.        , 0.01165528, 0.01608714, 0.00869807,
       0.00445253, 0.01247073, 0.        , 0.00643961, 0.01070718,
       0.01356352, 0.0064616 , 0.00336373, 0.00394369, 0.00672135,
       0.01659303, 0.        , 0.        , 0.00157772, 0.        ,
       0.00463295, 0.        , 0.        , 0.1380051 , 0.0602852 ,
       0.2124907 , 0.01367527, 0.01015513, 0.00578212, 0.01669722,
       0.02725587, 0.02481184, 0.02105913, 0.02376351, 0.02466254,
       0.01465469, 0.00871107, 0.02615918, 0.003855  , 0.02196545])

How does this compare to the feature importance for your logistic regression?

HINT: This is quite different from the approach used in the model.

You’ll need to look back at the exercises form last week.

# modify this code to point towards your logistic regression model object
scaler = StandardScaler()

X_train_stand = scaler.fit_transform(X_train)
X_test_stand = scaler.fit_transform(X_test)

model_lr = LogisticRegression()
model_lr = model_lr.fit(X_train_stand, y_train)

# Examine feature weights and sort by most influential
co_eff = model_lr.coef_[0]

co_eff_df = pd.DataFrame()
co_eff_df['feature'] = list(X)
co_eff_df['co_eff'] = co_eff
co_eff_df['abs_co_eff'] = np.abs(co_eff)
co_eff_df.sort_values(by='abs_co_eff', ascending=False, inplace=True)

co_eff_df
feature co_eff abs_co_eff
33 Stroke Type_PIH -1.138691 1.138691
32 Stroke Type_I 1.138691 1.138691
28 Stroke severity group_2. Minor -0.655217 0.655217
29 Stroke severity group_3. Moderate 0.558470 0.558470
34 S2RankinBeforeStroke -0.507321 0.507321
47 S2NihssArrivalBestLanguage 0.458131 0.458131
10 Onset Time Known Type_BE -0.290595 0.290595
25 Anticoag before stroke_1 -0.274249 0.274249
12 Onset Time Known Type_P 0.266564 0.266564
27 Stroke severity group_1. No stroke symtpoms -0.248087 0.248087
8 Age -0.244158 0.244158
24 Anticoag before stroke_0 0.235502 0.235502
16 Hypertension 0.235309 0.235309
17 Atrial Fib -0.230893 0.230893
42 S2NihssArrivalMotorArmRight 0.219366 0.219366
49 S2NihssArrivalExtinctionInattention 0.195696 0.195696
37 S2NihssArrivalLocCommands -0.189860 0.189860
40 S2NihssArrivalFacialPalsy 0.187638 0.187638
30 Stroke severity group_4. Moderate to severe 0.185069 0.185069
44 S2NihssArrivalMotorLegRight -0.181887 0.181887
36 S2NihssArrivalLocQuestions 0.168651 0.168651
0 Hosp_1 0.168301 0.168301
21 Antiplatelet_0 0.161005 0.161005
35 S2NihssArrival -0.142778 0.142778
19 TIA -0.138909 0.138909
43 S2NihssArrivalMotorLegLeft 0.131229 0.131229
14 2+ comorbidotes -0.129622 0.129622
20 Co-mordity -0.124540 0.124540
41 S2NihssArrivalMotorArmLeft 0.120298 0.120298
5 Hosp_6 0.116469 0.116469
46 S2NihssArrivalSensory 0.114362 0.114362
45 S2NihssArrivalLimbAtaxia -0.099983 0.099983
3 Hosp_4 -0.093776 0.093776
2 Hosp_3 -0.093303 0.093303
23 Antiplatelet_NK -0.092551 0.092551
18 Diabetes 0.090665 0.090665
11 Onset Time Known Type_NK 0.069936 0.069936
7 Male 0.068703 0.068703
38 S2NihssArrivalBestGaze 0.065874 0.065874
22 Antiplatelet_1 -0.059578 0.059578
4 Hosp_5 -0.057283 0.057283
9 80+ -0.053481 0.053481
31 Stroke severity group_5. Severe -0.047326 0.047326
48 S2NihssArrivalDysarthria 0.044316 0.044316
6 Hosp_7 -0.044271 0.044271
39 S2NihssArrivalVisual 0.042245 0.042245
15 Congestive HF 0.030292 0.030292
26 Anticoag before stroke_NK 0.021401 0.021401
1 Hosp_2 0.010540 0.010540
13 # Comorbidities -0.001275 0.001275

Can you create two graphs showing feature importance for the two models?

Instead of using the plot code used in the linked article, try looking up the barh function from matplotlib.

Try ordering your plot so that the features with the most importance are at the top.

# Sort the feature importances from greatest to least using the sorted indices
sorted_indices = feature_importances_dt.argsort()[::-1]

sorted_feature_names =[X.columns[i] for i in sorted_indices]

sorted_importances = feature_importances_dt[sorted_indices]

# Create a bar plot of the feature importances
fig, ax = plt.subplots(figsize=(15,10))
ax.barh(width=sorted_importances, y=sorted_feature_names)
ax.invert_yaxis()
ax = plt.title("Feature Importance - Stroke Dataset - Decision Tree")

fig, ax = plt.subplots(figsize=(15,10))
ax.barh(width=co_eff_df['abs_co_eff'], y=co_eff_df['feature'])
ax.invert_yaxis()
ax = plt.title("Absolute Feature Importance - Stroke Dataset - Logistic Regression")

fig, ax = plt.subplots(figsize=(15,10))
ax.barh(width=co_eff_df['co_eff'], y=co_eff_df['feature'])
ax.invert_yaxis()
ax = plt.title("Feature Importance - Stroke Dataset - Logistic Regression")

5.4.2 Bonus Exercise 2

Can you improve accuracy of your decision tree model by changing the size of your train / test split?

NOTE - the examples below just show the impact of changing the train/test split.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

fit_dt_model(model = DecisionTreeClassifier(random_state=42))
Accuracy of predicting training data = 100.000%
Accuracy of predicting testing data = 74.866%
Precision score for testing data = 74.866%
Recall (sensitivity) score for testing data = 74.866%
Specificity score for testing data = 81.250%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

fit_dt_model(model = DecisionTreeClassifier(random_state=42))
Accuracy of predicting training data = 100.000%
Accuracy of predicting testing data = 76.923%
Precision score for testing data = 76.923%
Recall (sensitivity) score for testing data = 76.923%
Specificity score for testing data = 80.588%

5.4.3 Bonus Exercise 3

Try dropping some features from your data.

Can you improve the performance of your model this way?

NOTE: This solution just shows selecting a subset of features - not the best ones necessarily!

X.columns
Index(['Hosp_1', 'Hosp_2', 'Hosp_3', 'Hosp_4', 'Hosp_5', 'Hosp_6', 'Hosp_7',
       'Male', 'Age', '80+', 'Onset Time Known Type_BE',
       'Onset Time Known Type_NK', 'Onset Time Known Type_P',
       '# Comorbidities', '2+ comorbidotes', 'Congestive HF', 'Hypertension',
       'Atrial Fib', 'Diabetes', 'TIA', 'Co-mordity', 'Antiplatelet_0',
       'Antiplatelet_1', 'Antiplatelet_NK', 'Anticoag before stroke_0',
       'Anticoag before stroke_1', 'Anticoag before stroke_NK',
       'Stroke severity group_1. No stroke symtpoms',
       'Stroke severity group_2. Minor', 'Stroke severity group_3. Moderate',
       'Stroke severity group_4. Moderate to severe',
       'Stroke severity group_5. Severe', 'Stroke Type_I', 'Stroke Type_PIH',
       'S2RankinBeforeStroke', 'S2NihssArrival', 'S2NihssArrivalLocQuestions',
       'S2NihssArrivalLocCommands', 'S2NihssArrivalBestGaze',
       'S2NihssArrivalVisual', 'S2NihssArrivalFacialPalsy',
       'S2NihssArrivalMotorArmLeft', 'S2NihssArrivalMotorArmRight',
       'S2NihssArrivalMotorLegLeft', 'S2NihssArrivalMotorLegRight',
       'S2NihssArrivalLimbAtaxia', 'S2NihssArrivalSensory',
       'S2NihssArrivalBestLanguage', 'S2NihssArrivalDysarthria',
       'S2NihssArrivalExtinctionInattention'],
      dtype='object')
fit_dt_model(model = DecisionTreeClassifier(max_depth=5, random_state=42))
Accuracy of predicting training data = 80.583%
Accuracy of predicting testing data = 80.680%
Precision score for testing data = 80.680%
Recall (sensitivity) score for testing data = 80.680%
Specificity score for testing data = 89.895%
# This isn't necessarily the best subset of features to use - it's just an example!
X_reduced = X[['S2NihssArrival', 'Stroke Type_PIH', 'Age', 'S2RankinBeforeStroke']]

X_reduced_train, X_reduced_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.1, random_state=42)

model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_reduced_train, y_train)
y_pred_train = model.predict(X_reduced_train)
y_pred_test = model.predict(X_reduced_test)

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
precision_score_test = precision_score(y_test, y_pred_test, average='micro')
recall_sensitivity_score_test = recall_score(y_test, y_pred_test, average='micro')
specificity_score_test = precision_score(y_test, y_pred_test, pos_label=0)

# Note that the lecture slides show this displayed as a float using :3f
# We can instead use :.3% to format the number as a percentage to 3 decimal places.
print(f"Accuracy of predicting training data = {accuracy_train:.3%}")
print(f"Accuracy of predicting testing data = {accuracy_test:.3%}")
print(f"Precision score for testing data = {precision_score_test:.3%}")
print(f"Recall (sensitivity) score for testing data = {recall_sensitivity_score_test:.3%}")
print(f"Specificity score for testing data = {specificity_score_test:.3%}")
Accuracy of predicting training data = 79.821%
Accuracy of predicting testing data = 77.540%
Precision score for testing data = 77.540%
Recall (sensitivity) score for testing data = 77.540%
Specificity score for testing data = 86.408%