3  Stroke Thromobolysis Dataset: Logistic Regression Exercise (Solution)

The data loaded in this exercise is for seven acute stroke units, and whether a patient receives clost-busting treatment for stroke. There are lots of features, and a description of the features can be found in the file stroke_data_feature_descriptions.csv.

Train a Logistic Regression model to try to predict whether or not a stroke patient receives clot-busting treatment. Use the prompts below to write each section of code.

What do you conclude are the most important features for predicting whether a patient receives clot busting treatment? Can you improve accuracy by changing the size of your train / test split? If you have time, perhaps consider dropping some features from your data based on your outputs (in the same way you dropped passengerID in the Titanic example). Don’t forget you’ll need to rerun all subsequent cells if you make changes like that.

import pandas as pd
import numpy as np
# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Download data
# (not required if running locally and have previously downloaded data)

download_required = True

if download_required:

    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '2004_titanic/master/jupyter_notebooks/data/hsma_stroke.csv'
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data to data subfolder
    data.to_csv(data_directory + 'hsma_stroke.csv', index=False)

# Load data
data = pd.read_csv('data/hsma_stroke.csv')
# Make all data 'float' type
data = data.astype(float)
# Show data
data.head()
Clotbuster given Hosp_1 Hosp_2 Hosp_3 Hosp_4 Hosp_5 Hosp_6 Hosp_7 Male Age ... S2NihssArrivalFacialPalsy S2NihssArrivalMotorArmLeft S2NihssArrivalMotorArmRight S2NihssArrivalMotorLegLeft S2NihssArrivalMotorLegRight S2NihssArrivalLimbAtaxia S2NihssArrivalSensory S2NihssArrivalBestLanguage S2NihssArrivalDysarthria S2NihssArrivalExtinctionInattention
0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 63.0 ... 3.0 4.0 0.0 4.0 0.0 0.0 0.0 0.0 1.0 1.0
1 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 85.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 91.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 90.0 ... 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0
4 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 69.0 ... 2.0 0.0 4.0 1.0 4.0 0.0 1.0 2.0 2.0 1.0

5 rows × 51 columns

# Look at overview of data
data.describe()
Clotbuster given Hosp_1 Hosp_2 Hosp_3 Hosp_4 Hosp_5 Hosp_6 Hosp_7 Male Age ... S2NihssArrivalFacialPalsy S2NihssArrivalMotorArmLeft S2NihssArrivalMotorArmRight S2NihssArrivalMotorLegLeft S2NihssArrivalMotorLegRight S2NihssArrivalLimbAtaxia S2NihssArrivalSensory S2NihssArrivalBestLanguage S2NihssArrivalDysarthria S2NihssArrivalExtinctionInattention
count 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 ... 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000 1862.000000
mean 0.403330 0.159506 0.142320 0.154672 0.165414 0.055854 0.113319 0.208915 0.515575 74.553706 ... 1.114930 1.002148 0.963480 0.963480 0.910849 0.216971 0.610097 0.944146 0.739527 0.566595
std 0.490698 0.366246 0.349472 0.361689 0.371653 0.229701 0.317068 0.406643 0.499892 12.280576 ... 0.930527 1.479211 1.441594 1.406501 1.380606 0.522643 0.771932 1.121379 0.731083 0.794000
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 40.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 67.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 76.000000 ... 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
75% 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 83.000000 ... 2.000000 2.000000 2.000000 2.000000 2.000000 0.000000 1.000000 2.000000 1.000000 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100.000000 ... 3.000000 4.000000 4.000000 4.000000 4.000000 2.000000 2.000000 3.000000 2.000000 2.000000

8 rows × 51 columns

# Look at mean feature values for those who were given a clotbuster vs those
# that weren't
mask = data['Clotbuster given'] == 1
given = data[mask]

mask = data['Clotbuster given'] == 0
not_given = data[mask]

summary = pd.DataFrame()
summary['given'] = given.mean()
summary['not given'] = not_given.mean()

summary
given not given
Clotbuster given 1.000000 0.000000
Hosp_1 0.203728 0.129613
Hosp_2 0.122503 0.155716
Hosp_3 0.182423 0.135914
Hosp_4 0.137150 0.184518
Hosp_5 0.067909 0.047705
Hosp_6 0.123835 0.106211
Hosp_7 0.162450 0.240324
Male 0.515313 0.515752
Age 73.303595 75.398740
80+ 0.346205 0.393339
Onset Time Known Type_BE 0.149134 0.352835
Onset Time Known Type_NK 0.010652 0.015302
Onset Time Known Type_P 0.840213 0.631863
# Comorbidities 1.053262 1.258326
2+ comorbidotes 0.298269 0.393339
Congestive HF 0.041278 0.047705
Hypertension 0.464714 0.470747
Atrial Fib 0.186418 0.239424
Diabetes 0.151798 0.172817
TIA 0.209055 0.327633
Co-mordity 0.676431 0.732673
Antiplatelet_0 0.097204 0.144014
Antiplatelet_1 0.079893 0.079208
Antiplatelet_NK 0.822903 0.776778
Anticoag before stroke_0 0.122503 0.083708
Anticoag before stroke_1 0.046605 0.129613
Anticoag before stroke_NK 0.830892 0.786679
Stroke severity group_1. No stroke symtpoms 0.002663 0.041404
Stroke severity group_2. Minor 0.061252 0.396040
Stroke severity group_3. Moderate 0.619174 0.333033
Stroke severity group_4. Moderate to severe 0.182423 0.105311
Stroke severity group_5. Severe 0.134487 0.124212
Stroke Type_I 1.000000 0.803780
Stroke Type_PIH 0.000000 0.196220
S2RankinBeforeStroke 0.360852 0.791179
S2NihssArrival 12.515313 9.032403
S2NihssArrivalLocQuestions 0.850866 0.605761
S2NihssArrivalLocCommands 0.394141 0.360936
S2NihssArrivalBestGaze 0.521971 0.341134
S2NihssArrivalVisual 0.809587 0.495050
S2NihssArrivalFacialPalsy 1.407457 0.917192
S2NihssArrivalMotorArmLeft 1.215712 0.857786
S2NihssArrivalMotorArmRight 1.073236 0.889289
S2NihssArrivalMotorLegLeft 1.165113 0.827183
S2NihssArrivalMotorLegRight 0.977364 0.865887
S2NihssArrivalLimbAtaxia 0.214381 0.218722
S2NihssArrivalSensory 0.762983 0.506751
S2NihssArrivalBestLanguage 1.181092 0.783978
S2NihssArrivalDysarthria 0.902796 0.629163
S2NihssArrivalExtinctionInattention 0.762983 0.433843
# Divide into features and labels
X = data.drop('Clotbuster given', axis=1)
y = data['Clotbuster given']
# Divide into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Standardise data
def standardise_data(X_train, X_test):

    # Initialise a new scaling object for normalising input data
    sc = StandardScaler()

    # Apply the scaler to the training and test sets
    train_std=sc.fit_transform(X_train)
    test_std=sc.fit_transform(X_test)

    return train_std, test_std

X_train_std, X_test_std = standardise_data(X_train, X_test)
# Fit (train) Logistic Regression model
model = LogisticRegression()
model.fit(X_train_std, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict training and test labels, and calculate accuracy
y_pred_train = model.predict(X_train_std)
y_pred_test = model.predict(X_test_std)

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)

print (f'Accuracy of predicting training data = {accuracy_train}')
print (f'Accuracy of predicting test data = {accuracy_test}')
Accuracy of predicting training data = 0.8202005730659025
Accuracy of predicting test data = 0.7939914163090128
# Examine feature weights and sort by most influential
co_eff = model.coef_[0]

co_eff_df = pd.DataFrame()
co_eff_df['feature'] = list(X)
co_eff_df['co_eff'] = co_eff
co_eff_df['abs_co_eff'] = np.abs(co_eff)
co_eff_df.sort_values(by='abs_co_eff', ascending=False, inplace=True)

co_eff_df
feature co_eff abs_co_eff
32 Stroke Type_I 1.121156 1.121156
33 Stroke Type_PIH -1.121156 1.121156
28 Stroke severity group_2. Minor -0.709358 0.709358
29 Stroke severity group_3. Moderate 0.594452 0.594452
35 S2NihssArrival -0.468261 0.468261
34 S2RankinBeforeStroke -0.458139 0.458139
47 S2NihssArrivalBestLanguage 0.407136 0.407136
8 Age -0.351431 0.351431
10 Onset Time Known Type_BE -0.311359 0.311359
17 Atrial Fib -0.304925 0.304925
12 Onset Time Known Type_P 0.296688 0.296688
16 Hypertension 0.284943 0.284943
41 S2NihssArrivalMotorArmLeft 0.270736 0.270736
27 Stroke severity group_1. No stroke symtpoms -0.269664 0.269664
24 Anticoag before stroke_0 0.268555 0.268555
25 Anticoag before stroke_1 -0.265632 0.265632
36 S2NihssArrivalLocQuestions 0.259637 0.259637
40 S2NihssArrivalFacialPalsy 0.249051 0.249051
49 S2NihssArrivalExtinctionInattention 0.216644 0.216644
30 Stroke severity group_4. Moderate to severe 0.194853 0.194853
19 TIA -0.193125 0.193125
42 S2NihssArrivalMotorArmRight 0.186265 0.186265
38 S2NihssArrivalBestGaze 0.180682 0.180682
37 S2NihssArrivalLocCommands -0.175957 0.175957
3 Hosp_4 -0.171598 0.171598
21 Antiplatelet_0 0.155566 0.155566
14 2+ comorbidotes -0.145232 0.145232
0 Hosp_1 0.122579 0.122579
5 Hosp_6 0.116022 0.116022
23 Antiplatelet_NK -0.110145 0.110145
39 S2NihssArrivalVisual 0.093206 0.093206
1 Hosp_2 0.071554 0.071554
45 S2NihssArrivalLimbAtaxia 0.067437 0.067437
43 S2NihssArrivalMotorLegLeft 0.065901 0.065901
9 80+ 0.064134 0.064134
13 # Comorbidities -0.062878 0.062878
4 Hosp_5 -0.047460 0.047460
6 Hosp_7 -0.046642 0.046642
46 S2NihssArrivalSensory 0.044662 0.044662
11 Onset Time Known Type_NK 0.042662 0.042662
2 Hosp_3 -0.034359 0.034359
7 Male 0.032534 0.032534
48 S2NihssArrivalDysarthria 0.031809 0.031809
20 Co-mordity -0.029874 0.029874
22 Antiplatelet_1 -0.026909 0.026909
31 Stroke severity group_5. Severe -0.019006 0.019006
15 Congestive HF 0.016522 0.016522
18 Diabetes 0.013792 0.013792
44 S2NihssArrivalMotorLegRight -0.009994 0.009994
26 Anticoag before stroke_NK -0.006123 0.006123