3 Stroke Thromobolysis Dataset: Logistic Regression Exercise (Solution)

The data loaded in this exercise is for seven acute stroke units, and whether a patient receives clost-busting treatment for stroke. There are lots of features, and a description of the features can be found in the file stroke_data_feature_descriptions.csv.

Train a Logistic Regression model to try to predict whether or not a stroke patient receives clot-busting treatment. Use the prompts below to write each section of code.

What do you conclude are the most important features for predicting whether a patient receives clot busting treatment? Can you improve accuracy by changing the size of your train / test split? If you have time, perhaps consider dropping some features from your data based on your outputs (in the same way you dropped passengerID in the Titanic example). Don’t forget you’ll need to rerun all subsequent cells if you make changes like that.

import pandas as pd
import numpy as np
# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Download data
# (not required if running locally and have previously downloaded data)

download_required = True

if download_required:

    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '2004_titanic/master/jupyter_notebooks/data/hsma_stroke.csv'
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data to data subfolder
    data.to_csv(data_directory + 'hsma_stroke.csv', index=False)

# Load data
data = pd.read_csv('data/hsma_stroke.csv')
# Make all data 'float' type
data = data.astype(float)
# Show data
data.head()

	Clotbuster given	Hosp_2	Age	...	S2NihssArrivalFacialPalsy	S2NihssArrivalMotorArmLeft	S2NihssArrivalMotorArmRight	S2NihssArrivalMotorLegLeft	S2NihssArrivalMotorLegRight	S2NihssArrivalLimbAtaxia	S2NihssArrivalSensory	S2NihssArrivalBestLanguage	S2NihssArrivalDysarthria	S2NihssArrivalExtinctionInattention
0	1.0	1.0	63.0	...	3.0	4.0	0.0	4.0	0.0	0.0	0.0	0.0	1.0	1.0
1	1.0	1.0	85.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	1.0	0.0
2	0.0	1.0	91.0	...	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
3	0.0	1.0	90.0	...	1.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0
4	1.0	1.0	69.0	...	2.0	0.0	4.0	1.0	4.0	0.0	1.0	2.0	2.0	1.0

5 rows × 51 columns

# Look at overview of data
data.describe()

	Clotbuster given	Hosp_1	Hosp_2	Hosp_3	Hosp_4	Hosp_5	Hosp_6	Hosp_7	Male	Age	...	S2NihssArrivalFacialPalsy	S2NihssArrivalMotorArmLeft	S2NihssArrivalMotorArmRight	S2NihssArrivalMotorLegLeft	S2NihssArrivalMotorLegRight	S2NihssArrivalLimbAtaxia	S2NihssArrivalSensory	S2NihssArrivalBestLanguage	S2NihssArrivalDysarthria	S2NihssArrivalExtinctionInattention
count	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	...	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000	1862.000000
mean	0.403330	0.159506	0.142320	0.154672	0.165414	0.055854	0.113319	0.208915	0.515575	74.553706	...	1.114930	1.002148	0.963480	0.963480	0.910849	0.216971	0.610097	0.944146	0.739527	0.566595
std	0.490698	0.366246	0.349472	0.361689	0.371653	0.229701	0.317068	0.406643	0.499892	12.280576	...	0.930527	1.479211	1.441594	1.406501	1.380606	0.522643	0.771932	1.121379	0.731083	0.794000
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	40.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	67.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	76.000000	...	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
75%	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	83.000000	...	2.000000	2.000000	2.000000	2.000000	2.000000	0.000000	1.000000	2.000000	1.000000	1.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	100.000000	...	3.000000	4.000000	4.000000	4.000000	4.000000	2.000000	2.000000	3.000000	2.000000	2.000000

8 rows × 51 columns

# Look at mean feature values for those who were given a clotbuster vs those
# that weren't
mask = data['Clotbuster given'] == 1
given = data[mask]

mask = data['Clotbuster given'] == 0
not_given = data[mask]

summary = pd.DataFrame()
summary['given'] = given.mean()
summary['not given'] = not_given.mean()

summary

	given	not given
Clotbuster given	1.000000	0.000000
Hosp_1	0.203728	0.129613
Hosp_2	0.122503	0.155716
Hosp_3	0.182423	0.135914
Hosp_4	0.137150	0.184518
Hosp_5	0.067909	0.047705
Hosp_6	0.123835	0.106211
Hosp_7	0.162450	0.240324
Male	0.515313	0.515752
Age	73.303595	75.398740
80+	0.346205	0.393339
Onset Time Known Type_BE	0.149134	0.352835
Onset Time Known Type_NK	0.010652	0.015302
Onset Time Known Type_P	0.840213	0.631863
# Comorbidities	1.053262	1.258326
2+ comorbidotes	0.298269	0.393339
Congestive HF	0.041278	0.047705
Hypertension	0.464714	0.470747
Atrial Fib	0.186418	0.239424
Diabetes	0.151798	0.172817
TIA	0.209055	0.327633
Co-mordity	0.676431	0.732673
Antiplatelet_0	0.097204	0.144014
Antiplatelet_1	0.079893	0.079208
Antiplatelet_NK	0.822903	0.776778
Anticoag before stroke_0	0.122503	0.083708
Anticoag before stroke_1	0.046605	0.129613
Anticoag before stroke_NK	0.830892	0.786679
Stroke severity group_1. No stroke symtpoms	0.002663	0.041404
Stroke severity group_2. Minor	0.061252	0.396040
Stroke severity group_3. Moderate	0.619174	0.333033
Stroke severity group_4. Moderate to severe	0.182423	0.105311
Stroke severity group_5. Severe	0.134487	0.124212
Stroke Type_I	1.000000	0.803780
Stroke Type_PIH	0.000000	0.196220
S2RankinBeforeStroke	0.360852	0.791179
S2NihssArrival	12.515313	9.032403
S2NihssArrivalLocQuestions	0.850866	0.605761
S2NihssArrivalLocCommands	0.394141	0.360936
S2NihssArrivalBestGaze	0.521971	0.341134
S2NihssArrivalVisual	0.809587	0.495050
S2NihssArrivalFacialPalsy	1.407457	0.917192
S2NihssArrivalMotorArmLeft	1.215712	0.857786
S2NihssArrivalMotorArmRight	1.073236	0.889289
S2NihssArrivalMotorLegLeft	1.165113	0.827183
S2NihssArrivalMotorLegRight	0.977364	0.865887
S2NihssArrivalLimbAtaxia	0.214381	0.218722
S2NihssArrivalSensory	0.762983	0.506751
S2NihssArrivalBestLanguage	1.181092	0.783978
S2NihssArrivalDysarthria	0.902796	0.629163
S2NihssArrivalExtinctionInattention	0.762983	0.433843

# Divide into features and labels
X = data.drop('Clotbuster given', axis=1)
y = data['Clotbuster given']

# Divide into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Standardise data
def standardise_data(X_train, X_test):

    # Initialise a new scaling object for normalising input data
    sc = StandardScaler()

    # Apply the scaler to the training and test sets
    train_std=sc.fit_transform(X_train)
    test_std=sc.fit_transform(X_test)

    return train_std, test_std

X_train_std, X_test_std = standardise_data(X_train, X_test)

# Fit (train) Logistic Regression model
model = LogisticRegression()
model.fit(X_train_std, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Predict training and test labels, and calculate accuracy
y_pred_train = model.predict(X_train_std)
y_pred_test = model.predict(X_test_std)

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)

print (f'Accuracy of predicting training data = {accuracy_train}')
print (f'Accuracy of predicting test data = {accuracy_test}')

Accuracy of predicting training data = 0.8202005730659025
Accuracy of predicting test data = 0.7939914163090128

# Examine feature weights and sort by most influential
co_eff = model.coef_[0]

co_eff_df = pd.DataFrame()
co_eff_df['feature'] = list(X)
co_eff_df['co_eff'] = co_eff
co_eff_df['abs_co_eff'] = np.abs(co_eff)
co_eff_df.sort_values(by='abs_co_eff', ascending=False, inplace=True)

co_eff_df

	feature	co_eff	abs_co_eff
32	Stroke Type_I	1.121156	1.121156
33	Stroke Type_PIH	-1.121156	1.121156
28	Stroke severity group_2. Minor	-0.709358	0.709358
29	Stroke severity group_3. Moderate	0.594452	0.594452
35	S2NihssArrival	-0.468261	0.468261
34	S2RankinBeforeStroke	-0.458139	0.458139
47	S2NihssArrivalBestLanguage	0.407136	0.407136
8	Age	-0.351431	0.351431
10	Onset Time Known Type_BE	-0.311359	0.311359
17	Atrial Fib	-0.304925	0.304925
12	Onset Time Known Type_P	0.296688	0.296688
16	Hypertension	0.284943	0.284943
41	S2NihssArrivalMotorArmLeft	0.270736	0.270736
27	Stroke severity group_1. No stroke symtpoms	-0.269664	0.269664
24	Anticoag before stroke_0	0.268555	0.268555
25	Anticoag before stroke_1	-0.265632	0.265632
36	S2NihssArrivalLocQuestions	0.259637	0.259637
40	S2NihssArrivalFacialPalsy	0.249051	0.249051
49	S2NihssArrivalExtinctionInattention	0.216644	0.216644
30	Stroke severity group_4. Moderate to severe	0.194853	0.194853
19	TIA	-0.193125	0.193125
42	S2NihssArrivalMotorArmRight	0.186265	0.186265
38	S2NihssArrivalBestGaze	0.180682	0.180682
37	S2NihssArrivalLocCommands	-0.175957	0.175957
3	Hosp_4	-0.171598	0.171598
21	Antiplatelet_0	0.155566	0.155566
14	2+ comorbidotes	-0.145232	0.145232
0	Hosp_1	0.122579	0.122579
5	Hosp_6	0.116022	0.116022
23	Antiplatelet_NK	-0.110145	0.110145
39	S2NihssArrivalVisual	0.093206	0.093206
1	Hosp_2	0.071554	0.071554
45	S2NihssArrivalLimbAtaxia	0.067437	0.067437
43	S2NihssArrivalMotorLegLeft	0.065901	0.065901
9	80+	0.064134	0.064134
13	# Comorbidities	-0.062878	0.062878
4	Hosp_5	-0.047460	0.047460
6	Hosp_7	-0.046642	0.046642
46	S2NihssArrivalSensory	0.044662	0.044662
11	Onset Time Known Type_NK	0.042662	0.042662
2	Hosp_3	-0.034359	0.034359
7	Male	0.032534	0.032534
48	S2NihssArrivalDysarthria	0.031809	0.031809
20	Co-mordity	-0.029874	0.029874
22	Antiplatelet_1	-0.026909	0.026909
31	Stroke severity group_5. Severe	-0.019006	0.019006
15	Congestive HF	0.016522	0.016522
18	Diabetes	0.013792	0.013792
44	S2NihssArrivalMotorLegRight	-0.009994	0.009994
26	Anticoag before stroke_NK	-0.006123	0.006123