The data loaded in this exercise is for seven acute stroke units, and whether a patient receives clost-busting treatment for stroke. There are lots of features, and a description of the features can be found in the file stroke_data_feature_descriptions.csv.
Train a Logistic Regression model to try to predict whether or not a stroke patient receives clot-busting treatment. Use the prompts below to write each section of code.
What do you conclude are the most important features for predicting whether a patient receives clot busting treatment? Can you improve accuracy by changing the size of your train / test split? If you have time, perhaps consider dropping some features from your data based on your outputs (in the same way you dropped passengerID in the Titanic example). Don’t forget you’ll need to rerun all subsequent cells if you make changes like that.
import pandas as pdimport numpy as np# Import machine learning methodsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler# Download data# (not required if running locally and have previously downloaded data)download_required =Trueif download_required:# Download processed data: address ='https://raw.githubusercontent.com/MichaelAllen1966/'+\'2004_titanic/master/jupyter_notebooks/data/hsma_stroke.csv' data = pd.read_csv(address)# Create a data subfolder if one does not already existimport os data_directory ='./data/'ifnot os.path.exists(data_directory): os.makedirs(data_directory)# Save data to data subfolder data.to_csv(data_directory +'hsma_stroke.csv', index=False)# Load datadata = pd.read_csv('data/hsma_stroke.csv')# Make all data 'float' typedata = data.astype(float)# Show datadata.head()
Clotbuster given
Hosp_1
Hosp_2
Hosp_3
Hosp_4
Hosp_5
Hosp_6
Hosp_7
Male
Age
...
S2NihssArrivalFacialPalsy
S2NihssArrivalMotorArmLeft
S2NihssArrivalMotorArmRight
S2NihssArrivalMotorLegLeft
S2NihssArrivalMotorLegRight
S2NihssArrivalLimbAtaxia
S2NihssArrivalSensory
S2NihssArrivalBestLanguage
S2NihssArrivalDysarthria
S2NihssArrivalExtinctionInattention
0
1.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
63.0
...
3.0
4.0
0.0
4.0
0.0
0.0
0.0
0.0
1.0
1.0
1
1.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
85.0
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2.0
1.0
0.0
2
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
91.0
...
0.0
1.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
3
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
90.0
...
1.0
1.0
0.0
1.0
0.0
0.0
1.0
0.0
1.0
0.0
4
1.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
69.0
...
2.0
0.0
4.0
1.0
4.0
0.0
1.0
2.0
2.0
1.0
5 rows × 51 columns
# Look at overview of datadata.describe()
Clotbuster given
Hosp_1
Hosp_2
Hosp_3
Hosp_4
Hosp_5
Hosp_6
Hosp_7
Male
Age
...
S2NihssArrivalFacialPalsy
S2NihssArrivalMotorArmLeft
S2NihssArrivalMotorArmRight
S2NihssArrivalMotorLegLeft
S2NihssArrivalMotorLegRight
S2NihssArrivalLimbAtaxia
S2NihssArrivalSensory
S2NihssArrivalBestLanguage
S2NihssArrivalDysarthria
S2NihssArrivalExtinctionInattention
count
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
...
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
1862.000000
mean
0.403330
0.159506
0.142320
0.154672
0.165414
0.055854
0.113319
0.208915
0.515575
74.553706
...
1.114930
1.002148
0.963480
0.963480
0.910849
0.216971
0.610097
0.944146
0.739527
0.566595
std
0.490698
0.366246
0.349472
0.361689
0.371653
0.229701
0.317068
0.406643
0.499892
12.280576
...
0.930527
1.479211
1.441594
1.406501
1.380606
0.522643
0.771932
1.121379
0.731083
0.794000
min
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
40.000000
...
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
25%
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
67.000000
...
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
50%
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.000000
76.000000
...
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.000000
0.000000
75%
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.000000
83.000000
...
2.000000
2.000000
2.000000
2.000000
2.000000
0.000000
1.000000
2.000000
1.000000
1.000000
max
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
100.000000
...
3.000000
4.000000
4.000000
4.000000
4.000000
2.000000
2.000000
3.000000
2.000000
2.000000
8 rows × 51 columns
# Look at mean feature values for those who were given a clotbuster vs those# that weren'tmask = data['Clotbuster given'] ==1given = data[mask]mask = data['Clotbuster given'] ==0not_given = data[mask]summary = pd.DataFrame()summary['given'] = given.mean()summary['not given'] = not_given.mean()summary
given
not given
Clotbuster given
1.000000
0.000000
Hosp_1
0.203728
0.129613
Hosp_2
0.122503
0.155716
Hosp_3
0.182423
0.135914
Hosp_4
0.137150
0.184518
Hosp_5
0.067909
0.047705
Hosp_6
0.123835
0.106211
Hosp_7
0.162450
0.240324
Male
0.515313
0.515752
Age
73.303595
75.398740
80+
0.346205
0.393339
Onset Time Known Type_BE
0.149134
0.352835
Onset Time Known Type_NK
0.010652
0.015302
Onset Time Known Type_P
0.840213
0.631863
# Comorbidities
1.053262
1.258326
2+ comorbidotes
0.298269
0.393339
Congestive HF
0.041278
0.047705
Hypertension
0.464714
0.470747
Atrial Fib
0.186418
0.239424
Diabetes
0.151798
0.172817
TIA
0.209055
0.327633
Co-mordity
0.676431
0.732673
Antiplatelet_0
0.097204
0.144014
Antiplatelet_1
0.079893
0.079208
Antiplatelet_NK
0.822903
0.776778
Anticoag before stroke_0
0.122503
0.083708
Anticoag before stroke_1
0.046605
0.129613
Anticoag before stroke_NK
0.830892
0.786679
Stroke severity group_1. No stroke symtpoms
0.002663
0.041404
Stroke severity group_2. Minor
0.061252
0.396040
Stroke severity group_3. Moderate
0.619174
0.333033
Stroke severity group_4. Moderate to severe
0.182423
0.105311
Stroke severity group_5. Severe
0.134487
0.124212
Stroke Type_I
1.000000
0.803780
Stroke Type_PIH
0.000000
0.196220
S2RankinBeforeStroke
0.360852
0.791179
S2NihssArrival
12.515313
9.032403
S2NihssArrivalLocQuestions
0.850866
0.605761
S2NihssArrivalLocCommands
0.394141
0.360936
S2NihssArrivalBestGaze
0.521971
0.341134
S2NihssArrivalVisual
0.809587
0.495050
S2NihssArrivalFacialPalsy
1.407457
0.917192
S2NihssArrivalMotorArmLeft
1.215712
0.857786
S2NihssArrivalMotorArmRight
1.073236
0.889289
S2NihssArrivalMotorLegLeft
1.165113
0.827183
S2NihssArrivalMotorLegRight
0.977364
0.865887
S2NihssArrivalLimbAtaxia
0.214381
0.218722
S2NihssArrivalSensory
0.762983
0.506751
S2NihssArrivalBestLanguage
1.181092
0.783978
S2NihssArrivalDysarthria
0.902796
0.629163
S2NihssArrivalExtinctionInattention
0.762983
0.433843
# Divide into features and labelsX = data.drop('Clotbuster given', axis=1)y = data['Clotbuster given']
# Divide into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Standardise datadef standardise_data(X_train, X_test):# Initialise a new scaling object for normalising input data sc = StandardScaler()# Apply the scaler to the training and test sets train_std=sc.fit_transform(X_train) test_std=sc.fit_transform(X_test)return train_std, test_stdX_train_std, X_test_std = standardise_data(X_train, X_test)
# Fit (train) Logistic Regression modelmodel = LogisticRegression()model.fit(X_train_std, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
# Predict training and test labels, and calculate accuracyy_pred_train = model.predict(X_train_std)y_pred_test = model.predict(X_test_std)accuracy_train = np.mean(y_pred_train == y_train)accuracy_test = np.mean(y_pred_test == y_test)print (f'Accuracy of predicting training data = {accuracy_train}')print (f'Accuracy of predicting test data = {accuracy_test}')
Accuracy of predicting training data = 0.8202005730659025
Accuracy of predicting test data = 0.7939914163090128
# Examine feature weights and sort by most influentialco_eff = model.coef_[0]co_eff_df = pd.DataFrame()co_eff_df['feature'] =list(X)co_eff_df['co_eff'] = co_effco_eff_df['abs_co_eff'] = np.abs(co_eff)co_eff_df.sort_values(by='abs_co_eff', ascending=False, inplace=True)co_eff_df