HSMA - Machine Learning Notebooks
Preface
This is a collection of the notebooks making up module 4 in the HSMA programme.
Links to the lecture videos and slides can be found below, including for sessions 4A, 4C and 4H, which did not have notebooks.
4A - An Introduction to Machine Learning
In this session we introduce some of the core concepts of AI and Machine Learning, including the concepts of features and labels, overfitting and underfitting and assessing model performance. We also explore some of the different types of machine learning, and practice our understanding of these new concepts by seeing if we can unpick patterns in Dan’s film preferences for “Dan’s Desert Island DVDs”.
4B - Logistic Regression
In this session we’ll begin exploring some of the Machine Learning approaches that we can use, starting with Logistic Regression - a way of fusing together traditional linear regression models with a logistic function to create a powerful classifier model. You’ll see how these models can be implemented in Python, and practice using both the Titanic dataset you saw earlier in the course, as well as a new stroke patient dataset.
4C - Ethics in AI
In this session we’ll explore some of the key ethical considerations that are fundamental to any machine learning work, as we explore what can (and will) go wrong.
4D - Decision Trees and Random Forests
In this session we’ll begin looking at how decision trees are built and how we can use the sklearn implementation of decision trees on our own datasets.
We also recap sensitivity (recall), specificity and precision and how to calculate these in sklearn.
Next, we’ll take a look at how we can avoid some of the problems of decision trees by using an ensemble method - random forests.
We also find out an easier way of calculating sensitivity, specificity and precision in one function, as well as hearing about a new metric called f1 score, and we learn about the confusion matrix, which is a powerful tool for helping to break down how different models perform.
4E - Boosted Trees for Classification and Regression
In this session we’ll take a look at a family of models called boosted trees. These are a very powerful type of algorithm that perform extremely well on tabular datasets. The session touches on XGBoost, AdaBoost, CatBoost, Histogram-based gradient boosting classifiers and LightGBM.
Next, we take a look at how decision trees, random forests and boosted trees can also be used when you want to predict a numeric value instead of classifying a sample as a member of one group or another.
We also touch on some key parts of data preprocessing so we can work with a new dataset of patient length of stay in the final exercise, covering how to OneHot encode categorical variables to make this data usable with machine learning libraries.
4F - Neural Networks
In this session we’ll be looking at a subfield of AI that has dominated many of the big advancements in AI over the last few years - Deep Learning - as we introduce Neural Networks.
4G - Explainable AI
In this part of the Explainable AI session, we explore
- why explainability is importantant in AI models
- what we mean by explainability
- the difference between correlation and causation
Next, we explore
- how we can extract feature importance from a logistic regression model
- how to interpret the coefficients from logistic regression models
- the relationship between odds, log odds and probability
Then, we move onto
- feature importance for tree-based models with Mean Decrease in Importance (MDI)
- model-agnostic feature importance with permutation feature importance (PFI)
In this part of the Explainable AI session, we explore
- the partial dependence plot (PDP)
- the individual conditional expectation plot (ICE)
- ways of enhancing these plots
In this part of the Explainable AI session, we explore
- what Shapley values are
- how the shap library allows us to look at global and local feature importance
- how to create and interpret different shap plots
In the final part of the Explainable AI session, we explore
- why calculating prediction uncertainty may be useful
- how to calculate and show prediction uncertainty
4H - Reinforcement Learning
Note that we’d recommend not looking at the slides until after the first time the reinforcement learning game is played manually.
In this session we take a look at Reinforcement Learning in a session that will be very different to any you’ve experienced thus far.
App Link: https://bergam0t.github.io/ReinforcementLearningGame/
App Github Repository: https://github.com/Bergam0t/ReinforcementLearningGame
4I - Synthetic Data using SMOTE
In this session we take a look at synthetic data - how to create our own fake but realistic data when we want to augment an underrepresented class, or just use the data instead of our real data.
4J - Optimising ML: Imputation, Feature Engineering & Selection, Hyperparameters
Unfortunately the first 5 minutes or so of the lecture was not recorded
Covering a range of ways to improve your model’s performance, including:
Missing Data Imputation with SimpleImputer and IterativeImputer
Feature Selection with SequentialFeatureSelector (forward and backward selection) and SelectFromModel (feature importance selection with model coefficients or mean decrease in impurity)
Feature Engineering
Dataset Splits (train/test/validation, k-fold)
Dealing with Imbalanced Datasets with model parameters
Hyperparameter tuning with exhaustive gridsearch, randomised gridsearch, and the Optuna framework
Additional areas in the slides, but not covered in the video, are:
- ensemble models
- sklearn pipelines
- automatic model selection with the flaml library
- model calibration curves (reliability plots)