HSMA - Machine Learning Notebooks

Author

Dan Chalk & Sammi Rosser (with thanks to Mike Allen and the SAMueL team)

Published

July 7, 2026

Preface

HSMA - Machine Learning Notebooks

This is a collection of the notebooks making up module 4 in the HSMA programme.

Links to the lecture videos and slides can be found below, including for sessions 4A, 4C and 4H, which did not have notebooks.

4A - An Introduction to Machine Learning

In this session we introduce some of the core concepts of AI and Machine Learning, including the concepts of features and labels, overfitting and underfitting and assessing model performance. We also explore some of the different types of machine learning, and practice our understanding of these new concepts by seeing if we can unpick patterns in Dan’s film preferences for “Dan’s Desert Island DVDs”.

4B - Logistic Regression

In this session we’ll begin exploring some of the Machine Learning approaches that we can use, starting with Logistic Regression - a way of fusing together traditional linear regression models with a logistic function to create a powerful classifier model. You’ll see how these models can be implemented in Python, and practice using both the Titanic dataset you saw earlier in the course, as well as a new stroke patient dataset.

4C - Ethics in AI

In this session we’ll explore some of the key ethical considerations that are fundamental to any machine learning work, as we explore what can (and will) go wrong.

4D - Decision Trees and Random Forests

In this session we’ll begin looking at how decision trees are built and how we can use the sklearn implementation of decision trees on our own datasets.

We also recap sensitivity (recall), specificity and precision and how to calculate these in sklearn.

Next, we’ll take a look at how we can avoid some of the problems of decision trees by using an ensemble method - random forests.

We also find out an easier way of calculating sensitivity, specificity and precision in one function, as well as hearing about a new metric called f1 score, and we learn about the confusion matrix, which is a powerful tool for helping to break down how different models perform.

4E - Boosted Trees for Classification and Regression

In this session we’ll take a look at a family of models called boosted trees. These are a very powerful type of algorithm that perform extremely well on tabular datasets. The session touches on XGBoost, AdaBoost, CatBoost, Histogram-based gradient boosting classifiers and LightGBM.

Next, we take a look at how decision trees, random forests and boosted trees can also be used when you want to predict a numeric value instead of classifying a sample as a member of one group or another.

We also touch on some key parts of data preprocessing so we can work with a new dataset of patient length of stay in the final exercise, covering how to OneHot encode categorical variables to make this data usable with machine learning libraries.

4F - Neural Networks

In this session we’ll be looking at a subfield of AI that has dominated many of the big advancements in AI over the last few years - Deep Learning - as we introduce Neural Networks.

4G - Explainable AI

In this part of the Explainable AI session, we explore

why explainability is importantant in AI models
what we mean by explainability
the difference between correlation and causation

Next, we explore

how we can extract feature importance from a logistic regression model
how to interpret the coefficients from logistic regression models
the relationship between odds, log odds and probability

Then, we move onto

feature importance for tree-based models with Mean Decrease in Importance (MDI)
model-agnostic feature importance with permutation feature importance (PFI)

In this part of the Explainable AI session, we explore

the partial dependence plot (PDP)
the individual conditional expectation plot (ICE)
ways of enhancing these plots

In this part of the Explainable AI session, we explore

what Shapley values are
how the shap library allows us to look at global and local feature importance
how to create and interpret different shap plots

In the final part of the Explainable AI session, we explore

why calculating prediction uncertainty may be useful
how to calculate and show prediction uncertainty

4H - Reinforcement Learning

Note that we’d recommend not looking at the slides until after the first time the reinforcement learning game is played manually.

In this session we take a look at Reinforcement Learning in a session that will be very different to any you’ve experienced thus far.

App Link: https://bergam0t.github.io/ReinforcementLearningGame/

App Github Repository: https://github.com/Bergam0t/ReinforcementLearningGame

4I - Synthetic Data using SMOTE

In this session we take a look at synthetic data - how to create our own fake but realistic data when we want to augment an underrepresented class, or just use the data instead of our real data.

4J - Optimising ML: Imputation, Feature Engineering & Selection, Hyperparameters

Unfortunately the first 5 minutes or so of the lecture was not recorded

Covering a range of ways to improve your model’s performance, including:

Missing Data Imputation with SimpleImputer and IterativeImputer
Feature Selection with SequentialFeatureSelector (forward and backward selection) and SelectFromModel (feature importance selection with model coefficients or mean decrease in impurity)
Feature Engineering
Dataset Splits (train/test/validation, k-fold)
Dealing with Imbalanced Datasets with model parameters
Hyperparameter tuning with exhaustive gridsearch, randomised gridsearch, and the Optuna framework

Additional areas in the slides, but not covered in the video, are:

ensemble models
sklearn pipelines
automatic model selection with the flaml library
model calibration curves (reliability plots)