30 K-fold validation (Titanic Dataset)

We have mentioned k-fold validation a few times throughout the exercises.

We split data into a number of ‘folds’ - let’s say 10. This means we have 10 datasets, each with 10% of the data from our whole* dataset in.

We use 90% of the data to train on, and 10% as our validation dataset.

We record this score, then use a different 10% to score our model on, training our model on the remaining 90%.

We repeat this until every ‘fold’ has been used as the validation dataset once.

Note: Here, we apply k-fold validation after splitting out a final testing dataset to use later.

K-fold validation replaces the train/validate data split.

There are various helper functions within sklearn to allow us to undertake k-fold validation.

from sklearn.model_selection import train_test_split, cross_validate, cross_val_predict, \
                                    StratifiedKFold, KFold
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report
import matplotlib.pyplot as plt

try:
    data = pd.read_csv("data/processed_data.csv")

except FileNotFoundError:
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'

    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

data = data.astype(float)

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

feature_names = X.columns.tolist()

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training/Validation Dataset Samples: {len(X_train_val)}")
print(f"Testing Dataset Samples: {len(X_test)}")

Training/Validation Dataset Samples: 712
Testing Dataset Samples: 179

Now we can use the cross_validate function to set up and score our models.

It will default to using ‘stratified’ sampling, where the percentage of samples within each group are kept as consistent as possible across all samples.

model = XGBClassifier(random_state=42)

scores = cross_validate(
    model, X_train_val, y_train_val,
    scoring=['accuracy', 'f1', 'roc_auc', 'precision_macro', 'recall_macro'],
    n_jobs=-1,
    cv=10
    )

scores_df = pd.DataFrame(scores)

scores_df.drop(columns=["fit_time", "score_time"]).plot(kind="box", figsize=(15, 6))

Let’s repeat this with a smaller number of folds.

model = XGBClassifier(random_state=42)

scores = cross_validate(
    model, X_train_val, y_train_val,
    scoring=['accuracy', 'f1', 'roc_auc', 'precision_macro', 'recall_macro'],
    n_jobs=-1,
    cv=5
    )

scores_df_5_fold = pd.DataFrame(scores)

scores_df_5_fold

	fit_time	score_time	test_accuracy	test_f1	test_roc_auc	test_precision_macro	test_recall_macro
0	0.146062	0.042184	0.797203	0.707071	0.852164	0.791950	0.767894
1	0.195811	0.051646	0.797203	0.743363	0.847170	0.784504	0.793383
2	0.244153	0.046421	0.845070	0.784314	0.882552	0.838271	0.826797
3	0.241028	0.049274	0.816901	0.754717	0.862943	0.804325	0.804325
4	0.261768	0.046865	0.816901	0.754717	0.854167	0.806838	0.802189

scores_df_5_fold.drop(columns=["fit_time", "score_time"]).plot(kind="box", figsize=(15, 6))

Let’s compare what we see.

scores_df_5_fold['folds'] = 5
scores_df['folds'] = 10

pd.concat([scores_df_5_fold, scores_df]).drop(columns=["fit_time", "score_time"]).plot(
    kind="box", figsize=(20, 6), by='folds'
    )

test_accuracy              Axes(0.125,0.11;0.133621x0.77)
test_f1                 Axes(0.285345,0.11;0.133621x0.77)
test_precision_macro     Axes(0.44569,0.11;0.133621x0.77)
test_recall_macro       Axes(0.606034,0.11;0.133621x0.77)
test_roc_auc            Axes(0.766379,0.11;0.133621x0.77)
dtype: object

Roughly how many datapoints do we get in our validation portion of the dataset when we do this?

len(X_train_val) / 10

71.2

Roughly how many examples of each class would there be in our resulting datasets?

y_train_val.value_counts()/10

0.0    44.4
1.0    26.8
Name: Survived, dtype: float64

Let’s compare this with 5-fold validation.

len(X_train_val) / 5

142.4

y_train_val.value_counts()/5

0.0    88.8
1.0    53.6
Name: Survived, dtype: float64

31 Prediction

We can still create predictions when using cross validation.

y_pred_10 = cross_val_predict(
    model,
    X_train_val,
    y_train_val,
    cv=10,
)

y_pred_5 = cross_val_predict(
    model,
    X_train_val,
    y_train_val,
    cv=5,
)

This allows us to create outputs like confusion matrices.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7))

confusion_matrix_10 = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(
        y_true=y_train_val,
        y_pred=y_pred_10
        ),
        display_labels=["Died", "Survived"]

)

confusion_matrix_10.plot(ax=ax1)
ax1.set_title("10-fold CV")

confusion_matrix_5 = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(
        y_true=y_train_val,
        y_pred=y_pred_5
        ),
        display_labels=["Died", "Survived"]

)

confusion_matrix_5.plot(ax=ax2)
ax2.set_title("5-fold CV")

Text(0.5, 1.0, '5-fold CV')

Note the slight variation in performance here.

We can also generate performance reports.

pd.DataFrame(
    classification_report(y_train_val, y_pred_5, output_dict=True)
).round(3)

	0.0	1.0	accuracy	macro avg	weighted avg
precision	0.844	0.764	0.815	0.804	0.813
recall	0.863	0.735	0.815	0.799	0.815
f1-score	0.853	0.749	0.815	0.801	0.814
support	444.000	268.000	0.815	712.000	712.000

32 Further Control

By setting up the KFold validation ourselves, we can have further control over the process.

Here, we are going to ensure the data is shuffled before use, and set a random seed so the process is replicable.

scores = cross_validate(
    model, X_train_val, y_train_val,
    scoring=['accuracy', 'f1', 'roc_auc', 'precision_macro', 'recall_macro'],
    n_jobs=-1,
    cv=StratifiedKFold(5, shuffle=True, random_state=42)
    )

scores_df_5_fold_shuffled_42 = pd.DataFrame(scores)

scores_df_5_fold_shuffled_42

	fit_time	score_time	test_accuracy	test_f1	test_roc_auc	test_precision_macro	test_recall_macro
0	1.316582	0.346691	0.783217	0.704762	0.831149	0.770354	0.763941
1	3.018813	0.269090	0.874126	0.823529	0.898772	0.874342	0.855181
2	1.302990	0.355363	0.823944	0.757282	0.853297	0.813913	0.806127
3	0.987926	0.197368	0.781690	0.704762	0.837821	0.766880	0.764787
4	0.566683	0.074188	0.830986	0.750000	0.828283	0.838571	0.799242

Let’s take a look at the impact of changing the random seed.

scores = cross_validate(
    model, X_train_val, y_train_val,
    scoring=['accuracy', 'f1', 'roc_auc', 'precision_macro', 'recall_macro'],
    n_jobs=-1,
    cv=StratifiedKFold(5, shuffle=True, random_state=101)
    )

scores_df_5_fold_shuffled_102 = pd.DataFrame(scores)

scores_df_5_fold_shuffled_102

	fit_time	score_time	test_accuracy	test_f1	test_roc_auc	test_precision_macro	test_recall_macro
0	0.221575	0.176622	0.776224	0.698113	0.815543	0.762363	0.758323
1	1.539527	0.315225	0.825175	0.752475	0.866729	0.820922	0.801290
2	1.212714	0.612245	0.760563	0.679245	0.810473	0.744117	0.744117
3	1.664313	0.329696	0.802817	0.740741	0.896544	0.788924	0.793089
4	1.536362	0.269400	0.838028	0.772277	0.872475	0.835946	0.815657

scores_df_5_fold_shuffled_42['seed'] = 42
scores_df_5_fold_shuffled_102['seed'] = 102

pd.concat([scores_df_5_fold_shuffled_42,scores_df_5_fold_shuffled_102]) \
    .drop(columns=["fit_time", "score_time"]) \
    .plot(kind="box", figsize=(20, 6), by='seed')

test_accuracy              Axes(0.125,0.11;0.133621x0.77)
test_f1                 Axes(0.285345,0.11;0.133621x0.77)
test_precision_macro     Axes(0.44569,0.11;0.133621x0.77)
test_recall_macro       Axes(0.606034,0.11;0.133621x0.77)
test_roc_auc            Axes(0.766379,0.11;0.133621x0.77)
dtype: object

32.1 Leave-one-out

An extreme version of cross validation is leave-one-out validation, where only one datapoint at a time is used for evaluating the model, and the rest is used as training data.

This can be useful for very small datasets, but is more computationally intensive.

Leave-one-out tends to give us a more realistic idea of the performance in the real world, but can also lead to higher variance.

scores = cross_validate(
    model, X_train_val, y_train_val,
    scoring=['accuracy', 'f1', 'roc_auc', 'precision_macro', 'recall_macro'],
    n_jobs=-1,
    cv=KFold(len(X_train_val)-1)
    )

scores_df_leave_one_out = pd.DataFrame(scores)

scores_df_leave_one_out

	fit_time	score_time	test_accuracy	test_f1	test_roc_auc	test_precision_macro	test_recall_macro
0	1.777401	2.717442	1.0	0.0	NaN	1.0	1.0
1	3.770272	2.818617	0.0	0.0	NaN	0.0	0.0
2	3.055812	0.985970	1.0	0.0	NaN	1.0	1.0
3	4.052561	1.256033	1.0	0.0	NaN	1.0	1.0
4	3.003047	2.907689	0.0	0.0	NaN	0.0	0.0
...	...	...	...	...	...	...	...
706	0.062687	0.026082	1.0	1.0	NaN	1.0	1.0
707	0.117362	0.037119	1.0	0.0	NaN	1.0	1.0
708	0.160492	0.026065	1.0	0.0	NaN	1.0	1.0
709	0.156961	0.022564	1.0	1.0	NaN	1.0	1.0
710	0.063164	0.018045	1.0	0.0	NaN	1.0	1.0

711 rows × 7 columns

y_pred_leave_one_out = cross_val_predict(
    model,
    X_train_val,
    y_train_val,
    cv=KFold(len(X_train_val)-1),
)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7))

confusion_matrix_10 = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(
        y_true=y_train_val,
        y_pred=y_pred_leave_one_out
        ),
        display_labels=["Died", "Survived"]

)

confusion_matrix_10.plot(ax=ax1)
ax1.set_title("Leave One Out")

confusion_matrix_5 = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(
        y_true=y_train_val,
        y_pred=y_pred_5
        ),
        display_labels=["Died", "Survived"]

)

confusion_matrix_5.plot(ax=ax2)
ax2.set_title("5-fold CV")

Text(0.5, 1.0, '5-fold CV')

We can also generate other reports here.

pd.DataFrame(
    classification_report(y_train_val, y_pred_leave_one_out,
                          output_dict=True)
)

	0.0	1.0	accuracy	macro avg	weighted avg
precision	0.839912	0.761719	0.811798	0.800816	0.810480
recall	0.862613	0.727612	0.811798	0.795112	0.811798
f1-score	0.851111	0.744275	0.811798	0.797693	0.810897
support	444.000000	268.000000	0.811798	712.000000	712.000000