Comparing Machine Learning Algorithms

Machine learning models are chosen based on their mean performance, often calculated using k-fold cross-validation. The algorithm with the best mean performance is expected to be better than those algorithms with worse mean performance. But what if the difference in the mean performance is caused by a statistical due to an unlikely chance or occurrence?

The solution is to use a statistical hypothesis test to evaluate whether the difference in the mean performance between any two algorithms is valid or not.

Hypothesis Test for Comparing Algorithms

One approach is to evaluate each model on the same k-fold cross-validation split of the data (e.g. using the same random number seed to split the data in each case) and calculate a score for each split. This would give a sample of $10$ scores for $10$-fold cross-validation. The scores can then be compared using a paired statistical hypothesis test because the same treatment (rows of data) was used for each algorithm to come up with each score. The Paired Student’s t-Test could be used.

A problem with using the Paired Student’s t-Test, in this case, is that each evaluation of the model is not independent. This is because the same rows of data are used to train the data multiple times — actually, each time, except for the time a row of data is used in the hold-out test fold. This lack of independence in the evaluation means that the Paired Student’s t-Test is biased.

How to test models with the lack of independence?

This statistical test can be adjusted to take the lack of independence into account. Additionally, the number of folds and repeats of the procedure can be configured to achieve a good sampling of model performance that generalizes well to a wide range of problems and algorithms. Specifically two-fold cross-validation with five repeats, so-called 5×2-fold cross-validation.

This approach was proposed by Thomas Dietterich in his 1998 paper titled “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.”

5×2 Procedure With MLxtend

The MLxtend library provides an implementation via the paired_ttest_5x2cv() function.

You can then call the paired_ttest_5x2cv() function and pass in your data and models and it will report the t-statistic value and the p-value as to whether the difference in the performance of the two algorithms is significant or not.

The smaller the alpha value, the better, and a common value is $5\%$ $(0.05)$.

Comparing Classifier Algorithms

Here, we compare the performance of two machine learning algorithms on a binary classification task, then check if the observed difference is statistically significant or not.

First, we can use the make_classification() function to create a synthetic dataset with $1,000$ samples and $10$ input variables.

In [1]:
# create classification dataset
from sklearn.datasets import make_classification

# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)

# summarize the dataset
print(X.shape, y.shape)
(1000, 10) (1000,)

We will compare the performance of two linear algorithms on this dataset. Specifically, a logistic regression algorithm and a support vector classifier (SVC) algorithm.

The procedure I like is to use repeated stratified k-fold cross-validation with $2$ folds and $5$ repeats. We will use this procedure to evaluate each algorithm and return and report the mean classification accuracy.

In [2]:
# compare logistic regression and svc for binary classification
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
import matplotlib.pyplot as plt

# evaluate model 1
model1 = LogisticRegression()
cv1 = RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=1)
scores1 = cross_val_score(model1, X, y, scoring='accuracy', cv=cv1, n_jobs=-1)
print('LogisticRegression Mean Accuracy: %.3f (+/-) %.3f' % (np.mean(scores1), np.std(scores1)))

# evaluate model 2
model2 = SVC()
cv2 = RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=1)
scores2 = cross_val_score(model2, X, y, scoring='accuracy', cv=cv2, n_jobs=-1)
print('SupportVectorClassifier Mean Accuracy: %.3f (+/-) %.3f' % (np.mean(scores2), np.std(scores2)))
LogisticRegression Mean Accuracy: 0.894 (+/-) 0.012
SupportVectorClassifier Mean Accuracy: 0.945 (+/-) 0.011
In [3]:
# plot the results
plt.boxplot([scores1, scores2], labels=['LR', 'SVC'], showmeans=True)

Now we can use a hypothesis test to see if the observed results are statistically significant.

We can then interpret the p-value using an alpha of $5\%$ percent.

In [4]:
from mlxtend.evaluate import paired_ttest_5x2cv

# check if difference between algorithms is real
t, p = paired_ttest_5x2cv(estimator1=model1, estimator2=model2, X=X, y=y, scoring='accuracy', random_seed=1)

# summarize
print('P-value: %.3f, t-Statistic: %.3f' % (p, t))

# interpret the result
if p <= 0.05:
    print('Difference between mean performance is probably real')
    print('Algorithms probably have the same performance')
P-value: 0.003, t-Statistic: -5.318
Difference between mean performance is probably real

In this case, we can see that the p-value is about $0.003$, which is lower than $0.05$. This leads us to reject the null hypothesis, suggesting that any observed difference between the algorithms is probably real.


  1. Independent Data Samples: If you have near unlimited data, gather $k$ separate train and test datasets to calculate $10$ truly independent skill scores for each method. You may then correctly apply the paired Student’s t-test. This is most unlikely as we are often working with small data samples.
  1. Accept the Problems of 10-fold CV: The naive $10$-fold cross-validation can be used with an unmodified paired Student t-test can be used. It has good repeatability relative to other methods and a modest type II error, but is known to have a high type I error.
  1. Use a Nonparametric Paired Test: We can use a nonparametric test that makes fewer assumptions, such as not assuming that the distribution of the skill scores is Gaussian. One example is the Wilcoxon signed-rank test, which is the nonparametric version of the paired Student’s t-test. This test has less statistical power than the paired t-test, although more power when the expectations of the t-test are violated, such as independence.

    This statistical hypothesis test is recommended for comparing algorithms different datasets by Janez Demsar in his 2006 paper “Statistical Comparisons of Classifiers over Multiple Data Sets“.

  1. Use Estimation Statistics Instead: Instead of statistical hypothesis tests, estimation statistics can be calculated, such as confidence intervals. These would suffer from similar problems where the assumption of independence is violated given the resampling methods by which the models are evaluated.

    Statistical methods such as the bootstrap can be used to calculate defensible nonparametric confidence intervals that can be used to both present results and compare classifiers. This is a simple and effective approach that you can always fall back upon and that I recommend in general.

McNemar’s Test to Compare Two Machine Learning Classifiers

Thomas Dietterich recommended the McNemar’s test in cases where it is expensive or impractical to train multiple copies of classification models.

This describes the current situation with deep learning models that are both very large and are trained and evaluated on large datasets, often requiring days or weeks to train a single model. Specifically, the test is recommended in those cases where the algorithms that are being compared can only be evaluated once, e.g. on one test set, as opposed to repeated evaluations via a resampling technique, such as k-fold cross-validation.

Contingency Table

The McNemar’s test operates upon a contingency table.

Before we dive into the test, let’s understand how the contingency table for two classifiers is calculated.

A contingency table is a count of two categorical variables. In the case of the McNemar’s test, we are interested in binary variables correct/incorrect or yes/no for a control and a treatment or $2$ cases. This is called a $2×2$ contingency table.

Consider that we have two trained classifiers. Each classifier makes binary class prediction for each of the $10$ examples in a test dataset. The predictions are evaluated and determined to be correct or incorrect.

We can then summarize these results in a table, as follows:

Instance Classifier1 Correct Classifier2 Correct
1 Yes No
2 No No
3 No Yes
4 No No
5 Yes Yes
6 Yes Yes
7 Yes Yes
8 No No
9 Yes No
10 Yes Yes

We can see that Classifier1 got $6$ correct, or an accuracy of $60\%$, and Classifier2 got $5$ correct, or $50\%$ accuracy on the test set.

The table can now be reduced to a contingency table.

The contingency table relies on the fact that both classifiers were trained on exactly the same training data and evaluated on exactly the same test data instances.

The contingency table has the following structure:

Classifier2 Correct Classifier2 Incorrect
Classifier1 Correct ?? ??
Classifier1 Incorrect ?? ??

In the case of the first cell in the table, we must sum the total number of test instances that Classifier1 got correct and Classifier2 got correct. For example, the first instance that both classifiers predicted correctly was instance number $5$. The total number of instances that both classifiers predicted correctly was $4$.

Another more programmatic way to think about this is to sum each combination of Yes/No in the results table above.

Classifier2 Correct Classifier2 Incorrect
Classifier1 Correct yes/yes yes/no
Classifier1 Incorrect no/yes no/no

The results organized into a contingency table are as follows:

Classifier2 Correct Classifier2 Incorrect
Classifier1 Correct 4 2
Classifier1 Incorrect 1 3

McNemar’s Test Statistic

McNemar’s test is a paired nonparametric or distribution-free statistical hypothesis test. It is also less intuitive than some other statistical hypothesis tests.

The McNemar’s test is checking if the disagreements between two cases match. Technically, this is referred to as the homogeneity (being all the same or all of the same kind) of the contingency table (specifically the marginal homogeneity). Therefore, the McNemar’s test is a type of homogeneity test for contingency tables.

The test is widely used in medicine to compare the effect of a treatment against a control.

In terms of comparing two binary classification algorithms, the test is commenting on whether the two models disagree in the same way (or not). It is not commenting on whether one model is more or less accurate or error prone than another. This is clear when we look at how the statistic is calculated.

The McNemar’s test statistic is calculated as:

statistic = (Yes/No - No/Yes)^2 / (Yes/No + No/Yes)

Where Yes/No is the count of test instances that Classifier1 got correct and Classifier2 got incorrect, and No/Yes is the count of test instances that Classifier1 got incorrect and Classifier2 got correct.

This calculation of the test statistic assumes that each cell in the contingency table used in the calculation has a count of at least $25$. The test statistic has a Chi-Squared distribution with $1$ degree of freedom.

We can see that only two elements of the contingency table are used, specifically that the Yes/Yes and No/No elements are not used in the calculation of the test statistic. As such, we can see that the statistic is reporting on the different correct or incorrect predictions between the two models, not the accuracy or error rates. This is important to understand when making claims about the finding of the statistic.

The default assumption, or null hypothesis, of the test is that the two cases disagree to the same amount. If the null hypothesis is rejected, it suggests that there is evidence to suggest that the cases disagree in different ways, that the disagreements are skewed.

Given the selection of a significance level, the p-value calculated by the test can be interpreted as follows:

  • p > alpha: fail to reject H0, no difference in the disagreement (e.g. treatment had no effect).
  • p <= alpha: reject H0, significant difference in the disagreement (e.g. treatment had an effect).

Interpret the McNemar’s Test for Classifiers

The two terms used in the calculation of the McNemar’s Test capture the errors made by both models. Specifically, the No/Yes and Yes/No cells in the contingency table. The test checks if there is a significant difference between the counts in these two cells. That is all.

If these cells have counts that are similar, it shows us that both models make errors in much the same proportion, just on different instances of the test set. In this case, the result of the test would not be significant and the null hypothesis would not be rejected.

If these cells have counts that are not similar, it shows that both models not only make different errors, but in fact have a different relative proportion of errors on the test set. In this case, the result of the test would be significant and we would reject the null hypothesis.

We can summarize this as follows:

  • Fail to Reject Null Hypothesis: Classifiers have a similar proportion of errors on the test set.
  • Reject Null Hypothesis: Classifiers have a different proportion of errors on the test set.

After performing the test and finding a significant result, it may be useful to report an effect statistical measure in order to quantify the finding. For example, a natural choice would be to report the odds ratios, or the contingency table itself, although both of these assume a technical reader.

It may be useful to report the difference in error between the two classifiers on the test set. In this case, be careful with your claims as the significant test does not report on the difference in error between the models, only the relative difference in the proportion of error between the models.

Finally, in using the McNemar’s test, Dietterich has two important limitations that must be considered. They are:

  1. No Measure of Training Set or Model Variability.

    Generally, model behavior varies based on the specific training data used to fit the model.

    This is due to both the interaction of the model with specific training instances and the use of randomness during learning. Fitting the model on multiple different training datasets and evaluating the skill, as is done with resampling methods, provides a way to measure the variance of the model.

  1. Less Direct Comparison of Models.

    The two classifiers are evaluated on a single test set, and the test set is expected to be smaller than the training set.

    This is different from hypothesis tests that make use of resampling methods as more, if not all, of the dataset is made available as a test set during evaluation (which introduces its own problems from a statistical perspective).

    This provides less of an opportunity to compare the performance of the models. It requires that the test set is appropriately representative of the domain, often meaning that the test dataset is large.

McNemar’s Test in Python

The McNemar’s test can be implemented in Python using the mcnemar() Statsmodels function.

The function takes the contingency table as an argument and returns the calculated test statistic and p-value.

There are two ways to use the statistic depending on the amount of data.

If there is a cell in the table that is used in the calculation of the test statistic that has a count of less than $25$, then a modified version of the test is used that calculates an exact p-value using a binomial distribution. This is the default usage of the test:

stat, p = mcnemar(table, exact=True)

Alternately, if all cells used in the calculation of the test statistic in the contingency table have a value of 25 or more, then the standard calculation of the test can be used.

stat, p = mcnemar(table, exact=False, correction=True)

McNemar’s Test Example

Here we use a very similar example to the previous example demonstrating the $5×2$ Procedure. Because of how computationally expensive and time consuming it would be to simulate the correct use of the test in a deep learning application we compare the same classifiers on a similar dataset. The only difference here is we use $5000$ samples and implement a train test split to evaluate on unseen data to give a soft implementation of the McNemar’s test.

In [5]:
from sklearn.model_selection import train_test_split

# define dataset
X, y = make_classification(n_samples=5000, n_features=10, n_informative=10, n_redundant=0, random_state=1)

# split dataset into train and test data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
In [6]:
# evaluate model 1
model1 = LogisticRegression(), y_train)
model1_pred = model1.predict(X_test)
scores1 = model1.score(X_test, y_test)
print('LogisticRegression Test Score Accuracy: %.3f' % (scores1))

# evaluate model 2
model2 = SVC(), y_train)
model2_pred = model2.predict(X_test)
scores2 = model2.score(X_test, y_test)
print('SupportVectorClassifier Test Score Accuracy: %.3f' % (scores2))
LogisticRegression Test Score Accuracy: 0.860
SupportVectorClassifier Test Score Accuracy: 0.964

Now we make use of our test results to make datframe for each model with a column indicating true or false comparing the models predictions to the actual test data. Then we take the value counts of the true false column for each model and store it into a variable to help create out contingency table for the McNemar’s test.

In [7]:
import pandas as pd

model1_table = pd.DataFrame(dict(y_true=y_test, y_pred=model1_pred))
model1_table['model1_tf'] = model1_table.y_true == model1_table.y_pred
model1_tf = model1_table.model1_tf.value_counts()
True     860
False    140
Name: model1_tf, dtype: int64
In [8]:
model2_table = pd.DataFrame(dict(y_true=y_test, y_pred=model2_pred))
model2_table['model2_tf'] = model2_table.y_true == model2_table.y_pred
model2_tf = model2_table.model2_tf.value_counts()
True     964
False     36
Name: model2_tf, dtype: int64

Here, we use the true false value counts for each model to create the contingency table for the McNemar’s test and visualize the table as a heatmap.

In [9]:
import seaborn as sns 

table = [[model1_tf[1]+model2_tf[1], model1_tf[1]+model2_tf[0]],
         [model1_tf[0]+model2_tf[1], model1_tf[0]+model2_tf[0]]]

table = pd.DataFrame(table, index=['Classifier1 Correct', 'Classifier1 Incorrect'], 
                     columns=['Classifier2 Correct', 'Classifier2 Incorrect'])

sns.heatmap(table, annot=True, fmt=".1f", annot_kws={"size": 16})
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t)

Finally, we use the mcnemar() function from statsmodels on our table to get a p-value to determine if a significant difference exist in the proportion of errors between each classifier.

In [10]:
# Example of calculating the mcnemar test
from statsmodels.stats.contingency_tables import mcnemar

# calculate mcnemar test
result = mcnemar(table.values, exact=False, correction=True)

# summarize the finding
print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))

# interpret the p-value
alpha = 0.05
if result.pvalue > alpha:
    print('Same proportions of errors (fail to reject H0)')
    print('Different proportions of errors (reject H0)')
statistic=21.424, p-value=0.000
Different proportions of errors (reject H0)

In this case, we can see that the p-value is about $0.000$, which is lower than $0.05$. This leads us to reject the null hypothesis, suggesting that any observed difference between the algorithms is probably real.

In [ ]: