## Implement conformal prediction for classification without the need for bespoke packages

This blog post is inspired by the book by Chris Molner — Introduction to conformal prediction with Python. Chris is brilliant at making new machine learning techniques accessible to others. I would also particularly recommend his books on explainable machine learning.

A GitHub repository with the complete code can be found here: Compliant prediction.

Conformal prediction is both a method for quantifying uncertainty and a method for classifying instances (which can be refined for classes or subgroups). Uncertainty results in classification into sets of potential classes rather than single predictions.

Conformal prediction specifies a coverage, which specifies the probability that the true result is covered by the prediction region. The interpretation of prediction regions in conformal prediction depends on the task. For classification we get prediction sets, while for regression we get prediction intervals.

Below is an example of the difference between “traditional” classification (balance of likelihood) and conformal prediction (ensembles).

The advantages of this method are:

**Guaranteed coverage**: The prediction sets generated by conformal prediction have true result coverage guarantees, that is, they will detect the percentage of true values that you set as the minimum target coverage. Conformal prediction does not depend on a well-calibrated model: the only thing that matters is that, like all machine learning, the new classified samples must come from data distributions similar to those of the training and calibration data. Coverage can also be guaranteed between classes or subgroups, although this requires an additional step in the method we will discuss.

**Easy to use**: Conformal prediction approaches can be implemented from scratch with just a few lines of code, as we will do here.**Model independent**: Conformal prediction works with any machine learning model. It uses the normal outputs of the model you prefer.**Without distribution**: Conformal prediction makes no assumptions about the underlying distributions of the data; it is a non-parametric method.**No retraining required**: Conformal prediction can be used without retraining your model. This is another way to examine and use the model results.**Wide application**: Conformal prediction works for tabular data classification, image or time series classification, regression, and many other tasks, although we will only demonstrate classification here.

Quantifying uncertainty is essential in many situations:

- When we use a model’s predictions to make decisions. Are we sure of these predictions? Is using “most likely class” sufficient for the task we have?
- When we want to communicate the uncertainty associated with our predictions to stakeholders, without talking about probabilities or odds, or even log-odds!

*Blanket* is the key to conformal prediction. In classification, this is the normal region of data that a particular class inhabits. Coverage is equivalent to *sensitivity* Or *reminder*; it is the proportion of observed values that are identified in the classification sets. We can tighten or loosen the coverage area by adjusting 𝛼 (*coverage = 1 — *𝛼).

## Import packages

`import matplotlib.pyplot as plt`

import numpy as np

import pandas as pd

from sklearn.datasets import make_blobs

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

**Create synthetic data for classification**

Sample data will be produced using SK-Learn’s `make_blobs` method.

`n_classes = 3`

# Make train and test data

X, y = make_blobs(n_samples=10000, n_features=2, centers=n_classes, cluster_std=3.75, random_state=42)# Reduce the size of the first class to create an imbalanced dataset

# Set numpy random seed

np.random.seed(42)

# Get index of when y is class 0

class_0_idx = np.where(y == 0)(0)

# Get 30% of the class 0 indices

class_0_idx = np.random.choice(class_0_idx, int(len(class_0_idx) * 0.3), replace=False)

# Get the index for all other classes

rest_idx = np.where(y != 0)(0)

# Combine the indices

idx = np.concatenate((class_0_idx, rest_idx))

# Shuffle the indices

np.random.shuffle(idx)

# Split the data

X = X(idx)

y = y(idx)

# Split off model training set

X_train, X_rest, y_train, y_rest = train_test_split(X, y, test_size=0.5, random_state=42)

# Split rest into calibration and test

X_Cal, X_test, y_cal, y_test = train_test_split(X_rest, y_rest, test_size=0.5, random_state=42)

# Set class labels

class_labels = ('blue', 'orange', 'green')

`# Plot the data`

fig = plt.subplots(figsize=(5, 5))

ax = plt.subplot(111)

for i in range(n_classes):

ax.scatter(X_test(y_test == i, 0), X_test(y_test == i, 1), label=class_labels(i), alpha=0.5, s=10)

legend = ax.legend()

legend.set_title("Class")

ax.set_xlabel("Feature 1")

ax.set_ylabel("Feature 2")

plt.show()

**Building a classifier**

We will use a simple logistic regression model here, but the method can work with any model, from a simple logistic regression model based on tabular data to 3D ConvNets for image classification.

`# Build and train the classifier`

classifier = LogisticRegression(random_state=42)

classifier.fit(X_train, y_train)# Test the classifier

y_pred = classifier.predict(X_test)

accuracy = np.mean(y_pred == y_test)

print(f"Accuracy: {accuracy:0.3f}")

# Test recall for each class

for i in range(n_classes):

recall = np.mean(y_pred(y_test == i) == y_test(y_test == i))

print(f"Recall for class {class_labels(i)}: {recall:0.3f}")

`Accuracy: 0.930`

Recall for class blue: 0.772

Recall for class orange: 0.938

Recall for class green: 0.969

Note how the recall of the minority class is lower than that of the other classes. Recall, also known as sensitivity, is the number of a class that is correctly identified by the classifier.

## If**or the ***non-compliance note*** score**

In conformal prediction, the nonconformity score, often noted *if*, is a measure of the gap between a new instance and existing instances in the training set. It is used to determine whether or not a new instance belongs to a particular class.

In classification, the most common measure of non-compliance is *1 – predicted class probability* for the given label. Thus, if the predicted probability that a new instance belongs to a certain class is high, the nonconformity score will be low, and vice versa.

For a conformal prediction, we obtain *if* scores for all classes (note: we only look at the model output for the true class of an instance, even if it has a higher predicted probability of being another class). We then find a score threshold which contains (or *covers*) 95% of the data. The classification will then identify 95% of new instances (provided our new data is similar to our training data).

**Calculate the conformal prediction threshold**

We will now predict the classification probabilities of the calibration set. This will be used to set a classification threshold for new data.

`# Get predictions for calibration set`

y_pred = classifier.predict(X_Cal)

y_pred_proba = classifier.predict_proba(X_Cal)# Show first 5 instances

y_pred_proba(0:5)

`array(((4.65677826e-04, 1.29602253e-03, 9.98238300e-01),`

(1.73428257e-03, 1.20718182e-02, 9.86193899e-01),

(2.51649788e-01, 7.48331668e-01, 1.85434981e-05),

(5.97545130e-04, 3.51642214e-04, 9.99050813e-01),

(4.54193815e-06, 9.99983628e-01, 1.18300819e-05)))

**Calculate non-compliance scores:**

We will calculate *if *scores based solely on examining the probabilities associated with the observed class. For each instance, we will get the predicted probability for the class of that instance. THE *if* the score (non-compliance) is *1-probability*. The more *if* score, the less this example conforms to this class compared to other classes.

`si_scores = ()`

# Loop through all calibration instances

for i, true_class in enumerate(y_cal):

# Get predicted probability for observed/true class

predicted_prob = y_pred_proba(i)(true_class)

si_scores.append(1 - predicted_prob) # Convert to NumPy array

si_scores = np.array(si_scores)

# Show first 5 instances

si_scores(0:5)

`array((1.76170035e-03, 1.38061008e-02, 2.51668332e-01, 9.49187344e-04,`

1.63720201e-05))

**Obtain the 95th percentile threshold:**

The threshold determines what *blanket *our ranking will have. Coverage refers to the proportion of predictions that actually contain the true outcome.

The threshold is the percentile corresponding to *1 – *𝛼. To achieve 95% coverage, we set an 𝛼 of 0.05.

When used in real life, the quantile level (based on 𝛼) requires a finite sample correction to calculate the corresponding 𝑞-quantile. We multiply 0.95 by $(n+1)/n$, which means 𝑞𝑙𝑒𝑣𝑒𝑙 would be 0.951 for n = 1000.

`number_of_samples = len(X_Cal)`

alpha = 0.05

qlevel = (1 - alpha) * ((number_of_samples + 1) / number_of_samples)

threshold = np.percentile(si_scores, qlevel*100)

print(f'Threshold: {threshold:0.3f}')

`Threshold: 0.598`

Display the graph of s_i values, with cutoff threshold.

`x = np.arange(len(si_scores)) + 1`

sorted_si_scores = np.sort(si_scores)

index_of_95th_percentile = int(len(si_scores) * 0.95)# Color by cut-off

conform = 'g' * index_of_95th_percentile

nonconform = 'r' * (len(si_scores) - index_of_95th_percentile)

color = list(conform + nonconform)

fig = plt.figure(figsize=((6,4)))

ax = fig.add_subplot()

# Add bars

ax.bar(x, sorted_si_scores, width=1.0, color = color)

# Add lines for 95th percentile

ax.plot((0, index_of_95th_percentile),(threshold, threshold),

c='k', linestyle='--')

ax.plot((index_of_95th_percentile, index_of_95th_percentile), (threshold, 0),

c='k', linestyle='--')

# Add text

txt = '95th percentile conformality threshold'

ax.text(5, threshold + 0.04, txt)

# Add axis labels

ax.set_xlabel('Sample instance (sorted by $s_i$)')

ax.set_ylabel('$S_i$ (non-conformality)')

plt.show()

## Get samples/classes from the test set classified as positive

Now we can find all these model outputs below the threshold.

It is possible that an individual example has no values, or several values, below the threshold.

`prediction_sets = (1 - classifier.predict_proba(X_test) <= threshold)`

# Show first ten instances

prediction_sets(0:10)

`array((( True, False, False),`

(False, False, True),

( True, False, False),

(False, False, True),

(False, True, False),

(False, True, False),

(False, True, False),

( True, True, False),

(False, True, False),

(False, True, False)))

**Obtain prediction set labels and compare them to standard classification.**

`# Get standard predictions`

y_pred = classifier.predict(X_test)# Function to get set labels

def get_prediction_set_labels(prediction_set, class_labels):

# Get set of class labels for each instance in prediction sets

prediction_set_labels = (

set((class_labels(i) for i, x in enumerate(prediction_set) if x)) for prediction_set in

prediction_sets)

return prediction_set_labels

# Collate predictions

results_sets = pd.DataFrame()

results_sets('observed') = (class_labels(i) for i in y_test)

results_sets('labels') = get_prediction_set_labels(prediction_sets, class_labels)

results_sets('classifications') = (class_labels(i) for i in y_pred)

results_sets.head(10)

observed labels classifications

0 blue {blue} blue

1 green {green} green

2 blue {blue} blue

3 green {green} green

4 orange {orange} orange

5 orange {orange} orange

6 orange {orange} orange

7 orange {blue, orange} blue

8 orange {orange} orange

9 orange {orange} orange

Note that instance 7 is actually class orange, but was classified by the simple classifier as blue. The conformal prediction classifies it as a set of orange and blue.

**Plot the data showing instance 7 which should eventually be split into 2 classes:**

`# Plot the data`

fig = plt.subplots(figsize=(5, 5))

ax = plt.subplot(111)

for i in range(n_classes):

ax.scatter(X_test(y_test == i, 0), X_test(y_test == i, 1),

label=class_labels(i), alpha=0.5, s=10)

# Add instance 7

set_label = results_sets('labels').iloc(7)

ax.scatter(X_test(7, 0), X_test(7, 1), color='k', s=100, marker='*', label=f'Instance 7')

legend = ax.legend()

legend.set_title("Class")

ax.set_xlabel("Feature 1")

ax.set_ylabel("Feature 2")

txt = f"Prediction set for instance 7: {set_label}"

ax.text(-20, 18, txt)

plt.show()

**Show Coverage and Average Set Size**

*Blanket* is the proportion of prediction sets that actually contain the true result.

*Average set size* is the average number of classes predicted per instance.

We will define some functions to calculate the results.

`# Get class counts`

def get_class_counts(y_test):

class_counts = ()

for i in range(n_classes):

class_counts.append(np.sum(y_test == i))

return class_counts# Get coverage for each class

def get_coverage_by_class(prediction_sets, y_test):

coverage = ()

for i in range(n_classes):

coverage.append(np.mean(prediction_sets(y_test == i, i)))

return coverage

# Get average set size for each class

def get_average_set_size(prediction_sets, y_test):

average_set_size = ()

for i in range(n_classes):

average_set_size.append(

np.mean(np.sum(prediction_sets(y_test == i), axis=1)))

return average_set_size

# Get weighted coverage (weighted by class size)

def get_weighted_coverage(coverage, class_counts):

total_counts = np.sum(class_counts)

weighted_coverage = np.sum((coverage * class_counts) / total_counts)

weighted_coverage = round(weighted_coverage, 3)

return weighted_coverage

# Get weighted set_size (weighted by class size)

def get_weighted_set_size(set_size, class_counts):

total_counts = np.sum(class_counts)

weighted_set_size = np.sum((set_size * class_counts) / total_counts)

weighted_set_size = round(weighted_set_size, 3)

return weighted_set_size

Show results for each class.

`results = pd.DataFrame(index=class_labels)`

results('Class counts') = get_class_counts(y_test)

results('Coverage') = get_coverage_by_class(prediction_sets, y_test)

results('Average set size') = get_average_set_size(prediction_sets, y_test)

results

` Class counts Coverage Average set size`

blue 241 0.817427 1.087137

orange 848 0.954009 1.037736

green 828 0.977053 1.016908

View overall results.

`weighted_coverage = get_weighted_coverage(`

results('Coverage'), results('Class counts'))weighted_set_size = get_weighted_set_size(

results('Average set size'), results('Class counts'))

print (f'Overall coverage: {weighted_coverage}')

print (f'Average set size: {weighted_set_size}')

`Overall coverage: 0.947`

Average set size: 1.035

NOTE: Although our overall coverage is as desired, being very close to 95%, the coverage of different classes varies and is lowest (83%) for our smallest class. If the coverage of individual classes is important, we can set thresholds for classes independently, which we will do now.

When we want to be sure of coverage of all classes, we can set thresholds for each class independently.

Note: We could also do this for subgroups of data, for example ensuring equal coverage for a diagnosis across racial groups, if we found that coverage using a shared threshold caused problems.

## Get thresholds for each class independently

`# Set alpha (1 - coverage)`

alpha = 0.05

thresholds = ()

# Get predicted probabilities for calibration set

y_cal_prob = classifier.predict_proba(X_Cal)

# Get 95th percentile score for each class's s-scores

for class_label in range(n_classes):

mask = y_cal == class_label

y_cal_prob_class = y_cal_prob(mask)(:, class_label)

s_scores = 1 - y_cal_prob_class

q = (1 - alpha) * 100

class_size = mask.sum()

correction = (class_size + 1) / class_size

q *= correction

threshold = np.percentile(s_scores, q)

thresholds.append(threshold)print(thresholds)

`(0.9030202125697161, 0.6317149025299887, 0.26033562285411)`

**Apply a class-specific threshold to each class classification**

`# Get Si scores for test set`

predicted_proba = classifier.predict_proba(X_test)

si_scores = 1 - predicted_proba# For each class, check whether each instance is below the threshold

prediction_sets = ()

for i in range(n_classes):

prediction_sets.append(si_scores(:, i) <= thresholds(i))

prediction_sets = np.array(prediction_sets).T

# Get prediction set labels and show first 10

prediction_set_labels = get_prediction_set_labels(prediction_sets, class_labels)

# Get standard predictions

y_pred = classifier.predict(X_test)

# Collate predictions

results_sets = pd.DataFrame()

results_sets('observed') = (class_labels(i) for i in y_test)

results_sets('labels') = get_prediction_set_labels(prediction_sets, class_labels)

results_sets('classifications') = (class_labels(i) for i in y_pred)

# Show first 10 results

results_sets.head(10)

` observed labels classifications`

0 blue {blue} blue

1 green {green} green

2 blue {blue} blue

3 green {green} green

4 orange {orange} orange

5 orange {orange} orange

6 orange {orange} orange

7 orange {blue, orange} blue

8 orange {orange} orange

9 orange {orange} orange

**Check coverage and set size in all classes**

We now have approximately 95% coverage across all classes. The conformal prediction method gives us better coverage of the minority class than the standard classification method.

`results = pd.DataFrame(index=class_labels)`

results('Class counts') = get_class_counts(y_test)

results('Coverage') = get_coverage_by_class(prediction_sets, y_test)

results('Average set size') = get_average_set_size(prediction_sets, y_test)

results

` Class counts Coverage Average set size`

blue 241 0.954357 1.228216

orange 848 0.956368 1.139151

green 828 0.942029 1.006039

`weighted_coverage = get_weighted_coverage(`

results('Coverage'), results('Class counts'))weighted_set_size = get_weighted_set_size(

results('Average set size'), results('Class counts'))

print (f'Overall coverage: {weighted_coverage}')

print (f'Average set size: {weighted_set_size}')

`Overall coverage: 0.95`

Average set size: 1.093

Conformal prediction was used to classify instances into sets rather than single predictions. Instances located on the border between two classes were labeled with both classes rather than choosing the class with the highest probability.

When it is important that all classes are detected with the same coverage, the classification threshold of instances can be set separately (this method could also be used for subgroups of data, e.g. to guarantee the same coverage between different ethnic groups).

The conformal prediction does not modify the model predictions. It just uses them in a different way than traditional classification. It can be used alongside more traditional methods.

(All images are by the author)