Conformal Prediction: Simple Model-Agnostic Prediction with Statistically Guaranteed Error Rates

This Blog Post about Conformal Prediction © 2023 by Lynd Bacon and Loma Buena Associates, is licensed under CC BY-SA 4.0 

 

TL;DR? Here’s the BLUF: 

"Conformal prediction methods can provide you with statistically valid certainty guarantees regarding the likely values of predictions using “new” data, data that you haven’t observed outcomes for, or that you just didn’t use for training and validating models."

A simple classification example with Python code snippets follows below, along with some suggested follow-up resources.

 


 Suppose you’ve trained, tuned, and validated a predictive model that you will use to predict outcomes for new “objects,” new cases or observations. Your model might be simple regression or classification model, an ensemble model of some sort, or a deep neural network. Your objects might be customers and their responses or their types, drug molecules and their efficacy, or any other sort of thing that you use a model to predict. Your data are of the form (Xi, Yi) for some “I” number of objects, or cases, where Xi are measures on “features,” or on predictor variables, and Yi to be predicted is univariate or multivariate, and is continuous or discrete. 

After you’ve trained your model, you have some new objects to predict Yi values for. You have Xi for these objects, but not Yi , which is why you need to make predictions. So you apply your model to each of the objects’ Xi’s to get a Yi prediction. For each of your new objects, perhaps you get a real number estimate, or perhaps a set of softmax probability estimates. You may take the probabilities as indicating a "true" category or class.

How confident should you be about these estimates? Are they close to the “ground truth” values (assuming they exist) for your new objects?

What if you could not just get a prediction for each new object, but also a measure of confidence or certainty with a guaranteed error rate that’s based on a range, or a region, of likely Yi values? What if this measure was agnostic with respect to the kind of model you used to make a prediction? What if it worked for any classifier, or for any regression model?

It turns out that there is class of machine learning methods for doing this sort of inference. These methods have been under development for almost two decades by machine learning researchers. They enable what is called conformal prediction, or conformal inference. Conformal prediction can be used for classification and regression, as well as for other kinds of prediction problems, e.g., time series prediction, or survival analysis. Seminal work and important contributions to the extensive literature have been provided by Vovk et al. (2022) and a number of other scientists.

 Useful Features of Conformal Prediction Methods

Conformal prediction methods have some useful features, including the following.

  • Conformal inference is model agnostic in the sense that no assumptions about model features (e.g., functional form, assumed data generating mechanisms, etc.) are required. The methods do vary by the type of prediction or supervised learning problem, e.g., classification, or regression.

  • Prediction regions, or prediction sets, corresponding to different levels of certainty are nested: lower certainty sets are contained by higher certainty sets.

  • The better the predictive accuracy of a model, the smaller predictive sets will be at a given level of certainty.

  • Data exchangeability is assumed in the sense that data for new observations could in theory be exchanged with observations’ data used to characterize prediction regions. IID data are exchangeable. No particular kind of distribution need be assumed.

It's Really Pretty Simple To Do

There are conformal inference flavors that vary in their details. A common form is what’s called inductive conformal inference. This approach requires using two data sets, a data set for model training and validation, and a calibration data set. The two data sets should be samples from the same population. Some models necessitate some assumptions about the data, like whether the objects are distributed IID in some particular distribution.

A Groaningly Simple Example: Conformal Prediction of Segment Memberships for New Customers

The Example Data,  and The Classification Model

A gradient boosting classifier was fit and tuned to predict segment membership using the Customer Segmentation data available on Kaggle https://www.kaggle.com/datasets/vetrirah/customer.

This is a “smallish” data for many machine learning/AI purposes. The predefined training data set (n=8,068) was used for model training, cross-validation, and final assessment. Two of the original nine features were dropped due to large numbers of missing values. Non-continuous features were one hot encoded. Four segment membership label values, A, B, C, and D, are specified in the data, and their distribution across the customer records is relatively balanced.  It's not known whether any of the nine features were segmentation "basis" variables.

The test data set (n=2,627) includes the same features as the training data set. It was randomly split into a conformal prediction calibration data set of 2,000 customers, leaving 627 customers to be “new customers” for which predictions can be examined.

Model tuning was very basic. It consisted of doing a simple grid search to choose some hyperparameter values that produced better cross-validated predictive accuracy results. The final (“best”) classifier model predicted segment membership for held out data about 55% accurately, a relatively modest performance. Comments on the Kaggle site indicate obtained accuracies up to 60%. It’s possible that more feature engineering or transformations, or use of a different classifier model, might have increased accuracy by a few points.

The final "best" model was used to predict segment membership probabilities using the 2,000 calibration customer data records. The predicted values for each customer consisted of four softmax probability estimates, one for each of the segment labels A, B, C, and D. The “true” segment labels for these customers are known, as they also were for the data used for model training, of course. For the present example, we’re “pretending” we don’t know the labels for the 627 “new” customers. We generate and assess segment prediction sets for these customers in what follows.

Customer Segment Membership Conformal Prediction

What follows is an example of what’s called adaptive conformal prediction

The first thing we need to do is to decide on a level of certainty that we want our prediction sets to reflect. Let’s say we want to have two to one odds, on average, that our prediction sets actually include the “true” (but known) segment labels for new customers. In other words, we want our prediction sets to have at least 67% marginal coverage of “ground truth” label values. The certainty level that you choose for a particular application is a criterion for you and your stakeholders to decide on.

What follows includes some Python numpy code snippets, with a wee bit of pandas thrown in. You can do conformal prediction in pretty much any programming language that you can do some math in. Or, you can make use of the functionality of various open source libraries. The code snippets you see here reflect examples you can find on line at GitHub (e.g., https://github.com/aangelopoulos/conformal-prediction) and elsewhere.

In the following code snippets, the softmax predictions for the 2,000 calibration customer are in a numpy array z_calib with 2,000 rows and four columns. The columns are the four softmax probability estimates for membership in the customer segments.  These estimates were obtained by applying the final classifier model to their data. The “ground truth” segment labels corresponding to these 2,000 predictions is in an 1D array y_calib.

Step 1 – Calculate Conformity Scores Using Calibration Data

AKA, “non-conformity” scores. Conformal methods use a score that reflects how typical a new object for prediction is relative to those used to train the model used for making predictions.

Different conformity scores could be used. Here we’ll use a score that makes use of the information in all four predicted segment membership probabilities.

# Create an index array that sorts z_calib row probabilities into ascending order:
prob_idx = z_calib.argsort(axis=1)[:,::-1] # [:,::-1] reverses row sort order
# Apply the index, and accumulate sum of probs across each row:
z_calib_sorted = np.take_along_axis(z_calib, prob_idx, axis=1).cumsum(axis=1)
# calculate calibration scores using the label values in y_calib
# y_calib selects the col of x_calib that has the cumul prob corresponding to the label value
N=2000 # size of calibration data
calib_scores = np.take_along_axis(z_calib_sorted, prob_idx.argsort(axis=1), axis=1)[range(N), y_calib]

In the code above, we first sort each row of our softmax probability array z_calib in ascending order. Then we compute the cumulative sums across each row. Next, we select the sum in each row that’s in the column corresponding to the “true” segment label value in the array y_calib. This amounts to including all segments in the score for each row up to and including the “true” (known) label. All of our conformity scores will have values in [0,1]. The larger the value, the less “conforming.”  So you might say our scores are nonconformity scores.

Step 2 – Calculate adjusted quantile reflecting desired certainty level

Now that we have our conformity scores, we use them to calculate the quantile of their distribution that corresponds to our desired 67% marginal coverage, adjusting for the finite sample size. qhat is the desired quantile.

# get adjusted quantile qhat, the quantile confidence threshold, for 67% confidence prediction sets
# 67%, or odds 2/1 of sets including "ground truth"
# adjustment is for finite sample size
alpha = 0.33 # error rate = 1 - certainty
qhat = np.quantile(calib_scores, np.ceil((N-1)*(1-alpha))/N, method="higher")

Give the conformity scores calculated for this example, qhat is approximately 0.936.

Step 3 – Use the adjusted quantile to calculate prediction sets for “new” customers

We can now compute a 67% segment membership prediction set for each of our “new” customers, either individually, or in a bunch. These prediction sets will vary in terms of the number of segments in them, depending on how certain our model is about each new customer. The sets can be from one to four in size; when four, our classifier model is uninformative given the level of certainty we want.

We include in each new customer’s prediction set each segment for which the cumulative predicted probability of membership added over decreasing segment probabilities, is less than or equal to the quantile we calculated in Step 2.

In the following code snippet new_preds are our model’s softmax predictions for the 627 “new” customers. We sort the probability estimates in each customer’s row in ascending order, and then accumulate them across columns. Then we add to each customer’s prediction set each segment up to and including the segment at which the cumulative probability reaches the quantile value qhat we computed.

new_preds_prob_idx = new_preds.argsort(axis=1)[:,::-1]
preds_sort = np.take_along_axis(new_preds,new_preds_prob_idx, axis=1).cumsum(axis=1)
pred_sets = np.take_along_axis(preds_sort <= qhat, new_preds_prob_idx.argsort(axis=1), axis=1)

New Customer Prediction Sets

Using the example code above, the prediction sets for each of the 627 new customers are defined in the numpy array pred_sets by four Boolean values, one for each of the segments. Here’s what 10 new customer prediction sets look like this array. The columns from left to right correspond to the segments A, B, C, and D. These results are in no particular ordering of the new customers.


array([[False,  True,  True, False],
       [ True,  True,  True, False],
       [ True, False, False,  True],
       [False,  True,  True, False],
       [False, False,  True, False],
       [ True,  True, False,  True],
       [ True,  True, False,  True],
       [ True, False, False,  True],
       [ True,  True,  True, False],
       [ True, False,  True,  True]])
 

Prediction Set Sizes

Recall that the "best" gradient boosting classifier predicting segment memberships had an accuracy of just under 55% when tested on held out data. Given this, it may not surprise you that many of the new customers have prediction sets that contain more than one segment.     Below is a simple prediction set size frequency distribution, produced using a convenient bit of pandas, for the 627 new customers. The “0” row indicates that the classifier was uninformative regarding 12 customers. Note that only 57 customers have prediction sets containing a single segment.

pred_set_size = pred_sets.sum(axis=1)
pd.Series(pred_set_size).value_counts()

Counts of predictive set sizes produced by the above code:

3    353
2    205
1     57
0     12

Do Prediction Sets Contain "Ground Truth" Labels As Expected?

Or, as hoped?

Our 627 customers are not really “new” in the sense that we actually do know their segment membership labels. (We were pretending we didn't, right?) As this is the case we can examine whether their prediction sets actually do provide 67% marginal coverage of their “true” labels.

For the example at hand, the calculated rate at which the prediction sets of the 627 customers actually include their "true" segment labels is approximately 67.3%.  We choose to have 67% certainty regarding our prediction sets.  This exemplifies the marginal coverage guarantee that conformal methods provide.

Summary

You can see that basic conformal prediction can be pretty straightforward. Principled extensions can make it less conservative, i.e., it can provide somewhat smaller prediction sets.

Conformal prediction is model agnostic, and it doesn’t require assuming any particular data distribution. It does require having some data. It also requires a decision about the degree of certainty desired with respect to use of the results.

You might wonder about the choice of 67% certainty for the above example. A larger degree of certainty would result in larger prediction sets. A smaller degree would produce smaller ones. The choice of certainty level is up to the user or to other stakeholders. Conformal prediction won’t make that choice for you. In the case of the example above, if the classifier was more predictively accurate, 67% prediction sets should be on average smaller.

The keys to useful conformal prediction include choosing an appropriate conformity score for the kind of prediction problem you have, and having enough data. 

Some Resources

There's a lot of literature about conformal prediction.  Here are a few suggested bits:

Angelopoulos, A. & Bates, S. (2022) “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” https://arxiv.org/abs/2107.07511

Angelopoulos, A., Bates, S., Malik, J. & Jordan, M. (2022) “Uncertainty Sets for Image Classsifiers Using Conformal Prediction.” https://arxiv.org/abs/2009.14193v5

Balasubramainian, V., Ho, S., & Vovk, V. “Conformal Prediction for Reliable Machine Learning.” Sebastopol CA: O’Reilly, 2014.

Molnar, C. "Introduction To Conformal Prediction With
Python." Muchen, DE: MUCBOOK, 2023.

Toccaceli, P. (2022) “Introduction to conformal predictors.” Pattern Recognition, 124, 1-11.

Vovk, V., Gammerman, A. & Shafer, G. “Algorithmic Learning in a Random World.” Springer Nature Switzerland AG, 2022.



Comments

Popular posts from this blog

Conformal Prediction of Customer Segment Memberships by Customer Type; Uncertainty, Quantified and Unquantified

Are you Certain About How Uncertain You Should Be?