SHAP is the most popular model-agnostic technique that is used to explain predictions. SHAP stands for SHapley Additive exPlanations
Shapely values are obtained by incorporating concepts from Cooperative Game Theory and local explanations
Mathematical and Algorithm Foundation
Shapely Values
Shapely values were from game theory and invented by Lloyd Shapley. Shapely values were invented to be a way of providing a fair solution to the following question:
Question
If we have a coalition C that collaborates to produce a value V: How much did each individual member contribute to the final value
The method here we assess each individual member’s contribution is to removing each member to get a new coalition and then compare their production, like this graphs:
And then, we get every member 1 included or not included coalitions like this:
Using left value - right value, we can get difference like image left above; And then we calculate the mean of them:
Shapely Additive Explanations
We need to know what’s additive mean here. Lundberg and Lee define an additive feature attribution as follows:
, the simplified local inputs usually means that we turn a feature vector into a discrete binary vector, where features are either included or excluded. Also, the should take this form:
- is the null output of this model, that is, the average output of this model
- is feature affect, is how much that feature changes the output of the model, introduced above. It’s called attribution
Now Lundberg and Lee go on to describe a set of three desirable properties of such an additive feature method, local accuracy, missingness, and consistency.
Local accuracy
Missingness
if a feature excluded from the model. it’s attribution must be zero; that is, the only thing that can affect the output of the explanation model is the inclusion of features, not the exclusion.
Consistency
If feature contribution changes, the feature effect cannot change in the opposite direction
Why SHAP
Lee and Lundberg in their paper argue that only SHAP satisfies all three properties if the feature attributions in only additive explanatory model are specifically chosen to be the shapley values of those features
SHAP, step-by-step Process, same as shap.explainer
For example, we consider a ice cream shop in the airport, it has four features we can know to predict his business.
For, example, we want to know the temperature 80 in sample [80 1 100 4] shapley value, here’s the step
- Step 1. Get random permutation of features, and give a bracket to the feature we care and everything in its right. (manually)
- Step 2. Pick random sample from dataset
For example, [200 5 70 8], form: [F D T H]
- Step 3. Form vectors
is partially from original sample and partially from the random chosen one, the feature in bracket will from random chosen one, exclude what we care
just change the feature we care into the same as random chosen one’s feature value
Then, calculate the diff and record
- Step 4. Record the diff & return to step 1. and repeat many times
Shapley kernel
Too many coalitions need to be sampled
Like we introduce shapley values above, for each we need to sample a lot of coalitions to compute the difference.
For 4 features, we need 64 total coalitions to sample; For 32 features, we need 17.1 billion coalitions to sample.
It’s entirely untenable.
So, to get over this difficulty, we need devise a shapley kernel, and that’s how the Lee and Lundberg do
Detail
Though most of ML models won’t just let you omit a feature, what we do is define a background dataset B, one that contains a set of representative data points that model was trained over. We then filled in out omitted feature of features with values from background dataset, while holding the features are included in the permutation fixed to their original values. We then take the average of the model output over all of these new synthetic data point as our model output for that feature permutation which we call .
Them we have a number of samples computed in this way,like image in left.
We can formulate this as a weighted linear regression, with each feature assigned a coefficient.
And we can prove that, in the special choice, the coefficient can be the shaplely values. This weighting scheme is the basis of the Shapley Kernal. In this situation, the weighted linear regression process as a whole is Kernal SHAP.
Different types of SHAP
- Kernal SHAP
- Low-order SHAP
- Linear SHAP
- Max SHAP
- Deep SHAP
- Tree SHAP
You need to notice
We can see that, we calculate shapley values using linear regression lastly. So there must be the error here, but some python packages can not give us the error bound, so it’s confusion to konw if this error come from linear regression or the data, or the model.
Reference
Shapley Additive Explanations (SHAP)
SHAP: A reliable way to analyze your model interpretability
【Python可解释机器学习库SHAP】:Python的可解释机器学习库SHAP
Shapley Values : Data Science Concepts
Appendix
Other methods to interprete model: