Categories:

Berkeley Andrus: One advantage of many machine learning tasks is the ease with which we can measure model performance. We use BLEU scores, classification accuracy, error margins, etc. to compare models to one another, decide what is worthy of publication, and get an estimate of how close a model is to being production-ready.

Despite – or perhaps because of – how easy it is to compare accuracy scores, we often overlook another form of model evaluation: analyzing failure modes. This is formally referred to as Failure Modes and Effects Analysis (FMEA). Though FMEA can be applied to any discipline, it is especially important in the machine learning field which already suffers from a lack of interpretability. We can create incredible neural networks that perform well 99% of the time, but until we understand why and how our systems fail in the other 1% of cases, it’s hard to be sure if they are ready for field work.

FMEA in machine learning usually means figuring out which types of test cases your model struggles with and identifying patterns in what goes wrong. For example, does your image classifier label every owl as a tiger or does it sometimes guess airplane? Does your chat-bot swear at users only when they say something insulting to it or does it happen unpredictably? How does your speech recognizer perform when the speaker has an accent or when an utterance is unusually long?

These kinds of questions take our analysis a step deeper than looking at a single accuracy metric or error rate. Taking time to understand why and how our model is failing provides a number of benefits:

  • It may show us low-hanging fruit where we can improve our model. For example, if we notice that our text classifier can distinguish between poetry and romance but not between science fiction and mystery, we might decide to augment our dataset or add new discriminating features to our input vectors that specifically address our model’s weaknesses.
  • Knowing the circumstances under which our model breaks down may help us prepare for the errors that will inevitably come. If we know that our self-driving car is twice as likely to get in an accident after dusk, we may chose not to use it or enforce a lower speed limit during high-risk times.
  • In some rare cases, failure mode analysis illuminates fatal errors that should prevent us from using our model, such as errors based on race or gender biases that we did not realize appeared in our training data. These discoveries can send us back to the drawing board and lead to better models in the future.

But how do we go about finding these types of weaknesses? It is admittedly more difficult than just measuring accuracy, but that doesn’t mean our failure mode analysis needs to be elaborate. When it comes to finding a model’s failure modes a little bit of testing goes a long way.

The type of testing you do will depend largely on the task your model is trying to accomplish, but one tool that I have found to be particularly useful is confusion matrices.

Confusion Matrices

A confusion matrix is a table (like the one below) that visualizes the performance of a classification model, though it could be adapted to other types of tasks. The y axis labels show which class the model predicted on each test case. The x axis labels are the correct labels we wanted the model to predict. Each numeric value in the table is the number of times the model guessed the row label when the correct answer was the column label. A confusion matrix for a perfectly accurate classifier has high numbers along the diagonal and zeros everywhere else.

Confusion matrices are easy enough to generate while measuring accuracy on a test set or benchmark. Once the data is collected, I like to display confusion matrices as heat maps using matplotlib (they have a tutorial and examples here).

Examples

I want to demonstrate how effective confusion matrices can be with some examples from my own research. Let’s start by looking back at the table above, which I generated while working on a forthcoming paper on flexible speech interfaces for video games.

The first thing I noticed in this table was that the rows for ‘spaceship’ and ‘NULL’ were darker than the others. Dark rows often indicate that the associated labels are being over-predicted by the model. In this case, though, the ‘NULL’ column was also darker than the other columns. That was a clue to me that the model wasn’t just overpredicting ‘NULL’, it was generally confused about that particular label.

Hopefully you can already see why this type of analysis is helpful. If we know which labels are being over- or under- predicted or are otherwise giving our model trouble, we have a better idea of how to adjust our training data, our network architecture, or any other part of our system to get better performance in the future. In this case, noticing the difficulties with the ‘NULL’ label led me to discover a bug in how I was representing that specific label, and I was able to adjust and move forward accordingly.

But we can do better than looking at rows or columns independently. Take a look at the table below, which came from another iteration of my classification model. What jumps out to you?

Aside from the fact that the ‘Follow’ row is over-represented, I noticed two clusters. The first cluster is the four cells in the upper left-hand corner, the ones where both the prediction and the correct label are ‘Attack’ or ‘Protect’. The second cluster I noticed is the group of six cells where the correct label is ‘Enter’ or ‘Exit’ and the prediction is ‘Enter’, ‘Exit’, or ‘Follow’.

(As a side note, it is important to remember that geographic proximity of cells in a confusion matrix isn’t usually meaningful. For this project I deliberately ordered the rows and columns so that related concepts would be near each other. If I hadn’t done that, it wouldn’t make sense to look for ‘clusters’, as any two cells in the same row or column would be just as related as two adjacent cells.)

Both of these clusters include antonym pairs, namely {‘Attack’, ‘Protect’} and {‘Enter’, ‘Exit’}. In other words, my model was frequently guessing the opposite of what was intended. In my application (interpreting closed-domain user commands), predicting the antonym of the correct label would be a big problem. The fact that many of the incorrect predictions were antonyms of the correct labels told me that my model’s performance was not as good as the accuracy score led me to believe — when it guessed wrong, it guessed really wrong. Some investigation revealed that the antonym problem was largely a product of the underlying language representation model I was using, which was outside of my control. However, thinking about antonyms led me to new ideas about downstream processing that could complement my model and detect these types of mistakes. In other words, taking the time to understand my model’s weaknesses helped me know how to use it more effectively and protect hypothetical end users from the worst parts of my model.

Let’s look at one more example that I found surprising. Here’s the confusion matrix:

This matrix has two main clusters, one involving the labels {‘enemies’, ‘spaceship’, ‘vehicle’} (and ‘turret’ if you count the 178 in the ‘turret’ row and ‘enemies’ column), and the other with the labels {‘turret’, ‘tunnels’, ‘building’}.

The connection between labels in these clusters wasn’t apparent to me at first, but after digging around in my training data I realized that, from the perspective of the training data, the labels in the first cluster were entailed. According to what my model was learning spaceships, vehicles, and turrets were all subsets of ‘enemies’, because that’s what the training data conveyed. For example, my model was trained on phrases like ‘Shoot down that aircraft!’ but not on phrases like ‘Protect the UFO at all costs!’. After realizing my mistake, I augmented my training data with example phrases that treated spaceships, vehicles, and turrets as allies and neutral agents in addition to enemies. I also disambiguated between vehicles and spaceships (which actually are entailed) by making sure all the vehicle phrases referred to land vehicles and all the spaceship phrases referred to aircraft. This made my model’s task more meaningful and also boosted its classification accuracy.

Conclusion

As helpful as it is to have a highly accurate model, we miss opportunities for improvement when we don’t take the time to analyze why and how our models are failing. In some cases failure mode analysis helps us improve our models, and in other cases it helps us know how to use them effectively. Whatever your application is, I hope that these examples have been helpful and that they inspire you to find the methods that work for you.

Comments are closed