Learning hides bugs

I’ve got an assertion to make. Let me give it a try:

“It’s difficult to write bug-free machine learning code.”

That’s a bit of a platitude. There are plenty of bugs one could encounter in ML code: misaligned tensors, floating point error explosion, improper dataset segmentation… the list goes on. None of these seem to be a uniquely difficult class of bug - harder to find and fix than those we might find in a program like a webserver or syntax parser. Even the tricky problems around parallel processing or offloading computation to GPUs aren’t much harder than what you can find with device drivers and multithreading elsewhere. There is the argument that ML has some very difficult algorithms to understand correctly, but I’ll throw that in a bin labeled “Computer Science” and ignore it for this analysis. So what is my assertion about - giving me the benefit of the doubt here, and assuming that I actually have something worthwhile to say?

Another whack at stating this little thesis:

“The failure modes of machine learning make bug free code difficult to write.”

Okay, this is a bit better, but vague. What are the failure modes of machine learning programs? In particular, we want to know what the unique failure modes are. At one of the spectrum, we have coding problems like tensor misalignment, or filling up our device memory and getting a driver error. As we agreed earlier, these aren’t meaningfully unique to machine learning programs. And in fact, the failure modes for these problems are pretty good, in the sense that they help us fix the bugs. When a tensor is misaligned, we get an exception and the program stops. We can debug the cause and fix it. When the GPU starts filling up, we can analyze how much data we’re putting on it, and try to lower our batch size or unrolled model size. It's not difficult to write code free of these bugs (though it is, at times, exhausting).

We also have problems like statistical bias, and poorly selected or optimized algorithms that take too long to run. These sometimes have good failure cases - if our algorithm is slow, we can analyze it with profiling tools, look at the literature for more appropriate alternative algorithms, and so on. Statistical bias is harder - in some cases, we can examine, manually or statistically, the outputs of our models for characteristic bias. In other cases, the possibility of indiscernible bias is arguably a fundamental limitation for the appropriate applications of machine learning. However, I’m also going to throw this problem, for the course of this article, in the bucket labeled “Computer ScienceComputer Science and Statistics.” These aren’t coding problems, and not what I’m trying to assert something about right now [1].

What is left are the classes of bug where we do not have a dramatic software failure mode like our code crashing or the model failing to descend - and we also assume that we do not have “bugs” in our assumptions about the model we have decided to implement. In other words, we have not set down a flawed model (like a neural network with no activations) on paper before beginning to implement it. What are these failure modes? They are the cases where the model is specified correctly, our implementation runs, it completes the task to some degree of performance, and yet has something wrong with it.

For instance, it might be the case that the model is a neural network, and is missing an activation - though we have it on the diagram of the model we drew before coding. The layers sandwiching the “missing” activation can only learn a linear transform. What happens in this case? Nothing dramatic, often. The model descends as usual, and perhaps converges. The performance, however, is potentially lower than if our specified model was implemented correctly. I claim that this kind of failure is somewhat unique to machine learning code.

The failure mode is “soft.” The bug does not make itself known. The impact of the bug is quantitative, not qualitative. There are, to be fair, other quantitative bugs. Performance bugs in non-learning algorithms are quantitative, but also quite accessible to analysis, such as through profiling and program structure analysis. Furthermore, the performance topology of a given non-learning algorithm is often well-defined, and we can reasonably expect relatively tight bounds on the performance of many non-learning algorithms.

Bugs that produce higher-than-optimal error in floating point and linear algebra libraries are qualitative, often similar to (and indeed, often the causes of) bugs in learning algorithms. Even this notion of performance, however, is analytically tractable: the accuracy numerical programs is an entire field with powerful tools of analysis and formalization. The performance of learning algorithms is not so easily predicted.

I argue that there are two dynamics at work that interact to cause these soft failure modes. The first is the nature of gradient-based optimization, which is the dominant paradigm for ML model training (proponents of GA and similar techniques not withstanding). We can formalize a gradient-descended model as a tuple $<W, L, G>$ (weights $W$, loss function $L:W \to Z$, gradient function $G:Z \to L$). If everything is specified and implemented correctly, then we expect that $L(W) < L(W+\lambda (G(L(W)))$, for $\lambda$ very small. This is simply the definition of the gradient.

What if the elements don’t do exactly what we want them to? Specifically, what if the gradient is computed incorrectly? We now have a tuple $<W, L, G’>$, where $G \neq G’$. We have no relationship between $L(W)$ and $L(W+\lambda (G(L(W)))$. GD shouldn’t do anything at all - just a random walk.

But what if $G$ is correlated with $G’$? It seems reasonable that GD might do something useful here, and indeed it does. This is the concept behind stochastic gradient descent, of course: approximate the true gradient with a minibatch gradient, and descend as if it was the real one. Here, we actually have a different tuple: $<W, L’, G>$. $L’$ is the loss function taken over the batch, not the sample population. The gradient, presumably, is correct, but taken over the wrong points. We can recover the second tuple $<W, L, G’>$ by defining $G’ = G(L’(W))$ - now we approximate the true loss with a different gradient, which we assert is correlated with the true gradient. Note that we can never actually compute the true loss L for most problems.

Let’s assume, boldly, that this result generalizes to all tuples $<W, L, G’>$. We can descend, with some degree of effectiveness, with a gradient that is merely correlated with the true gradient. This means that if our gradient function is implemented incorrectly, but the function it implements is correlated with the gradient, the model will descend and possibly converge! The failure mode is soft. If we don’t know what our model should be able to achieve in terms of loss, we may never know that this bug exists.

Today, with a few battle-tested autograd implementations, it may be quite unlikely for us to encounter bugs in our gradient. The loss function is easier to err in, especially with complex tasks that don't fit easily into the common classification/regression/reconstruction use cases. However, there is another way that SGD can have a "soft" failure. Let’s replace the model in our tuple with $W’$ - an incorrectly implemented model. This could be the model with a missing activation we discussed earlier. Again, we will often find our model descending and converging, through possibly above the loss a correctly specified model could achieve. In this case, however, we are not encountering a problem with SGD.

Consider a toy problem that consists of computing an XOR and performing a linear regression. The loss is a weighted sum of the two task-specific losses. We will model the problem with a neural network that contains one hidden layer, using a nonlinear activation. When we forget to implement the activation, our model is linearized, and cannot learn an XOR, but still descends and minimizes one component of the loss. When we correctly implement the specified model, the loss after training will be lower than that for the incorrect model. However, there was never any indication during training the model without an activation that the model was not correct. The model failed silently.

The second dynamic of the two I mentioned earlier, and the one exposed by this toy problem, is the lack of strong expectations around the performance of machine learning techniques. If this was not a toy problem, and the loss was more opaque, we might never know that the model was underperforming.

Consider how many papers introduce techniques with "unexpected" performance. We may have a case of practice outrunning theory. In general, it is difficult, if not impossible, to estimate the performance a given model will achieve on a given task. For this reason, we are unaware when a model is not living up to its performance potential. Not only can bugs be silent, they can be invisible.

In sum effect, it is very hard to analyze the loss performance of most machine learning models in the way that we can analyze the hot spots and other speed or memory performance characteristics of more formalizable programs. There are ways of doing this - looking at learned activations and training loss profiles and so on - but they are not formalized or reliable.

The clearest statement I can make of these ideas is:

"Gradient descent, when applied to models with unclear performance properties, tends to mask bugs in implementation code."

We can see this as the cousin of a well-established tendency of learning algorithms to identify and exploit bugs or loopholes in their experimental environments [2]. In addition to exploiting bugs, learning algorithms can perform well despite them, though potentially not as well as they could unburdened.

Okay, I've made my complaint. Now what do we do about this? The problem doesn't seem particularly tractable. We have a notion of how a model functions and sometimes a theory as to why it functions well - based on information theory, probability, biomimicry, etc - and we have a implementation of it, to some degree of correctness, that may function in very different ways and may achieve the performance it does for very different reasons than our theory. How do we reduce the last two possibilities [3]?

One strategy is to strengthen the ties between theoretical specification and practical implementation. This could take the form of libraries to generate human-readable representations of models from code, or even symbolic equations. This output could be compared to model specs or papers to verify correctness.

This isn't a trivial task, especially in complex training setups involving multiple models and dynamic model structures like RNNs. However, incremental progress is possible. Of course, these libraries themselves could be incorrectly implemented, but we decrease the attack surface for bugs with each layer of verification (at the cost of productivity).

Another strategy is increasing the use of toy problems that are simple enough to verify correctness of a model. Synthetic datasets are key here. However, in cases where practical methods frontrun proofs of theoretical performance expectations, it may be difficult to tightly bound the performance of a model architecture or technique using test datasets.

My last suggestion is less a scientific or programming technique than an organizational one: run a bakeoff. If several individuals or teams attempt to implement a model correctly in isolation, there is a chance that each of them can discover issues with another's model.



I do not, however, think that bias is a problem that anyone who works in or with learning algorithms can ignore. Bias is not "somebody else's problem". Everyone involved in offloading decision-making to opaque statistical engines needs to responsibly and sincerely grapple with the societal implications of that process. Even if you think your use case is benign, it may not be: consider how even a recommender system can amplify socioeconomic disparities along racial, sexual, or other inappropriate decision boundaries.


This exploitative behavior is perhaps most known to occur in genetic algorithms, but I have personally observed it in SGD training as well - I suspect it is simply the complexity of simulations that GA algorithms inhabit (in contrast to the simplicity of most SGD loss functions) that exacerbates this tendency, and that SGD-driven RL in particular will experience it to the same extent.


We cannot hope to eliminate them - to do so we would need complete confidence in the algorithms and systems underlying our model implementation's computation. The customary reference to "Reflections on Trusting Trust" applies - with the note that we are not worried about a deliberate attacker here but simply the possibility human error at all levels of the stack.

The sources of this blog are available at Github. If you have any comments, corrections, or questions, please get in touch with me - I'd love to chat.