Recent results in deep and convolutional neural networks with powerful industrial applications have led to widespread adoption of these techniques outside of academia. One key advance has been the introduction of automatic differentiation, which allows developers and scientists to write code using standard numerical routines and functions that can be differentiated and used in training neural networks or other machine learning models. Many developers working with neural networks today, however, do not have a strong understanding of how automatic differentiation functions "under the hood," as well as how it differs from symbolic and numerical differentiation. In this article, using the style of literate programming, I will build some toy domain-specific languages that model how automatic differentiation functions in popular libraries like Tensorflow and PyTorch.

Please note that this is not a primer on automatic differentiation itself or the underlying theory. There are many high-quality sources on these topics, which I will not hope to outdo. The purpose of this article is to show, to those familiar with the concepts of AD but not the implementation, how simply automatic differentiation can be implemented.

For this reason I will present minimal background on AD and the alternatives and invite readers to search out these topics themselves.

There are roughly three ways to perform mathematical differentiation that most people are aware of:

This is the application of rules on abstract symbols to analytically calculate formulas for the derivative of $f$ at $x$ (where $x$ may be high dimensional).

Symbolic differentiation involves a *computer algebra system* (CAS). Some people differentiate between symbolic differentiation and "manual" differentiation, where the latter is the same essential procedure as the former, but is performed, generally on paper, by a human.

In practice, this would be something like a CAS looking up a rule in a table to convert an expression $y = x^2$, consisting of exponentiation of a scalar value, to a derivative expression $\frac{dy}{dx} = 2x$, without ever considering the value of x itself. Often, this would occur without even considering the value of *2*. Frege, at that point, is rolling in his grave. Alternatively, this would consist of a very bored 16-year-old sitting in math class looking up this rule in a table to generate the second expression.

This is the computation of an approximate derivative of $f$ at $x$ using numerical algorithms.

The most common techniques are forms of finite difference estimation, which consists of perturbing the value of $x$ and analyzing the resultant change in $f(x)$. Again, this can be done by massive linear algebra algorithms, or bored 16-year-olds.

While this technique seems hacky, it's commonly used in engineering and science. It's important to keep in mind that we are working with floating point numbers of finite precision in most cases, and sometimes a "hacky" method can max out our precision. At the same time, numerical differentiation requires significant attention to floating point arithmetic and error explosion.

Here, the derivative is computed from concrete values, and we do not obtain any structured expression.

This is the computation of an exact (to machine arithmetic precision) derivative of $f$ at $x$ by analyzing the *computation* of $f(x)$. Rather than analyzing symbols that express mathematical operations (as in symbolic differentiation), or perturbing computations that represent mathematical operations, we analyze a program that computes a function $f$. This analysis is far simpler than that of symbolic algebra expressions, and can be formalized into predictable algorithms and procedures.

The concept behind automatic differentiation is simple: every operation that our program is allowed to perform is shadowed by an operation that can compute the derivative of that operation. In practice, we record operations on values, and then use the "shadow" operations to compute derivatives of these operations using the chain rule. You can imagine all of the functions in your programming environment being doubled in this way. With the right abstraction layer, a normally written program can be made differentiable by an interpreter.

We could generalizing formalize this notion using Turing machines and differentiating operations on the tape, but this wouldn't be a very useful kind of differentiation in practice. In most cases, automatic differentiation is defined over a subset of a common programming language or library, or for a domain-specific language. By restricting the domain of AD to operations for which we can define efficient derivative operations, we gain a lot of expressive power and efficiency. An example is `autograd`

, which replaces large portions of `numpy`

and `scipy`

with differentiable operations.

To demonstrate the mechanics of AD, we are going to build several languages.

Each of these languages is going to a "lsp" - or not quite a Lisp. Don't be fooled: they are missing some key features that most Lisps have. They aren't even all Turing complete. Lisp syntax is used, because it's easy to parse, and parsing is not the main interest here.

Each lsp is defined by the tuple $l = <\mathcal V, \mathcal O, \mathcal A, \mathcal E>$ which includes:

- A value type $V$, the set of valid computational results in the language.
- An operator type $\mathcal O$, the set of valid transformations $F: \mathcal V \to \mathcal V$
- An atomizer $\mathcal A:\mathcal S \to \mathcal V \cup \mathcal O$, converting atomic tokens from the set of valid tokens $\mathcal S$ (generally a set of strings without parens or whitespace) into value and operator type instances
- An environment $\mathcal E$ that contains predefined items in $\mathcal V \cup \mathcal O$. By default the environment is static, but it could be made dynamic in some lsps (though not all of those defined in this article).

The syntax rules are very similar to standard Lisp:

f the atomizer accepts an expression, that expression is valid.

If an expression $e$

- consists of a pair of matched parens with at least two subexpressions
`(a, b ...)`

separated by whitespace, and - every subexpression $e$ contains is valid, and
- the first subexpression of $e$,
`a`

, evaluates to an operator type for this lsp,

then the whole expression is valid.

- consists of a pair of matched parens with at least two subexpressions

All other expressions are invalid.

We can parse a lsp quite easily, by tokenizing the input string and recursively processing parentheses contents. Parsing a tokenized string consists of converting all tokens except parens to the lsp's operator and value types, and then using the paren structure to create a nested list with identical structure.

In [1]:

```
import operator
from functools import reduce
import re
import math
from pprint import pprint
import numpy as np
```

In [2]:

```
class Lsp(object):
open_paren = re.compile("\(")
close_paren = re.compile("\)")
splitter = re.compile("\s+")
def __init__(self, value_type, operator_type, atomizer, environment):
self.value_type = value_type
self.operator_type = operator_type
self.atomizer = atomizer
self.environment = environment
def parse_helper(self, tokens):
token = tokens.pop(0)
if token == "(":
els = []
while tokens[0] != ")":
next_expr, tokens = self.parse_helper(tokens)
els += [next_expr]
return els, tokens[1:]
elif token == ")":
raise ValueError()
else:
return self.atomize(token), tokens
def parse(self, tokens):
parsed, _ = self.parse_helper(tokens)
return parsed
def atomize(self, token):
if token in self.environment.keys():
return self.environment[token]
return self.atomizer(token)
def tokenize(self, code_str):
return list(filter(
lambda x: x is not '',
Lsp.splitter.split(Lsp.close_paren.sub(" ) ", Lsp.open_paren.sub(" ( ", code_str)))
))
```

We can evaluate an lsp expression by recursively evaluating the subexpressions from the "bottom up".

In [3]:

```
class EvalLsp(Lsp):
"""A lisp that may be evaluated."""
def eval_stack(self, stack):
if isinstance(stack, self.value_type) or isinstance(stack, self.operator_type):
# atom
return stack
fn = stack[0]
if isinstance(fn, list):
# nested function
return self.eval_stack(self.eval_stack(stack[0]) + stack[1:])
if isinstance(fn, self.operator_type):
# eval this function on the args
return fn(*map(self.eval_stack, stack[1:]))
raise ValueError(f"Unparsable value {fn}")
def eval_str(self, code_str):
raise NotImplemented()
```

`Lsplsp`

is about the simplest lsp langauge that we can make from this definition that isn't degenerate. It can run a few functions on Python ints.

In [4]:

```
class SimpleLsp(EvalLsp):
"""A DoubleLsp is an Lsp that returns the same value in the forward and backward results."""
def eval_str(self, code_str):
stack = self.parse(self.tokenize(code_str))
evaled = self.eval_stack(stack)
return (evaled)
```

In [5]:

```
LspLspValue = int
class LspLspOperator(object):
"""An lsplsp operator is just a Python function annotated with a symbol."""
def __init__(self, symbol, function):
self.symbol = symbol
self.function = function
def __call__(self, *args):
return self.function(*args)
```

In [6]:

```
int_token = re.compile("^\d+$")
def lsplsp_atomizer(atom):
"""Lsplsp attempts to evaluate anything it can't find in it's environment as a Python int."""
if int_token.match(atom):
return int(atom)
raise ValueError()
```

In [7]:

```
lsplsp_sum = LspLspOperator('+', lambda *args: sum(args))
lsplsp_diff = LspLspOperator('-', lambda *args: args[0] - sum(args[1:]))
```

In [8]:

```
lsplsp = SimpleLsp(
LspLspValue,
LspLspOperator,
lsplsp_atomizer,
{ f.symbol: f for f in [lsplsp_sum, lsplsp_diff] }
)
```

In [9]:

```
lsplsp.eval_str("(+ 5 2 (- 6 3))")
```

Out[9]:

The rest of the lsps we're going to take a look at are "forward-backward lsps," or fb-lsps for short.

There are two special characteristics of fb-lsps:

- Evaluation of an fb-lsp expression returns all the intermediate results in a nested structure.
- Evaluation of an fb-lsp expression results in two results: a 'forward' results and a 'backward' result.

Exactly what these two results will be used for is an inscrutable mystery for now...

We can recycle `lsplsp`

as `fblsplsp`

, an fb-lsp that simply returns the same value in F and B results.
It's cheating a little to call this an fb-lsp, and it doesn't provide us with anything new or useful, but we'll do it anyway.

In [10]:

```
class FBCheatLsp(EvalLsp):
"""An Lsp that that returns the same value in the forward and backward results."""
def eval_str(self, code_str):
stack = self.parse(self.tokenize(code_str))
evaled = self.eval_stack(stack)
return (
evaled,
evaled
)
```

In [11]:

```
FBLspLspValue = int
class FBLspLspOperator(object):
"""An fblsplsp function returns the argument values in a tuple with the result."""
def __init__(self, symbol, function):
self.symbol = symbol
self.function = function
def __call__(self, *args):
return (self.function(*(arg if isinstance(arg, FBLspLspValue) else arg[0] for arg in args)), args)
```

In [12]:

```
fblsplsp_sum = FBLspLspOperator('+', lambda *args: sum(args))
fblsplsp_diff = FBLspLspOperator('-', lambda *args: args[0] - sum(args[1:]))
```

In [13]:

```
fblsplsp = FBCheatLsp(
FBLspLspValue,
FBLspLspOperator,
lsplsp_atomizer,
{ f.symbol: f for f in [fblsplsp_sum, fblsplsp_diff] }
)
```

When we evaluate an `fblsplsp`

expression, we get a tuple of the forward and backwards results. Each element of the tuple is a nested tuple, forming a tree of intermediate results.

In [14]:

```
pprint(fblsplsp.eval_str("(+ 5 2 (- 6 3 3))"))
```

Generally, our fb-lsps will compute two different values for the forward and backward result, with some shared computation. We can implement these as follows:

In [15]:

```
class FBHookLsp(EvalLsp):
"""An Lsp that computes the forward and backward results using a hook from stack evaluation."""
def eval_str(self, code_str):
stack = self.parse(self.tokenize(code_str))
evaled = self.eval_stack(stack)
return (
evaled.forward(),
evaled.backward()
)
```

`gradlsp`

¶Time to get back on topic: automatic differentiation.

`gradlsp`

is an fb-lsp that returns a graph of conventional computation (just like `fblsplsp`

) in the forward result, and returns a graph of gradients taken with respect to the final result in the backward result. We will define `gradlsp`

to work over the domain of real numbers (approximated as Python floating-point values).

In [16]:

```
int_token = re.compile("^\d+$")
float_token = re.compile("^\d+(\.\d*)$")
#string_token = re.compile("^\"([^\"]*)\"")
def gradlsp_atomizer(token):
if int_token.match(token):
return GradLspValue(float(token))
if float_token.match(token):
return GradLspValue(float(token))
raise ValueError(token)
```

`gradlsp`

implements "reverse-mode" automatic differentiation.
Any operation in `gradlsp`

that takes arguments $a$ and returns a value $v$ records, in $v$,
the necessary information to compute the gradients of each of $a$ (and $a$'s ancestor arguments) with respect to $v$ or a value computed by operations on $v$.

The information stored in each value $v$ is the complete graph of previous values leading to $v$'s computation, as well as the gradient of the function $f$ that produced $v$ at the point of evaluation.

In theory, gradient computation of all subresults from one of these graphs requires a (computationally expensive) topological sort.

However, `gradlsp`

doesn't have any variables. Every operation on a value in gradlsp is pure with respect to the boxed value - it does not modify the original value. Furthermore, all operations are pure with respect to the value's gradient graph. Even the detach operation simply returns a new value with a degenerate graph. This ensures that the gradient graph is a DAG: a cycle in the graph would require a changed (impure) value. Because there are no cycles, we can treat the gradient graph as if it were a tree. This is very inefficient: if the DAG isn't treelike, we are performing wasted computations.

In [17]:

```
class GradLspValue(object):
def __init__(self, value, gradient = [], children = [], symbol = None, operator_symbol=None):
self.gradient = gradient
self.children = children
self.value = value
self.symbol = symbol
self.operator_symbol = operator_symbol
def __repr__(self):
return f"<GLV {self.value}, gradient={self.gradient}, with {len(self.children)} children>"
def forward(self):
return (self.value, self.operator_symbol or '?') + ((
[child.forward() for child in self.children],
) if len(self.children) > 0 else tuple())
def backward(self, adjoint=1):
if len(self.children) == 0:
return adjoint
return (adjoint, self.operator_symbol or '?') + ((
[child.backward(adjoint * derivative)
for child, derivative in zip(self.children, self.gradient)],
) if len(self.children) > 0 else tuple())
```

Furthermore, in `gradlsp`

, we require every operator be scalar-valued. For this reason, the compute DAG for any value is by necessity a tree: every operation has only one parent. This decision reduces the power of `gradlsp`

, as we will see soon.

In [18]:

```
class GradLspOperator(object):
def __init__(self, symbol, function, grad_function):
self.function = function
self.grad_function = grad_function
self.symbol = symbol
def __repr__(self):
return f"<GLF {self.symbol}>"
def __call__(self, *args):
result = GradLspValue(
self.function(*map(lambda glv: glv.value, args)),
self.grad_function(*map(lambda glv: glv.value, args)),
args,
operator_symbol=self.symbol
)
return result
```

When computing the backward result, we propagate the gradient of the final value $y$ backwards. This consists of repeated vector multiplication between the original value and the intermediate gradients. At each intermediate value $v$, we obtain an adjoint value $\frac{dy}{dv}$, the derivative of $v$ with respect to the final result $y$.

In [19]:

```
plus = GradLspOperator(
'+',
lambda *args: sum(args),
lambda *args: [1,] * len(args)
)
```

In [20]:

```
minus = GradLspOperator(
'-',
lambda *args: args[0] - sum(args[1:]),
lambda *args: [1,] + [-1,] * (len(args) - 1)
)
```

In [21]:

```
prod = GradLspOperator(
'*',
lambda *args: reduce(operator.mul, args, 1),
lambda *args: [reduce(operator.mul, (args[:i] + args[i+1:]), 1) for i in range(len(args))]
)
```

In [22]:

```
quot = GradLspOperator(
'/',
lambda x, y: x / y,
lambda x, y: [1/y, -(x/(math.pow(y, 2)))]
)
```

In [23]:

```
sin = GradLspOperator(
'sin',
lambda x: math.sin(x),
lambda x: [math.cos(x)]
)
```

In [24]:

```
cos = GradLspOperator(
'cos',
lambda x: math.cos(x),
lambda x: [-math.sin(x)]
)
```

In [25]:

```
detach = GradLspOperator(
'$',
lambda x: x,
[]
)
```

The detach operation `($ foo)`

detaches the compute graph of `foo`

from the returned result (which has the same *value* as foo), resulting in a new compute graph consisting of `foo`

only.
The (singular) node of the gradient DAG rooted at `value`

, in any case, will be zero.
This value, though it may depend on other values in actuality, is "held constant" from the standpoint of the gradient computation.

In [26]:

```
pi = GradLspValue(math.pi, symbol='pi')
```

In [27]:

```
gradlsp = FBHookLsp(
GradLspValue,
GradLspOperator,
gradlsp_atomizer,
{ f.symbol: f for f in [plus, minus, prod, quot, sin, cos, pi, detach] }
)
```

In [28]:

```
pprint(gradlsp.eval_str("(* (* 3 2) (cos pi))"))
```

`gradlsp`

isn't very useful.
It works on scalar valued functions only.
It's hard to implement more than a best fit line with it.

`jacoblsp`

is our next step.
Values in `jacoblsp`

are tensors of arbitrary size (though many functions only accept 1D or 2D tensors).

An operator in `jacoblsp`

can express a function $\mathbb R^n \to \mathbb R^m$.
Correspondingly, instead of the gradient, `jacoblsp`

computes the Jacobian tensor for each operation.
Instead of storing the local Jacobian itself in each value, and multiplying by the appropriate adjoint in the backward pass, we store a function that takes the adjoint and returns adjoint matrices for each of the generating function's arguments. This allows us to minimize space use in some operators, and concisely express others.

Each value in `jacoblsp`

is backed by an n-dimensional `numpy`

array.
An array is expressed as a literal `1,2,3`

. Trailing commas are allowed, but spaces are not.
For instance, `(*. 1,2,3 4,5,6)`

will compute the dot product of $[1,2,3]$ with $[4,5,6]$
Note that every value in `jacoblsp`

is an array: scalar values are simply arrays of size one.

Note that `jacoblsp`

is still very inefficient. We compute every subtree of the compute DAG for every backwards operation. If our DAG isn't treelike, we are performing wasted computations. It's easier to express combinatorically complex programs in `jacoblsp`

than it was in `jacoblsp`

, so this problem is more worrying. However, we still have a treelike DAG, assuming that individual gradient functions do not branch.

In [29]:

```
class JacobLspValue(object):
def __init__(self, value, adjoint_fns = None, children=[], shape=None, symbol = None, operator_symbol=None):
self.adjoint_fns = adjoint_fns if adjoint_fns else [(lambda x: x) for c in children]
self.children = children
self.value = value
self.symbol = symbol
self.operator_symbol = operator_symbol
self.shape = shape if shape else self.value.shape
def __repr__(self):
return f"<JLV {self.value}, with {len(self.children)} children>"
def forward(self):
if not self.children:
return (self.value,)
return (
self.value,
self.operator_symbol or '?',
[child.forward() for child in self.children]
)
def backward(self, adjoint=None):
if adjoint is None:
adjoint = self.value
if not self.children:
return (adjoint,)
return (
adjoint,
self.operator_symbol or '?',
[child.backward(adjoint_fn(adjoint))
for adjoint_fn, child
in zip(self.adjoint_fns, self.children)
]
)
```

In [30]:

```
class JacobLspOperator(object):
def __init__(self, symbol, function, max_args=None, max_dims=None):
self.function = function
self.symbol = symbol
self.max_args = max_args
self.max_dims = max_dims
def __repr__(self):
return f"<JLF {self.symbol}>"
def __call__(self, *args):
if self.max_args is not None and len(args) > self.max_args:
raise ArgumentException()
if self.max_dims and any(np.ndim(arg) > max_dim for arg, max_dim in zip(args, self.max_dims)):
raise ArgumentException()
arg_values = list(map(lambda jlv: jlv.value, args))
output_value, adjoint_fns = self.function(*arg_values)
result = JacobLspValue(
output_value,
adjoint_fns,
args if adjoint_fns else [],
operator_symbol=self.symbol
)
return result
```

In [31]:

```
int_arr_token= re.compile("^-?(\d+,)*-?\d+,?$")
float_arr_token= re.compile("^(-?\d+(\.\d*)?,)*-?\d+(\.\d*)?,?$")
def jacoblsp_atomizer(token):
if int_arr_token.match(token):
return JacobLspValue(np.array(list(map(int, filter(bool, token.split(',')))), dtype='int'))
if int_arr_token.match(token):
return JacobLspValue(np.array(list(map(float, filter(bool, token.split(',')))), dtype='float'))
raise ValueError(token)
```

In [32]:

```
jplus = JacobLspOperator(
'+',
lambda *args: (
np.stack(args).sum(axis=0),
[(lambda g: g) for a in args]
)
)
```

In [33]:

```
jminus = JacobLspOperator(
'-',
lambda *args: (
args[0] - np.stack(args[1:]).sum(axis=0),
[lambda g: g] + [(lambda g: -g) for a in args[1:]]
)
)
```

In [34]:

```
jprod = JacobLspOperator(
'*',
lambda *args: (
np.stack(args).prod(axis=0),
[(lambda g: g.dot(np.stack(args[:i] + args[i+1:]))) for i in range(len(args))]
)
)
```

In [35]:

```
matvecdot = (lambda A, x: [
lambda g: np.tensordot(g, x, 0),
lambda g: np.tensordot(g, A, [-1, 0])
])
vecmatdot = (lambda a, B: [
lambda g: np.tensordot(g, B.T, 1),
lambda g: np.tensordot(g, a, 0)
])
vecvecdot = (lambda a, b: [
lambda g: b,
lambda g: a,
])
matmatdot = (lambda A, B: [
lambda g: np.tensordot(g, B.T, 1),
lambda g: np.tensordot(g, A, [-2, 0]).T
])
selectdot = ({
(1,1): vecvecdot,
(2,1): matvecdot,
(1,2): vecmatdot,
(2,2): matmatdot,
})
```

In [36]:

```
jdot = JacobLspOperator(
'*.',
lambda A, B: (
np.dot(A, B),
selectdot[A.ndim, B.ndim](A,B)
),
max_args = 2,
max_dims = [2, 2]
)
```

In [37]:

```
jquot = JacobLspOperator(
'/',
lambda x, y: (
x / y,
[lambda g: g/y, lambda g: g.dot(-(x/(math.pow(y, 2))))],
),
max_args = 2
)
```

In [38]:

```
jcos = JacobLspOperator(
'cos',
lambda x: (
np.cos(x),
[lambda g: g.dot(-np.sin(x))]
),
max_args = 1
)
```

In [39]:

```
jsin = JacobLspOperator(
'sin',
lambda x: (
np.sin(x),
[lambda g: g.dot(np.cos(x))],
),
max_args = 1
)
```

In [40]:

```
jfill = JacobLspOperator(
'full',
lambda fill, *shp: (
np.full(shp[0] if len(shp) == 1 else shp, fill),
None,
)
)
```

In [41]:

```
relu = JacobLspOperator(
'relu',
lambda x: (
np.where(x < 0, 0, x),
[lambda g: (g * np.where(x < 0, 0, 1)),],
),
max_args=1
)
```

In [42]:

```
stack = JacobLspOperator(
'stack',
lambda *args: (np.stack(args), [lambda g: g[i] for i in range(len(args))])
)
```

In [43]:

```
transpose = JacobLspOperator(
'T',
lambda arg: (arg.T, [lambda g: g.T]) if len(arg.shape) == 2 else (arg.reshape((1,-1)), [lambda g: g.reshape(1,-1)]),
max_args = 1,
max_dims = [2]
)
```

In [44]:

```
jacoblsp = FBHookLsp(
JacobLspValue,
JacobLspOperator,
jacoblsp_atomizer,
{ f.symbol: f for f in [jplus, jminus, jprod, jquot, jsin, jcos, jdot, jfill, relu, stack, transpose] }
)
```

We now have all the building blocks to implement a neural network in `jacoblsp`

, and take the network's gradient.

As an example, we can build a classification network, which will compute the binary XOR of two inputs (with an appended bias term). It will consist of two layers with a single nonlinear activation. Using the inputs $(1,1), (0,1), (1,0), (0,0)$, we would expect to obtain the outputs $0, 1, 1, 0$. This is a toy problem, for which layer weights can be given a priori, with no training necessary. The network is implemented as follows:

In [45]:

```
pprint(jacoblsp.eval_str('(*. (stack -2,1) (relu (*. (stack 1,1,-1 1,1,0) (T (stack 1,1,1 0,1,1 1,0,1 0,0,1)))))'))
```

We can see that `jacoblsp`

has computed the correct result and a gradient for it with each of the intermediate functions. In particular, we can see that it has computed the gradients for the layer weights. Appending a loss function:

In [46]:

```
pprint(jacoblsp.eval_str('(- (T 0,1,1,0) (*. (stack -2,1) (relu (*. (stack 1,1,-1 1,1,0) (T (stack 1,1,1 0,1,1 1,0,1 0,0,1))))))'))
```

Here, the gradient is zero, because the loss is zero. Perturbing the layer weights results in nonzero gradients.

We have seen how the operators of a language may be incrementally extended to include gradient operators, and how this results in an automatically differentiable language. In the follow-up to this article, I will show how `jacoblsp`

can be extended to include gradient descent with learnable parameters, and explore how to compute efficient gradients on non-treelike compute graphs.

The sources of this blog are available at Github.
If you have any comments, corrections, or questions, please get in touch with me - I'd love to chat.