Demystifying Automatic Differentiation, Part 1

Recent results in deep and convolutional neural networks with powerful industrial applications have led to widespread adoption of these techniques outside of academia. One key advance has been the introduction of automatic differentiation, which allows developers and scientists to write code using standard numerical routines and functions that can be differentiated and used in training neural networks or other machine learning models. Many developers working with neural networks today, however, do not have a strong understanding of how automatic differentiation functions "under the hood," as well as how it differs from symbolic and numerical differentiation. In this article, using the style of literate programming, I will build some toy domain-specific languages that model how automatic differentiation functions in popular libraries like Tensorflow and PyTorch.

Background

Please note that this is not a primer on automatic differentiation itself or the underlying theory. There are many high-quality sources on these topics, which I will not hope to outdo. The purpose of this article is to show, to those familiar with the concepts of AD but not the implementation, how simply automatic differentiation can be implemented.

For this reason I will present minimal background on AD and the alternatives and invite readers to search out these topics themselves.

There are roughly three ways to perform mathematical differentiation that most people are aware of:

Symbolic differentiation

This is the application of rules on abstract symbols to analytically calculate formulas for the derivative of $f$ at $x$ (where $x$ may be high dimensional).

Symbolic differentiation involves a computer algebra system (CAS). Some people differentiate between symbolic differentiation and "manual" differentiation, where the latter is the same essential procedure as the former, but is performed, generally on paper, by a human.

In practice, this would be something like a CAS looking up a rule in a table to convert an expression $y = x^2$, consisting of exponentiation of a scalar value, to a derivative expression $\frac{dy}{dx} = 2x$, without ever considering the value of x itself. Often, this would occur without even considering the value of 2. Frege, at that point, is rolling in his grave. Alternatively, this would consist of a very bored 16-year-old sitting in math class looking up this rule in a table to generate the second expression.

Numerical differentiation

This is the computation of an approximate derivative of $f$ at $x$ using numerical algorithms.

The most common techniques are forms of finite difference estimation, which consists of perturbing the value of $x$ and analyzing the resultant change in $f(x)$. Again, this can be done by massive linear algebra algorithms, or bored 16-year-olds.

While this technique seems hacky, it's commonly used in engineering and science. It's important to keep in mind that we are working with floating point numbers of finite precision in most cases, and sometimes a "hacky" method can max out our precision. At the same time, numerical differentiation requires significant attention to floating point arithmetic and error explosion.

Here, the derivative is computed from concrete values, and we do not obtain any structured expression.

Automatic differentiation

This is the computation of an exact (to machine arithmetic precision) derivative of $f$ at $x$ by analyzing the computation of $f(x)$. Rather than analyzing symbols that express mathematical operations (as in symbolic differentiation), or perturbing computations that represent mathematical operations, we analyze a program that computes a function $f$. This analysis is far simpler than that of symbolic algebra expressions, and can be formalized into predictable algorithms and procedures.

The concept behind automatic differentiation is simple: every operation that our program is allowed to perform is shadowed by an operation that can compute the derivative of that operation. In practice, we record operations on values, and then use the "shadow" operations to compute derivatives of these operations using the chain rule. You can imagine all of the functions in your programming environment being doubled in this way. With the right abstraction layer, a normally written program can be made differentiable by an interpreter.

We could generalizing formalize this notion using Turing machines and differentiating operations on the tape, but this wouldn't be a very useful kind of differentiation in practice. In most cases, automatic differentiation is defined over a subset of a common programming language or library, or for a domain-specific language. By restricting the domain of AD to operations for which we can define efficient derivative operations, we gain a lot of expressive power and efficiency. An example is autograd, which replaces large portions of numpy and scipy with differentiable operations.

Lsps

To demonstrate the mechanics of AD, we are going to build several languages.

Each of these languages is going to a "lsp" - or not quite a Lisp. Don't be fooled: they are missing some key features that most Lisps have. They aren't even all Turing complete. Lisp syntax is used, because it's easy to parse, and parsing is not the main interest here.

Each lsp is defined by the tuple $l = <\mathcal V, \mathcal O, \mathcal A, \mathcal E>$ which includes:

  • A value type $V$, the set of valid computational results in the language.
  • An operator type $\mathcal O$, the set of valid transformations $F: \mathcal V \to \mathcal V$
  • An atomizer $\mathcal A:\mathcal S \to \mathcal V \cup \mathcal O$, converting atomic tokens from the set of valid tokens $\mathcal S$ (generally a set of strings without parens or whitespace) into value and operator type instances
  • An environment $\mathcal E$ that contains predefined items in $\mathcal V \cup \mathcal O$. By default the environment is static, but it could be made dynamic in some lsps (though not all of those defined in this article).

The syntax rules are very similar to standard Lisp:

  1. f the atomizer accepts an expression, that expression is valid.

  2. If an expression $e$

    • consists of a pair of matched parens with at least two subexpressions (a, b ...) separated by whitespace, and
    • every subexpression $e$ contains is valid, and
    • the first subexpression of $e$, a, evaluates to an operator type for this lsp,

    then the whole expression is valid.

All other expressions are invalid.

We can parse a lsp quite easily, by tokenizing the input string and recursively processing parentheses contents. Parsing a tokenized string consists of converting all tokens except parens to the lsp's operator and value types, and then using the paren structure to create a nested list with identical structure.

In [1]:
import operator
from functools import reduce
import re
import math
from pprint import pprint
import numpy as np
In [2]:
class Lsp(object):
    open_paren = re.compile("\(")
    close_paren = re.compile("\)")
    splitter = re.compile("\s+")

    def __init__(self, value_type, operator_type, atomizer, environment):
        self.value_type = value_type
        self.operator_type = operator_type
        self.atomizer = atomizer
        self.environment = environment
        
    def parse_helper(self, tokens):
        token = tokens.pop(0)
        if token == "(":
            els = []
            while tokens[0] != ")":
                next_expr, tokens = self.parse_helper(tokens)
                els += [next_expr]

            return els, tokens[1:]
        elif token == ")":
            raise ValueError()
        else:
            return self.atomize(token), tokens
        
    def parse(self, tokens):
        parsed, _ = self.parse_helper(tokens)
        return parsed
    
    def atomize(self, token):
        if token in self.environment.keys():
            return self.environment[token]
        return self.atomizer(token)
    
    def tokenize(self, code_str):
        return list(filter(
            lambda x: x is not '', 
            Lsp.splitter.split(Lsp.close_paren.sub(" ) ", Lsp.open_paren.sub(" ( ", code_str)))
        ))

We can evaluate an lsp expression by recursively evaluating the subexpressions from the "bottom up".

In [3]:
class EvalLsp(Lsp):
    """A lisp that may be evaluated."""
    def eval_stack(self, stack):
        if isinstance(stack, self.value_type) or isinstance(stack, self.operator_type):
            # atom
            return stack

        fn = stack[0]

        if isinstance(fn, list):
            # nested function
            return self.eval_stack(self.eval_stack(stack[0]) + stack[1:])

        if isinstance(fn, self.operator_type):
            # eval this function on the args
            return fn(*map(self.eval_stack, stack[1:]))
    
        raise ValueError(f"Unparsable value {fn}")
        
    def eval_str(self, code_str):
        raise NotImplemented()

lsplsp

Lsplsp is about the simplest lsp langauge that we can make from this definition that isn't degenerate. It can run a few functions on Python ints.

In [4]:
class SimpleLsp(EvalLsp):
    """A DoubleLsp is an Lsp that returns the same value in the forward and backward results."""
    def eval_str(self, code_str):
        stack = self.parse(self.tokenize(code_str))
        evaled = self.eval_stack(stack)
        return (evaled)
In [5]:
LspLspValue = int
class LspLspOperator(object):
    """An lsplsp operator is just a Python function annotated with a symbol."""
    def __init__(self, symbol, function):
        self.symbol = symbol
        self.function = function
    def __call__(self, *args):
        return self.function(*args)
    
In [6]:
int_token = re.compile("^\d+$")

def lsplsp_atomizer(atom):
    """Lsplsp attempts to evaluate anything it can't find in it's environment as a Python int."""
    if int_token.match(atom):
        return int(atom)
    raise ValueError()
In [7]:
lsplsp_sum = LspLspOperator('+', lambda *args: sum(args))
lsplsp_diff = LspLspOperator('-', lambda *args: args[0] - sum(args[1:]))
In [8]:
lsplsp = SimpleLsp(
    LspLspValue,
    LspLspOperator,
    lsplsp_atomizer,
    { f.symbol: f for f in [lsplsp_sum, lsplsp_diff] }
)
In [9]:
lsplsp.eval_str("(+ 5 2 (- 6 3))")
Out[9]:
10

Forward-backward lsps

The rest of the lsps we're going to take a look at are "forward-backward lsps," or fb-lsps for short.

There are two special characteristics of fb-lsps:

  1. Evaluation of an fb-lsp expression returns all the intermediate results in a nested structure.
  2. Evaluation of an fb-lsp expression results in two results: a 'forward' results and a 'backward' result.

Exactly what these two results will be used for is an inscrutable mystery for now...

We can recycle lsplsp as fblsplsp, an fb-lsp that simply returns the same value in F and B results. It's cheating a little to call this an fb-lsp, and it doesn't provide us with anything new or useful, but we'll do it anyway.

In [10]:
class FBCheatLsp(EvalLsp):
    """An Lsp that that returns the same value in the forward and backward results."""
    def eval_str(self, code_str):
        stack = self.parse(self.tokenize(code_str))
        evaled = self.eval_stack(stack)
        return (
            evaled,
            evaled
        )
In [11]:
FBLspLspValue = int

class FBLspLspOperator(object):
    """An fblsplsp function returns the argument values in a tuple with the result."""
    def __init__(self, symbol, function):
        self.symbol = symbol
        self.function = function
    def __call__(self, *args):
        return (self.function(*(arg if isinstance(arg, FBLspLspValue) else arg[0] for arg in args)), args)
In [12]:
fblsplsp_sum = FBLspLspOperator('+', lambda *args: sum(args))
fblsplsp_diff = FBLspLspOperator('-', lambda *args: args[0] - sum(args[1:]))
In [13]:
fblsplsp = FBCheatLsp(
    FBLspLspValue,
    FBLspLspOperator,
    lsplsp_atomizer,
    { f.symbol: f for f in [fblsplsp_sum, fblsplsp_diff] }
)

When we evaluate an fblsplsp expression, we get a tuple of the forward and backwards results. Each element of the tuple is a nested tuple, forming a tree of intermediate results.

In [14]:
pprint(fblsplsp.eval_str("(+ 5 2 (- 6 3 3))"))
((7, (5, 2, (0, (6, 3, 3)))), (7, (5, 2, (0, (6, 3, 3)))))

Generally, our fb-lsps will compute two different values for the forward and backward result, with some shared computation. We can implement these as follows:

In [15]:
class FBHookLsp(EvalLsp):
    """An Lsp that computes the forward and backward results using a hook from stack evaluation."""
    def eval_str(self, code_str):
        stack = self.parse(self.tokenize(code_str))
        evaled = self.eval_stack(stack)
        return (
            evaled.forward(),
            evaled.backward()
        )

gradlsp

Time to get back on topic: automatic differentiation.

gradlsp is an fb-lsp that returns a graph of conventional computation (just like fblsplsp) in the forward result, and returns a graph of gradients taken with respect to the final result in the backward result. We will define gradlsp to work over the domain of real numbers (approximated as Python floating-point values).

In [16]:
int_token = re.compile("^\d+$")
float_token = re.compile("^\d+(\.\d*)$")
#string_token = re.compile("^\"([^\"]*)\"")

def gradlsp_atomizer(token):
    if int_token.match(token):
        return GradLspValue(float(token))
    if float_token.match(token):
        return  GradLspValue(float(token))
    raise ValueError(token)

gradlsp implements "reverse-mode" automatic differentiation. Any operation in gradlsp that takes arguments $a$ and returns a value $v$ records, in $v$, the necessary information to compute the gradients of each of $a$ (and $a$'s ancestor arguments) with respect to $v$ or a value computed by operations on $v$.

The information stored in each value $v$ is the complete graph of previous values leading to $v$'s computation, as well as the gradient of the function $f$ that produced $v$ at the point of evaluation.

In theory, gradient computation of all subresults from one of these graphs requires a (computationally expensive) topological sort.

However, gradlsp doesn't have any variables. Every operation on a value in gradlsp is pure with respect to the boxed value - it does not modify the original value. Furthermore, all operations are pure with respect to the value's gradient graph. Even the detach operation simply returns a new value with a degenerate graph. This ensures that the gradient graph is a DAG: a cycle in the graph would require a changed (impure) value. Because there are no cycles, we can treat the gradient graph as if it were a tree. This is very inefficient: if the DAG isn't treelike, we are performing wasted computations.

In [17]:
class GradLspValue(object):
    def __init__(self, value, gradient = [], children = [], symbol = None, operator_symbol=None):
        self.gradient = gradient
        self.children = children
        self.value = value
        self.symbol = symbol
        self.operator_symbol = operator_symbol
        
    def __repr__(self):
        return f"<GLV {self.value}, gradient={self.gradient}, with {len(self.children)} children>"
        
    def forward(self):
        return (self.value, self.operator_symbol or '?') + ((
            [child.forward() for child in self.children],
        ) if len(self.children) > 0 else tuple())
    
    def backward(self, adjoint=1):
        if len(self.children) == 0:
            return adjoint
        return (adjoint, self.operator_symbol or '?') + ((
            [child.backward(adjoint * derivative) 
             for child, derivative in zip(self.children, self.gradient)],
        ) if len(self.children) > 0 else tuple())

Furthermore, in gradlsp, we require every operator be scalar-valued. For this reason, the compute DAG for any value is by necessity a tree: every operation has only one parent. This decision reduces the power of gradlsp, as we will see soon.

In [18]:
class GradLspOperator(object):
    def __init__(self, symbol, function, grad_function):
        self.function = function
        self.grad_function = grad_function
        self.symbol = symbol
        
    def __repr__(self):
        return f"<GLF {self.symbol}>"
        
    def __call__(self, *args):
        result = GradLspValue(
            self.function(*map(lambda glv: glv.value, args)),
            self.grad_function(*map(lambda glv: glv.value, args)),
            args,
            operator_symbol=self.symbol
        )
        
        return result

When computing the backward result, we propagate the gradient of the final value $y$ backwards. This consists of repeated vector multiplication between the original value and the intermediate gradients. At each intermediate value $v$, we obtain an adjoint value $\frac{dy}{dv}$, the derivative of $v$ with respect to the final result $y$.

In [19]:
plus = GradLspOperator(
    '+',
    lambda *args: sum(args),
    lambda *args: [1,] * len(args)
)
In [20]:
minus = GradLspOperator(
    '-',
    lambda *args: args[0] - sum(args[1:]),
    lambda *args: [1,] + [-1,] * (len(args) - 1)
)
In [21]:
prod = GradLspOperator(
    '*',
    lambda *args: reduce(operator.mul, args, 1),
    lambda *args: [reduce(operator.mul, (args[:i] + args[i+1:]), 1) for i in range(len(args))]
)
In [22]:
quot = GradLspOperator(
    '/',
    lambda x, y: x / y,
    lambda x, y: [1/y, -(x/(math.pow(y, 2)))]
)
In [23]:
sin = GradLspOperator(
    'sin',
    lambda x: math.sin(x),
    lambda x: [math.cos(x)]
)
In [24]:
cos = GradLspOperator(
    'cos',
    lambda x: math.cos(x),
    lambda x: [-math.sin(x)]
)
In [25]:
detach = GradLspOperator(
    '$',
    lambda x: x,
    []
)

The detach operation ($ foo) detaches the compute graph of foo from the returned result (which has the same value as foo), resulting in a new compute graph consisting of foo only. The (singular) node of the gradient DAG rooted at value, in any case, will be zero. This value, though it may depend on other values in actuality, is "held constant" from the standpoint of the gradient computation.

In [26]:
pi = GradLspValue(math.pi, symbol='pi')
In [27]:
gradlsp = FBHookLsp(
    GradLspValue, 
    GradLspOperator, 
    gradlsp_atomizer, 
    { f.symbol: f for f in [plus, minus, prod, quot, sin, cos, pi, detach] }
)
In [28]:
pprint(gradlsp.eval_str("(* (* 3 2) (cos pi))"))
((-6.0,
  '*',
  [(6.0, '*', [(3.0, '?'), (2.0, '?')]),
   (-1.0, 'cos', [(3.141592653589793, '?')])]),
 (1, '*', [(-1.0, '*', [-2.0, -3.0]), (6.0, 'cos', [-7.347880794884119e-16])]))

jacoblsp

gradlsp isn't very useful. It works on scalar valued functions only. It's hard to implement more than a best fit line with it.

jacoblsp is our next step. Values in jacoblsp are tensors of arbitrary size (though many functions only accept 1D or 2D tensors).

An operator in jacoblsp can express a function $\mathbb R^n \to \mathbb R^m$. Correspondingly, instead of the gradient, jacoblsp computes the Jacobian tensor for each operation. Instead of storing the local Jacobian itself in each value, and multiplying by the appropriate adjoint in the backward pass, we store a function that takes the adjoint and returns adjoint matrices for each of the generating function's arguments. This allows us to minimize space use in some operators, and concisely express others.

Each value in jacoblsp is backed by an n-dimensional numpy array. An array is expressed as a literal 1,2,3. Trailing commas are allowed, but spaces are not. For instance, (*. 1,2,3 4,5,6) will compute the dot product of $[1,2,3]$ with $[4,5,6]$ Note that every value in jacoblsp is an array: scalar values are simply arrays of size one.

Note that jacoblsp is still very inefficient. We compute every subtree of the compute DAG for every backwards operation. If our DAG isn't treelike, we are performing wasted computations. It's easier to express combinatorically complex programs in jacoblsp than it was in jacoblsp, so this problem is more worrying. However, we still have a treelike DAG, assuming that individual gradient functions do not branch.

In [29]:
class JacobLspValue(object):
    def __init__(self, value, adjoint_fns = None, children=[], shape=None, symbol = None, operator_symbol=None):
        self.adjoint_fns = adjoint_fns if adjoint_fns else [(lambda x: x) for c in children]
        self.children = children
        self.value = value
        self.symbol = symbol
        self.operator_symbol = operator_symbol
        self.shape = shape if shape else self.value.shape
    def __repr__(self):
        return f"<JLV {self.value}, with {len(self.children)} children>"
    
    def forward(self):
        if not self.children:
            return (self.value,)
        return (
            self.value, 
            self.operator_symbol or '?', 
            [child.forward() for child in self.children]
        )
    
    def backward(self, adjoint=None):
        if adjoint is None:
            adjoint = self.value
        
        if not self.children:
            return (adjoint,)
        
        return (
            adjoint, 
            self.operator_symbol or '?', 
            [child.backward(adjoint_fn(adjoint)) 
             for adjoint_fn, child 
             in zip(self.adjoint_fns, self.children)
            ]
        )
               
In [30]:
class JacobLspOperator(object):
    def __init__(self, symbol, function, max_args=None, max_dims=None):
        self.function = function
        self.symbol = symbol
        self.max_args = max_args
        self.max_dims = max_dims
        
    def __repr__(self):
        return f"<JLF {self.symbol}>"
        
    def __call__(self, *args):
        if self.max_args is not None and len(args) > self.max_args:
            raise ArgumentException()
        if self.max_dims and any(np.ndim(arg) > max_dim for arg, max_dim in zip(args, self.max_dims)):
            raise ArgumentException()
        arg_values = list(map(lambda jlv: jlv.value, args))
        output_value, adjoint_fns = self.function(*arg_values)
        result = JacobLspValue(
            output_value,
            adjoint_fns,
            args if adjoint_fns else [],
            operator_symbol=self.symbol
        )
        
        return result
In [31]:
int_arr_token= re.compile("^-?(\d+,)*-?\d+,?$")
float_arr_token= re.compile("^(-?\d+(\.\d*)?,)*-?\d+(\.\d*)?,?$")

def jacoblsp_atomizer(token):
    if int_arr_token.match(token):
        return JacobLspValue(np.array(list(map(int, filter(bool, token.split(',')))), dtype='int'))
    if int_arr_token.match(token):
        return JacobLspValue(np.array(list(map(float, filter(bool, token.split(',')))), dtype='float'))
    raise ValueError(token)
In [32]:
jplus = JacobLspOperator(
    '+',
    lambda *args: (
        np.stack(args).sum(axis=0),
        [(lambda g: g) for a in args]
    )
)
In [33]:
jminus = JacobLspOperator(
    '-',
    lambda *args: (
        args[0] - np.stack(args[1:]).sum(axis=0),
        [lambda g: g] + [(lambda g: -g) for a in args[1:]]
    )
)
In [34]:
jprod = JacobLspOperator(
    '*',
    lambda *args: (
        np.stack(args).prod(axis=0),
        [(lambda g: g.dot(np.stack(args[:i] + args[i+1:]))) for i in range(len(args))]
    )
)
In [35]:
matvecdot = (lambda A, x: [
    lambda g: np.tensordot(g, x, 0),
    lambda g: np.tensordot(g, A, [-1, 0])
])

vecmatdot = (lambda a, B: [
    lambda g: np.tensordot(g, B.T, 1),
    lambda g: np.tensordot(g, a, 0)
])

vecvecdot = (lambda a, b: [
    lambda g: b,
    lambda g: a,
])

matmatdot = (lambda A, B: [
    lambda g: np.tensordot(g, B.T, 1),
    lambda g: np.tensordot(g, A, [-2, 0]).T
])

selectdot = ({
    (1,1): vecvecdot,
    (2,1): matvecdot,
    (1,2): vecmatdot,
    (2,2): matmatdot,
})
In [36]:
jdot = JacobLspOperator(
    '*.',
    lambda A, B: (
        np.dot(A, B),
        selectdot[A.ndim, B.ndim](A,B)
    ),
    max_args = 2,
    max_dims = [2, 2]
)
In [37]:
jquot = JacobLspOperator(
    '/',
    lambda x, y: (
        x / y,
        [lambda g: g/y, lambda g: g.dot(-(x/(math.pow(y, 2))))],
    ),
    max_args = 2
)
In [38]:
jcos = JacobLspOperator(
    'cos',
    lambda x: (
        np.cos(x),
        [lambda g: g.dot(-np.sin(x))]
    ),
    max_args = 1
)
In [39]:
jsin = JacobLspOperator(
    'sin',
    lambda x: (
        np.sin(x),
        [lambda g: g.dot(np.cos(x))],
    ),
    max_args = 1
)
In [40]:
jfill = JacobLspOperator(
    'full',
    lambda fill, *shp: (
        np.full(shp[0] if len(shp) == 1 else shp, fill),
        None,
    )
)
In [41]:
relu = JacobLspOperator(
    'relu',
    lambda x: (
        np.where(x < 0, 0, x),
        [lambda g: (g * np.where(x < 0, 0, 1)),],
    ),
    max_args=1
)
In [42]:
stack = JacobLspOperator(
    'stack',
    lambda *args: (np.stack(args), [lambda g: g[i] for i in range(len(args))])
)
In [43]:
transpose = JacobLspOperator(
    'T',
    lambda arg: (arg.T, [lambda g: g.T]) if len(arg.shape) == 2 else (arg.reshape((1,-1)), [lambda g: g.reshape(1,-1)]),
    max_args = 1,
    max_dims = [2]
)
In [44]:
jacoblsp = FBHookLsp(
    JacobLspValue,
    JacobLspOperator,
    jacoblsp_atomizer,
    { f.symbol: f for f in [jplus, jminus, jprod, jquot, jsin, jcos, jdot, jfill, relu, stack, transpose] }
)

We now have all the building blocks to implement a neural network in jacoblsp, and take the network's gradient.

As an example, we can build a classification network, which will compute the binary XOR of two inputs (with an appended bias term). It will consist of two layers with a single nonlinear activation. Using the inputs $(1,1), (0,1), (1,0), (0,0)$, we would expect to obtain the outputs $0, 1, 1, 0$. This is a toy problem, for which layer weights can be given a priori, with no training necessary. The network is implemented as follows:

In [45]:
pprint(jacoblsp.eval_str('(*. (stack -2,1) (relu (*. (stack 1,1,-1 1,1,0) (T (stack 1,1,1 0,1,1 1,0,1 0,0,1)))))'))
((array([[0, 1, 1, 0]]),
  '*.',
  [(array([[-2,  1]]), 'stack', [(array([-2,  1]),)]),
   (array([[1, 0, 0, 0],
       [2, 1, 1, 0]]),
    'relu',
    [(array([[ 1,  0,  0, -1],
       [ 2,  1,  1,  0]]),
      '*.',
      [(array([[ 1,  1, -1],
       [ 1,  1,  0]]),
        'stack',
        [(array([ 1,  1, -1]),), (array([1, 1, 0]),)]),
       (array([[1, 0, 1, 0],
       [1, 1, 0, 0],
       [1, 1, 1, 1]]),
        'T',
        [(array([[1, 1, 1],
       [0, 1, 1],
       [1, 0, 1],
       [0, 0, 1]]),
          'stack',
          [(array([1, 1, 1]),),
           (array([0, 1, 1]),),
           (array([1, 0, 1]),),
           (array([0, 0, 1]),)])])])])]),
 (array([[0, 1, 1, 0]]),
  '*.',
  [(array([[0, 2]]), 'stack', [(array([0, 2]),)]),
   (array([[ 0, -2, -2,  0],
       [ 0,  1,  1,  0]]),
    'relu',
    [(array([[ 0, -2, -2,  0],
       [ 0,  1,  1,  0]]),
      '*.',
      [(array([[-2, -2, -4],
       [ 1,  1,  2]]),
        'stack',
        [(array([1, 1, 2]),), (array([1, 1, 2]),)]),
       (array([[ 0, -1, -1,  0],
       [ 0, -1, -1,  0],
       [ 0,  2,  2,  0]]),
        'T',
        [(array([[ 0,  0,  0],
       [-1, -1,  2],
       [-1, -1,  2],
       [ 0,  0,  0]]),
          'stack',
          [(array([0, 0, 0]),),
           (array([0, 0, 0]),),
           (array([0, 0, 0]),),
           (array([0, 0, 0]),)])])])])]))

We can see that jacoblsp has computed the correct result and a gradient for it with each of the intermediate functions. In particular, we can see that it has computed the gradients for the layer weights. Appending a loss function:

In [46]:
pprint(jacoblsp.eval_str('(- (T 0,1,1,0) (*. (stack -2,1) (relu (*. (stack 1,1,-1 1,1,0) (T (stack 1,1,1 0,1,1 1,0,1 0,0,1))))))'))
((array([[0, 0, 0, 0]]),
  '-',
  [(array([[0, 1, 1, 0]]), 'T', [(array([0, 1, 1, 0]),)]),
   (array([[0, 1, 1, 0]]),
    '*.',
    [(array([[-2,  1]]), 'stack', [(array([-2,  1]),)]),
     (array([[1, 0, 0, 0],
       [2, 1, 1, 0]]),
      'relu',
      [(array([[ 1,  0,  0, -1],
       [ 2,  1,  1,  0]]),
        '*.',
        [(array([[ 1,  1, -1],
       [ 1,  1,  0]]),
          'stack',
          [(array([ 1,  1, -1]),), (array([1, 1, 0]),)]),
         (array([[1, 0, 1, 0],
       [1, 1, 0, 0],
       [1, 1, 1, 1]]),
          'T',
          [(array([[1, 1, 1],
       [0, 1, 1],
       [1, 0, 1],
       [0, 0, 1]]),
            'stack',
            [(array([1, 1, 1]),),
             (array([0, 1, 1]),),
             (array([1, 0, 1]),),
             (array([0, 0, 1]),)])])])])])]),
 (array([[0, 0, 0, 0]]),
  '-',
  [(array([[0, 0, 0, 0]]), 'T', [(array([[0, 0, 0, 0]]),)]),
   (array([[0, 0, 0, 0]]),
    '*.',
    [(array([[0, 0]]), 'stack', [(array([0, 0]),)]),
     (array([[0, 0, 0, 0],
       [0, 0, 0, 0]]),
      'relu',
      [(array([[0, 0, 0, 0],
       [0, 0, 0, 0]]),
        '*.',
        [(array([[0, 0, 0],
       [0, 0, 0]]),
          'stack',
          [(array([0, 0, 0]),), (array([0, 0, 0]),)]),
         (array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]]),
          'T',
          [(array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]]),
            'stack',
            [(array([0, 0, 0]),),
             (array([0, 0, 0]),),
             (array([0, 0, 0]),),
             (array([0, 0, 0]),)])])])])])]))

Here, the gradient is zero, because the loss is zero. Perturbing the layer weights results in nonzero gradients.

We have seen how the operators of a language may be incrementally extended to include gradient operators, and how this results in an automatically differentiable language. In the follow-up to this article, I will show how jacoblsp can be extended to include gradient descent with learnable parameters, and explore how to compute efficient gradients on non-treelike compute graphs.

The sources of this blog are available at Github. If you have any comments, corrections, or questions, please get in touch with me - I'd love to chat.