# Reductions in representation learning with rate-distortion theory

In lab meeting this week, we discussed unsupervised learning in the context of deep generative models, namely $\beta$-variational auto-encoders ($\beta$-VAEs), drawing from the original, Higgins et al. 2017 (ICLR), and its follow-up, Burgess et al. 2018. The classic VAE represents a clever approach to learning highly expressive generative models, defined by a deep neural network that transforms samples from a standard normal distribution to some distribution of interest (e.g., natural images).  Technically, VAE training seeks to maximize a lower bound on the likelihood $p_\theta(x) = \int p_\theta(x\mid z) p(z) dz$, where $p_\theta(x|z)$ defines the generative mapping from latents $z$ to data $x$. This “evidence lower bound” (ELBO) depends on a variational approximation to the posterior, $q_\phi(z\mid x)$, which is also parametrized by a deep neural network (the so-called “encoder”).

A crucial drawback to the classic VAE, however, is that the learned latent representations tend to lack interpretability. The $\beta$-VAE seeks to overcome this limitation by learning “disentangled” representations, in which single latents are sensitive to single generative factors in the data and relatively invariant to others (Bengio et al. 2013). I would call these “intuitively robust” — rotating an apple (orientation) shouldn’t make its latent representation any less red (color) or any less fruity (type). To overcome this challenge, $\beta$-VAEs optimize a modified ELBO given by:

$\underset{\theta,\phi}{\text{maximize}}\:\:\mathbb{E}_{q_\phi(z\mid x)}\left[\log p_\theta(x\mid z)\right]-\beta D_{KL}(q_\phi(z\mid x)\Vert\, p(z))$

with and standard VAEs corresponding to $\beta=1$. The new hyperparameter $\beta$ controls the optimization’s tension between maximizing the data likelihood and limiting the expressiveness of the variational posterior relative to a fixed latent prior $p(z)=\mathcal{N}(0,I)$.

Recent work has been interested in tuning the latent representations of deep generative models (Adversarial Autoencoders (Makhzani et al. 2016), InfoGANS (Chen et al. 2016), Total Correlation VAEs (Chen et al. 2019), among others), but the generalization used by $\beta$-VAEs in particular looked somehow familiar to me. This is because $\beta$-VAEs recapitulate the classical rate-distortion theory problem. This was observed briefly also in recent work by Alemi et al. 2018, but I would like to elaborate and show explicitly how $\beta$-VAEs are reducible to a distortion-rate minimization using deep generative models.

Rate-distortion theory is a theoretical framework for lossy data compression through a noisy channel. This fundamental problem in information theory balances the minimum permissible amount of information (in bits) transmitted across the channel, the “rate”, against the corruption of the original signal, a penalty measured by a “distortion” function $d(x,z)$. Our terminology changes, but the fundamental problem is the same; I made that comparison as obvious as possible in the figure below.

Derivation. Given a dataset $\mathcal{D}$ with a distribution $p^*(x)$, define any statistical mapping $q_\phi(z\mid x)$ that encodes $x$ into a code $z$. Note that $q_\phi$ is just an encoder, and together they induce a joint distribution $p(x,z)=q_\phi(z\mid x)p^*(x)$ with a marginal $p(z)=\int dx\, q_\phi(z\mid x)p^*(x)$. The distortion-rate optimization would minimize distortion $d(\cdot,\cdot)$ subject to a maximum rate $R$, i.e.

$\underset{q_\phi(z\mid x),p(z)}{\text{minimize}}\:\:\mathbb{E}_{p(x,z)}[d(x,z)]\:\:\text{subject to}\:\: I(x,z)\le R$

$\Longrightarrow \underset{q_\phi(z\mid x),p(z)}{\text{minimize}}\:\:\mathbb{E}_{p(x,z)}[d(x,z)]-\beta I(x,z)$

Consider first the mutual information. We leverage a more tractable upper bound with

$I(x,z)=\int dx\,p^*(x)\int dz\,q_\phi(z\mid x)\log\frac{q_\phi(z\mid x)}{p(z)}\le \int dx\,p^*(x)\int dz\,q_\phi(z\mid x)\log\frac{q_\phi(z\mid x)}{m(z)}$

$\text{since}\:\: D_{KL}\left(p(z)\Vert\, m(z)\right)\Longrightarrow -\int dz\,p(z)\log p(z) \le -\int dz\,p(z)\log m(z)$

We’ve replaced the marginal $p(z)$ induced by our choice of encoder $q_\phi$ with another distribution $m(z)$ that makes the optimization more tractable, e.g. $\mathcal{N}(0,I)$ in the VAE. Our objective can be rewritten as

$\underset{q_\phi(z\mid x),m(z)}{\text{minimize}}\:\:\mathbb{E}_{p(x,z)}[d(x,z)]-\beta\, \mathbb{E}_{x\sim\mathcal{D}}\left[D_{KL}(q_\phi(z\mid x)\Vert\, m(z))\right]$

Suppose the distortion of interest is posterior density (mis)estimation, $d(x,z)=-\log p_\theta(x\mid z)$. Such a function penalizes representations $z$ from which we cannot regenerate an observed data vector $x$ through the decoding network $p_\theta$ with high probability. A typical distortion-rate problem would fix the distortion function, but we choose to learn this decoder. We can optimize the objective for each $x$ to eliminate the outer expectation over the data $\mathcal{D}$, fix $m(z)=\mathcal{N}(0,I)$, and recover the $\beta$-VAE objective precisely:

$\underset{q_\phi(z\mid x),p_\theta(x\mid z)}{\text{maximize}}\:\:\mathbb{E}_{q_\phi(z\mid x)}\left[\log p_\theta(x\mid z)\right]-\beta\, D_{KL}(q_\phi(z\mid x)\Vert\, m(z))$

When $\beta> 1$, our optimization prioritizes minimizing the second term (rate) over maximizing the first one (distortion). In this sense, the authors’ argument for large $\beta$ can be reinterpreted as an argument for higher-distortion, lower-rate codes (read: latent representations) to encourage interpretability. I edited a figure below from Alemi et al. 2018 to clarify this.

Information-theoretic hypotheses abound. Perhaps enforcing optimization in this region could discourage solutions that depend on learning an ultra-powerful decoder (VAE: generator) $p_\theta(x\mid z)$, in other words solutions that depend on a good code, not necessarily a good decode. Does eliminating this possibility simply make room to fish out an ad-hoc interpretable representation, or is there a more sophisticated explanation waiting to be found? We’ll see.

# Machine Theory of Mind

In lab meeting last week, we read Machine Theory of Mind, a recent paper from Neil Rabinowitz and his collaborators at DeepMind & Google Brain (a trimmed version of the paper was presented at ICML 2018). Here, Theory of Mind (ToM) is broadly defined as the ability to represent the mental states of others. This paper aims to demonstrate ToM in an artificial agent. While designing & training such an agent constitutes one challenge, the authors must first devise a scenario in which ToM can be convincingly shown. Inspired by the Sally-Anne test — a classic test of ToM from developmental psychology that evaluates whether a child understands that others can hold false beliefs — the authors construct an analogous test, then train an agent to successfully pass it.

The paper is composed of a series of experiments that build in complexity to this final test. Within each experiment are three key parts. First, the environment: a simple 11×11 grid-world containing walls and 4 colored boxes that are all randomly located within each new world. Second, the agents: an individual agent belongs to a particular “species” according to its policy for acting within an environment. Agents can behave randomly, algorithmically, or with a learned policy (via deep RL). The trajectory of a particular agent within a particular environment constitutes an episode. Reward within an episode is generally maximized by navigating to a box of a particular color as fast as possible. However, limitations on the sightedness and statefulness of the agents, as well as the inclusion of more complex subgoals, are adjusted per the needs of each experiment. Finally, the observer: a meta-learning agent, called ToMnet, that parses the episodes of many agents in many environments so as to learn a prior over the behavior of an agent species. At test time, ToMnet uses a novel agent’s recent episodes & its trajectory on a current episode thus far to infer a posterior and make predictions regarding the agent’s future behavior.

To probe ToM in ToMnet, the authors introduce a species of agent with both a limited field of view and a subgoal. For example, an agent that can only see the squares adjacent to it must first navigate to a “star” in the grid-world before finally navigating to the blue box to achieve maximum reward. In certain environments, the agent passes the blue box early in its initial search and so knows directly where to go after finding the star, even if the blue box is not visible to the agent from the star’s location. The test comes when the experimenter now swaps the locations of the boxes while the agent is on the star and the boxes are out of view. While the agent is blind to the swap, ToMnet is not. And so, the analogous Sally-Anne test arises: Will ToMnet not recognize that the swap occurred outside of the agent’s field of view, and thus mistakenly predict that it will move toward the new location of the blue box? Or, will ToMnet recognize that the agent maintains the false belief that no swap has occurred, and thus correctly predict that it will move toward the old location of the blue box?

ToMnet predicts behavior reflecting the agent’s false belief, successfully passing the test. Importantly, this finding is supplemented with results that show that ToMnet is sensitive to how different fields of view make an agent “differentially vulnerable to acquire false beliefs,” and that ToMnet still passes the test even if it had never seen swap events during training. Thus, ToMnet “learns that agents can act based on false beliefs,” providing a compelling proof-of-concept for Machine Theory of Mind.

# Insights on representational similarity in neural networks with canonical correlation

For this week’s journal club, we covered “Insights on representational similarity in neural networks with canonical correlation” by Morcos, Raghu, and Bengio, NeurIPS, 2018.  To date, many different convolutional neural networks (CNNs) have been proposed to tackle the object recognition problem, including Inception (Szegedy et al., 2015), ResNet (He et al., 2016), and VGG (Simonyan and Zisserman, 2015). These networks have vastly different architectures but all achieve high accuracy. How can this be the case? One possibility is that although the architectures vary, the representations (i.e., the way these networks encode information about the objects of natural images) are very similar.

To test this, we first need a metric of similarity. One approach has been “representation similarity analysis” (RSA) (Kriegskorte et al., 2008) which relies on distance matrices to test if two representations are similar. One potential problem with RSA is that some dimensions of the representations may be “noisy” (i.e., dimensions that do not pertain to encoding the input information). For example, during training, some dimensions of the activity of CNN neurons may vary substantially across epochs but are not relevant to encoding object information. These dimensions could mask the signal of relevant dimensions when analyzing a distance matrix.

One way to avoid this is to try to directly identify the relevant dimensions, allowing us to ignore the noisy dimensions. The authors relied on an old but trusted method called canonical correlation analysis (CCA), which was developed way back in the 1930s (Hotelling, 1936)! CCA has been a handy tool in computational neuroscience, relating the activity of neurons across two populations (Semedo et al., 2014) as well as relating population activity to the output of model neurons (Susillo et al., 2015). Newer methods have been developed that are more appropriate for various problems. These include partial least squares (Höskuldsson, 1988), kernel CCA (Hardoon et al., 2004), as well as a method I developed for my own work called distance covariance analysis (DCA) (Cowley et al., 2017).  The common thread among all of these methods is that they identify dimensions that encode similar information among two or more datasets.

Overview of CCA. CCA is a close relative to linear regression, but whereas linear regression aims at prediction, CCA focuses on correlation—and thus is most suitable for cases in which the investigator seeks intuition of the data.  Given two datasets (e.g., $\mathbf{X} \in \mathcal{R}^{k \times N} \textrm{ and } \mathbf{Y} \in \mathcal{R}^{p \times N}$, both centered, where $N$ is the number of samples), CCA seeks to identify a pair of dimensions $\mathbf{u} \in \mathcal{R}^k \textrm{ and } \mathbf{v} \in \mathcal{R}^p$ such that the Pearson’s correlation between the projections $\mathbf{u}^T \mathbf{X} \textrm{ and } \mathbf{v}^T \mathbf{Y}$ is the largest. In other words, CCA identifies linear combinations of the variables in $\mathbf{X} \textrm{ and } \mathbf{Y}$ that are the most linearly-related. CCA need not stop there—it can identify pairs of dimensions that monotonically decrease in correlation. In this way, we can ignore the dimensions with the smallest correlations (which likely are spurious). One fun fact about CCA is that any two identified dimensions in $\mathbf{X}$ are uncorrelated: $\textrm{corr}(\mathbf{u}_i^T \mathbf{X}, \mathbf{u}_j^T \mathbf{X}) = 0 \textrm{ for } i \neq j$ (and the same for $\mathbf{v}_i, \mathbf{v}_j$). This is different from PCA, whose identified dimensions are both uncorrelated and orthogonal.  The uncorrelatedness of CCA dimensions ensures that we do not include dimensions that contain redundant information. (Implementation details: CCA is solved with singular-value decomposition, but be sure to use a regularized form akin to ridge regression—it was unclear if the authors used regularization).

Figure 1. Generalizing networks converge to more similar solutions than memorizing networks.

Onto the results. The authors proposed a distance metric of CCA to uncover some intuitive characteristics about deep neural networks. First, they found that different initializations of generalizing networks (i.e., networks trained on labeled natural images) were more similar than different initializations of memorizing networks (i.e., networks trained on the same dataset with randomly-shuffled labels). This is expected, as natural labels likely put a constraint on generalizing networks. Interestingly, when comparing generalizing and memorizing networks (Fig. 1, yellow line, ‘Inter’), they found that generalizing and memorizing networks were as similar as different memorizing networks trained on the same fixed dataset. This suggests that overfitted networks converge on very different solutions for the same problem. Also interesting was that earlier layers of both generalizing and memorizing networks seem to converge on similar solutions, while the later layers diverged. This suggests that earlier layers rely more on the structure of natural images while the later layers rely more on the structure of the labels. Second, they found that wider networks (i.e., networks with more filters per layer) converge to more similar solutions than those of narrower networks.  They argue that this supports the “lottery-ticket” hypothesis that wider networks are more likely to have a sub-network that fits the desired function.  Finally, they found that networks trained with different initializations and learning rates on the same problem converge to different groups of solutions. This highlights the need to try different initializations when training neural networks.

This paper left me thinking a lot about representation in the visual cortex of the brain. Does visual cortical population activity have stable and “noisy” dimensions?  If we reduced the number of visual cortical neurons per visual cortical area (either via lesion or pharmacological intervention) in a developing animal, would these animals have severe perceptual deficits (i.e., their visual system did not have the right lottery ticket when developing)?  Lastly, it seems plausible that humans start out with different initializations of their visual cortices—does that suggest different humans have converged on different solutions to solving visual perception?  If so, it suggests that inter-subject variability may be larger than previously thought.

# The Loss-Calibrated Bayesian

By Farhan Damani

In lab meeting this week, we discussed loss-calibrated approximate inference in the context of Bayesian decision theory (Lacoste-Julien et. al. 2011, Cobb et. al. 2018). For many applications, the cost of an incorrect prediction can vary depending on the nature of the mistake. Suppose you are in charge of controlling a nuclear power plant with an unknown temperature $\theta$. We observe indirect measurements of the temperature $D$, and we use Bayesian inference to infer a posterior distribution over the temperature given the observations $p(\theta|D)$. The plant is in danger of over-heating and as the operator, you can either keep the plant running or shut it down. Keeping the plant running while the plant’s temperature exceeds a critical threshold $T_{\text{critic}}$ will cause a nuclear meltdown, incurring a huge loss $L(\theta > T_{\text{critic}}, \text{'on'})$ while shutting off the plant for benign temperatures incurs a minor loss $L(\theta < T_{\text{critic}}, \text{'off'})$

In figure 1 we observe the true posterior $p(\theta|D)$ is multi-modal. Our suite of approximate inference techniques characterize general properties of the posterior, attempting to match either the first or second moment of $p$. Both strategies underestimate the posterior mass for the safety-critical region. Instead, the dash-dotted line, while failing to characterize typical properties of the posterior, results in the same decision as the true posterior by optimizing for task-specific utility. The point is the “best” approximate posterior is subjective, and therefore, we should tailor our inferential resources to find an approximation that is well suited for the decision task at hand.

Bayesian decision theory extends the Bayesian paradigm by including a task-specific utility function $U(\theta, a)$, which tells us the utility of taking action $a \in \mathcal{A}$ when the world is in state $\theta$. According to this view, the optimal action minimizes the posterior risk: $\underset{a}{\arg \min} \text{ } \mathcal{R}(a) = \mathbb{E}_{p(\theta|D)}[U(\theta, a)]$. Typically, this is computed using a 2-step procedure. First approximate the posterior $p(\theta|D)$ with a $q(\theta|D)$ and then minimize the risk under $q$. This approach, however, assumes our approximate $q$ measures properties of the posterior that we care about. This by definition requires our utility function, so therefore, we should jointly optimize the approximate posterior with the action that minimizes the posterior risk. Cobb et. al. 2018 show how to derive a variational lower bound that depends on a task-specific utility function. In their setup, they show that minimizing the KL divergence between an approximate posterior q and a calibrated posterior scaled by the utility function results in the standard ELBO loss plus an additional utility-dependent regularization term. This formulation is amenable to stochastic optimization, allowing for the practical deployment of this framework to supervised learning.

# An orderly single-trial organization of population dynamics in premotor cortex predicts behavioral variability

This week we read some new work from Shaul Druckmann and Karel Svoboda’s groups (https://www.biorxiv.org/content/early/2018/07/25/376830). They analyzed simultaneously recorded activity from the anterior lateral motor cortex (ALM; perhaps homologous to a premotor area in primates) from mice performing a delayed discrimination task (either a somatosensory pole detection task or an auditory tone discrimination task). They analyzed 55 sessions with 6-31 units on each session. Given the strong task epoch dependent responsiveness of most cells in the population, they fit a switching linear dynamical system (sLDS) to the data using expectation-maximazation. However, in their model, the switch times were dictated by the task structure, making the model significantly easier to fit than a sLDS with unconstrained switch times. They called their model a epoch-dependent linear dynamical system (EDLDS).

They used leave-one-neuron-out cross validation (i.e. compute the posterior on test trials without including one neuron’s activity, and then predict that neuron’s activity from the posterior) to test the model fit and found that it often fit the data about as well a sLDS that could flexibly assign the timing of the same number of switch events and significantly outperformed Gaussian process factor analysis.

The model defines a low-dimensional latent space to which they apply several analyses. First, they applied linear discriminate analysis (LDA) to decode the animal’s choice on each trial and show that it outperforms LDA applied to the activity of the full neural population, even when regularization is included.

Next, they applied principal components analysis (PCA) to the latent activity and the full neural population activity to visualize the dominant temporal trajectories within each space. PC projections of the full population activity showed sharp temporal transitions between task epochs and a random spatial ordering across trials, while PC projections of the latent activity showed smooth temporal transitions and strongly ordered dynamics.

They quantified the “orderliness” of each representation by computing the consistency of the trial-ranked value of the LDA projection across time to confirm greater orderliness within the latent space than the full neural activity. They also found that decode analyses to previous trial outcome or choice on error trials using the latent activity outperformed the same analyses applied to the full neural activity.

In summary, the dynamical nature of the EDLDS provides a smooth, de-noised portrait of the temporal dynamics present in the data but that might not easily reveal itself with standard analyses.

# Motor Cortex Embeds Muscle-like Commands in an Untangled Population Response

This week we discussed a recent publication by Abigail Russo from Mark Churchland’s lab. The authors examined primary motor cortex (M1) population responses and EMG activity from primates performing a novel cycling task. The task required the animal to rotate a pedal (like riding a bicycle, except with one’s arm) a fixed number of rotations. There were several task conditions, including pedaling forward and backward.

The authors found that while M1 neural activity contained components of muscle activity (e.g. trial-averaged EMG activity could be accurately predicted by linear combinations of the trial-averaged neural activity) the dominant structure present in the neural population response was not muscle-like. They came to this conclusion by examining the top principal component projections of the neural activity and the EMG activity, where they found that the former co-rotated for forward and backward pedaling while the later counter-rotated. This discrepancy in rotation direction is inconsistent with the notation that neural activity encodes force or kinematic commands.

Based on this observation, the authors proposed a novel hypothesis: the dominant, non-muscle-like activity patterns in M1 exist so as to “detangle” the representation of the muscle-like activity patterns. A rough analogy would be something like a phonograph, whose dominant dynamics are rotating, but which only serve to lay out a coding direction (normal to the rotation) which can be “read-out” in a simple way by the phonograph needle. The authors show with network models that a “detangled” response has the desirable property of noise-robustness.

The following toy-model from the paper illustrates the idea of “tangled-ness.” Imagine that a population of neurons must generate output 1 and output 2 depicted below. If the population represented those signals directly (depicted in the leftmost phase portrait) in a 2-dimensional space, the trajectory would trace out a “figure-8,” a highly tangled trajectory and one that cannot be generated by an autonomous dynamical system (which the authors assume more-or-less accurately caricatures the dynamical properties of M1). In order to untangle the neural representation (depicted in the rightmost phase portrait), the neural activity needs to add an extra, third dimension which resides in the null space of the output. Now, these dynamics can be generated autonomously and a linear projection of them can generate the output.

The authors directly compute a measure of tangling within the neural data and the EMG data. The metric is the following:

$Q(t) = \text{max}_{t^\prime} \frac{|| \dot x_t - \dot x_{t^\prime}||}{|| x_t - x_{t^\prime} || + \epsilon}$.

It can be summarized in the following way: identify two moments in time where the state is very similar, but where the derivative of the state is very different. Such points are exemplified by the intersection of the “figure-8” trajectory above, since the intersection is two identical states with very different derivatives. Across multiple animals, species and motor tasks the authors found a consistent relationship: neural activity is less tangled than EMG activity (as shown below). The authors note that a tangled EMG response is acceptable or perhaps even desirable, since EMG reflects incoming commands and therefore does not need to abide by the requirements that an autonomous dynamical system (like M1) does.

Based on these analyses, the authors conclude that the dominant signals present in M1 population activity principally perform a computational role, by untangling the representation of muscle-like signals that can be read-out approximately linearly by the muscles.

# Automatic Differentiation in Machine Learning: a Survey

 In lab meeting this week we discussed Automatic Differentiation in Machine Learning: a Survey, which addresses the general technique of autodifferentiation for functions expressed in computer programs. The paper initially outlines three main approaches to the computation of derivatives in computer programs: 1) manually working out derivates by hand and coding the result. 2) using numerical approximations to derivates in the form of assessing function values at small discrete steps, and 3) computer manipulation of symbolic mathematic expressions that automatically generates differential expressions via standard calculus rules. The limitations of approach 1 are that derivatives can involve the calculation of a large number of terms (expression swell) that are both manually cumbersome to deal with and lend themselves to easy algebraic mistakes. The disadvantage of approach 2 is a finite difference approach to numerical differentiation inevitably involves round-off errors and is sensitive to step-size. The disadvantage to 3 is that this approach is computationally cumbersome and often standard calculus rules involve forms of derivates that still need to be manually reduced. In contrast to these methods, the paper introduces autodifferentiation as a set of techniques that operates at the elementary computation level via step-wise implementation of the chain rule. The assessment of a function in a computer program involves an execution of procedures that can be represented via a computational graph where each node in the graph is an intermediary temporary variable. Computers execute this trace of elementary operations via intermediary variables and this procedure can be utilized efficiently to additionally concurrently calculate a derivative. The paper outlines two general approaches to autodifferentiation: forward mode and backward mode. In forward mode, a derivative trace is evaluated alongside the computers execution of the functional trace, which intermediary values from the functional evaluation trace are used in concert with the derivative trace calculation. This execution can be conveniently realized by extending the representation of the real numbers to include a dual component. Analogous to imaginary numbers, this dual component is denoted with an $\epsilon$, where $\epsilon^2 = 0$. That is, each number in the execution of the function on a computer is now extended to include tuple which includes this additional dual component whose expressions evaluate using intermediate variables alongside the evaluation of the function.  Using this simple rule of the dual number ($\epsilon^2 = 0$), evaluations implement the notion of the chain rule in differentiation to automatically calculate derivatives at every computational step along the evaluation trace. In executing this forward mode autodifferentiation, a derivative trace is initially ‘seeded’ with an input derivative vector as a function is evaluated with an initial input. Whereas symbolic differentiation can be thought of as the mapping of $f(x)$ to $J_{f}(x)$, forward autodifferentiation is the mapping of $f(c)$ to $J_{f}(c) \cdot \vec{x}$. As such, considering a functional mapping from dimension $n$ to $m$, forward mode requires n trace evaluations to calculate the entire Jacobian, each time the seeded derivative vector is an index for a particular dimension in n. Thus, the calculation of the Jacobian can be completed in approximately n times the total number of operations in an evaluation of $f(c)$. Backward mode, by contrast, computes $f(c)$ to $\vec{y} \cdot J_{f}(c)$. That is, one evaluation of the derivative trace in backward mode can similarly compute a Jacobian vector dot product, in this case extracting a row of the Jacobian (as opposed to a column as in forward mode). In this case, the calculation of the full Jacobian can be completed in approximately m times the total number of operations in an evaluation of $f(c)$.  A disadvantage to backward mode autodifferentiation is that each intermediate variable along a function evaluation trace must be stored in memory for use during the backward execution of the derivative trace.