For those unaware of what automatic differentiation is: It's a close-to-magical tool which turns code for evaluating a function f into code for evaluating its derivative f'.
It uses a special sort of invented numbers which square to zero even though they are not themselves zero.
It seems to me that most of the AD frameworks used for deep learning implement the backward function that returns the jacobian for every initial function, and then chain those backward functions
I used it for non ML tasks in geophysics, which made my life a lot easier. However, I think most scientists and engineers aren't aware of it. It has been described as "criminally underused."
Yes, definitely, there are even battle-tested implementations for Fortran available.
Though I have never seen AD frameworks used in the production contexts for neural networks/backpropagation. As you say, the code for this seems to be mostly handrolled. Please take this negative statement with a grain of salt, I don't actually work in machine learning.
Backpropagation is literally the reverse mode AD applied to neural nets. That said, I wish had a paper that I could point to that shows explicitly that. From what I can tell, the AD community has known this basically since the start, but for whatever reason backpropogation is taught rather than the more general technique.
To clarify, your statement that, "It uses a special sort of invented numbers which square to zero even though they are not themselves zero." is not entirely true. In case someone else is looking at this, what the OP is referring to is dual numbers. That is one way to implement things, but not the most common way for fast tools.
Fundamentally, automatic differentiation is the methodical application of the chain rule. Forward mode results from the application of the rules for directional derivatives. Reverse mode results from the total derivative. The reason that we have a reverse pass in reverse mode can be seen from this perspective. The directional derivative is `(f o g)'(x)dx = f'(g(x)) g'(x)`. Note, we compute `g(x)` before `f(g(x))` and the derivative `g'(x)` before `f'(g(x))`. Therefore, we can compute the derivatives as we compute the answer. If we want the gradient, which results from the total derivative, we have `grad (f o g)(x)` = `g'(x)* grad f(g(x))`. Although we still compute `g(x)` before `f(g(x))` during our computation, the gradient requires the computation of `grad f(g(x))` before the application of the adjoint operator `g'(x)*`. We do the evaluations on the first pass, and cache extra values, and then compute the gradient on a reverse pass because we need the adjoint of the total derivatives in the reverse order.
Or, at least that's my bias in how to derive things.
Besides efficiency concerns (which I don't have a clue about), a disadvantage of the point of view using dual numbers is that, to my knowledge, it can only be used to derive the forward mode of automatic differentiation. Still I take pleasure in appreciating the slightly mystic aura of the dual numbers. :-)
Dual numbers don't have efficiency concerns in a language like Julia where the dispatching occurs at compile time, so basically you get the same compiled code as you'd want from source-to-source transformations. The issue though is that all functions have to be generic. Dual numbers is a very easy form to allow the user to add their own overloads though and take advantage of how the AD works: this is much harder with source-to-source since you'd have to actually modify the AST transform. So there's a trade-off here in what is flexible.
But yes, dual numbers only do forward mode. However, slightly related is tracker types, like ReverseDiff.jl or Flux.jl's Tracker, where you essentially push a type forward through the code similarly to a dual number, and use this to build the computational graph which you then run backwards with the chain rule.
presumably, source-to-source "AD" could do some symbolic reasoning (whether any systems do, and whether it still counts as AD, I don't actually know). Specifically, I have in mind a source-to-source system that could look at `sqrt(x^2)` and produce a valid sub-gradient at x=0. With AD (whether forward-mode or reverse), you get NaN=Inf*0, because the gradient of x^2 is 0 and the gradient of `sqrt` is Inf.
Hmmm, I don’t know. If you’re allowed skim over edge cases, the statement of the chain rule is pretty obvious: the composition of two linear functions is another linear function, with the coefficients multiplied.
It's not and I also find it a bit frustrating when the default answer is just chain rule. For me, the key insights into how AD is derived are the following:
1. There is a fundamental difference between normed spaces and inner product spaces and this affects what algorithms we can derive. Specifically, the forward mode corresponds to a version of the chain rule that only requires a normed space and not an inner product space. If we assume that we have an inner product, then we can apply the Reisz representation theorem. This is important because it means that there's a concrete element in the space that corresponds to the derivative. This is precisely the gradient. Further, we have a concrete way of finding this element. Because we have a (complete) inner product space, we have a Hilbert adjoint. The gradient, and ultimately reverse mode, can be calculated by combining the chain rule with the Hilbert adjoint to obtain the gradient. Specifically,
Here, * represents the Hilbert adjoint. Anyway, we get away with this because the derivatives are really linear operators, which we get from the total derivative.
2. Eventually, we feed the the gradients `grad f(g(x))` into the adjoint of `g'(x)`. However, we it turns out that we can delay further delaying sending the values of the adjoint of `g'(x)` into its parent by accumulating all of the gradient information being fed to it first. Essentially, this means that we can follow a computational graph and accumulate all of the intermediate derivative information before moving on. This means that we can traverse the graph a single time in reverse mode, which is efficient. How we traverse this graph corresponds to a topological sort of the graph. This can be generated at compile time using what's called a tape. This may or may not be more efficient modulo a bunch of reasons, which aren't all that important right now.
Anyway, this is more to say that automatic differentiation is not an obvious result from basic calculus. At least not to me. For me, it required knowledge of real and functional analysis, graph theory, and knowledge of programming languages. It's not impossible to learn, but it's not trivial in my opinion.
It uses a special sort of invented numbers which square to zero even though they are not themselves zero.
Here is one of the many tutorials on automatic differentiation: https://pizzaseminar.speicherleck.de/automatic-differentiati...
reply