Special Case: Implicit Equations
The above discussion shows that there are good reasons to differentiate solvers directly, and good reasons to instead write derivative rules for solvers which use forward/adjoint equations. For time series equations, this always has a trade-off. There is an important special case here though that for methods which iterate to convergence, automatic differentiation of the solver is essentially never a good idea. The reason is because the implicit function theorem gives that the derivative of the solution is directly defined at the solution point. For example, for solving
Note of course a similar performance vs stability trade-off does apply here. Since this derivation assumes you have
This does lead to some counter-intuitive results. For example, we had a paper where we exploited this to note that differentiating and ODE solve which goes to infinity (steady state) is faster than a “long ODE”, since steady states have a similar implicit definition. It’s quicker to go to infinity than it is to go to 1000, who would’ve thought? Math is fun.
Does Differentiation of Solver Internals Make Sense or Have a Meaning?
“ODE solvers” have all sorts of things in there, like adaptivity parameters and heuristics. One of the things that happens when you do automatic differentiation of the solver is that you aren’t just differentiating the solver’s states and parameters, but the process will differentiate everything. It turns out that AD of a solver can thus be useful in some tricky ways which put this to use. For example, at ICML we had a paper which regularized the parameters of a neural ODE by the sum of the computed error estimates of the adaptivity heuristics. This would then push the learned equation towards an area of parameter space where the adaptivity gives the largest time steps possible, and thus the learned equation is the “fastest possible learned equation that fits the time series”. Such a trick is only possible if you are doing automatic differentiation of the solver since you’d need to differentiate the solver’s internals in order to have access to those values in the loss function! This just shows one of many ways in which AD’s “extra information” which analytical continuous derivative definitions don’t have could potentially be useful for some applications.
Automatic Differentiation in Continuous Sensitivity Methods
Finally, I want to note that even when you attempt to avoid automatic differentiation of the solver by using continuous sensitivity methods, it turns out that the optimal way to build the extended equations is to use automatic differentiation!
Summary: there are many good reasons to do automatic differentiation of solvers, but there are also many good reasons to use some analytical derivative techniques. But even if you do analytical derivative techniques, you still want to automatic differentiate something in order to do it optimally!
For example, let’s return to the forward sensitivity equations:
It turns out that
since it is equivalent to the directional derivative. This is explained in more detail in these lectures (or accompanying video).
In the same vein, continuous adjoints of ODE solves boil down to defining a differential equation which is solved backwards and that differential equation which is solved backwards has a term which is
The takeaway from this plot is that using these AD tricks results in a few orders of magnitude performance improvements (by avoiding the Jacobian construction, which are the “seeding” versions on the left, the right shows the difference that different AD techniques make, which itself is another few orders of magnitude). When people note that the Julia differential equation adjoint solvers are much faster than the adjoints from Sundials COVDES and IDAS on large equations, this part right here is one of the major factors because Sundials does not embed a reverse AD engine into its adjoint code to do the vjp definitions, and instead falls back to using a numerical formulation unless the user provides a vjp override, which is seemingly to be uncommon to do but from these plots clearly should be done more often.
Summary
In total, what can we takeaway so far about differentiating solvers?
- There are some advantages to differentiating solvers, but there are also some advantages to mixing in analytical continuous adjoints. It’s context-dependent which is better.
- Even when mixing in analytical continuous derivative rules, these are best defined with automatic differentiation within their constructed equations, so one cannot avoid AD completely if one wishes to achieve full performance on arbitrary models.
- For cases which converge to some kind of implicitly defined solution, using special adjoint tricks will be much better than direct differentiation of the solver.
There’s still a lot more to mention, especially as stochastic simulation gets involved, but I’ll cut this here for now. As you can see, there’s still some open questions that are being investigated in the field, so if you find this interesting please feel free to get in touch.
- Forward-mode outperforms reverse-mode / adjoint techniques when the equations are “sufficiently small”. For modern implementations this seems to be at around 100.
- For forward-mode cases, “good” automatic differentiation libraries can make use of structure between the primal and derivative constructions to better CSE/SIMD the generated code for the derivative term, thus forward-mode AD of the solver can be much faster than forward sensitivity analysis even though the two are mathematically the same operation.
- For reverse-mode cases, the continuous adjoints seem to be faster with current implementations.
But that last bit then has many caveats to put on it. For one, there seems to be a trade-off between performance and stability here. This is noted in the appendix of the paper “Universal Differential Equations for Scientific Machine Learning, which states:
Previous research has shown that the discrete adjoint approach is more stable than continuous adjoints in some cases [53, 47, 94, 95, 96, 97] while continuous adjoints have been demonstrated to be more stable in others [98, 95] and can reduce spurious oscillations [99, 100, 101]. This trade-off between discrete and continuous adjoint approaches has been demonstrated on some equations as a trade-off between stability and computational efficiency [102, 103, 104, 105, 106, 107, 108, 109, 110]. Care has to be taken as the stability of an adjoint approach can be dependent on the chosen discretization method [111, 112, 113, 114, 115]
with the references pointing to those in the manuscript.
then you get
However, there are cases where the “analytical” way of writing the derivative is not equivalent to its automatic differentiation counterpart. For example, the adjoint method is a different way to get
So What is Better, Continuous Derivative Rules or Discrete Derivatives of the Solver?
Automatic differentiation of a “solver” is a subject with many details for doing it in the most effective form. For this reason, there are a lot of talks and courses that go into lots of depth on the topic. I recently gave a talk on some of the latest stuff in differentiable simulation with the American Statistical Association, and have some detailed notes on such adjoint derivations as part of the 18.337 Parallel Computing and Scientific Machine Learning graduate course at MIT. And there are entire organizations like my SciML Open Source Software Organization which work day-in and day-out on the development of new differentiable solvers.
I’ll give a brief summary of all my materials here below.
Continuous vs Discrete Differentiation of Solvers
AD of a solver can be done in essentially two different ways: either directly performing automatic differentiation to the steps of the solver, or by defining higher level adjoint rules that will compute the derivative. In some cases these can be mathematically equivalent. For example, forward sensitivity analysis of an ODE
Thus if you solve the extended system of equations:
Christian Meunier, who started as chairman of Nissan Americas in January, said the tariffs are accelerating already needed plans for the automaker to increase domestic production to assist in a turnaround of its embattled U.S. operations.
Return to School Today.
More information about what we collect and how we share your personal information is available in our Privacy Policy.
