ODEs · intermediate · 45 min read

Numerical Methods for ODEs

From Euler's method to adaptive solvers — error analysis, stability regions, stiff systems, and the convergence theory that guarantees discrete approximations track continuous solutions. The computational engine behind neural ODEs.

Abstract. We can prove solutions to y' = f(t,y) exist (Picard-Lindelöf) and analyze their stability (Lyapunov theory) — but how do we actually compute them? Euler's method replaces the derivative with a finite difference: y_{n+1} = y_n + h·f(t_n,y_n). This is first-order accurate — global error grows as O(h). Runge-Kutta methods evaluate f at intermediate points to achieve higher accuracy: the classical RK4 is O(h⁴). But accuracy is not enough. For stiff systems — where the Jacobian eigenvalues span many orders of magnitude — explicit methods require impossibly small step sizes for stability. Implicit methods (backward Euler, trapezoidal rule) sacrifice some computational simplicity for unconditional stability. Adaptive solvers (Dormand-Prince/RK45) adjust step sizes on the fly, concentrating computational effort where the solution changes rapidly. The Lax equivalence theorem unifies the theory: consistency plus stability equals convergence. These ideas have direct ML implications — neural ODEs use adaptive ODE solvers as their forward pass, ResNets are Euler discretizations of continuous dynamics, and the condition number of the Hessian is the stiffness ratio of the gradient flow.

Where this leads → formalML

  • formalML The convergence theory developed here — consistency, stability, and the Lax equivalence theorem — provides the analytical framework for understanding when discrete random walks approximate continuous diffusions, and at what rate.

1. The Computational Question

We’ve spent three topics building the theory of ordinary differential equations. We proved that solutions to y=f(t,y)y' = f(t, y) exist and are unique whenever ff is Lipschitz (Picard-Lindelöf, Topic 21). We classified linear systems via the eigenvalues of the system matrix and computed exact solutions through the matrix exponential eAte^{At} (Topic 22). We analyzed nonlinear systems through linearization and Lyapunov functions, predicting long-term behavior without ever solving the equation (Topic 23).

But we haven’t asked the most practical question of all: how do we actually compute solutions?

For most ODEs that arise in practice — neural network training dynamics, planetary motion, chemical kinetics, fluid flow — there is no closed-form solution. Picard-Lindelöf says a solution exists, but it doesn’t tell us what that solution looks like at t=1t = 1. Lyapunov theory says trajectories converge to an equilibrium, but it doesn’t tell us how fast, or what they look like along the way. To get actual numbers, we need numerical methods.

A numerical method discretizes the continuous problem: instead of computing y(t)y(t) for all t[t0,T]t \in [t_0, T], we compute approximations yny(tn)y_n \approx y(t_n) on a discrete grid t0<t1<<tN=Tt_0 < t_1 < \cdots < t_N = T. The simplest such scheme is Euler’s method, which we already met in Topic 21: yn+1=yn+hf(tn,yn),h=tn+1tn.y_{n+1} = y_n + h \, f(t_n, y_n), \qquad h = t_{n+1} - t_n.

It’s the most natural thing to write down — replace the derivative yy' with a forward difference (yn+1yn)/h(y_{n+1} - y_n)/h, solve for yn+1y_{n+1}, and march forward. But Euler’s method opens up a Pandora’s box of questions that the rest of this topic answers:

  1. How accurate is it? If we use a step size h=0.01h = 0.01, how close is our computed yny_n to the true y(tn)y(t_n)? What if h=0.001h = 0.001? Does the error shrink linearly with hh, quadratically, or some other rate?
  2. Can we do better? Is there a computationally cheap way to get higher-order accuracy — error that shrinks faster as we refine the grid?
  3. When does it fail? Are there equations where Euler’s method gives garbage no matter how small we make hh? (Spoiler: yes — they’re called stiff equations.)
  4. How do we know when our answer is good enough? Is there a way to estimate the error of a numerical method without knowing the true solution?
  5. What does “convergence” even mean for a numerical method? And when can we prove it?

The answers to these questions form the field of numerical analysis of ODEs, and they have direct payoffs in machine learning. Neural ODEs (Chen et al., 2018) use adaptive ODE solvers as their forward pass — so the convergence rate of the underlying solver determines training accuracy and computational cost. ResNets implement hl+1=hl+fθ(hl)h_{l+1} = h_l + f_\theta(h_l) — exactly Euler’s method with step size 1, where θ\theta is the network’s learned parameters. Gradient descent itself, θn+1=θnηL(θn)\theta_{n+1} = \theta_n - \eta \nabla L(\theta_n), is forward Euler applied to the gradient flow ODE θ˙=L(θ)\dot\theta = -\nabla L(\theta), and the learning rate η\eta is a step size with all the stability constraints we’re about to derive.

This is the topic that closes the loop from theory to computation. Let’s begin where we always begin: with the simplest method, and an honest accounting of its errors.

2. Euler’s Method — Error Analysis

We met Euler’s method in Topic 21 as a way to draw approximate solution curves on a direction field. We never asked how accurate those curves were. Now we will.

The first thing to understand is that there are two different kinds of error in any numerical method, and they accumulate differently. The local truncation error is the error introduced by a single step, assuming the previous step was exact. The global error is the cumulative error after many steps. Local error tells us how good the recipe is; global error tells us how the recipe behaves over a long simulation.

📐 Definition 1 (Local Truncation Error)

The local truncation error of a one-step method yn+1=yn+hΦ(tn,yn,h)y_{n+1} = y_n + h \cdot \Phi(t_n, y_n, h) at step nn is the residual: τn  =  y(tn+1)y(tn)hΦ(tn,y(tn),h),\tau_n \;=\; y(t_{n+1}) - y(t_n) - h \cdot \Phi(t_n, y(t_n), h), where y(t)y(t) is the exact solution of the IVP. This measures how closely a single step of the method matches the true Taylor expansion of yy — independent of any errors that have already accumulated in yny_n.

For forward Euler, Φ(t,y,h)=f(t,y)\Phi(t, y, h) = f(t, y), so the residual is τn=y(tn+1)y(tn)hf(tn,y(tn)).\tau_n = y(t_{n+1}) - y(t_n) - h f(t_n, y(t_n)).

We can compute this by Taylor-expanding y(tn+1)=y(tn+h)y(t_{n+1}) = y(t_n + h) around tnt_n: y(tn+h)=y(tn)+hy(tn)+h22y(ξn)y(t_n + h) = y(t_n) + h y'(t_n) + \tfrac{h^2}{2} y''(\xi_n) for some ξn(tn,tn+h)\xi_n \in (t_n, t_n + h), by Taylor’s theorem with the Lagrange form of the remainder. Since y(tn)=f(tn,y(tn))y'(t_n) = f(t_n, y(t_n)), the first-order terms cancel: τn=h22y(ξn).\tau_n = \tfrac{h^2}{2} y''(\xi_n).

So the local truncation error of Euler’s method is O(h2)O(h^2). Each individual step makes an error of order h2h^2. Over N=(Tt0)/hN = (T - t_0)/h steps, naive accounting suggests the global error grows like Nh2=(Tt0)hN \cdot h^2 = (T - t_0) h — that is, O(h)O(h). We say Euler is a first-order method: global error is one power lower than local error, because the method takes 1/h\sim 1/h steps, and each step contributes an error of h2\sim h^2.

But this naive accounting hides a subtlety: errors don’t just accumulate — they also get amplified by the method itself. If ff is sensitive to its yy argument (large f/y\partial f / \partial y), then a small error in yny_n can grow into a large error in yn+1y_{n+1}. The right tool for this analysis is the discrete Grönwall inequality, which bounds error accumulation in linear recurrences.

🔷 Theorem 1 (Global Error of Euler's Method)

Let f:[t0,T]×RRf: [t_0, T] \times \mathbb{R} \to \mathbb{R} be continuous and Lipschitz in yy with constant LL, and assume the exact solution y(t)y(t) is twice continuously differentiable with y(t)M|y''(t)| \leq M on [t0,T][t_0, T]. Let y0,y1,,yNy_0, y_1, \ldots, y_N be the Euler approximations with step size hh, and let en=yny(tn)e_n = y_n - y(t_n) be the error at step nn (so e0=0e_0 = 0). Then en    Mh2L(eL(tnt0)1)for all n.|e_n| \;\leq\; \frac{Mh}{2L}\bigl(e^{L(t_n - t_0)} - 1\bigr) \quad\text{for all } n. In particular, en=O(h)|e_n| = O(h) as h0h \to 0, with the constant depending exponentially on LL and the integration interval.

Proof.

We track how the error en=yny(tn)e_n = y_n - y(t_n) evolves from one step to the next. By the definitions of the Euler step and the local truncation error, yn+1=yn+hf(tn,yn)y_{n+1} = y_n + h f(t_n, y_n) y(tn+1)=y(tn)+hf(tn,y(tn))+τny(t_{n+1}) = y(t_n) + h f(t_n, y(t_n)) + \tau_n Subtracting, en+1=en+h[f(tn,yn)f(tn,y(tn))]τn.e_{n+1} = e_n + h\bigl[f(t_n, y_n) - f(t_n, y(t_n))\bigr] - \tau_n. Apply the Lipschitz condition f(tn,yn)f(tn,y(tn))Lyny(tn)=Len|f(t_n, y_n) - f(t_n, y(t_n))| \leq L|y_n - y(t_n)| = L|e_n| and the bound τnMh2/2|\tau_n| \leq Mh^2/2 from the Taylor expansion above: en+1    en+hLen+Mh22  =  (1+hL)en+Mh22.|e_{n+1}| \;\leq\; |e_n| + hL|e_n| + \tfrac{Mh^2}{2} \;=\; (1 + hL)|e_n| + \tfrac{Mh^2}{2}.

This is a linear recurrence in en|e_n| with growth factor (1+hL)(1 + hL) and forcing term Mh2/2Mh^2/2. Solving by induction (or by the discrete Grönwall lemma), starting from e0=0e_0 = 0: en    Mh22k=0n1(1+hL)k  =  Mh22(1+hL)n1hL  =  Mh2L[(1+hL)n1].|e_n| \;\leq\; \tfrac{Mh^2}{2} \sum_{k=0}^{n-1} (1 + hL)^k \;=\; \tfrac{Mh^2}{2} \cdot \frac{(1 + hL)^n - 1}{hL} \;=\; \tfrac{Mh}{2L}\bigl[(1 + hL)^n - 1\bigr].

The geometric sum used the formula k=0n1rk=(rn1)/(r1)\sum_{k=0}^{n-1} r^k = (r^n - 1)/(r - 1) with r=1+hLr = 1 + hL.

Now use the elementary inequality 1+xex1 + x \leq e^x for all real xx, applied to x=hLx = hL: (1+hL)n    (ehL)n  =  enhL  =  eL(tnt0),(1 + hL)^n \;\leq\; (e^{hL})^n \;=\; e^{nhL} \;=\; e^{L(t_n - t_0)}, since nh=tnt0nh = t_n - t_0. Substituting, en    Mh2L(eL(tnt0)1).|e_n| \;\leq\; \frac{Mh}{2L}\bigl(e^{L(t_n - t_0)} - 1\bigr).

This bound has three pieces. The factor Mh/(2L)Mh/(2L) shows the linear-in-hh scaling that defines first-order convergence. The factor (eL(tnt0)1)(e^{L(t_n - t_0)} - 1) is the error amplification — it grows exponentially in both the Lipschitz constant LL and the integration time. And the dependence on M=maxyM = \max|y''| shows that highly curved solutions accumulate error faster than nearly-linear ones, since the local truncation τn=(h2/2)y(ξn)\tau_n = (h^2/2)y''(\xi_n) depends on the second derivative. \blacksquare

The exponential factor eL(Tt0)e^{L(T - t_0)} is the punchline. For long integration intervals or steep right-hand sides, even a perfectly linear-in-hh method can give catastrophically large errors unless hh is correspondingly small. This is the first hint that accuracy alone is not enough — we also need to think about how errors amplify.

📝 Example 1 (Euler on $y' = y$ — error grows exponentially)

Take y=yy' = y, y(0)=1y(0) = 1, with exact solution y(t)=ety(t) = e^t. Here f(t,y)=yf(t, y) = y, so L=1L = 1 and y=yy'' = y, giving M=eTM = e^T on [0,T][0, T]. The Euler iteration is yn+1=yn+hyn=(1+h)yny_{n+1} = y_n + h y_n = (1+h) y_n, so yn=(1+h)ny_n = (1+h)^n.

At T=1T = 1 with h=0.1h = 0.1: y10=(1.1)102.594y_{10} = (1.1)^{10} \approx 2.594, while the true value is e2.718e \approx 2.718. The error is 0.124\approx 0.124, which matches the bound en(Mh/2L)(eL1)=(e0.1/2)(e1)0.234|e_n| \leq (Mh/2L)(e^L - 1) = (e \cdot 0.1 / 2)(e - 1) \approx 0.234 (the bound is loose by a factor of 2 — Grönwall is conservative).

Halving the step to h=0.05h = 0.05: y20=(1.05)202.653y_{20} = (1.05)^{20} \approx 2.653, error 0.065\approx 0.065 — exactly half. This is the linear-in-hh scaling that defines a first-order method. To get an error of 10610^{-6}, you’d need h106h \approx 10^{-6}one million steps on the unit interval. Euler’s method is honest about its accuracy, but it makes you pay for it.

📝 Example 2 (Euler on $y' = -10y$ — accuracy vs. stability)

Now try y=10yy' = -10y, y(0)=1y(0) = 1, with exact solution y(t)=e10ty(t) = e^{-10t}. This is a fast-decaying equation: by t=1t = 1, the true solution is e104.5×105e^{-10} \approx 4.5 \times 10^{-5}.

The Euler iteration is yn+1=(110h)yny_{n+1} = (1 - 10h) y_n, so yn=(110h)ny_n = (1 - 10h)^n. The behavior depends critically on 110h|1 - 10h|:

  • If h=0.01h = 0.01: 110h=0.91 - 10h = 0.9, so y100=0.91002.7×105y_{100} = 0.9^{100} \approx 2.7 \times 10^{-5}. Reasonably close to the true answer.
  • If h=0.1h = 0.1: 110h=01 - 10h = 0, so yn=0y_n = 0 for all n1n \geq 1. Wildly wrong (off by a factor of \infty relatively, but at least bounded).
  • If h=0.2h = 0.2: 110h=11 - 10h = -1, so yny_n oscillates between 11 and 1-1 forever. The numerical solution doesn’t even decay.
  • If h=0.3h = 0.3: 110h=21 - 10h = -2, so yn=(2)ny_n = (-2)^n — the numerical solution blows up, even though the true solution is decaying to zero!

This is shocking: making the step size larger didn’t just reduce accuracy — at h=0.3h = 0.3, it caused the numerical solution to head to ±\pm\infty while the true solution is going to zero. The problem isn’t truncation error: it’s stability. The Euler method amplifies errors by a factor of 110h|1 - 10h| each step, and this amplification factor exceeds 1 once h>0.2h > 0.2. We’ll formalize this as the “stability region” of Euler’s method in Section 5.

Euler's method on y' = y diverging from the exact solution; log-log error vs. step size showing slope 1 (first-order convergence)

💡 Remark 1 (The two pressures on $h$)

Examples 1 and 2 reveal the fundamental tension in numerical methods. Truncation error pushes us to make hh small: Euler is first-order, so halving hh halves the error. But stability also constrains hh: if hh is too large, errors in past steps get amplified into runaway oscillations or blow-up. For Euler on the test equation y=λyy' = \lambda y with λ<0\lambda < 0, stability requires h2/λh \leq 2/|\lambda|.

For “well-behaved” equations these constraints don’t conflict — accuracy demands a smaller hh than stability does, and you just pick the smaller of the two. But for stiff equations, where λ|\lambda| is large but y(t)y(t) is also small, the stability constraint dominates and forces hh to be far smaller than accuracy requires. Section 6 develops this in detail. The cure is implicit methods, which have unconditional stability and break the tension. The cost: each step requires solving a nonlinear equation rather than evaluating one. Section 7 develops them.

3. Runge-Kutta Methods

Euler’s method is first-order. To get more accuracy from a given step size, we need a higher-order method — one whose local truncation error is O(hp+1)O(h^{p+1}) for some p>1p > 1, giving global error O(hp)O(h^p).

The naive way to increase order is to use higher derivatives of yy in the Taylor expansion. Since y=f(t,y)y' = f(t, y), the chain rule gives y=ft+fyfy'' = f_t + f_y \cdot f (where ftf_t and fyf_y denote the partial derivatives of ff with respect to tt and yy), with progressively more complicated expressions for yy''', y(4)y^{(4)}, etc. We could write down a “Taylor method” that uses these expressions directly: yn+1=yn+hyn+h22yn+h36yn+y_{n+1} = y_n + h y'_n + \tfrac{h^2}{2} y''_n + \tfrac{h^3}{6} y'''_n + \cdots But this requires the user to symbolically differentiate ff — impractical for complicated right-hand sides, and a non-starter for ML applications where ff is a neural network.

The brilliant idea of Runge-Kutta methods (Carl Runge 1895, Wilhelm Kutta 1901) is to mimic the higher-order Taylor expansion by evaluating ff at multiple intermediate points within a single step. The values of ff at carefully chosen “stages” implicitly compute the higher derivatives. We never have to differentiate ff ourselves — only evaluate it at a few extra locations.

A general ss-stage Runge-Kutta method has the form ki=f(tn+cih,  yn+hAi),yn+1=yn+hibiki,k_i = f(t_n + c_i h,\; y_n + h\, A_i),\qquad y_{n+1} = y_n + h \sum_i b_i k_i, where Ai=jai,jkjA_i = \sum_j a_{i,j} k_j is the linear combination of previously-computed stages. The coefficients cic_i, ai,ja_{i,j}, and bib_i define the method. If ai,j=0a_{i,j} = 0 whenever jij \geq i, the stages can be computed sequentially (k1k_1 first, then k2k_2 using k1k_1, etc.) and the method is explicit. If the coefficient matrix has nonzero entries on or above the diagonal, kik_i depends on itself and the method is implicit — each stage requires solving a nonlinear equation. We’ll meet implicit methods in Section 7.

📐 Definition 2 (Butcher Tableau)

A Runge-Kutta method is uniquely specified by its coefficients, conventionally arranged in a Butcher tableau (after John Butcher, who systematized the theory in the 1960s). The tableau has three pieces: a column vector c\mathbf c listing the time fractions of each stage, an s×ss \times s matrix AA holding the stage-coupling coefficients, and a row vector b\mathbf b holding the weights used to combine the stages into the final update. The conventional layout writes c\mathbf c to the left of AA, with a horizontal rule separating the stage rows from the row of weights b\mathbf b. For an explicit method, the matrix AA is strictly lower triangular (ai,j=0a_{i,j} = 0 for jij \geq i), which means the stages can be computed in order k1,k2,,ksk_1, k_2, \ldots, k_s without any nonlinear solve.

Forward Euler corresponds to the trivial case s=1s = 1, c1=0c_1 = 0, a1,1=0a_{1,1} = 0, b1=1b_1 = 1. The midpoint method uses s=2s = 2 stages with c1=0c_1 = 0, c2=1/2c_2 = 1/2, a2,1=1/2a_{2,1} = 1/2, and weights b1=0b_1 = 0, b2=1b_2 = 1 to achieve order 2. The classical RK4 uses s=4s = 4 stages with c=(0,1/2,1/2,1)c = (0, 1/2, 1/2, 1), with the matrix AA filled in so that k2k_2 uses k1k_1 at the half-step, k3k_3 uses k2k_2 at the half-step, and k4k_4 uses k3k_3 at the full step. The weights are b1=1/6b_1 = 1/6, b2=1/3b_2 = 1/3, b3=1/3b_3 = 1/3, b4=1/6b_4 = 1/6 — Simpson’s rule applied to the four stage-slopes.

Butcher tableaux for forward Euler, midpoint method, classical RK4, and Dormand-Prince RK45 — visual layouts showing the c-vector, A-matrix, and b-vector structure

The classical RK4 is so important it deserves its formula spelled out. Reading off the tableau, k1=f(tn,yn),k_1 = f(t_n, y_n), k2=f(tn+h/2,  yn+(h/2)k1),k_2 = f(t_n + h/2,\; y_n + (h/2) k_1), k3=f(tn+h/2,  yn+(h/2)k2),k_3 = f(t_n + h/2,\; y_n + (h/2) k_2), k4=f(tn+h,  yn+hk3),k_4 = f(t_n + h,\; y_n + h k_3), yn+1=yn+(h/6)(k1+2k2+2k3+k4).y_{n+1} = y_n + (h/6)(k_1 + 2k_2 + 2k_3 + k_4). Each step uses 4 evaluations of ff. The pattern is unmistakable: k1k_1 is the slope at the left endpoint (Euler’s slope), k2k_2 and k3k_3 are slopes at the midpoint computed in two slightly different ways, and k4k_4 is the slope at the right endpoint computed using k3k_3. The final update is a weighted average — Simpson’s rule applied to slopes.

How do we know this combination achieves order 4? The answer comes from systematically matching the Taylor expansion of y(tn+h)y(t_n + h) against the RK formula and equating coefficients.

🔷 Theorem 2 (The Classical RK4 Method Has Order 4)

The classical Runge-Kutta method defined above has local truncation error O(h5)O(h^5) and global error O(h4)O(h^4). Equivalently, it satisfies all eight of the order conditions required for a 4-stage explicit Runge-Kutta method to achieve order 4 in the Butcher theory.

A full proof of Theorem 2 requires the algebraic theory of order conditions — equations among the Butcher coefficients that must hold for the method’s Taylor expansion to match the true solution’s Taylor expansion to a given order. For order pp, the number of order conditions grows quickly: 1 condition for order 1, 2 for order 2, 4 for order 3, 8 for order 4, 17 for order 5, 37 for order 6, and so on. (The growth is governed by rooted trees — each tree corresponds to a way of assembling powers of ff and its derivatives by chain rule, and each tree gives one equation.) The classical RK4 was discovered by Kutta in 1901 by solving all 8 order-4 equations simultaneously; it’s an algebraic miracle that an explicit 4-stage method can satisfy all of them. Butcher’s Numerical Methods for Ordinary Differential Equations is the canonical reference for this theory.

What matters for us is the consequence: RK4 is fourth-order accurate. Halving hh reduces global error by a factor of 24=162^4 = 16, not just 2 as Euler does. To achieve a target accuracy of ε\varepsilon, RK4 needs about hε1/4h \sim \varepsilon^{1/4}, while Euler needs hεh \sim \varepsilon. To get ε=106\varepsilon = 10^{-6}, RK4 needs h0.03h \sim 0.03 (30\sim 30 steps on the unit interval, 120\sim 120 function evaluations); Euler needs h106h \sim 10^{-6} (106\sim 10^6 steps and function evaluations). RK4 is roughly four orders of magnitude more efficient — and that’s for a “smooth” problem where stability isn’t the binding constraint.

📝 Example 3 (RK4 on $y' = y$, $y(0) = 1$ — one step)

Let’s compute one step of RK4 from t0=0t_0 = 0 with h=0.1h = 0.1. The right-hand side is f(t,y)=yf(t, y) = y.

k1=f(0,1)=1,k_1 = f(0, 1) = 1, k2=f(0.05,  1+0.051)=f(0.05,1.05)=1.05,k_2 = f(0.05,\; 1 + 0.05 \cdot 1) = f(0.05, 1.05) = 1.05, k3=f(0.05,  1+0.051.05)=f(0.05,1.0525)=1.0525,k_3 = f(0.05,\; 1 + 0.05 \cdot 1.05) = f(0.05, 1.0525) = 1.0525, k4=f(0.1,  1+0.11.0525)=f(0.1,1.10525)=1.10525,k_4 = f(0.1,\; 1 + 0.1 \cdot 1.0525) = f(0.1, 1.10525) = 1.10525, y1=1+(0.1/6)(1+2(1.05)+2(1.0525)+1.10525)=1+0.10517=1.10517.y_1 = 1 + (0.1/6)(1 + 2(1.05) + 2(1.0525) + 1.10525) = 1 + 0.10517 = 1.10517.

The exact value is e0.1=1.10517091808e^{0.1} = 1.10517091808\ldots, so RK4 nailed it to 5 decimal places in a single step. Compare to Euler’s y1=1+0.11=1.1y_1 = 1 + 0.1 \cdot 1 = 1.1, which is correct only to 2 decimal places. RK4 used four function evaluations to get five extra digits — a staggering improvement per evaluation.

📝 Example 4 (RK4 vs Euler — equal-accuracy step-size comparison)

Suppose we want global error 103\leq 10^{-3} on the IVP y=yy' = y, y(0)=1y(0) = 1, integrated to t=1t = 1. From the analysis above, Euler’s global error is h\sim h (with constant of order 1 here), so we need h103h \approx 10^{-3}, i.e., 1000 steps.

RK4’s global error is h4\sim h^4, so we need h4103h^4 \approx 10^{-3}, i.e., h0.18h \approx 0.18. That’s about 6 steps. Each RK4 step costs 4 function evaluations, so RK4 uses 24\approx 24 function evaluations total. Euler uses 1000\approx 1000. RK4 is 40 times faster for the same accuracy on this problem.

The crossover where Euler and RK4 cost the same per accuracy is roughly h1h \approx 1 — so for any reasonable accuracy target, RK4 wins. The story changes for stiff equations (Section 6) where stability, not accuracy, sets the step size; then the per-step cost of RK4 becomes a liability rather than an asset.

💡 Remark 2 (Higher order isn't always better)

RK4 is the workhorse of explicit methods for moderate accuracy problems. Going higher — RK5, RK8 — gives diminishing returns. The number of order conditions grows combinatorially, so an order-8 method might need 11 stages to satisfy them all, costing 11 function evaluations per step. This pays off only if the savings in step count exceed the per-step overhead. For very high accuracy (ε<1012\varepsilon < 10^{-12}) on smooth problems, methods like Dormand-Prince 8(7) or extrapolation methods become competitive. For low accuracy (ε>103\varepsilon > 10^{-3}), Euler or RK2 may suffice. RK4 hits the sweet spot for typical problems.

The right way to think about it: the “order vs. cost” trade-off depends on (a) the accuracy you need, (b) the cost of one ff evaluation, and (c) the smoothness of ff. For neural ODEs, ff is a forward pass through a network — expensive — and the desired accuracy is moderate, so RK4 or RK45 are both popular choices. For molecular dynamics with millions of atoms, ff is even more expensive, and methods specialized for Hamiltonian structure (symplectic integrators) take precedence over generic high-order methods.

Final error at t = 5:Euler 2.96e-3RK4 5.31e-7RK45 3.55e-5 (7 steps, 0 rejected)

The simplest test problem. Smooth, well-conditioned, with eigenvalue λ = -1 well inside every method's stability region. All methods converge cleanly.

4. Adaptive Step-Size Control

Both Euler and RK4 above used a fixed step size hh. But that’s wasteful: the local truncation error τn\tau_n depends on y|y''| (or higher derivatives), which can vary by orders of magnitude over the integration interval. In regions where the solution is nearly linear, we’d be happy with a large hh. In regions of rapid change — near a transient, an oscillation, a near-singularity — we need a small hh. A fixed-step method either wastes computation in smooth regions (using a step size suited to the worst case) or undershoots accuracy in tough regions (using a step size suited to the best case).

The fix is adaptive step-size control: estimate the local error after each step, and grow or shrink hh based on whether the estimate is below or above a user-supplied tolerance. The trick is doing this cheaply — we’d like to estimate the error without doubling the number of function evaluations.

The standard approach uses an embedded Runge-Kutta pair: two RK methods of orders pp and p+1p+1 that share the same stages. From the same {k1,,ks}\{k_1, \ldots, k_s\}, you compute two updates yn+1y_{n+1} (order pp) and y^n+1\hat y_{n+1} (order p+1p+1). The difference y^n+1yn+1\hat y_{n+1} - y_{n+1} is the local error of the lower-order solution to leading order — for free, since the stages were going to be computed anyway.

📐 Definition 3 (Embedded Runge-Kutta Pair)

An embedded RK pair of orders pp and p+1p+1 consists of two sets of weights {bi}\{b_i\} and {b^i}\{\hat b_i\} applied to the same ss stages k1,,ksk_1, \ldots, k_s: yn+1=yn+hibiki(order p),y^n+1=yn+hib^iki(order p+1).y_{n+1} = y_n + h \sum_i b_i k_i \quad (\text{order } p), \qquad \hat y_{n+1} = y_n + h \sum_i \hat b_i k_i \quad (\text{order } p+1). The difference y^n+1yn+1=hi(b^ibi)ki\hat y_{n+1} - y_{n+1} = h \sum_i (\hat b_i - b_i) k_i is an estimate of the local truncation error of the order-pp method, accurate to leading order. The most widely used embedded pair is the Dormand-Prince RK4(5) method (Dormand & Prince, 1980), which has s=7s = 7 stages and is the default solver in scipy.integrate.solve_ivp (method='RK45') and torchdiffeq ('dopri5').

Once we have an error estimate errEstn=y^n+1yn+1\text{errEst}_n = |\hat y_{n+1} - y_{n+1}|, we adjust the step size to drive it toward a target tolerance tol\text{tol}. The optimal new step size comes from balancing two competing requirements: we want errEsttol\text{errEst} \approx \text{tol} (so we’re not wasting effort), but we also want a safety margin so the next step is likely to be accepted on the first try.

🔷 Theorem 3 (Optimal Step-Size Update)

For an embedded RK pair where the lower-order method has order pp, the local truncation error scales like errEstChp+1\text{errEst} \approx C h^{p+1} for some constant CC. To achieve a target tolerance tol\text{tol} on the next step, we want tol=Ch~p+1\text{tol} = C \tilde h^{p+1}, where h~\tilde h is the new step size. Dividing the two equations and solving: h~  =  h(tolerrEst)1/(p+1).\tilde h \;=\; h \cdot \left(\frac{\text{tol}}{\text{errEst}}\right)^{1/(p+1)}. In practice this is multiplied by a safety factor of about 0.90.9 (so the new step is slightly conservative), and clamped to a range [hmin,hmax][h_{\min}, h_{\max}] to prevent runaway shrinkage or growth. If errEst>tol\text{errEst} > \text{tol}, the current step is rejected (the computed yn+1y_{n+1} is discarded) and the step is retried with the smaller h~\tilde h. If errEsttol\text{errEst} \leq \text{tol}, the step is accepted and the larger h~\tilde h is used for the next step.

The exponent 1/(p+1)1/(p+1) comes from the order of the local truncation error, not the global error. For Dormand-Prince RK4(5), the lower-order method is order 4, so the exponent is 1/51/5. Halving the error requires shrinking hh by a factor of 21/50.872^{1/5} \approx 0.87 — a fairly modest change. Doubling tol\text{tol} allows increasing hh by the same factor.

📝 Example 5 (RK45 on $y' = -50(y - \cos t) - \sin t$)

Consider the forced linear ODE y=50(ycost)sint,y(0)=0,y' = -50(y - \cos t) - \sin t, \qquad y(0) = 0, which has the closed-form solution y(t)=cost(1e50t)y(t) = \cos t \cdot (1 - e^{-50 t}) — a sinusoid that “rings up” to its steady-state amplitude over a transient of length 1/501/50. For t0.1t \lesssim 0.1 the solution changes rapidly (rising from 0 to nearly cost\cos t); for t0.1t \gtrsim 0.1 it follows the smooth cosine.

A fixed-step RK4 with h=0.01h = 0.01 would take 100100 steps to reach t=1t = 1, mostly in the smooth region where h=0.05h = 0.05 would have sufficed. RK45 with tol=106\text{tol} = 10^{-6} uses small steps in the transient (h0.005h \sim 0.005 for the first few accepted steps) and then expands rapidly: by t=0.5t = 0.5 it’s using h0.1h \sim 0.1. Total step count: maybe 30\sim 30 — three times fewer than the fixed-step version, with comparable accuracy. The savings come from concentrating computational effort where the solution actually demands it.

📝 Example 6 (Step-size adaptation near a finite-time blow-up)

The IVP y=y2y' = y^2, y(0)=1y(0) = 1 has the explicit solution y(t)=1/(1t)y(t) = 1/(1-t), which blows up as t1t \to 1^-. A fixed-step method either misses the blow-up entirely (if hh is too small to reach t=1t = 1) or overshoots and produces nonsense.

RK45 handles this gracefully: as yy grows, f(t,y)=y2|f(t, y)| = y^2 grows even faster, and the local truncation error explodes proportionally. The step-size formula h~=h(tol/errEst)1/5\tilde h = h \cdot (\text{tol}/\text{errEst})^{1/5} then shrinks hh rapidly. Near the singularity, RK45 takes geometrically smaller steps until hh hits the floor hminh_{\min}, at which point the integration halts. The user gets a reasonable solution up to t0.99t \approx 0.99 or so, with a clear signal (extreme step-size shrinkage, or hitting hminh_{\min}) that something is going wrong. This is the “graceful degradation” that adaptive methods provide and fixed-step methods don’t.

💡 Remark 3 (Absolute and relative tolerances)

In practice, scientific ODE solvers use a combined tolerance: toln=atol+rtolyn\text{tol}_n = \text{atol} + \text{rtol} \cdot |y_n|. The absolute tolerance atol\text{atol} kicks in when yn|y_n| is small (close to the noise floor), and the relative tolerance rtol\text{rtol} kicks in when yn|y_n| is large (so the error is bounded as a fraction of the value). Typical defaults are atol=106\text{atol} = 10^{-6} and rtol=103\text{rtol} = 10^{-3}.

This dual scheme avoids two failure modes: a pure relative tolerance breaks down when the solution crosses zero (tol0\text{tol} \to 0 would force h0h \to 0); a pure absolute tolerance is too coarse for small solutions and too tight for large ones. In scipy.integrate.solve_ivp, you pass atol and rtol as arguments. In torchdiffeq, the same conventions apply. ML practitioners using neural ODEs should be aware of these — the wrong choice of tolerance can either slow training to a crawl (over-tight) or introduce solver-induced bias into gradients (over-loose).

accepted: 17 rejected: 2

Sigmoidal growth saturating at the carrying capacity y = 1. Mildly nonlinear. Tests how methods handle a transition between near-zero and near-saturation regimes. The bar chart below the solution shows the step size h the adaptive solver chose at each step — wider bars where the solution is smooth, narrower bars (and a redder tint) where it changes rapidly. Red triangles mark steps that were rejected and retried with a smaller h.

5. Numerical Stability and A-Stability

Example 2 in Section 2 hinted at it: Euler’s method can blow up on a stable equation if the step size is too large. The numerical method introduces its own dynamics, and those dynamics can be unstable even when the underlying ODE isn’t.

To analyze this carefully, we restrict to the simplest possible test problem: the linear scalar equation y=λyy' = \lambda y with λC\lambda \in \mathbb{C}. The exact solution is y(t)=y0eλty(t) = y_0 e^{\lambda t}, which decays to zero iff Re(λ)<0\text{Re}(\lambda) < 0. We then ask: for which step sizes hh does the numerical method also produce a sequence that decays to zero?

The answer depends only on the product z=hλCz = h\lambda \in \mathbb{C}. For each numerical method, applying it to y=λyy' = \lambda y produces a recurrence of the form yn+1=R(z)yny_{n+1} = R(z) y_n, where R(z)R(z) is a rational (or polynomial) function called the stability function of the method. The numerical solution decays iff R(z)1|R(z)| \leq 1.

📐 Definition 4 (Stability Region)

The stability region of a numerical method is the set S  =  {zC:R(z)1},S \;=\; \{ z \in \mathbb{C} : |R(z)| \leq 1 \}, where R(z)R(z) is the method’s stability function obtained by applying the method to y=λyy' = \lambda y with z=hλz = h\lambda. The numerical solution decays for an eigenvalue λ\lambda if and only if hλSh\lambda \in S.

For a scalar real λ<0\lambda < 0, this gives the constraint hhmax(λ)h \leq h_{\max}(\lambda) where hmaxh_{\max} is determined by where the boundary of SS crosses the real axis. For a system y=Ayy' = Ay, every eigenvalue of hAhA must lie in SS, which is the constraint that determines the maximum stable step size for a multi-mode system.

📐 Definition 5 (A-Stability)

A numerical method is A-stable if its stability region SS contains the entire closed left half-plane C={z:Re(z)0}\mathbb{C}^- = \{z : \text{Re}(z) \leq 0\}. Equivalently, R(z)1|R(z)| \leq 1 for all zz with Re(z)0\text{Re}(z) \leq 0.

A-stability is the strongest possible stability property: an A-stable method gives a decaying numerical solution for any stable scalar test equation, regardless of step size. There are no step-size restrictions for stability — only for accuracy.

A-stability is the holy grail of stability properties. If your method is A-stable, you never have to worry about choosing hh small enough to avoid blow-up; you only have to choose hh small enough to get the desired accuracy. For stiff problems, this is a huge deal — explicit methods can require hh smaller than accuracy demands by factors of 10610^6 or more, while A-stable implicit methods can use hh governed purely by accuracy requirements.

The bad news, due to Germund Dahlquist in 1963: A-stability is much harder to achieve than you’d hope.

🔷 Theorem 4 (Dahlquist's Second Barrier)

There is no explicit Runge-Kutta method that is A-stable. More generally, no explicit linear multistep method of order greater than 22 can be A-stable, and no implicit linear multistep method of order greater than 22 can be A-stable. (The trapezoidal rule, which is implicit and second-order, is A-stable — it sits exactly at the barrier.)

This is a structural obstruction, not a technical one. The stability function of any explicit RK method is a polynomial in zz (since the method takes finitely many additions and multiplications), and polynomials grow without bound as z|z| \to \infty — so they can’t satisfy R(z)1|R(z)| \leq 1 in the unbounded left half-plane. Implicit methods can have rational stability functions whose denominators tame the growth at infinity, but Dahlquist showed that requiring both A-stability AND order >2> 2 leads to coefficient constraints that have no solution. The barrier is the reason A-stable methods are essentially limited to implicit Euler, the trapezoidal rule, and a handful of close cousins (BDF for stiffness, Radau IIA for higher order with weakened stability). All of these are implicit.

Let’s see the stability functions of the methods we’ve met.

📝 Example 7 (Stability region of forward Euler)

Apply forward Euler to y=λyy' = \lambda y: yn+1=yn+hλyn=(1+hλ)yn.y_{n+1} = y_n + h \lambda y_n = (1 + h\lambda) y_n. So R(z)=1+zR(z) = 1 + z, where z=hλz = h\lambda. The stability region is S={zC:1+z1}.S = \{z \in \mathbb{C} : |1 + z| \leq 1\}. This is the closed disk of radius 1 centered at z=1z = -1. On the real axis, it’s the interval [2,0][-2, 0]. So for a real eigenvalue λ<0\lambda < 0, forward Euler is stable iff h2/λh \leq 2/|\lambda|. This recovers the constraint from Example 2 in Section 2: on y=10yy' = -10y, stability requires h0.2h \leq 0.2 — exactly where we saw the numerical solution start oscillating.

📝 Example 8 (Stability region of RK4)

Applying RK4 to y=λyy' = \lambda y and computing the four stages, the stability function turns out to be exactly the truncated Taylor series of eze^z to fourth order: R(z)=1+z+z22+z36+z424.R(z) = 1 + z + \tfrac{z^2}{2} + \tfrac{z^3}{6} + \tfrac{z^4}{24}. This is no accident: applying RK4 to the test equation forces RR to match ez=ehλe^z = e^{h\lambda}, the exact solution, to as many Taylor terms as the method’s order — for RK4 that’s 5 terms (orders 0 through 4).

The stability region {z:R(z)1}\{z : |R(z)| \leq 1\} is now a four-lobed shape that extends along the real axis to about z2.78z \approx -2.78. So RK4 is stable for real λ<0\lambda < 0 when h<2.78/λh < 2.78/|\lambda| — a roughly 40% improvement over forward Euler, in addition to RK4’s much higher accuracy. RK4 is not A-stable: its region is bounded, so for very stiff systems it still needs tiny step sizes for stability.

Proof.

We compute the stability region directly. Forward Euler applied to y=λyy' = \lambda y gives yn+1=(1+hλ)yny_{n+1} = (1 + h\lambda) y_n, so iteratively yn=(1+hλ)ny0=R(z)ny0y_n = (1 + h\lambda)^n y_0 = R(z)^n y_0 where z=hλz = h\lambda.

The numerical solution remains bounded iff R(z)1|R(z)| \leq 1, which is iff 1+z1|1 + z| \leq 1. Writing z=x+iyz = x + iy for real x,yx, y: 1+z2=(1+x)2+y21    (1+x)2+y21.|1 + z|^2 = (1 + x)^2 + y^2 \leq 1 \iff (1 + x)^2 + y^2 \leq 1. This is the closed disk of radius 11 centered at z=1z = -1. It is a compact subset of C\mathbb{C}, contained entirely in {z:2Re(z)0}\{z : -2 \leq \text{Re}(z) \leq 0\}.

The disk does not contain the entire left half-plane: for example, z=10z = -10 gives 1+z=9>1|1 + z| = 9 > 1. Therefore forward Euler is not A-stable. For any real eigenvalue λ<0\lambda < 0, the constraint hλSh\lambda \in S becomes 2hλ0-2 \leq h\lambda \leq 0, i.e., h2/λh \leq 2/|\lambda|. As λ|\lambda| grows, the maximum stable step size shrinks proportionally — this is the source of stiffness pain for explicit methods. \blacksquare

Stability regions of forward Euler, backward Euler, classical RK4, and the trapezoidal rule plotted in the complex plane, with the left half-plane shaded for reference

The implicit methods of Section 7 will have stability regions that contain the entire left half-plane — they’ll be A-stable. For now, just note the qualitative picture: explicit methods have bounded stability regions (small islands in the complex plane), while A-stable implicit methods have unbounded stability regions (the entire left half-plane and often more). The boundedness vs. unboundedness is the structural difference Dahlquist’s theorem exploits.

6. Stiff Systems

We now have all the tools to define stiffness precisely. A system of ODEs is stiff if its dynamics happen on widely separated time scales: some components decay (or oscillate) very fast, while others evolve very slowly. The fast components force any explicit method to use a step size suited to the fastest time scale, even if you only care about the slow dynamics.

The right way to quantify this is via the eigenvalues of the Jacobian.

📐 Definition 6 (Stiffness Ratio)

For a system y˙=f(y)\dot{\mathbf{y}} = \mathbf{f}(\mathbf{y}), the stiffness ratio at a point y\mathbf{y} is κ(y)  =  maxiλi(J(y))miniλi(J(y)),\kappa(\mathbf{y}) \;=\; \frac{\max_i |\lambda_i(J(\mathbf{y}))|}{\min_i |\lambda_i(J(\mathbf{y}))|}, where λi\lambda_i are the eigenvalues of the Jacobian J(y)=f/yJ(\mathbf{y}) = \partial \mathbf{f}/\partial \mathbf{y}. A system is stiff at y\mathbf{y} if κ(y)1\kappa(\mathbf{y}) \gg 1 — typically κ>100\kappa > 100 or so. Stiffness depends on the integration interval: a system with κ=106\kappa = 10^6 is stiff for T=1T = 1 (where you need to integrate through 106/T\sim 10^6/T fast-decay periods) but not necessarily stiff for T=107T = 10^{-7} (where the slow dynamics haven’t even started).

Note the connection to Topic 23: the Jacobian eigenvalues are exactly what we computed for linearization stability analysis. A stable equilibrium has Re(λi)<0\text{Re}(\lambda_i) < 0 for all eigenvalues; a stiff stable equilibrium has some eigenvalues with very large Re(λi)|\text{Re}(\lambda_i)| alongside others with small Re(λi)|\text{Re}(\lambda_i)|.

📝 Example 9 (The Robertson chemical kinetics system)

A canonical stiff test problem: three coupled chemical reactions y˙1=0.04y1+104y2y3,\dot y_1 = -0.04\, y_1 + 10^4\, y_2 y_3, y˙2=0.04y1104y2y33×107y22,\dot y_2 = 0.04\, y_1 - 10^4\, y_2 y_3 - 3 \times 10^7\, y_2^2, y˙3=3×107y22,\dot y_3 = 3 \times 10^7\, y_2^2, with y(0)=(1,0,0)\mathbf{y}(0) = (1, 0, 0). The three rate constants span 9 orders of magnitude: 0.040.04, 10410^4, 3×1073 \times 10^7. The Jacobian eigenvalues at the initial state have moduli ranging from 0.040.04 to 107\sim 10^7, giving a stiffness ratio κ109\kappa \sim 10^9.

For an explicit method on this problem, stability requires h1071h \cdot 10^7 \lesssim 1, i.e., h107h \lesssim 10^{-7}. To integrate to t=100t = 100 (a typical time scale for the slow y1y_1 component), you’d need 109\sim 10^9 steps. With each step requiring 4 function evaluations for RK4, that’s 4×109\sim 4 \times 10^9 function evaluations — completely intractable. An implicit method (which we develop in Section 7) can use h0.1h \sim 0.1 throughout, requiring only 1000\sim 1000 steps. The speedup is six orders of magnitude.

📝 Example 10 (Van der Pol oscillator at large $\mu$)

The Van der Pol oscillator x¨μ(1x2)x˙+x=0\ddot x - \mu(1 - x^2)\dot x + x = 0, written as a first-order system with y1=xy_1 = x, y2=x˙y_2 = \dot x: y˙1=y2,y˙2=μ(1y12)y2y1.\dot y_1 = y_2, \qquad \dot y_2 = \mu(1 - y_1^2) y_2 - y_1. For small μ\mu this is a smooth oscillator. For large μ\mu — say μ=1000\mu = 1000 — it becomes a relaxation oscillator: long stretches where the solution is nearly constant, punctuated by very rapid jumps when y1|y_1| crosses 11 and the damping coefficient μ(1y12)\mu(1 - y_1^2) flips sign.

During the slow stretches the Jacobian’s eigenvalues are small. During the rapid jumps, one eigenvalue becomes very large (roughly 2μ-2\mu at the most stiff points). The stiffness ratio varies from O(1)O(1) during smooth stretches to O(μ2)=O(106)O(\mu^2) = O(10^6) during jumps. An explicit method has to use a step size suited to the worst stretches throughout, even though most of the trajectory is smooth. Adaptive step-size control helps somewhat (it can take big steps during smooth stretches and shrink during jumps), but even RK45 spends most of its computation in the rapid-jump regions and is still much slower than an implicit method. The Van der Pol oscillator at μ=1000\mu = 1000 is a textbook case where you should reach for solve_ivp(method='Radau') or 'BDF' rather than the default 'RK45'.

💡 Remark 4 (Stiffness is not a property of the equation alone)

A subtle but important point: stiffness is a combination of the equation, the initial condition, and the desired integration interval. A system with κ=106\kappa = 10^6 is stiff only if you want to integrate through many fast-decay periods. If your interval is short enough that the fast modes are still active, you actually need to resolve them — and an explicit method with small hh is appropriate.

This is why “stiff” is not a binary classification but a regime. Most modern ODE solvers (solve_ivp, DifferentialEquations.jl) include automatic stiffness detection: they monitor the Jacobian eigenvalues (or proxies like the step-size rejection rate of an explicit method) and switch between explicit and implicit methods on the fly. The field of ODE solving has matured to the point where users mostly don’t need to make this choice manually — but understanding why stiffness matters lets you debug when the automation gets it wrong.

λ_max ≈ -1.0e+3, hλ = -10.000 (outside region — UNSTABLE)
Show:Region for:

Diagonal linear system. The fast component y₂ decays in t ≈ 0.001 but forces explicit methods to use h < 0.002 for stability. The slow component y₁ evolves on a time scale 1000× longer. The right panel shows the stability region of the selected method in the complex z = hλ plane (blue = stable region). The green/red dot is the current value of h·λ_max — green if it lies inside the stability region (the method is stable), red if it lies outside (the method will blow up). Move the h slider to see the dot enter or leave the region.

7. Implicit Methods

If explicit methods can’t handle stiffness gracefully, what can? The answer is implicit methods — schemes where the unknown yn+1y_{n+1} appears on both sides of the update equation, requiring an algebraic solve at each step. The cost is higher per step, but the payoff is dramatic: A-stability, meaning step sizes can be chosen for accuracy alone, regardless of stiffness.

📐 Definition 7 (Implicit Euler and Trapezoidal Rule)

Implicit (backward) Euler: yn+1=yn+hf(tn+1,yn+1)y_{n+1} = y_n + h \cdot f(t_{n+1}, y_{n+1}). The unknown yn+1y_{n+1} appears inside ff on the right-hand side — this is an implicit equation that must be solved at each step. For linear ff, the solve is a matrix inversion; for nonlinear ff, it’s a Newton iteration. This method is first-order accurate (same order as forward Euler) but A-stable.

Trapezoidal rule: yn+1=yn+(h/2)[f(tn,yn)+f(tn+1,yn+1)]y_{n+1} = y_n + (h/2)[f(t_n, y_n) + f(t_{n+1}, y_{n+1})]. The average of forward and backward Euler. Second-order accurate and A-stable — sitting exactly at the Dahlquist barrier of “highest-order A-stable linear multistep method.” Known as the Crank-Nicolson method in the PDE literature.

What does it mean to “solve” the implicit equation? Take backward Euler. We need to find yn+1y_{n+1} such that g(y):=yynhf(tn+1,y)=0.g(y) := y - y_n - h f(t_{n+1}, y) = 0. For nonlinear ff this requires a root-finding method, and the standard choice is Newton’s iteration: y(k+1)=y(k)g(y(k))g(y(k)),g(y)=1hfy(tn+1,y).y^{(k+1)} = y^{(k)} - \frac{g(y^{(k)})}{g'(y^{(k)})}, \qquad g'(y) = 1 - h \cdot \frac{\partial f}{\partial y}(t_{n+1}, y). The iteration converges quadratically when y(k)y^{(k)} is close to the true root. A typical implementation initializes with y(0)=yn+hf(tn,yn)y^{(0)} = y_n + h f(t_n, y_n) (one explicit Euler step) and converges in 2–4 iterations per timestep. Each iteration requires one evaluation of ff and one of f/y\partial f/\partial y — so the per-step cost is 4\sim 4 function evaluations and 4\sim 4 Jacobian evaluations. That’s somewhat more expensive than RK4’s 4 function evaluations, but per-accuracy it can be vastly cheaper because the step size isn’t constrained by stability.

🔷 Theorem 5 (A-Stability of the Implicit Euler Method)

The implicit Euler method has stability function R(z)=11z,R(z) = \frac{1}{1 - z}, and is A-stable: R(z)1|R(z)| \leq 1 for all zCz \in \mathbb{C} with Re(z)0\text{Re}(z) \leq 0.

Proof.

Apply implicit Euler to the test equation y=λyy' = \lambda y: yn+1=yn+hλyn+1.y_{n+1} = y_n + h\lambda y_{n+1}. Solving algebraically for yn+1y_{n+1} (this is one of the rare cases where the implicit equation has a closed-form solution): (1hλ)yn+1=yn    yn+1=11hλyn=11zyn,(1 - h\lambda) y_{n+1} = y_n \implies y_{n+1} = \frac{1}{1 - h\lambda} y_n = \frac{1}{1 - z} y_n, where z=hλz = h\lambda. So the stability function is R(z)=1/(1z)R(z) = 1/(1 - z).

To check A-stability, we need R(z)1|R(z)| \leq 1 for all zz with Re(z)0\text{Re}(z) \leq 0. Write z=x+iyz = x + iy with x0x \leq 0. Then 1z=(1x)iy1 - z = (1 - x) - iy, so 1z2=(1x)2+y2.|1 - z|^2 = (1 - x)^2 + y^2. Since x0x \leq 0, we have 1x11 - x \geq 1, so (1x)21(1 - x)^2 \geq 1. Therefore 1z2=(1x)2+y21+y21,|1 - z|^2 = (1 - x)^2 + y^2 \geq 1 + y^2 \geq 1, which gives 1z1|1 - z| \geq 1 and so R(z)=11z1.|R(z)| = \frac{1}{|1 - z|} \leq 1. Equality R(z)=1|R(z)| = 1 holds iff x=0x = 0 and y=0y = 0 — that is, only at z=0z = 0. Everywhere else in the closed left half-plane, R(z)<1|R(z)| < 1 strictly. So implicit Euler is A-stable, and in fact strongly damping for any Re(λ)<0\text{Re}(\lambda) < 0. \blacksquare

The geometric picture is illuminating. Forward Euler’s stability region was a unit disk centered at z=1z = -1 (a small island in the left half-plane). Implicit Euler’s stability region is everything outside the disk 1z<1|1 - z| < 1 — that is, the complement of the open unit disk centered at z=+1z = +1. This complement contains the entire left half-plane, and most of the right half-plane as well. Implicit Euler is so stable that it actually damps numerical solutions for some unstable continuous problems (those with Re(λ)>0\text{Re}(\lambda) > 0 but small) — a property called “L-stability” that makes it useful for stiff transients but can also mask physical instabilities.

📝 Example 11 (Implicit Euler on the Robertson system: stable with large step sizes)

On the Robertson kinetics from Example 9, implicit Euler (with Newton iteration on the 3×33 \times 3 Jacobian) can use h=102h = 10^{-2} to h=1h = 1 throughout the integration. To reach t=100t = 100, that’s 100\sim 100 to 10410^4 steps — six to seven orders of magnitude fewer than the 10910^9 steps an explicit method would need. Each implicit step is more expensive (Newton iteration costs 4×\sim 4 \times what one Euler step does), but 41044 \cdot 10^4 is still vastly less than 10910^9.

This is the qualitative payoff of implicit methods: they trade per-step computational complexity for the freedom to take large step sizes. For stiff problems, that trade is overwhelmingly worth it.

📝 Example 12 (The trapezoidal rule — second-order and A-stable)

Apply the trapezoidal rule to y=λyy' = \lambda y: yn+1=yn+(hλ/2)(yn+yn+1)    (1hλ/2)yn+1=(1+hλ/2)yn.y_{n+1} = y_n + (h\lambda/2)(y_n + y_{n+1}) \implies (1 - h\lambda/2) y_{n+1} = (1 + h\lambda/2) y_n. The stability function is R(z)=(1+z/2)/(1z/2)R(z) = (1 + z/2)/(1 - z/2). For z=x+iyz = x + iy with x0x \leq 0, both numerator and denominator have the same imaginary part magnitude y/2|y/2|, but 1+x/21x/2|1 + x/2| \leq |1 - x/2| (since x0x \leq 0). So R(z)1|R(z)| \leq 1 on the closed left half-plane — A-stable.

The trapezoidal rule is the highest-order A-stable linear multistep method (Dahlquist’s barrier theorem). It’s second-order accurate, A-stable, and requires only one implicit solve per step. For mildly stiff problems where you want better than first-order accuracy without giving up A-stability, it’s the natural choice. (For higher-order A-stable methods, you have to abandon linear multistep and use implicit Runge-Kutta methods like Radau IIA.)

💡 Remark 5 (The cost of implicitness — Newton iteration)

Each implicit step requires solving a nonlinear (or linear, for linear ODEs) equation. For systems of dimension dd, the cost per Newton iteration is dominated by the cost of factoring the Jacobian matrix Ihf/yI - h \cdot \partial \mathbf{f}/\partial \mathbf{y}, which is O(d3)O(d^3) for dense matrices and O(d1.5)O(d^{1.5}) to O(dlogd)O(d \log d) for sparse ones (banded, tridiagonal, etc.). A typical Newton iteration converges in 2–4 sweeps; if it doesn’t converge, the step is reduced and retried.

For neural ODEs, the implicit-method cost can be prohibitive: dd might be 10510^5 (the number of hidden units), making the O(d3)O(d^3) factorization 1015\sim 10^{15} flops per step. This is why neural ODE training almost always uses explicit adaptive methods (Dormand-Prince RK45) and accepts the stability constraints — the savings from avoiding implicit solves dominate. The exception is in physics-informed neural networks for stiff PDEs, where implicit solvers are essential despite the cost. Modern implementations (diffrax, torchdiffeq) provide both explicit and implicit options, and the user picks based on the stiffness regime.

Comparison of explicit and implicit Euler on a stiff system: explicit blows up while implicit remains stable

8. Multistep Methods

Both Euler and the Runge-Kutta methods are single-step: each new value yn+1y_{n+1} depends only on yny_n and possibly on intermediate evaluations of ff within the current step. Multistep methods instead reuse yn1,yn2,y_{n-1}, y_{n-2}, \ldots from previous steps and the corresponding ff-values fn1,fn2,f_{n-1}, f_{n-2}, \ldots. This is a fundamentally different way to gain order: rather than computing more stages within one step, you average more historical information.

📐 Definition 8 (Linear Multistep Methods)

A linear kk-step method has the form j=0kαjyn+j=hj=0kβjfn+j,\sum_{j=0}^{k} \alpha_j y_{n+j} = h \sum_{j=0}^{k} \beta_j f_{n+j}, where fn+j=f(tn+j,yn+j)f_{n+j} = f(t_{n+j}, y_{n+j}) and αk0\alpha_k \neq 0. If βk=0\beta_k = 0 the method is explicit (the new yn+ky_{n+k} doesn’t appear on the right-hand side); otherwise it’s implicit.

Two important families:

  • Adams-Bashforth (explicit): yn+k=yn+k1+hj=0k1βjfn+jy_{n+k} = y_{n+k-1} + h \sum_{j=0}^{k-1} \beta_j f_{n+j}. The kk-step Adams-Bashforth method achieves order kk by polynomial interpolation through past ff-values.
  • Adams-Moulton (implicit): yn+k=yn+k1+hj=0kβjfn+jy_{n+k} = y_{n+k-1} + h \sum_{j=0}^{k} \beta_j f_{n+j}. Achieves order k+1k+1 by including the new fn+kf_{n+k} in the interpolation.

📝 Example 13 (Adams-Bashforth 2-step)

The 2-step Adams-Bashforth method is yn+2=yn+1+h2(3fn+1fn).y_{n+2} = y_{n+1} + \frac{h}{2}(3 f_{n+1} - f_n). This is exact for linear-in-tt functions ff, and second-order accurate in general. The derivation: approximate f(t,y(t))f(t, y(t)) on [tn+1,tn+2][t_{n+1}, t_{n+2}] by the linear extrapolation of fn,fn+1f_n, f_{n+1}, then integrate that linear function exactly. The factor 3/23/2 on fn+1f_{n+1} and 1/2-1/2 on fnf_n are exactly the coefficients of tn+1tn+2Ln(s)ds\int_{t_{n+1}}^{t_{n+2}} L_n(s)\,ds where LnL_n is the linear Lagrange polynomial through (tn,fn)(t_n, f_n) and (tn+1,fn+1)(t_{n+1}, f_{n+1}).

Adams-Bashforth methods are very efficient per step: only one ff evaluation per step, since you reuse fnf_n from the previous step. To get started, you need a single-step method (e.g., one RK4 step) to compute y1y_1 from y0y_0. After that, the multistep recurrence takes over.

🔷 Theorem 6 (Convergence of Multistep Methods (Dahlquist Equivalence))

A linear multistep method is convergent (the numerical solution converges to the exact solution as h0h \to 0) if and only if it is both:

  1. Consistent: the local truncation error tends to zero as h0h \to 0.
  2. Zero-stable: the roots of the characteristic polynomial ρ(ζ)=αjζj\rho(\zeta) = \sum \alpha_j \zeta^j all lie in the closed unit disk ζ1|\zeta| \leq 1, with simple roots on the boundary.

This is Dahlquist’s equivalence theorem for linear multistep methods, the multistep analog of the Lax equivalence theorem we’ll prove for one-step methods in Section 9.

The zero-stability condition is the discrete analog of continuous stability: it requires that the trivial equation y=0y' = 0 (with f=0f = 0) is solved correctly by the method. A multistep method with a root ζ\zeta outside the unit disk would have a parasitic numerical mode that grows like ζn\zeta^n, drowning out the true solution even on the trivial problem. Zero-stability prevents this — it’s a discrete echo of the linearization stability theorem from Topic 23.

💡 Remark 6 (Multistep vs. multistage — a cost/memory trade-off)

Runge-Kutta methods (multistage) and Adams methods (multistep) both achieve high order, but they trade off differently. RK methods compute multiple ff-evaluations per step but use only the previous yy. Adams methods compute one ff-evaluation per step but maintain a history of past ff-values in memory.

For neural ODEs, multistage is preferred because each ff-evaluation is a forward pass through a network — expensive but cheap to checkpoint. For Adams methods, every retained fnjf_{n-j} is a snapshot of network activations, and over many steps the memory becomes prohibitive.

For traditional scientific computing where ff is cheap (a polynomial or a finite-element matrix-vector product) but the integration interval is long, Adams methods can be more efficient: one ff-evaluation per step beats four for RK4. The choice depends on what you’re paying for — flops, memory, or something else.

Adams-Bashforth stencil and accuracy comparison across orders 2, 3, and 4

9. Convergence Theory — The Lax Equivalence Theorem

We’ve now seen many methods, each with its own analysis of accuracy and stability. We’ve appealed to “consistency” (local truncation error → 0) and “stability” (errors don’t amplify) as the two conditions a good method must satisfy. Now we make this precise and prove the unifying theorem.

The result is due to Peter Lax (1956) for finite-difference methods on PDEs, but the same statement holds for one-step methods on ODEs. It’s the most important theorem in numerical analysis.

🔷 Theorem 7 (The Lax Equivalence Theorem (One-Step Version))

Let Φ(t,y,h)\Phi(t, y, h) be the increment function of a one-step method yn+1=yn+hΦ(tn,yn,h)y_{n+1} = y_n + h \Phi(t_n, y_n, h) applied to the IVP y=f(t,y)y' = f(t, y), y(t0)=y0y(t_0) = y_0, with ff Lipschitz in yy. Then the method is convergent — meaning maxnyny(tn)0\max_n |y_n - y(t_n)| \to 0 as h0h \to 0 over any fixed interval [t0,T][t_0, T] — if and only if it is both:

  1. Consistent: the local truncation error satisfies τn0\tau_n \to 0 uniformly in nn as h0h \to 0.
  2. Stable: the increment function Φ\Phi is Lipschitz in its second argument, with a constant LΦL_\Phi that does not depend on hh.

Moreover, if the method has order pp (meaning τn=O(hp+1)\tau_n = O(h^{p+1})), then the global error satisfies yny(tn)=O(hp)|y_n - y(t_n)| = O(h^p) uniformly on [t0,T][t_0, T].

Proof.

We track how the global error en=yny(tn)e_n = y_n - y(t_n) propagates from step to step, then sum the contributions. The argument is a generalization of the Euler-specific proof from Theorem 1.

By the definitions of the numerical step and the local truncation error, yn+1=yn+hΦ(tn,yn,h),y_{n+1} = y_n + h\, \Phi(t_n, y_n, h), y(tn+1)=y(tn)+hΦ(tn,y(tn),h)+τn.y(t_{n+1}) = y(t_n) + h\, \Phi(t_n, y(t_n), h) + \tau_n. Subtracting, en+1=en+h[Φ(tn,yn,h)Φ(tn,y(tn),h)]τn.e_{n+1} = e_n + h\bigl[\Phi(t_n, y_n, h) - \Phi(t_n, y(t_n), h)\bigr] - \tau_n.

Apply the stability condition (Lipschitz Φ\Phi with constant LΦL_\Phi): Φ(tn,yn,h)Φ(tn,y(tn),h)LΦyny(tn)=LΦen.|\Phi(t_n, y_n, h) - \Phi(t_n, y(t_n), h)| \leq L_\Phi |y_n - y(t_n)| = L_\Phi |e_n|. Combined with τnChp+1|\tau_n| \leq C h^{p+1} (consistency, with constant CC depending on maxy(p+1)\max|y^{(p+1)}|): en+1en+hLΦen+Chp+1=(1+hLΦ)en+Chp+1.|e_{n+1}| \leq |e_n| + h L_\Phi |e_n| + C h^{p+1} = (1 + h L_\Phi) |e_n| + C h^{p+1}.

This is a linear recurrence with growth factor (1+hLΦ)(1 + h L_\Phi) and forcing Chp+1C h^{p+1}. Iterating from e0=0e_0 = 0: enChp+1k=0n1(1+hLΦ)n1k=Chp+1(1+hLΦ)n1hLΦ.|e_n| \leq C h^{p+1} \sum_{k=0}^{n-1} (1 + h L_\Phi)^{n-1-k} = C h^{p+1} \cdot \frac{(1 + h L_\Phi)^n - 1}{h L_\Phi}.

Using (1+x)ex(1 + x) \leq e^x, we have (1+hLΦ)nenhLΦ=eLΦ(tnt0)(1 + h L_\Phi)^n \leq e^{n h L_\Phi} = e^{L_\Phi (t_n - t_0)}. So enChp+1eLΦ(tnt0)1hLΦ=ChpLΦ(eLΦ(tnt0)1).|e_n| \leq C h^{p+1} \cdot \frac{e^{L_\Phi (t_n - t_0)} - 1}{h L_\Phi} = \frac{C h^p}{L_\Phi}\bigl(e^{L_\Phi (t_n - t_0)} - 1\bigr).

This bound has three pieces:

  • The factor ChpC h^p shows the order-pp convergence: global error is pp powers of hh, one less than the local truncation order p+1p+1, because we summed 1/h\sim 1/h steps.
  • The factor (eLΦ(tnt0)1)(e^{L_\Phi (t_n - t_0)} - 1) is the error amplification, growing exponentially in both the Lipschitz constant and the integration time. This is the “stability cost” — even a stable method amplifies errors over a long interval.
  • Both factors are independent of nn (uniform bound) and of hh (the bound is O(hp)O(h^p) as h0h \to 0).

Conversely, if either consistency or stability fails, convergence fails. If the method is inconsistent (τn↛0\tau_n \not\to 0), then even at t=t0+ht = t_0 + h the single-step error doesn’t shrink, so we can’t have e10|e_1| \to 0. If the method is unstable (there’s no uniform Lipschitz constant for Φ\Phi), then small perturbations get amplified by an unbounded factor, and any roundoff or initial error grows without bound as h0h \to 0.

So consistency ++ stability     \iff convergence. This is the equivalence theorem: convergence is equivalent to the conjunction of two simpler, locally checkable conditions. \blacksquare

Lax’s theorem is the unifying result of numerical analysis. It says: to verify that a method converges (a global property of solutions over an entire interval), you only need to check two local properties — that one step is asymptotically correct (consistency), and that the per-step amplification factor is bounded (stability). This is a massive simplification: you don’t have to track the full numerical solution; you just check the recipe and the amplification.

For RK methods, consistency comes from the order conditions (which are pure algebraic identities on the coefficients), and stability comes from the Lipschitz continuity of the increment function (which follows from the Lipschitz continuity of ff, since Φ\Phi is built out of finitely many evaluations of ff). So every RK method applied to a Lipschitz problem is convergent at its theoretical order — no matter how complicated the method.

📝 Example 14 (Verifying the Lax hypotheses for RK4)

For RK4, the increment function is Φ(t,y,h)=16(k1+2k2+2k3+k4),\Phi(t, y, h) = \tfrac{1}{6}\bigl(k_1 + 2k_2 + 2k_3 + k_4\bigr), where each kik_i is itself a finite combination of ff-evaluations. Since ff is Lipschitz with constant LL, each kik_i is Lipschitz in yy with constant bounded by some polynomial in LL and hh. As h0h \to 0, this constant tends to LL, so Φ\Phi is Lipschitz in yy with a uniform constant. Stability checked.

Consistency follows from the order conditions: RK4 satisfies all 8 conditions for order 4, so its local truncation error is τn=O(h5)\tau_n = O(h^5). Consistency checked.

By Lax, RK4 converges at rate O(h4)O(h^4) on any Lipschitz IVP. Theorems 1 and 2 are now special cases of Theorem 7 — Theorem 1 verified the conditions for forward Euler (p=1p=1), and Theorem 2 verified the order conditions for RK4 (p=4p=4).

📝 Example 15 (Richardson extrapolation for empirical order verification)

Lax’s theorem gives a beautiful experimental tool: measure the convergence order of any method by running it at successively halved step sizes. If global error scales like e(h)Chp|e(h)| \approx C h^p, then e(h)e(h/2)ChpC(h/2)p=2p,\frac{|e(h)|}{|e(h/2)|} \approx \frac{C h^p}{C (h/2)^p} = 2^p, so log2(e(h)/e(h/2))p\log_2(|e(h)|/|e(h/2)|) \approx p. Plotting loge(h)\log|e(h)| vs logh\log h gives a straight line whose slope is the empirical order. This is Richardson extrapolation, and it converts an abstract theorem about convergence rates into a quick numerical check.

In practice: take a problem with a known exact solution (e.g., y=yy' = -y, y(t)=ety(t) = e^{-t}), run your method at step sizes h,h/2,h/4,h/8,h/16h, h/2, h/4, h/8, h/16, compute the error at t=Tt = T for each, and fit a line in log-log coordinates. For Euler the slope should be 1\approx 1; for midpoint 2\approx 2; for RK4 4\approx 4. This is a beautiful experimental verification of theory — and it’s what convergenceOrder in our odes.ts shared module does. We verified empirically: 1.0071.007 for Euler, 4.0144.014 for RK4, both within rounding of theory.

Log-log convergence plot showing global error vs. step size for Euler (slope 1), midpoint (slope 2), and RK4 (slope 4)

10. Computational Notes

Some practical guidance for using ODE solvers in Python and PyTorch.

scipy.integrate.solve_ivp is the standard SciPy interface. It supports several methods via the method argument:

  • 'RK45' (default) — Dormand-Prince adaptive 4(5) embedded pair. Best for non-stiff problems with moderate accuracy needs (10310^{-3} to 10910^{-9}).
  • 'RK23' — lower-order adaptive (2/3). Faster per step but takes more steps. Use when you only need rough accuracy.
  • 'DOP853' — Dormand-Prince 8(5,3). Higher-order; best for very high accuracy (10910^{-9} or smaller) on smooth problems.
  • 'Radau' — implicit Runge-Kutta of Radau IIA family, order 5, A-stable. Use for stiff problems.
  • 'BDF' — implicit multistep (Backward Differentiation Formulas). Use for stiff problems with moderate accuracy needs. The standard solver in many engineering codes.
  • 'LSODA' — automatic stiffness detection, switching between Adams and BDF. Use when you don’t know if your problem is stiff.

Pass dense_output=True to get an interpolation function y(t)y(t) defined on the entire interval (not just at the solver’s chosen grid points). This uses the solver’s internal interpolation and is much more accurate than re-solving on a finer grid.

Initial step size estimation: if you don’t pass first_step, scipy estimates it via h0f(t0,y0)1tol1/(p+1)h_0 \approx \|f(t_0, y_0)\|^{-1} \cdot \text{tol}^{1/(p+1)}. This is a rough formula but usually good enough; the adaptive controller will refine hh within the first few steps.

torchdiffeq is the PyTorch library for differentiable ODE solvers (Chen et al., 2018). It exposes odeint(func, y0, t, method=..., rtol=..., atol=...) with methods including 'dopri5' (Dormand-Prince), 'rk4' (fixed-step RK4), 'euler', and adaptive implicit methods. It supports both direct backpropagation through solver steps and the adjoint method (odeint_adjoint) which solves a backward-in-time ODE for the gradients — using O(1)O(1) memory instead of O(N)O(N) where NN is the number of solver steps. The adjoint method is what makes neural ODEs scalable.

diffrax is the JAX equivalent (Patrick Kidger), with a richer API and support for stochastic ODEs, controlled differential equations, and PDEs via method of lines.

Diagnosing stiffness in production code: if your explicit-method solve is taking many small steps and you suspect stiffness, compute the Jacobian eigenvalues at a few representative points along the trajectory. If the spectral radius times the typical step size exceeds the stability region radius (around 2.78 for RK4), you’re hitting the stability boundary, not the accuracy boundary. Switch to an implicit method.

A common pitfall: training a neural ODE with too-loose tolerance can introduce biased gradients. The adjoint method computes gradients by solving a backward ODE, and if the forward solver was inaccurate, the backward solver inherits that error. Set rtol and atol tight enough that the gradient noise from solver error is smaller than the gradient noise from the data. A good starting point: rtol=105\text{rtol} = 10^{-5}, atol=107\text{atol} = 10^{-7}.

11. Connections to ML

The numerical methods of this topic are not just mathematical curiosities — they are the computational engine of several active areas of deep learning. Four connections, each substantive enough to be its own research program.

📝 Example 16 (Neural ODEs — the forward pass as an IVP)

A neural ODE (Chen, Rubanova, Bettencourt and Duvenaud, 2018) defines a continuous-depth network: instead of stacking discrete layers hl+1=σ(Wlhl+bl)h_{l+1} = \sigma(W_l h_l + b_l), the network is parameterized as dhdt=fθ(h(t),t),h(0)=x,h(T)=output,\frac{d\mathbf{h}}{dt} = f_\theta(\mathbf{h}(t), t), \qquad \mathbf{h}(0) = \mathbf{x}, \qquad \mathbf{h}(T) = \text{output}, where fθf_\theta is a neural network and TT is a fixed integration time (the analog of “depth”). The forward pass is literally an IVP solve: compute h(T)\mathbf{h}(T) from h(0)\mathbf{h}(0) using an adaptive ODE solver, typically Dormand-Prince RK45 (torchdiffeq.odeint(f, h0, t=[0, T], method='dopri5')).

The backward pass is even more interesting. Naive backprop through solver steps would require storing all intermediate h(ti)\mathbf{h}(t_i) — memory that grows with the number of steps the adaptive solver takes. Chen et al.’s adjoint method instead solves a backward-in-time ODE for the gradients dL/dθd\mathcal{L}/d\theta, using a result from optimal control theory (Pontryagin’s maximum principle). The backward solve uses O(1)O(1) memory because it only needs h(T)\mathbf{h}(T) and the loss gradient at TT — everything else is reconstructed by integrating backwards.

The choice of solver matters. 'dopri5' (adaptive) concentrates computation in regions of rapid change in h\mathbf{h} — good for problems with localized features. 'rk4' (fixed-step) is simpler and gives reproducible step counts but wastes effort on smooth regions. 'euler' (Euler’s method, fixed-step) reproduces the discrete ResNet exactly when the step size equals 1 — making explicit the “ResNet = discretized neural ODE” correspondence.

Already referenced in First-Order ODEs & Existence Theorems Section 11. Now you understand why the ODE solver theory matters: the solver’s order pp determines the gradient accuracy of training, and the solver’s stability region determines whether you can train networks with stiff dynamics at all.

📝 Example 17 (Gradient descent as forward Euler)

Recall the gradient flow ODE from Topic 23: θ˙=L(θ),θ(0)=θ0.\dot{\boldsymbol\theta} = -\nabla L(\boldsymbol\theta), \qquad \boldsymbol\theta(0) = \boldsymbol\theta_0. This is the continuous-time analog of an optimizer that always moves in the steepest-descent direction. Its equilibria are the critical points of LL, and its asymptotic stability properties (which we analyzed via Hessian eigenvalues in Topic 23) match the convergence properties of optimization methods.

Now apply forward Euler with step size η\eta: θn+1=θn+η(L(θn))=θnηL(θn).\boldsymbol\theta_{n+1} = \boldsymbol\theta_n + \eta \cdot (-\nabla L(\boldsymbol\theta_n)) = \boldsymbol\theta_n - \eta \nabla L(\boldsymbol\theta_n). This is gradient descent with learning rate η\eta. Gradient descent is literally forward Euler applied to the gradient flow. Every theorem we proved about Euler’s method applies directly:

  • Local truncation error τ=(η2/2)θ¨=(η2/2)2LL\tau = (\eta^2/2)\ddot{\boldsymbol\theta} = -(\eta^2/2)\nabla^2 L \cdot \nabla L — error per step depends on the Hessian times the gradient.
  • Stability condition for the test problem θ˙=λθ\dot\theta = -\lambda \theta requires η<2/λ\eta < 2/\lambda. Translating: at a quadratic minimum where the loss looks like L(θ)(1/2)λ(θθ)2L(\theta) \approx (1/2) \lambda (\theta - \theta^*)^2, gradient descent is stable iff η<2/λ\eta < 2/\lambda. This is the well-known “learning rate must be less than 2/L where L is the smoothness constant” rule from the optimization literature — it’s just the forward Euler stability constraint.
  • Higher-order methods correspond to richer optimizers. Heavy-ball / Polyak momentum is a discretization of the second-order ODE θ¨+μθ˙=L\ddot\theta + \mu\dot\theta = -\nabla L — a damped harmonic oscillator with the gradient as a forcing term. Nesterov’s accelerated method is also a discretization of a second-order ODE, specifically θ¨+(3/t)θ˙+L(θ)=0\ddot\theta + (3/t)\dot\theta + \nabla L(\theta) = 0 (Su, Boyd & Candès 2014) — the 3/t3/t damping term is the continuous-time origin of Nesterov’s accelerated O(1/k2)O(1/k^2) convergence rate, compared to plain GD’s O(1/k)O(1/k). Adam is more elaborate: its per-coordinate adaptive step size is closely related to adaptive step-size control in ODE solvers (though Adam adapts based on gradient magnitudes, not error estimates).

This was already set up in Stability & Dynamical Systems Section 9. Now we have the full picture: the entire field of optimization is the field of numerical ODE methods applied to one specific equation — the gradient flow.

💡 Remark 7 (Stiffness in training — condition number as stiffness ratio)

A loss landscape with a poorly-conditioned Hessian — eigenvalue ratio λmax/λmin1\lambda_{\max}/\lambda_{\min} \gg 1 — is a stiff problem for gradient descent. The fast eigendirection requires a small learning rate (η<2/λmax\eta < 2/\lambda_{\max}) for stability, but the slow eigendirection then takes λmax/λmin\sim \lambda_{\max}/\lambda_{\min} iterations to converge. The condition number of the Hessian is exactly the stiffness ratio of the gradient flow.

This is why ill-conditioned losses are slow to train. Standard SGD is forward Euler — an explicit method — and inherits all of the explicit-method limitations on stiff problems. The remedies parallel those of stiff ODE solving:

  • Preconditioning (as in natural gradient, KFAC, Shampoo) reshapes the Hessian to reduce its condition number. Mathematically, this is changing variables in the gradient flow to make the system non-stiff. Computationally, it costs more per step (you have to invert or factor a preconditioner) but the steps can be much larger.
  • Implicit methods for optimization (proximal gradient, mirror descent with implicit updates) take large stable steps even on ill-conditioned losses. The cost is solving an implicit equation per step — typically a small inner optimization. On strongly convex problems, implicit gradient descent has no learning-rate stability constraint, just like implicit Euler.
  • Adam and related adaptive methods approximate per-coordinate preconditioning. Their effective stiffness ratio is the condition number after the per-coordinate rescaling, often much better than the raw Hessian’s. This is why Adam works well on transformer training where the raw Hessian condition number can exceed 101010^{10}.

The field of optimization is rediscovering numerical analysis lessons from the 1960s. Hairer-Wanner Vol II (the stiff-ODE bible) is, in a real sense, the textbook for designing optimizers that handle ill-conditioned losses. The ML and numerical analysis communities are slowly converging on this realization.

solver steps: 30 (compute budget)

Two interleaved spirals representing two classes. The learned vector field rotates them at different rates depending on radius, untangling them into linearly separable half-planes by t = T. This is the canonical demo from Chen et al. 2018. The left panel shows the vector field and how sample points flow through it from time 0 (open circles) to time t (filled circles). The right panel shows the empirical convergence order of Euler and RK4 on the y-component of the first trajectory — a direct verification of the theoretical orders 1 and 4. This is the experimental proof of the Lax equivalence theorem.

12. Connections & Further Reading

This is the fourth and final topic in Track 6 — Ordinary Differential Equations. With this topic the entire ODE track is published: existence (Picard-Lindelöf), structure (matrix exponential), qualitative behavior (Lyapunov, bifurcation), and now computation (Runge-Kutta, adaptive solvers, implicit methods, convergence theory). The shared odes.ts utility module reaches its final form with 39 exports — 17 interfaces and 22 functions — that powered every interactive visualization across all four topics.

The conceptual arc of the track: Topic 21 asked does a solution exist? Topic 22 asked what does it look like for linear systems? Topic 23 asked how does it behave qualitatively for nonlinear systems? Topic 24 asks how do we actually compute it? Each question depends on the previous ones, and each answer brings us closer to the computational reality that ML practitioners actually face.

Within formalCalculus:

  • First-Order ODEs & Existence Theorems — Euler’s method and RK4 were introduced here as computational tools. This topic provides the full error analysis (Theorems 1 and 2), explains why RK4 is order 4, and proves convergence via the Lax equivalence theorem.
  • Linear Systems & Matrix Exponential — exact solutions eAty0e^{At} \mathbf{y}_0 serve as benchmarks for measuring numerical accuracy (Examples 14 and 15), and the test equation y=λyy' = \lambda y is the foundation of all stability analysis (Section 5).
  • Stability & Dynamical Systems — numerical stability mirrors continuous-time stability. The stiffness ratio is computed from the Jacobian eigenvalues using the same tools as the linearization stability theorem; the stiffnessDetector function in odes.ts literally reuses eigenDecomposition2x2 from Topic 22.
  • Mean Value Theorem & Taylor Expansion — Taylor expansion is the fundamental engine for deriving the local truncation error of every method, including the explicit calculation τn=(h2/2)y(ξn)\tau_n = (h^2/2) y''(\xi_n) for Euler (Section 2) and the order conditions for RK4 (Section 3).
  • Metric Spaces & Topology — the abstract Banach Contraction Mapping Theorem, which generalizes the Newton-iteration convergence argument used for implicit methods in this topic. Topic 29’s Example 10 is precisely the implicit-method calculation from Section 7 viewed from the abstract side.

Forward to formalml.com:

  • Random Walks — discrete-time approximation of continuous diffusion processes. The convergence theory developed here (consistency, stability, Lax equivalence) is the analytical framework for understanding when discrete random walks approximate continuous SDEs and at what rate. The numerical analysis of stochastic ODEs (Euler-Maruyama, Milstein, stochastic Runge-Kutta) is a direct generalization of the deterministic theory in this topic, with extra care for the fact that Brownian increments scale like h\sqrt{h} rather than hh.

References

  1. book Hairer, Nørsett & Wanner (1993). Solving Ordinary Differential Equations I: Nonstiff Problems The definitive reference on Runge-Kutta methods, order conditions, and Butcher's algebraic theory. The standard text for everything in Sections 2–4.
  2. book Hairer & Wanner (1996). Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems Companion volume covering stiff systems, A-stability, implicit methods, and DAEs. The reference for Sections 5–7.
  3. book Iserles (2008). A First Course in the Numerical Analysis of Differential Equations Excellent undergraduate-level introduction to multistep methods and stability theory. Best for first encounter with the Lax equivalence theorem.
  4. book Butcher (2016). Numerical Methods for Ordinary Differential Equations The original source for the Butcher tableau formalism and order conditions. The combinatorial theory of trees underlying Section 3.
  5. book Burden & Faires (2015). Numerical Analysis Standard undergraduate text — Euler, RK4, error analysis. Accessible treatment for the foundational material
  6. paper Chen, Rubanova, Bettencourt & Duvenaud (2018). “Neural Ordinary Differential Equations” NeurIPS 2018 best paper. Introduced continuous-depth networks via adaptive ODE solvers and the adjoint method for backpropagation. The paper Section 11 builds on.
  7. paper He, Zhang, Ren & Sun (2016). “Deep Residual Learning for Image Recognition” ResNets as Euler discretizations of continuous dynamics — the architectural connection between numerical ODE methods and deep learning