ODEs · intermediate · 50 min read

Linear Systems & Matrix Exponential

From scalar ODEs to coupled systems — eigenvalue methods, phase portraits, and the matrix exponential that unifies them. The linear algebra engine of dynamical systems.

Abstract. A system of first-order ODEs y' = Ay couples multiple evolving quantities through a matrix A. The solution is y(t) = e^{At}y₀, where the matrix exponential e^{At} = Σ (At)^k/k! is a matrix-valued power series that always converges. When A is diagonalizable with eigendecomposition A = PDP⁻¹, the matrix exponential factors as e^{At} = P·diag(e^{λ₁t}, ..., e^{λₙt})·P⁻¹, reducing the system to n independent scalar equations — each solved by the methods of Topic 21. For 2×2 systems, the eigenvalues of A completely classify the phase portrait: real eigenvalues of the same sign produce nodes, opposite signs produce saddles, and complex eigenvalues produce spirals or centers. The trace-determinant plane organizes this classification into a single diagram. Non-homogeneous systems y' = Ay + g(t) are solved by variation of parameters: y(t) = e^{At}y₀ + ∫₀ᵗ e^{A(t-s)}g(s)ds — the convolution of the matrix exponential with the forcing term. The Picard-Lindelöf theorem extends immediately to vector IVPs because any matrix A is Lipschitz with constant ‖A‖. In machine learning, linear systems appear everywhere: gradient descent on a quadratic loss L(θ) = ½θᵀHθ yields the system θ̇ = -Hθ whose convergence rate is controlled by the eigenvalues of the Hessian H. Structured state-space models (S4, Mamba) parameterize sequence models as linear systems with learned matrices A, B, C — discretized via the matrix exponential for efficient training.

Where this leads → formalML

  • formalML Gradient descent on a quadratic loss L(θ) = ½θᵀHθ is the linear system θ̇ = −Hθ with solution θ(t) = e^{−Ht}θ₀. The eigenvalues of the Hessian H determine convergence rates: each eigen-component decays as e^{−λᵢt}. The condition number κ(H) = λ_max/λ_min is the ratio of fastest to slowest decay, explaining why ill-conditioned problems converge slowly.
  • formalML The spectral decomposition A = PDP⁻¹ is the computational engine of the matrix exponential. The spectral theorem guarantees that symmetric matrices have real eigenvalues and orthogonal eigenvectors, yielding e^{At} = Σ e^{λᵢt}vᵢvᵢᵀ — a sum of rank-one exponentials. This spectral representation appears in kernel methods (Mercer's theorem), PCA, and spectral graph theory.
  • formalML Natural gradient descent θ̇ = −F⁻¹∇L defines a linear system in the tangent space of the statistical manifold. The Fisher information matrix F acts as the metric tensor, and the matrix exponential e^{−F⁻¹∇²L·t} describes local training dynamics in natural gradient coordinates.

1. From Scalar to System — The Vector ODE

In First-Order ODEs, we solved scalar equations y=f(t,y)y' = f(t, y) — one unknown function evolving according to one rule. But almost nothing interesting is one-dimensional. Two masses connected by a spring give a system of two coupled equations. A circuit with an inductor and capacitor gives another. A neural network with pp parameters evolving under gradient descent gives a system of pp coupled equations. The passage from scalar to system is the passage from one-dimensional calculus to linear algebra.

A system of nn first-order ODEs can always be written as a single vector equation y=f(t,y)\mathbf{y}' = \mathbf{f}(t, \mathbf{y}), where y(t)Rn\mathbf{y}(t) \in \mathbb{R}^n is a vector-valued function. For a linear system with constant coefficients, this becomes y=Ay\mathbf{y}' = A\mathbf{y}, where ARn×nA \in \mathbb{R}^{n \times n} is a constant matrix. The direction field from Topic 21 — a slope at every point — becomes a vector field: at each point y\mathbf{y}, the matrix AA assigns a velocity vector AyA\mathbf{y}.

Why this matters for ML: Neural network parameters form a vector θRp\theta \in \mathbb{R}^p, and gradient descent θ˙=L(θ)\dot{\theta} = -\nabla L(\theta) is a system of pp coupled ODEs. Near a minimum, linearizing gives θ˙Hθ\dot{\theta} \approx -H\theta — a linear system whose behavior is governed by the Hessian eigenvalues. Understanding linear systems is understanding the local behavior of gradient descent.

📐 Definition 1 (Linear System of ODEs)

A linear system of first-order ordinary differential equations with constant coefficients is an equation

y(t)=Ay(t),\mathbf{y}'(t) = A\mathbf{y}(t),

where ARn×nA \in \mathbb{R}^{n \times n} is a constant matrix and y:IRn\mathbf{y}: I \to \mathbb{R}^n is the unknown vector-valued function. The initial value problem is y(t)=Ay(t)\mathbf{y}'(t) = A\mathbf{y}(t), y(0)=y0\mathbf{y}(0) = \mathbf{y}_0. The system is homogeneous if there is no additional forcing term; the non-homogeneous version is y(t)=Ay(t)+g(t)\mathbf{y}'(t) = A\mathbf{y}(t) + \mathbf{g}(t).

📝 Example 1 (Coupled oscillators)

Two masses connected by springs: y1=y2y_1' = y_2, y2=ω2y1y_2' = -\omega^2 y_1. In matrix form:

y=(01ω20)y.\mathbf{y}' = \begin{pmatrix} 0 & 1 \\ -\omega^2 & 0 \end{pmatrix} \mathbf{y}.

The eigenvalues are ±iω\pm i\omega, producing oscillatory solutions y1(t)=Acos(ωt+ϕ)y_1(t) = A\cos(\omega t + \phi) — the system rotates in the phase plane. This is the harmonic oscillator, the most fundamental dynamical system in physics and engineering.

📝 Example 2 (Higher-order to first-order conversion)

The second-order equation y+3y+2y=0y'' + 3y' + 2y = 0 converts to a first-order system via y1=yy_1 = y, y2=yy_2 = y':

y=(0123)y.\mathbf{y}' = \begin{pmatrix} 0 & 1 \\ -2 & -3 \end{pmatrix} \mathbf{y}.

Every nn-th order linear ODE becomes an n×nn \times n first-order system. This reduction is not just a notational convenience — it places all linear ODEs (regardless of order) into a single framework where eigenvalue methods apply uniformly.

💡 Remark 1 (Autonomous vs. non-autonomous systems)

When AA is constant, the system is autonomous: the velocity field AyA\mathbf{y} depends only on the state y\mathbf{y}, not on time. This is the typical case in ML — the loss landscape does not change during training (for fixed data). Time-dependent systems y=A(t)y\mathbf{y}' = A(t)\mathbf{y} are harder and appear in §7 via the non-homogeneous formulation.

Scalar ODE direction field compared with 2D vector field for a coupled system, and higher-order to first-order conversion diagram

2. The Matrix Exponential

The scalar ODE y=ayy' = ay has the solution y(t)=eaty0y(t) = e^{at}y_0. The system y=Ay\mathbf{y}' = A\mathbf{y} has the solution y(t)=eAty0\mathbf{y}(t) = e^{At}\mathbf{y}_0, where the matrix exponential is defined by the power series:

eAt=I+At+(At)22!+(At)33!+=k=0(At)kk!e^{At} = I + At + \frac{(At)^2}{2!} + \frac{(At)^3}{3!} + \cdots = \sum_{k=0}^{\infty} \frac{(At)^k}{k!}

This series always converges for any square matrix AA and any tRt \in \mathbb{R}. The proof uses the submultiplicativity of the operator norm: AkAk\|A^k\| \leq \|A\|^k, so the series is dominated by Aktk/k!=eAt<\sum \|A\|^k |t|^k / k! = e^{\|A\||t|} < \infty.

The matrix exponential is a power series in the matrix variable AtAt, with infinite radius of convergence. Everything we proved about scalar power series in Topic 18 — absolute convergence, term-by-term differentiation — extends to matrix-valued series.

📐 Definition 2 (Matrix Exponential)

For any n×nn \times n matrix AA and scalar tRt \in \mathbb{R}, the matrix exponential is

eAt=k=0(At)kk!,e^{At} = \sum_{k=0}^{\infty} \frac{(At)^k}{k!},

where A0=IA^0 = I (the identity matrix) and AkA^k denotes the kk-th matrix power. The series converges absolutely in any matrix norm.

🔷 Theorem 1 (Convergence of the Matrix Exponential)

For any ARn×nA \in \mathbb{R}^{n \times n} and tRt \in \mathbb{R}, the series k=0(At)kk!\sum_{k=0}^{\infty} \frac{(At)^k}{k!} converges absolutely, and eAteAt\|e^{At}\| \leq e^{\|A\||t|}.

Proof.

For any submultiplicative norm \|\cdot\|, we have

(At)kk!Aktkk!.\left\|\frac{(At)^k}{k!}\right\| \leq \frac{\|A\|^k |t|^k}{k!}.

The scalar series k=0(At)kk!=eAt\sum_{k=0}^{\infty} \frac{(\|A\||t|)^k}{k!} = e^{\|A\||t|} converges (it is the exponential function evaluated at At\|A\||t|). Therefore the matrix series converges absolutely by comparison.

Absolute convergence in a finite-dimensional normed space implies convergence (because Rn×n\mathbb{R}^{n \times n} with any matrix norm is a complete normed space), so eAte^{At} is well-defined.

The bound follows from the triangle inequality applied to the partial sums:

eAt=k=0(At)kk!k=0Aktkk!=eAt.\|e^{At}\| = \left\|\sum_{k=0}^{\infty} \frac{(At)^k}{k!}\right\| \leq \sum_{k=0}^{\infty} \frac{\|A\|^k |t|^k}{k!} = e^{\|A\||t|}. \qquad \blacksquare

🔷 Theorem 2 (Properties of the Matrix Exponential)

Let A,BRn×nA, B \in \mathbb{R}^{n \times n} and t,sRt, s \in \mathbb{R}. Then:

(a) eA0=Ie^{A \cdot 0} = I.

(b) ddteAt=AeAt=eAtA\displaystyle\frac{d}{dt}e^{At} = Ae^{At} = e^{At}A.

(c) eA(t+s)=eAteAse^{A(t+s)} = e^{At}e^{As} for all t,st, s.

(d) eAte^{At} is invertible with (eAt)1=eAt(e^{At})^{-1} = e^{-At}.

(e) If AB=BAAB = BA, then e(A+B)t=eAteBte^{(A+B)t} = e^{At}e^{Bt}.

(f) If ABBAAB \neq BA, then e(A+B)teAteBte^{(A+B)t} \neq e^{At}e^{Bt} in general.

💡 Remark 2 (The commutativity caveat)

Property (e) fails without commutativity. The Baker-Campbell-Hausdorff formula gives the correction: eAeB=eA+B+12[A,B]+e^A e^B = e^{A + B + \frac{1}{2}[A,B] + \cdots} where [A,B]=ABBA[A, B] = AB - BA is the commutator. This non-commutativity is why matrix exponentials are harder than scalar exponentials, and why Lie theory is the natural language for matrix groups. In ML, this matters when composing transformations that don’t commute — rotation followed by scaling is not the same as scaling followed by rotation.

📝 Example 3 (Diagonal and nilpotent exponentials)

If A=diag(λ1,,λn)A = \text{diag}(\lambda_1, \ldots, \lambda_n), then eAt=diag(eλ1t,,eλnt)e^{At} = \text{diag}(e^{\lambda_1 t}, \ldots, e^{\lambda_n t}) — each component evolves independently.

If NN is nilpotent (Nm=0N^m = 0 for some mm), then eNt=I+Nt+(Nt)22!++(Nt)m1(m1)!e^{Nt} = I + Nt + \frac{(Nt)^2}{2!} + \cdots + \frac{(Nt)^{m-1}}{(m-1)!} — a matrix polynomial in tt, since all higher powers vanish. This finite truncation makes nilpotent exponentials easy to compute.

[1,1](solid = exact, dashed = partial sum)[1,2](solid = exact, dashed = partial sum)[2,1](solid = exact, dashed = partial sum)[2,2](solid = exact, dashed = partial sum)

Partial sums of the matrix exponential converging, with entry-wise error on a log scale

3. Eigenvalue Methods — Diagonalizable Systems

If AA has nn linearly independent eigenvectors — i.e., AA is diagonalizable with A=PDP1A = PDP^{-1} where D=diag(λ1,,λn)D = \text{diag}(\lambda_1, \ldots, \lambda_n) — then the matrix exponential factors cleanly:

eAt=PeDtP1=P(eλ1teλnt)P1e^{At} = Pe^{Dt}P^{-1} = P \begin{pmatrix} e^{\lambda_1 t} & & \\ & \ddots & \\ & & e^{\lambda_n t} \end{pmatrix} P^{-1}

This reduces the system to nn independent scalar equations. The general solution is y(t)=c1eλ1tv1++cneλntvn\mathbf{y}(t) = c_1 e^{\lambda_1 t}\mathbf{v}_1 + \cdots + c_n e^{\lambda_n t}\mathbf{v}_n, where vi\mathbf{v}_i are the eigenvectors and cic_i are determined by the initial condition. Each term cieλitvic_i e^{\lambda_i t}\mathbf{v}_i is a scalar ODE solution (Topic 21) dressed in the direction of its eigenvector.

The Jacobian introduced eigendecomposition as a change of basis that diagonalizes a linear map. Here the same idea solves a differential equation: in the eigenbasis, the coupled system decouples into nn independent exponentials.

🔷 Theorem 3 (Solution via Eigendecomposition)

Let ARn×nA \in \mathbb{R}^{n \times n} be diagonalizable with eigenvalues λ1,,λn\lambda_1, \ldots, \lambda_n and corresponding linearly independent eigenvectors v1,,vn\mathbf{v}_1, \ldots, \mathbf{v}_n. The general solution of y=Ay\mathbf{y}' = A\mathbf{y} is

y(t)=c1eλ1tv1++cneλntvn,\mathbf{y}(t) = c_1 e^{\lambda_1 t}\mathbf{v}_1 + \cdots + c_n e^{\lambda_n t}\mathbf{v}_n,

where ciRc_i \in \mathbb{R} (or C\mathbb{C}) are constants determined by y(0)=y0\mathbf{y}(0) = \mathbf{y}_0.

Proof.

Each eλitvie^{\lambda_i t}\mathbf{v}_i satisfies

ddt[eλitvi]=λieλitvi=A(eλitvi)\frac{d}{dt}\bigl[e^{\lambda_i t}\mathbf{v}_i\bigr] = \lambda_i e^{\lambda_i t}\mathbf{v}_i = A\bigl(e^{\lambda_i t}\mathbf{v}_i\bigr)

because Avi=λiviA\mathbf{v}_i = \lambda_i \mathbf{v}_i. By linearity of differentiation, any linear combination cieλitvi\sum c_i e^{\lambda_i t}\mathbf{v}_i is also a solution.

Since {v1,,vn}\{\mathbf{v}_1, \ldots, \mathbf{v}_n\} is a basis for Rn\mathbb{R}^n, every initial condition y0=civi\mathbf{y}_0 = \sum c_i \mathbf{v}_i can be uniquely represented, so the general solution captures all solutions. \blacksquare

📝 Example 4 (A 2×2 system with real eigenvalues)

Consider A=(1203)A = \begin{pmatrix} -1 & 2 \\ 0 & -3 \end{pmatrix}. The eigenvalues are λ1=1\lambda_1 = -1, λ2=3\lambda_2 = -3, with eigenvectors v1=(1,0)T\mathbf{v}_1 = (1, 0)^T, v2=(1,1)T\mathbf{v}_2 = (1, -1)^T. The general solution is

y(t)=c1et(10)+c2e3t(11).\mathbf{y}(t) = c_1 e^{-t}\begin{pmatrix} 1 \\ 0 \end{pmatrix} + c_2 e^{-3t}\begin{pmatrix} 1 \\ -1 \end{pmatrix}.

Both eigenvalues are negative, so all solutions decay to the origin — a stable node. The fast component (e3te^{-3t}) dies out first, and trajectories ultimately align with the slow eigendirection v1\mathbf{v}_1.

📝 Example 5 (Complex eigenvalues)

The rotation matrix A=(0ωω0)A = \begin{pmatrix} 0 & -\omega \\ \omega & 0 \end{pmatrix} has eigenvalues ±iω\pm i\omega. The solution is y(t)=c1cos(ωt)+c2sin(ωt)\mathbf{y}(t) = c_1 \cos(\omega t) + c_2 \sin(\omega t) — pure rotation in the phase plane (a center). With damping, A=(αωωα)A = \begin{pmatrix} -\alpha & -\omega \\ \omega & -\alpha \end{pmatrix} (α>0\alpha > 0), the eigenvalues become α±iω-\alpha \pm i\omega, and solutions spiral inward — a stable spiral.

💡 Remark 3 (Real solutions from complex eigenvalues)

When AA is real and λ=α+iβ\lambda = \alpha + i\beta is a complex eigenvalue with eigenvector v=a+ib\mathbf{v} = \mathbf{a} + i\mathbf{b}, the two real-valued solutions are

eαt(cos(βt)asin(βt)b)andeαt(sin(βt)a+cos(βt)b).e^{\alpha t}\bigl(\cos(\beta t)\,\mathbf{a} - \sin(\beta t)\,\mathbf{b}\bigr) \quad\text{and}\quad e^{\alpha t}\bigl(\sin(\beta t)\,\mathbf{a} + \cos(\beta t)\,\mathbf{b}\bigr).

The factor eαte^{\alpha t} controls growth or decay; cos(βt)\cos(\beta t) and sin(βt)\sin(\beta t) produce rotation. When α<0\alpha < 0, we get spirals that decay inward; when α>0\alpha > 0, spirals that grow outward; when α=0\alpha = 0, closed ellipses.

Eigendecomposition visualization showing eigenvectors determining solution structure

4. Phase Portraits for 2×2 Systems

For a 2×22 \times 2 system y=Ay\mathbf{y}' = A\mathbf{y}, the qualitative behavior is completely determined by the eigenvalues of AA. The trace-determinant plane organizes all possibilities into a single diagram — one of the most useful classification tools in dynamical systems.

The eigenvalues of a 2×22 \times 2 matrix are λ=τ±τ24Δ2\lambda = \frac{\tau \pm \sqrt{\tau^2 - 4\Delta}}{2}, where τ=tr(A)\tau = \text{tr}(A) and Δ=det(A)\Delta = \det(A). Every qualitative feature of the phase portrait — stability, oscillation, node vs. spiral — is encoded in these two numbers.

📐 Definition 3 (Phase Portrait)

The phase portrait of y=Ay\mathbf{y}' = A\mathbf{y} is the collection of all solution trajectories y(t)\mathbf{y}(t) plotted in the phase plane (y1,y2)(y_1, y_2). The origin 0\mathbf{0} is the only equilibrium point (assuming AA is invertible, i.e., det(A)0\det(A) \neq 0).

🔷 Theorem 4 (Trace-Determinant Classification)

Let AA be a 2×22 \times 2 real matrix with τ=tr(A)\tau = \text{tr}(A) and Δ=det(A)\Delta = \det(A), and define the discriminant D=τ24ΔD = \tau^2 - 4\Delta. The phase portrait of y=Ay\mathbf{y}' = A\mathbf{y} is classified as follows:

(i) Δ<0\Delta < 0: saddle (eigenvalues of opposite sign).

(ii) Δ>0\Delta > 0, D>0D > 0, τ<0\tau < 0: stable node (distinct negative real eigenvalues).

(iii) Δ>0\Delta > 0, D>0D > 0, τ>0\tau > 0: unstable node (distinct positive real eigenvalues).

(iv) Δ>0\Delta > 0, D<0D < 0, τ<0\tau < 0: stable spiral (complex eigenvalues with negative real part).

(v) Δ>0\Delta > 0, D<0D < 0, τ>0\tau > 0: unstable spiral (complex eigenvalues with positive real part).

(vi) Δ>0\Delta > 0, τ=0\tau = 0: center (pure imaginary eigenvalues).

(vii) Δ=0\Delta = 0: degenerate (at least one zero eigenvalue; line of equilibria).

📝 Example 6 (The trace-determinant plane as a map)

Place a dot at (τ,Δ)(\tau, \Delta) for any 2×22 \times 2 matrix, and the diagram immediately tells you the phase portrait type. The parabola Δ=τ2/4\Delta = \tau^2/4 separates nodes from spirals. The τ\tau-axis (τ=0\tau = 0) separates stable from unstable. The Δ\Delta-axis (Δ=0\Delta = 0) separates saddles from non-saddles. Every 2×22 \times 2 linear system lives at exactly one point on this map.

💡 Remark 4 (Phase portraits and the Hessian)

Compare with the second-derivative test: the Hessian’s eigenvalues classify critical points of a function as minima (λ1,λ2>0\lambda_1, \lambda_2 > 0), maxima (λ1,λ2<0\lambda_1, \lambda_2 < 0), or saddle points (mixed signs). The ODE system matrix’s eigenvalues classify equilibria as stable nodes, unstable nodes, or saddles — the same eigenvalue analysis, governing dynamics instead of statics. Gradient descent connects them: θ˙=2fθ\dot{\theta} = -\nabla^2 f \cdot \theta near a critical point.

Matrix A
-1.02.00.0-3.0
Classification
Stable Node
Eigenvalues
λ₁ = -1.000
λ₂ = -3.000
Invariants
Trace τ = -4.000
Determinant Δ = 3.000
Discriminant D = 4.000
Trace-Determinant Plane
Click anywhere on the phase plane to launch a trajectory (up to 12)

The six canonical 2×2 phase portraits: stable node, unstable node, saddle, stable spiral, unstable spiral, center

5. Repeated and Complex Eigenvalues — Beyond Diagonalizability

What happens when AA is defective — a repeated eigenvalue λ\lambda with only one linearly independent eigenvector? The matrix is not diagonalizable, and the Jordan normal form provides the solution. We find a generalized eigenvector w\mathbf{w} satisfying (AλI)w=v(A - \lambda I)\mathbf{w} = \mathbf{v} (where v\mathbf{v} is the eigenvector), and the solution involves terms teλtte^{\lambda t} — the same polynomial-times-exponential that appeared in the scalar repeated-root case.

📐 Definition 4 (Generalized Eigenvector)

A vector w\mathbf{w} is a generalized eigenvector of rank 2 for AA with eigenvalue λ\lambda if (AλI)w=v(A - \lambda I)\mathbf{w} = \mathbf{v}, where v\mathbf{v} is an eigenvector ((AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0}, v0\mathbf{v} \neq \mathbf{0}). More generally, a generalized eigenvector of rank kk satisfies (AλI)kw=0(A - \lambda I)^k \mathbf{w} = \mathbf{0} but (AλI)k1w0(A - \lambda I)^{k-1}\mathbf{w} \neq \mathbf{0}.

🔷 Theorem 5 (Solution for Defective 2×2 Systems)

If AA has a repeated eigenvalue λ\lambda with a single eigenvector v\mathbf{v} and generalized eigenvector w\mathbf{w} (with (AλI)w=v(A - \lambda I)\mathbf{w} = \mathbf{v}), then the general solution is

y(t)=c1eλtv+c2eλt(tv+w).\mathbf{y}(t) = c_1 e^{\lambda t}\mathbf{v} + c_2 e^{\lambda t}(t\mathbf{v} + \mathbf{w}).

📝 Example 7 (A defective node)

A=(2102)A = \begin{pmatrix} 2 & 1 \\ 0 & 2 \end{pmatrix}. The repeated eigenvalue is λ=2\lambda = 2, with a single eigenvector v=(1,0)T\mathbf{v} = (1, 0)^T. The generalized eigenvector satisfies (A2I)w=v(A - 2I)\mathbf{w} = \mathbf{v}, giving w=(0,1)T\mathbf{w} = (0, 1)^T. The solution is

y(t)=c1e2t(10)+c2e2t(t(10)+(01)).\mathbf{y}(t) = c_1 e^{2t}\begin{pmatrix} 1 \\ 0 \end{pmatrix} + c_2 e^{2t}\left(t\begin{pmatrix} 1 \\ 0 \end{pmatrix} + \begin{pmatrix} 0 \\ 1 \end{pmatrix}\right).

The phase portrait is a degenerate unstable node — all trajectories are tangent to the eigenvector direction at the origin.

🔷 Proposition 1 (Jordan Normal Form and the Matrix Exponential)

If A=PJP1A = PJP^{-1} where JJ is in Jordan normal form with blocks Jk=λkI+NkJ_k = \lambda_k I + N_k (NkN_k nilpotent), then eAt=PeJtP1e^{At} = P e^{Jt} P^{-1} where

eJkt=eλkteNkt=eλkt(I+Nkt+Nk2t22!+).e^{J_k t} = e^{\lambda_k t} e^{N_k t} = e^{\lambda_k t}\left(I + N_k t + \frac{N_k^2 t^2}{2!} + \cdots\right).

This is a finite sum because NkN_k is nilpotent (Nkm=0N_k^m = 0 for some mm). Each Jordan block of size mm contributes terms tjeλtt^j e^{\lambda t} for j=0,1,,m1j = 0, 1, \ldots, m-1.

💡 Remark 5 (Defective matrices are rare but important)

Generically, matrices are diagonalizable — defective matrices form a set of measure zero in matrix space. But in applications, defective matrices arise naturally at bifurcation points — the boundary between qualitatively different behaviors. The transition from a stable spiral to a stable node passes through a defective node at τ2=4Δ\tau^2 = 4\Delta on the trace-determinant diagram. Stability & Dynamical Systems will examine these transitions systematically.

Defective node phase portrait with te^λt growth, and Jordan block exponential showing polynomial times exponential terms

6. Fundamental Matrices & the Wronskian

A fundamental matrix Φ(t)\Phi(t) is an n×nn \times n matrix whose columns are nn linearly independent solutions. For the system y=Ay\mathbf{y}' = A\mathbf{y}, the standard choice is Φ(t)=eAt\Phi(t) = e^{At}, satisfying Φ(0)=I\Phi(0) = I. Abel’s formula connects the determinant of the fundamental matrix (the Wronskian) to the trace of AA, telling us exactly how volumes evolve under the flow.

📐 Definition 5 (Fundamental Matrix)

A fundamental matrix for the system y=Ay\mathbf{y}' = A\mathbf{y} is a matrix-valued function Φ(t)\Phi(t) satisfying Φ(t)=AΦ(t)\Phi'(t) = A\Phi(t) whose columns ϕ1(t),,ϕn(t)\boldsymbol{\phi}_1(t), \ldots, \boldsymbol{\phi}_n(t) are linearly independent solutions. The standard fundamental matrix is Φ(t)=eAt\Phi(t) = e^{At}, satisfying Φ(0)=I\Phi(0) = I.

🔷 Theorem 6 (Abel's Formula (Liouville's Formula))

If Φ(t)\Phi(t) is a fundamental matrix for y=Ay\mathbf{y}' = A\mathbf{y}, then

detΦ(t)=detΦ(0)etr(A)t.\det \Phi(t) = \det \Phi(0) \cdot e^{\operatorname{tr}(A) \cdot t}.

For the standard fundamental matrix, deteAt=etr(A)t\det e^{At} = e^{\operatorname{tr}(A) \cdot t}.

Proof.

We show that ddtdetΦ(t)=tr(A)detΦ(t)\frac{d}{dt}\det \Phi(t) = \operatorname{tr}(A) \cdot \det \Phi(t) — a scalar linear ODE with solution detΦ(t)=detΦ(0)etr(A)t\det \Phi(t) = \det \Phi(0) \cdot e^{\operatorname{tr}(A)t}.

For the derivative of the determinant: by the multilinearity of the determinant in its rows,

ddtdetΦ=i=1ndetΦi(t),\frac{d}{dt}\det \Phi = \sum_{i=1}^{n} \det \Phi_i(t),

where Φi\Phi_i replaces the ii-th row of Φ\Phi with its derivative (the ii-th row of Φ=AΦ\Phi' = A\Phi).

The ii-th row of AΦA\Phi is jaij(row j of Φ)\sum_j a_{ij} (\text{row } j \text{ of } \Phi). By the alternating property of the determinant, a matrix with two identical rows has determinant zero. Therefore, only the diagonal term aiia_{ii} survives in each detΦi(t)\det \Phi_i(t):

ddtdetΦ=i=1naiidetΦ=tr(A)detΦ.\frac{d}{dt}\det \Phi = \sum_{i=1}^{n} a_{ii} \det \Phi = \operatorname{tr}(A) \det \Phi.

This is the scalar ODE W=tr(A)WW' = \operatorname{tr}(A) \cdot W with W(0)=detΦ(0)W(0) = \det \Phi(0), whose unique solution is W(t)=detΦ(0)etr(A)tW(t) = \det \Phi(0) \cdot e^{\operatorname{tr}(A)t}. \blacksquare

💡 Remark 6 (Volume evolution)

Abel’s formula says that the flow of y=Ay\mathbf{y}' = A\mathbf{y} preserves volume if and only if tr(A)=0\operatorname{tr}(A) = 0. Hamiltonian systems (from classical mechanics) have tr(A)=0\operatorname{tr}(A) = 0 and are volume-preserving — this is Liouville’s theorem. Dissipative systems (tr(A)<0\operatorname{tr}(A) < 0) contract volumes, which is why stable systems have trajectories that converge to lower-dimensional attractors.

Columns of the fundamental matrix as evolving basis vectors, with volume evolution showing expansion, contraction, and preservation

7. Non-Homogeneous Systems & Variation of Parameters

The non-homogeneous system y=Ay+g(t)\mathbf{y}' = A\mathbf{y} + \mathbf{g}(t) adds a forcing term — an external input that drives the system. The solution method is variation of parameters: replace the constant vector c\mathbf{c} in the homogeneous solution eAtce^{At}\mathbf{c} by a time-dependent vector c(t)\mathbf{c}(t), and solve for c(t)=eAtg(t)\mathbf{c}'(t) = e^{-At}\mathbf{g}(t).

🔷 Theorem 7 (Variation of Parameters for Linear Systems)

The solution of the IVP y=Ay+g(t)\mathbf{y}' = A\mathbf{y} + \mathbf{g}(t), y(0)=y0\mathbf{y}(0) = \mathbf{y}_0 is:

y(t)=eAty0+0teA(ts)g(s)ds.\mathbf{y}(t) = e^{At}\mathbf{y}_0 + \int_0^t e^{A(t-s)}\mathbf{g}(s)\,ds.

The first term is the homogeneous solution (free response); the integral is the convolution of the matrix exponential with the forcing (forced response).

Proof.

Let y(t)=eAtc(t)\mathbf{y}(t) = e^{At}\mathbf{c}(t) (variation of parameters ansatz). Differentiating using the product rule:

y=AeAtc+eAtc=Ay+eAtc.\mathbf{y}' = Ae^{At}\mathbf{c} + e^{At}\mathbf{c}' = A\mathbf{y} + e^{At}\mathbf{c}'.

Comparing with y=Ay+g(t)\mathbf{y}' = A\mathbf{y} + \mathbf{g}(t) gives eAtc(t)=g(t)e^{At}\mathbf{c}'(t) = \mathbf{g}(t), so c(t)=eAtg(t)\mathbf{c}'(t) = e^{-At}\mathbf{g}(t).

Integrating with the initial condition c(0)=y0\mathbf{c}(0) = \mathbf{y}_0:

c(t)=y0+0teAsg(s)ds.\mathbf{c}(t) = \mathbf{y}_0 + \int_0^t e^{-As}\mathbf{g}(s)\,ds.

Substituting back: y(t)=eAty0+0teA(ts)g(s)ds\mathbf{y}(t) = e^{At}\mathbf{y}_0 + \int_0^t e^{A(t-s)}\mathbf{g}(s)\,ds. \blacksquare

📝 Example 8 (Forced harmonic oscillator)

The system y=(01ω20)y+(0cos(Ωt))\mathbf{y}' = \begin{pmatrix} 0 & 1 \\ -\omega^2 & 0 \end{pmatrix}\mathbf{y} + \begin{pmatrix} 0 \\ \cos(\Omega t) \end{pmatrix} models a harmonic oscillator driven by an external force at frequency Ω\Omega. The convolution integral produces resonance when Ω=ω\Omega = \omega — the forced response grows without bound because the driving frequency matches the natural frequency.

📝 Example 9 (Input-output systems and the transfer function)

Writing y=Ay+Bu\mathbf{y}' = A\mathbf{y} + B\mathbf{u}, z=Cy\mathbf{z} = C\mathbf{y} (state-space form), the output is

z(t)=CeAty0+C0teA(ts)Bu(s)ds.\mathbf{z}(t) = Ce^{At}\mathbf{y}_0 + C\int_0^t e^{A(t-s)}B\mathbf{u}(s)\,ds.

The matrix CeAtBCe^{At}B is the impulse response — the system’s output when the input is a delta function. This is exactly the structure of state-space models in ML (S4/Mamba): the matrices AA, BB, CC are learned, and the matrix exponential mediates between continuous and discrete representations.

💡 Remark 7 (Convolution and the Laplace transform)

The variation of parameters formula is a matrix convolution. The Laplace transform converts this to an algebraic equation: y^(s)=(sIA)1y0+(sIA)1g^(s)\hat{\mathbf{y}}(s) = (sI - A)^{-1}\mathbf{y}_0 + (sI - A)^{-1}\hat{\mathbf{g}}(s). The transfer matrix (sIA)1(sI - A)^{-1} encodes the system’s frequency response. The poles of the transfer function are the eigenvalues of AA — connecting back to the phase portrait classification.

Free response (homogeneous)Forced response (particular)Total response
Undamped oscillator (ω = 2) driven near resonance (Ω = 2.5). The beating pattern reveals the frequency mismatch.

Decomposition of forced system response into homogeneous and particular components

8. Existence, Uniqueness & the Vector Picard-Lindelöf Theorem

The Picard-Lindelöf theorem (Topic 21) extends directly to vector-valued IVPs. For linear systems, the Lipschitz condition is automatically satisfied: Ay1Ay2Ay1y2\|A\mathbf{y}_1 - A\mathbf{y}_2\| \leq \|A\| \|\mathbf{y}_1 - \mathbf{y}_2\|, so the Lipschitz constant is simply L=AL = \|A\|. Moreover, linear systems never blow up in finite time — solutions exist for all tRt \in \mathbb{R}.

🔷 Theorem 8 (Picard-Lindelöf for Vector IVPs)

Let f:DRn\mathbf{f}: D \to \mathbb{R}^n be continuous on an open set DR×RnD \subseteq \mathbb{R} \times \mathbb{R}^n and Lipschitz in y\mathbf{y}:

f(t,y1)f(t,y2)Ly1y2for all (t,y1),(t,y2)D.\|\mathbf{f}(t, \mathbf{y}_1) - \mathbf{f}(t, \mathbf{y}_2)\| \leq L\|\mathbf{y}_1 - \mathbf{y}_2\| \quad \text{for all } (t, \mathbf{y}_1), (t, \mathbf{y}_2) \in D.

Then the IVP y=f(t,y)\mathbf{y}' = \mathbf{f}(t, \mathbf{y}), y(t0)=y0\mathbf{y}(t_0) = \mathbf{y}_0 has a unique local solution.

🔷 Corollary 1 (Global Existence for Linear Systems)

If ARn×nA \in \mathbb{R}^{n \times n} is constant and g:RRn\mathbf{g}: \mathbb{R} \to \mathbb{R}^n is continuous, then the IVP y=Ay+g(t)\mathbf{y}' = A\mathbf{y} + \mathbf{g}(t), y(0)=y0\mathbf{y}(0) = \mathbf{y}_0 has a unique solution on all of R\mathbb{R}.

This follows because f(y)=Ay\mathbf{f}(\mathbf{y}) = A\mathbf{y} is globally Lipschitz with constant A\|A\|, and the Grönwall inequality prevents blow-up: y(t)eAty0+0teA(ts)g(s)ds\|\mathbf{y}(t)\| \leq e^{\|A\|t}\|\mathbf{y}_0\| + \int_0^t e^{\|A\|(t-s)}\|\mathbf{g}(s)\|\,ds.

💡 Remark 8 (Linear systems never blow up)

Compare with Topic 21’s blow-up example y=y2y' = y^2, where the nonlinearity causes finite-time escape to infinity. For linear systems, the solution y(t)=eAty0\mathbf{y}(t) = e^{At}\mathbf{y}_0 grows at most exponentially: y(t)eAty0\|\mathbf{y}(t)\| \leq e^{\|A\|t}\|\mathbf{y}_0\|. Exponential growth is fast, but it never reaches infinity in finite time. This is why linearization is always locally valid — the linear approximation never predicts blow-up.

💡 Remark 9 (Uniqueness implies no trajectory crossing)

In the phase plane, solution trajectories of y=Ay\mathbf{y}' = A\mathbf{y} never cross (except at the origin, which is an equilibrium). If two trajectories crossed at a point y\mathbf{y}^* at time tt^*, that point would have two distinct solutions to the same IVP — contradicting uniqueness. This topological consequence constrains the geometry of phase portraits: trajectories can approach or depart from the origin, but they cannot intersect.

Trajectories that never cross in the phase plane, with exponential growth bound envelope

9. ML Connections — State-Space Models, Training Dynamics, and Continuous-Time Architectures

Linear systems are not just a theoretical stepping stone — they are the mathematical backbone of several active research areas in machine learning. The matrix exponential, eigenvalue decomposition, and phase portrait analysis appear directly in the design and analysis of modern architectures.

9.1: Gradient descent as a linear system

Near a critical point θ\theta^* of a smooth loss L(θ)L(\theta), the gradient flow θ˙=L(θ)\dot{\theta} = -\nabla L(\theta) linearizes to δ˙=Hδ\dot{\delta} = -H\delta where H=2L(θ)H = \nabla^2 L(\theta^*) is the Hessian and δ=θθ\delta = \theta - \theta^*. The solution δ(t)=eHtδ0\delta(t) = e^{-Ht}\delta_0 decays along each eigendirection of HH at rate eλite^{-\lambda_i t}.

The slowest mode (λmin\lambda_{\min}) dominates: convergence is governed by eλminte^{-\lambda_{\min}t}, and the condition number κ(H)=λmax/λmin\kappa(H) = \lambda_{\max}/\lambda_{\min} measures how anisotropic the convergence is. This is why preconditioning (replacing HH with II, as in Newton’s method) speeds up training — it makes all eigenvalues equal, so all directions converge at the same rate.

📝 Example 10 (Quadratic loss and eigen-decomposition of convergence)

Consider L(θ)=12θTHθL(\theta) = \frac{1}{2}\theta^T H \theta with H=diag(1,10,100)H = \text{diag}(1, 10, 100). The gradient flow θ˙=Hθ\dot{\theta} = -H\theta has solution θi(t)=θi,0eλit\theta_i(t) = \theta_{i,0} e^{-\lambda_i t}.

After time tt, the residual error in each direction is eλite^{-\lambda_i t}: direction 3 converges 100× faster than direction 1. The loss L(t)=12θi,02e2λitL(t) = \frac{1}{2}\sum \theta_{i,0}^2 e^{-2\lambda_i t} shows multi-scale decay — a fast initial drop (the large-eigenvalue components) followed by a long tail (the small-eigenvalue component).

Condition number: κ = 2.0θ = (3.0, 2.0)
Nearly isotropic loss landscape. Both eigencomponents converge at similar rates, and gradient descent is efficient even with a fixed learning rate.

9.2: Structured state-space models (S4/Mamba)

The S4 architecture (Gu et al. 2022) parameterizes a sequence model as a continuous-time linear system:

h(t)=Ah(t)+Bx(t),z(t)=Ch(t).\mathbf{h}'(t) = A\mathbf{h}(t) + B\mathbf{x}(t), \quad \mathbf{z}(t) = C\mathbf{h}(t).

The matrices AA, BB, CC are learned parameters. The system is discretized for training: hk+1=Aˉhk+Bˉxk\mathbf{h}_{k+1} = \bar{A}\mathbf{h}_k + \bar{B}\mathbf{x}_k where Aˉ=eAΔt\bar{A} = e^{A\Delta t} — the matrix exponential from this topic.

📝 Example 11 (S4 discretization via matrix exponential)

Given continuous parameters A,BA, B, the discrete parameters are Aˉ=eAΔt\bar{A} = e^{A\Delta t} and Bˉ=(eAΔtI)A1B\bar{B} = (e^{A\Delta t} - I)A^{-1}B (assuming AA is invertible). This exact discretization preserves the dynamics of the continuous system. The bilinear (Tustin) approximation Aˉ=(I+AΔt/2)(IAΔt/2)1\bar{A} = (I + A\Delta t/2)(I - A\Delta t/2)^{-1} is a Padé approximant that preserves stability — mapping eigenvalues with negative real part to eigenvalues inside the unit circle.

9.3: Continuous-time RNNs and neural CDEs

A continuous-time RNN replaces the discrete update hk+1=σ(Whk+Uxk)\mathbf{h}_{k+1} = \sigma(W\mathbf{h}_k + U\mathbf{x}_k) with the ODE h˙=h+σ(Wh+Ux(t))\dot{\mathbf{h}} = -\mathbf{h} + \sigma(W\mathbf{h} + U\mathbf{x}(t)). Near a fixed point, this is a linear system whose dynamics are governed by the Jacobian J=I+Wdiag(σ())J = -I + W \cdot \text{diag}(\sigma'(\cdot)). The eigenvalues of JJ determine whether the hidden state remembers (eigenvalues near 0) or forgets (eigenvalues 0\ll 0) — the vanishing/exploding gradient problem in its continuous-time form.

9.4: Spectral analysis of training dynamics

The loss Hessian’s eigenvalue distribution controls training: the bulk spectrum determines the average learning speed, while outlier eigenvalues (from batch normalization, skip connections, or specific data directions) create fast modes that converge first. The matrix exponential eHte^{-Ht} encodes the full training dynamics: each eigen-component is an independent exponential decay, and the superposition of all pp components determines the loss trajectory. Understanding this spectral picture is the gateway to spectral methods in ML.

ML connections: quadratic loss contours with gradient flow, condition number effect, S4 discretization, and continuous-time RNN dynamics

Connections & Further Reading

Prerequisites used in this topic:

Cross-track connections:

  • Power & Taylor Series — the matrix exponential eAt=(At)k/k!e^{At} = \sum (At)^k/k! is a power series with infinite radius of convergence
  • The Hessian & Second-Derivative Test — eigenvalue classification of equilibria mirrors the second-derivative test for critical points
  • The Derivative & Chain RuleddteAt=AeAt\frac{d}{dt}e^{At} = Ae^{At} is the matrix analog of ddteat=aeat\frac{d}{dt}e^{at} = ae^{at}
  • Completeness & Compactness — convergence of the matrix exponential series requires completeness of the matrix algebra

Looking ahead in this track:

  • Stability & Dynamical Systems — phase portrait classification becomes the linearization stability theorem; Lyapunov functions and bifurcation theory
  • Numerical Methods for ODEs — numerical computation of the matrix exponential (scaling-and-squaring, Padé), stiff systems, and adaptive methods

Forward links to formalML:

  • Gradient Descent — training dynamics on quadratic loss as the linear system θ˙=Hθ\dot{\theta} = -H\theta
  • Spectral Theorem — spectral decomposition is the computational engine of the matrix exponential
  • Information Geometry — natural gradient descent as a linear system in the statistical manifold’s tangent space

References

  1. book Arnold (1992). Ordinary Differential Equations Chapters 5–6 develop the matrix exponential and phase portrait classification from the geometric viewpoint. The primary reference for our visualization-first approach to 2×2 systems
  2. book Teschl (2012). Ordinary Differential Equations Chapter 3 provides the rigorous treatment of linear systems, fundamental matrices, and variation of parameters. Clear and free online
  3. book Hirsch, Smale & Devaney (2013). Differential Equations, Dynamical Systems, and an Introduction to Chaos Chapters 2–6 are the standard reference for phase portrait classification and the trace-determinant diagram. Excellent for building geometric intuition
  4. book Meyer (2000). Matrix Analysis and Applied Linear Algebra Chapter 10 covers the matrix exponential, Jordan normal form, and matrix functions with full rigor. The computational reference for matrix exponential properties
  5. paper Gu, Goel & Ré (2022). “Efficiently Modeling Long Sequences with Structured State Spaces” The S4 paper — introduces structured state-space models (SSMs) as discretized linear systems y' = Ay + Bu with learned matrices. The matrix exponential is central to the discretization and efficient computation via the HiPPO framework
  6. paper Gu & Dao (2023). “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” Extends S4 with input-dependent (selective) state-space parameters, achieving transformer-competitive performance. The linear system structure y' = A(x)y + B(x)u connects linear systems theory to modern sequence modeling
  7. paper Moler & Van Loan (2003). “Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later” The classic survey of matrix exponential computation methods — eigendecomposition, Padé approximation, scaling-and-squaring, ODE solvers. Essential context for the computational notes section