Multivariable Differential · intermediate · 50 min read

The Jacobian & Multivariate Chain Rule

Derivatives of vector-valued functions as matrices — the Jacobian as the best linear approximation, the chain rule as matrix multiplication, and the determinant as volume scaling

Abstract. The Jacobian matrix extends differentiation from scalar-valued functions f: ℝⁿ → ℝ to vector-valued functions f: ℝⁿ → ℝᵐ. Where the gradient is a single row of partial derivatives (the derivative of a function with one output), the Jacobian stacks m such rows — one for each output component. The Jacobian J_f(a) is the matrix representation of the total derivative Df(a): ℝⁿ → ℝᵐ, the best linear approximation to f near a. The central result is the multivariate chain rule: if g: ℝⁿ → ℝᵏ and f: ℝᵏ → ℝᵐ are differentiable, then the Jacobian of the composition f ∘ g is the matrix product J_{f∘g}(a) = J_f(g(a)) · J_g(a). This is the chain rule from single-variable calculus (multiply the derivatives along the chain) generalized to arbitrary dimensions — and it is exactly backpropagation. Every layer in a neural network computes a function fₖ: ℝⁿₖ → ℝⁿₖ₊₁, and the end-to-end Jacobian is the product J_L · J_fₖ · ⋯ · J_f₁. Reverse-mode automatic differentiation (backprop) computes this product right-to-left, one vector-Jacobian product at a time, at a cost independent of the number of parameters, which is why deep learning works at scale. For square Jacobians (n = m), the Jacobian determinant det J_f(a) measures how f distorts local volumes: areas near a are scaled by |det J_f(a)|, and the sign encodes orientation. This appears in the change-of-variables formula for integration (the r in r dr dθ for polar coordinates is the Jacobian determinant) and in normalizing flows, where the density of a transformed variable requires dividing by |det J_f|.

Where this leads → formalML

  • formalML Backpropagation computes ∇L by applying the multivariate chain rule through each layer of the network. The Jacobian of each layer is multiplied in sequence: J_{L∘fₖ∘⋯∘f₁} = J_L · J_fₖ ⋯ J_f₁. Reverse-mode AD evaluates this product right-to-left via vector-Jacobian products, computing the full gradient in O(1) backward passes regardless of parameter count.
  • formalML The Jacobian of a smooth map f: M → N between manifolds is the pushforward map df_p: T_pM → T_{f(p)}N between tangent spaces. The chain rule d(g∘f)_p = dg_{f(p)} ∘ df_p is the functoriality of the tangent functor.
  • formalML The Fisher information matrix transforms under reparametrization φ via I(φ(η)) = J_φᵀ I(η) J_φ — a Jacobian sandwich expressing how the natural Riemannian metric on a statistical manifold changes under coordinate transforms.

Overview & Motivation

In the previous topic, we differentiated functions f:RnRf: \mathbb{R}^n \to \mathbb{R} — one output. The gradient f\nabla f told us how that single output changes in every input direction. But a neural network layer is not a scalar-valued function. A layer fk:RnkRnk+1f_k: \mathbb{R}^{n_k} \to \mathbb{R}^{n_{k+1}} takes a vector of activations and produces a vector of activations — multiple inputs, multiple outputs. To differentiate such a function, we need a partial derivative for every input-output pair: how does the ii-th output change when the jj-th input is nudged?

Arranging all of these into a matrix gives the Jacobian. And the chain rule — the engine that makes backpropagation possible — says that the Jacobian of a composition is the product of the Jacobian matrices. If that sounds like matrix multiplication, it is. The single-variable chain rule (fg)(a)=f(g(a))g(a)(f \circ g)'(a) = f'(g(a)) \cdot g'(a) from Topic 5 was multiplying 1×11 \times 1 matrices. Now we see the general picture.

This topic develops three ideas: the Jacobian matrix (the derivative of a vector-valued function), the multivariate chain rule (matrix multiplication along a computation graph), and the Jacobian determinant (how a differentiable map distorts volumes). Together, these are the calculus that makes deep learning — and much of mathematical physics — work.

Vector-Valued Functions & the Jacobian Matrix

We begin with the conceptual step from scalar to vector. A function f:R2R2f: \mathbb{R}^2 \to \mathbb{R}^2 maps a region in the plane to another region in the plane. Each output component f1,f2f_1, f_2 is a scalar-valued function of two variables, and each has its own gradient. The Jacobian stacks these gradients as rows.

📐 Definition 1 (Vector-Valued Function)

A vector-valued function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is defined by mm component functions f=(f1,f2,,fm)f = (f_1, f_2, \ldots, f_m), where each fi:RnRf_i: \mathbb{R}^n \to \mathbb{R} is a scalar-valued function. The function maps a point aRna \in \mathbb{R}^n to the vector

f(a)=(f1(a),f2(a),,fm(a))Rm.f(a) = (f_1(a), f_2(a), \ldots, f_m(a)) \in \mathbb{R}^m.

When m=1m = 1, this reduces to the scalar-valued functions of Topic 9.

📐 Definition 2 (The Jacobian Matrix)

Let f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m and suppose all partial derivatives fixj(a)\frac{\partial f_i}{\partial x_j}(a) exist. The Jacobian matrix (or simply Jacobian) of ff at aa is the m×nm \times n matrix

Jf(a)=(f1x1(a)f1xn(a)fmx1(a)fmxn(a)).J_f(a) = \begin{pmatrix} \frac{\partial f_1}{\partial x_1}(a) & \cdots & \frac{\partial f_1}{\partial x_n}(a) \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1}(a) & \cdots & \frac{\partial f_m}{\partial x_n}(a) \end{pmatrix}.

Row ii of Jf(a)J_f(a) is fi(a)T\nabla f_i(a)^T — the gradient of the ii-th component function. Column jj is the vector of all partial derivatives with respect to xjx_j. Common notations include Jf(a)J_f(a), Df(a)Df(a), and (f1,,fm)(x1,,xn)\frac{\partial(f_1, \ldots, f_m)}{\partial(x_1, \ldots, x_n)}.

📝 Example 1 (Jacobian of a linear function)

Let f(x)=Ax+bf(x) = Ax + b where AA is a constant m×nm \times n matrix. Then Jf(a)=AJ_f(a) = A for all aa — the Jacobian of a linear function is the matrix itself. The derivative of a linear map is the linear map. This is the multivariable analog of ”f(x)=ax+bf(x) = ax + b has derivative aa.”

📝 Example 2 (Jacobian of polar-to-Cartesian)

Let f(r,θ)=(rcosθ,rsinθ)f(r, \theta) = (r\cos\theta, r\sin\theta). The Jacobian is

Jf(r,θ)=(cosθrsinθsinθrcosθ).J_f(r, \theta) = \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}.

At (r,θ)=(2,π/4)(r, \theta) = (2, \pi/4), we get Jf=(1/221/22)J_f = \begin{pmatrix} 1/\sqrt{2} & -\sqrt{2} \\ 1/\sqrt{2} & \sqrt{2} \end{pmatrix}. The first column tells us how (x,y)(x, y) changes when we increase rr (move radially outward); the second column tells us how (x,y)(x, y) changes when we increase θ\theta (move tangentially).

📝 Example 3 (Jacobian of a neural network layer)

A fully connected layer with activation is f(x)=σ(Wx+b)f(x) = \sigma(Wx + b) where σ\sigma is applied component-wise. By the chain rule (which we will prove shortly),

Jf(x)=diag(σ(Wx+b))W.J_f(x) = \text{diag}(\sigma'(Wx + b)) \cdot W.

When σ\sigma is the identity (linear layer), Jf=WJ_f = W — recovering Example 1. When σ=sigmoid\sigma = \text{sigmoid}, the diagonal matrix of σ\sigma' values modulates each row of WW: the partial derivative fixj\frac{\partial f_i}{\partial x_j} is σ(zi)Wij\sigma'(z_i) \cdot W_{ij}, where z=Wx+bz = Wx + b is the pre-activation vector.

💡 Remark 1 (Gradient as a special case)

When m=1m = 1, the Jacobian is a 1×n1 \times n row matrix: Jf(a)=(fx1(a)fxn(a))=f(a)TJ_f(a) = \begin{pmatrix} \frac{\partial f}{\partial x_1}(a) & \cdots & \frac{\partial f}{\partial x_n}(a) \end{pmatrix} = \nabla f(a)^T. The gradient is the transpose of the single-row Jacobian. When n=1n = 1, the Jacobian is an m×1m \times 1 column matrix — just the vector of ordinary derivatives of each component. When m=n=1m = n = 1, the Jacobian is a 1×11 \times 1 matrix containing the single-variable derivative f(a)f'(a) from Topic 5. The Jacobian unifies all of these cases.

Point a
(2.000, 0.785)
f(a)
(1.414, 1.414)
J_f(a)
[0.707, -1.414]
[0.707, 1.414]
det J_f(a)
2.000
|det| = 2.000

Construction of the Jacobian matrix: each row is the gradient of a component function, assembled into the full m × n matrix of partial derivatives.

The Jacobian as Linear Approximation

Just as the gradient was the best linear approximation for f:RnRf: \mathbb{R}^n \to \mathbb{R} (Topic 9, §7), the Jacobian is the best linear approximation for f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m. The total derivative is a linear map Df(a):RnRmDf(a): \mathbb{R}^n \to \mathbb{R}^m, and its matrix representation is the Jacobian.

📐 Definition 3 (Differentiability (Vector-Valued))

A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is differentiable at aa if there exists a linear map Df(a):RnRmDf(a): \mathbb{R}^n \to \mathbb{R}^m such that

limh0f(a+h)f(a)Df(a)(h)h=0.\lim_{h \to 0} \frac{\|f(a+h) - f(a) - Df(a)(h)\|}{\|h\|} = 0.

When Df(a)Df(a) exists, it is unique and represented by the Jacobian matrix: Df(a)(h)=Jf(a)hDf(a)(h) = J_f(a) \cdot h for all hRnh \in \mathbb{R}^n. This extends Definition 4 of Topic 9 from m=1m = 1 to arbitrary mm.

🔷 Theorem 1 (Differentiability ⟺ Component-wise Differentiability)

f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is differentiable at aa if and only if each component fi:RnRf_i: \mathbb{R}^n \to \mathbb{R} is differentiable at aa (in the sense of the total derivative from Topic 9). When ff is differentiable, the rows of the Jacobian are the gradients of the component functions.

Proof.

(\Rightarrow) If ff is differentiable at aa, then f(a+h)f(a)Jf(a)hh0\frac{\|f(a+h) - f(a) - J_f(a)h\|}{\|h\|} \to 0. For each component fif_i, the absolute value of a single component is bounded by the norm of the full vector:

fi(a+h)fi(a)fi(a)hhf(a+h)f(a)Jf(a)hh0.\frac{|f_i(a+h) - f_i(a) - \nabla f_i(a) \cdot h|}{\|h\|} \le \frac{\|f(a+h) - f(a) - J_f(a)h\|}{\|h\|} \to 0.

So each fif_i is differentiable at aa with total derivative fi(a)h\nabla f_i(a) \cdot h.

(\Leftarrow) If each fif_i is differentiable at aa, then for each ii: fi(a+h)fi(a)fi(a)h=o(h)|f_i(a+h) - f_i(a) - \nabla f_i(a) \cdot h| = o(\|h\|). We square and sum over all components:

f(a+h)f(a)Jf(a)h2=i=1mfi(a+h)fi(a)fi(a)h2.\|f(a+h) - f(a) - J_f(a)h\|^2 = \sum_{i=1}^m |f_i(a+h) - f_i(a) - \nabla f_i(a) \cdot h|^2.

Each summand is o(h2)o(\|h\|^2) by hypothesis, so the sum is o(h2)o(\|h\|^2). Taking the square root: f(a+h)f(a)Jf(a)h=o(h)\|f(a+h) - f(a) - J_f(a)h\| = o(\|h\|). \square

🔷 Proposition 1 (The Jacobian Approximation)

If f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is differentiable at aa, then

f(a+h)f(a)+Jf(a)hf(a + h) \approx f(a) + J_f(a) \cdot h

with error o(h)o(\|h\|). This is the multivariable Taylor expansion to first order for vector-valued functions. Geometrically: the affine map hf(a)+Jf(a)hh \mapsto f(a) + J_f(a)h is the best linear approximation to ff near aa.

📝 Example 4 (Linear approximation of polar-to-Cartesian)

Near (r,θ)=(2,π/4)(r, \theta) = (2, \pi/4), the polar-to-Cartesian map is approximated by

f(2+Δrπ/4+Δθ)(22)+(1/221/22)(ΔrΔθ).f\begin{pmatrix} 2 + \Delta r \\ \pi/4 + \Delta\theta \end{pmatrix} \approx \begin{pmatrix} \sqrt{2} \\ \sqrt{2} \end{pmatrix} + \begin{pmatrix} 1/\sqrt{2} & -\sqrt{2} \\ 1/\sqrt{2} & \sqrt{2} \end{pmatrix} \begin{pmatrix} \Delta r \\ \Delta\theta \end{pmatrix}.

A small change in (r,θ)(r, \theta) maps to a change in (x,y)(x, y) via matrix multiplication — the Jacobian acts on the perturbation vector. The interactive visualization below lets you see how the linear approximation (an ellipse) converges to the true image (a deformed curve) as the perturbation shrinks.

Avg error: 0.1557
Max error: 0.2491
Error/ε: 0.3115(→ 0 as ε → 0)

The Jacobian as linear approximation: a circle in input space maps to a deformed curve under f and to an ellipse under the linear approximation J_f, with the gap shrinking as the radius decreases.

The Multivariate Chain Rule

This is the central result of the topic. The chain rule says: the derivative of a composition is the product of the derivatives. In single-variable calculus (Topic 5), this meant multiplying numbers: (fg)(a)=f(g(a))g(a)(f \circ g)'(a) = f'(g(a)) \cdot g'(a). Now, the “derivatives” are matrices, and the “product” is matrix multiplication.

🔷 Theorem 2 (The Multivariate Chain Rule)

Let g:RnRkg: \mathbb{R}^n \to \mathbb{R}^k be differentiable at aRna \in \mathbb{R}^n, and let f:RkRmf: \mathbb{R}^k \to \mathbb{R}^m be differentiable at g(a)Rkg(a) \in \mathbb{R}^k. Then the composition fg:RnRmf \circ g: \mathbb{R}^n \to \mathbb{R}^m is differentiable at aa, and

Jfg(a)=Jf(g(a))Jg(a).J_{f \circ g}(a) = J_f(g(a)) \cdot J_g(a).

The Jacobian of the composition is the matrix product of the Jacobians, evaluated at the appropriate points. The dimensions are consistent: JfJ_f is m×km \times k, JgJ_g is k×nk \times n, so JfJgJ_f \cdot J_g is m×nm \times n — matching the expected size for a function from Rn\mathbb{R}^n to Rm\mathbb{R}^m.

Proof.

This is the most important proof in this topic. We track the error terms through two layers of the total derivative.

By differentiability of gg at aa, we can write

g(a+h)=g(a)+Jg(a)h+εg(h)hg(a + h) = g(a) + J_g(a)h + \varepsilon_g(h)\|h\|

where εg(h)0\varepsilon_g(h) \to 0 as h0h \to 0 (here εg(h)\varepsilon_g(h) is a vector in Rk\mathbb{R}^k). By differentiability of ff at b=g(a)b = g(a), we can write

f(b+)=f(b)+Jf(b)+εf()f(b + \ell) = f(b) + J_f(b)\ell + \varepsilon_f(\ell)\|\ell\|

where εf()0\varepsilon_f(\ell) \to 0 as 0\ell \to 0. Set =g(a+h)g(a)=Jg(a)h+εg(h)h\ell = g(a+h) - g(a) = J_g(a)h + \varepsilon_g(h)\|h\|. Then

(fg)(a+h)=f(g(a)+)=f(b)+Jf(b)+εf().(f \circ g)(a+h) = f(g(a) + \ell) = f(b) + J_f(b)\ell + \varepsilon_f(\ell)\|\ell\|.

Substituting the expression for \ell:

=f(b)+Jf(b)[Jg(a)h+εg(h)h]+εf()= f(b) + J_f(b)\bigl[J_g(a)h + \varepsilon_g(h)\|h\|\bigr] + \varepsilon_f(\ell)\|\ell\|

=f(b)+Jf(b)Jg(a)h+Jf(b)εg(h)h+εf()must show this is o(h).= f(b) + J_f(b)J_g(a)h + \underbrace{J_f(b)\varepsilon_g(h)\|h\| + \varepsilon_f(\ell)\|\ell\|}_{\text{must show this is } o(\|h\|)}.

First error term: Jf(b)εg(h)hJf(b)εg(h)h\|J_f(b)\varepsilon_g(h)\|h\|\| \le \|J_f(b)\| \cdot \|\varepsilon_g(h)\| \cdot \|h\|. The operator norm Jf(b)\|J_f(b)\| is a fixed constant, and εg(h)0\|\varepsilon_g(h)\| \to 0 as h0h \to 0, so this entire term is o(h)o(\|h\|).

Second error term: We need \|\ell\|. From the definition: =Jg(a)h+εg(h)hJg(a)h+εg(h)h\|\ell\| = \|J_g(a)h + \varepsilon_g(h)\|h\|\| \le \|J_g(a)\| \cdot \|h\| + \|\varepsilon_g(h)\| \cdot \|h\|. For hh sufficiently small, εg(h)<1\|\varepsilon_g(h)\| < 1, so (Jg(a)+1)h=Ch\|\ell\| \le (\|J_g(a)\| + 1)\|h\| = C\|h\| for a constant CC. As h0h \to 0, we have 0\ell \to 0, so εf()0\varepsilon_f(\ell) \to 0. Therefore εf()εf()Ch=o(h)\|\varepsilon_f(\ell)\| \cdot \|\ell\| \le \|\varepsilon_f(\ell)\| \cdot C\|h\| = o(\|h\|).

Combining both error terms, we have shown

(fg)(a+h)=(fg)(a)+Jf(g(a))Jg(a)h+o(h),(f \circ g)(a+h) = (f \circ g)(a) + J_f(g(a)) \cdot J_g(a) \cdot h + o(\|h\|),

which proves that fgf \circ g is differentiable at aa with Jfg(a)=Jf(g(a))Jg(a)J_{f \circ g}(a) = J_f(g(a)) \cdot J_g(a). \square

📝 Example 5 (Chain rule with explicit matrices)

Let g:R2R2g: \mathbb{R}^2 \to \mathbb{R}^2 by g(x,y)=(x2+y,xy)g(x, y) = (x^2 + y, xy) and f:R2Rf: \mathbb{R}^2 \to \mathbb{R} by f(u,v)=u2+v2f(u, v) = u^2 + v^2. Their Jacobians are

Jg(x,y)=(2x1yx),Jf(u,v)=(2u2v).J_g(x,y) = \begin{pmatrix} 2x & 1 \\ y & x \end{pmatrix}, \qquad J_f(u,v) = \begin{pmatrix} 2u & 2v \end{pmatrix}.

At (x,y)=(1,1)(x,y) = (1,1): g(1,1)=(2,1)g(1,1) = (2, 1), so Jg(1,1)=(2111)J_g(1,1) = \begin{pmatrix} 2 & 1 \\ 1 & 1 \end{pmatrix} and Jf(2,1)=(42)J_f(2,1) = \begin{pmatrix} 4 & 2 \end{pmatrix}.

Chain rule:

Jfg(1,1)=(42)(2111)=(106).J_{f \circ g}(1,1) = \begin{pmatrix} 4 & 2 \end{pmatrix} \begin{pmatrix} 2 & 1 \\ 1 & 1 \end{pmatrix} = \begin{pmatrix} 10 & 6 \end{pmatrix}.

Verification by direct computation: (fg)(x,y)=(x2+y)2+(xy)2(f \circ g)(x,y) = (x^2+y)^2 + (xy)^2. Then x(fg)=2(x2+y)(2x)+2(xy)(y)=4x(x2+y)+2xy2\frac{\partial}{\partial x}(f \circ g) = 2(x^2+y)(2x) + 2(xy)(y) = 4x(x^2+y) + 2xy^2. At (1,1)(1,1): 4(1)(2)+2(1)(1)=104(1)(2) + 2(1)(1) = 10. Confirmed.

📝 Example 6 (Three-fold composition)

For a three-layer chain hgfh \circ g \circ f:

Jhgf(a)=Jh(g(f(a)))Jg(f(a))Jf(a).J_{h \circ g \circ f}(a) = J_h(g(f(a))) \cdot J_g(f(a)) \cdot J_f(a).

The chain rule applies recursively — each Jacobian is evaluated at the appropriate intermediate value. This is the prototype for backpropagation through KK layers: the end-to-end Jacobian is JLJfKJfK1Jf1J_L \cdot J_{f_K} \cdot J_{f_{K-1}} \cdots J_{f_1}, with each factor evaluated at the activation from the forward pass.

💡 Remark 2 (Why matrix multiplication?)

The chain rule is matrix multiplication because derivatives are linear maps, and composition of linear maps corresponds to matrix multiplication. The single-variable chain rule (fg)=fg(f \circ g)' = f' \cdot g' multiplies 1×11 \times 1 matrices — it just happens to look like scalar multiplication. In Rn\mathbb{R}^n, the matrix structure is exposed, and the product JfJgJ_f \cdot J_g captures how sensitivities propagate through layers of computation.

Input
(1.000, 1.000)
Output
(5.000)
Final Jacobian
[10.000, 6.000]

The chain rule as matrix multiplication: a computation graph with Jacobian matrices at each node and the accumulated product shown step by step.

The Jacobian Determinant

When f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n (square Jacobian), the determinant detJf(a)\det J_f(a) has a geometric meaning: it measures how ff scales volumes near aa.

📐 Definition 4 (The Jacobian Determinant)

For f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n with all partial derivatives existing at aa, the Jacobian determinant is

detJf(a)=det(f1x1f1xnfnx1fnxn)(a).\det J_f(a) = \det \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_n}{\partial x_1} & \cdots & \frac{\partial f_n}{\partial x_n} \end{pmatrix}(a).

The Jacobian determinant is defined only when the Jacobian is square (m=nm = n).

🔷 Theorem 3 (Volume Distortion)

Let f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n be continuously differentiable near aa, and let RR be a small rectangular region containing aa. Then the image f(R)f(R) has volume approximately

vol(f(R))=detJf(a)vol(R)+o(vol(R))\text{vol}(f(R)) = |\det J_f(a)| \cdot \text{vol}(R) + o(\text{vol}(R))

as the diameter of RR shrinks to zero. The absolute value handles orientation: detJf(a)>0\det J_f(a) > 0 means ff preserves orientation; detJf(a)<0\det J_f(a) < 0 means it reverses orientation (like a reflection).

📝 Example 7 (Polar coordinates)

For f(r,θ)=(rcosθ,rsinθ)f(r, \theta) = (r\cos\theta, r\sin\theta):

detJf=det(cosθrsinθsinθrcosθ)=rcos2θ+rsin2θ=r.\det J_f = \det \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix} = r\cos^2\theta + r\sin^2\theta = r.

This is the rr in "rdrdθr\,dr\,d\theta" for polar integration. A small dr×dθdr \times d\theta rectangle in polar coordinates maps to a region of area approximately rdrdθr\,dr\,d\theta in Cartesian coordinates. Far from the origin (rr large), the same angular increment covers more Cartesian area; near the origin (rr small), less.

📝 Example 8 (Scaling transformation)

For f(x,y)=(2x,3y)f(x, y) = (2x, 3y): Jf=(2003)J_f = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix}, detJf=6\det J_f = 6. Every region’s area is multiplied by 6 — stretching by 2 in xx and by 3 in yy scales areas by 2×3=62 \times 3 = 6. The determinant of a diagonal matrix is the product of the diagonal entries, and each entry is a stretch factor along one axis.

🔷 Proposition 2 (Jacobian Determinant of a Composition)

For f,g:RnRnf, g: \mathbb{R}^n \to \mathbb{R}^n,

detJfg(a)=detJf(g(a))detJg(a).\det J_{f \circ g}(a) = \det J_f(g(a)) \cdot \det J_g(a).

Proof.

This follows from the chain rule (Theorem 2) and the multiplicativity of determinants:

detJfg(a)=det(Jf(g(a))Jg(a))=detJf(g(a))detJg(a).\det J_{f \circ g}(a) = \det\bigl(J_f(g(a)) \cdot J_g(a)\bigr) = \det J_f(g(a)) \cdot \det J_g(a).

Geometrically: volume distortions compose multiplicatively. If gg scales volume by factor 3 and ff scales volume by factor 2, then fgf \circ g scales volume by factor 6. \square

💡 Remark 3 (When the Jacobian determinant is zero)

detJf(a)=0\det J_f(a) = 0 means the linear approximation Jf(a)J_f(a) is not invertible — it “collapses” a neighborhood of aa into a lower-dimensional set. This is the boundary between the Inverse Function Theorem applying (nonzero determinant implies ff is locally invertible) and failing (zero determinant implies a critical value). Inverse & Implicit Function Theorems develops this connection.

det J_f(a)
2.0000
Input area
0.0900
Output area
0.1773
Ratio → |det J|
1.9701
|det J| = 2.0000

The Jacobian determinant as area distortion: a small square mapped through scaling, rotation, and polar-to-Cartesian transformations, with area ratios matching |det J|.

Coordinate Transformations

Coordinate transformations are the most concrete application of the Jacobian determinant. We have already seen polar coordinates (Example 7). Here we systematize the pattern.

Polar coordinates: (r,θ)(rcosθ,rsinθ)(r, \theta) \mapsto (r\cos\theta, r\sin\theta), with detJ=r\det J = r. The area element becomes dA=rdrdθdA = r\,dr\,d\theta. This is why the “extra rr” appears in polar integrals — it is the Jacobian determinant, accounting for the non-uniform distortion of the polar grid.

Spherical coordinates: (r,θ,ϕ)(rsinϕcosθ,  rsinϕsinθ,  rcosϕ)(r, \theta, \phi) \mapsto (r\sin\phi\cos\theta,\; r\sin\phi\sin\theta,\; r\cos\phi). The full 3×33 \times 3 Jacobian computation yields detJ=r2sinϕ\det J = r^2\sin\phi, giving the volume element dV=r2sinϕdrdθdϕdV = r^2\sin\phi\,dr\,d\theta\,d\phi.

Affine transformations: f(x)=Ax+bf(x) = Ax + b with AA a constant matrix. Then Jf=AJ_f = A everywhere, and detJf=detA\det J_f = \det A. Rotations have detA=1|\det A| = 1 (volume-preserving), reflections have detA=1\det A = -1 (orientation-reversing), and dilations have detA1|\det A| \neq 1 (volume-changing).

💡 Remark 4 (Preview of the change-of-variables formula)

The pattern we have seen — f(U)h(x)dx=Uh(f(u))detJf(u)du\int_{f(U)} h(x)\,dx = \int_U h(f(u)) \cdot |\det J_f(u)|\,du — is formalized as the Change of Variables theorem in the Multivariable Integral Calculus track. That topic provides the rigorous justification. This topic gives you the differential machinery: the Jacobian determinant is the local volume scaling factor, and integration sums up these local contributions.

Coordinate transformations: polar grid lines mapped to Cartesian, with a heatmap showing |det J| = r — the non-uniform area distortion across the domain.

Computational Notes

Numerical Jacobian. Compute JfJ_f by applying central differences to each input dimension: column jj of JfJ_f is

f(a+hej)f(ahej)2h.\frac{f(a + h\mathbf{e}_j) - f(a - h\mathbf{e}_j)}{2h}.

This costs 2n2n function evaluations for an m×nm \times n Jacobian — compared to 22 for a gradient (where m=1m = 1).

Automatic differentiation. jax.jacobian(f)(x) computes the full Jacobian via forward-mode AD: it runs nn forward passes, one per input dimension. When nn is small relative to mm, this is efficient.

JVPs and VJPs. Computing the full Jacobian matrix is often unnecessary — and for a neural network with millions of parameters, storing it is impossible. Instead:

  • A Jacobian-vector product (JVP) Jf(a)vJ_f(a) \cdot v costs one forward pass (forward-mode AD). This computes how a perturbation vv in the input propagates to the output.
  • A vector-Jacobian product (VJP) wTJf(a)w^T \cdot J_f(a) costs one backward pass (reverse-mode AD). This computes how a sensitivity ww in the output traces back to the input.

For backpropagation with a scalar loss, we need the gradient L\nabla L — a single row vector. This is one VJP propagated through the entire network, at a cost independent of the number of parameters. That asymmetry is why reverse-mode AD dominates deep learning.

import jax
import jax.numpy as jnp

# Full Jacobian (n forward passes)
J = jax.jacobian(f)(x)

# JVP: forward-mode, one pass
primals, tangents = jax.jvp(f, (x,), (v,))

# VJP: reverse-mode, one pass
primals, vjp_fn = jax.vjp(f, x)
grad = vjp_fn(w)

Comparison of Jacobian computation methods: numerical central differences, forward-mode AD (JVPs), and reverse-mode AD (VJPs), with cost annotations.

Connections to ML — Backpropagation & Normalizing Flows

Backpropagation is the multivariate chain rule

A KK-layer neural network computes y^=fKfK1f1(x)\hat{y} = f_K \circ f_{K-1} \circ \cdots \circ f_1(x), where each fk(z)=σk(Wkz+bk)f_k(z) = \sigma_k(W_k z + b_k). The loss is L(y^,y)L(\hat{y}, y). By the chain rule (Theorem 2, applied recursively):

JLfKf1(x)=JLJfKJfK1Jf1,J_{L \circ f_K \circ \cdots \circ f_1}(x) = J_L \cdot J_{f_K} \cdot J_{f_{K-1}} \cdots J_{f_1},

where each JfkJ_{f_k} is evaluated at the activation from the forward pass. For a scalar loss (L:RnK+1RL: \mathbb{R}^{n_{K+1}} \to \mathbb{R}), JLJ_L is a 1×nK+11 \times n_{K+1} row vector — the gradient of LL with respect to the network output. The gradient with respect to any intermediate layer is obtained by multiplying Jacobians from the loss backward through the chain.

This is not an analogy. Backpropagation is literally the multivariate chain rule, evaluated right-to-left.

Forward mode vs. reverse mode

The product JLJfKJf1J_L \cdot J_{f_K} \cdots J_{f_1} can be evaluated in two orders:

  • Right-to-left (forward mode): compute Jf1J_{f_1}, then Jf2Jf1J_{f_2} \cdot J_{f_1}, then Jf3(Jf2Jf1)J_{f_3} \cdot (J_{f_2} \cdot J_{f_1}), and so on. Each step is a JVP. For a scalar loss and pp parameters, this requires pp forward passes — one per parameter.
  • Left-to-right (reverse mode, backprop): compute JLJ_L, then JLJfKJ_L \cdot J_{f_K}, then (JLJfK)JfK1(J_L \cdot J_{f_K}) \cdot J_{f_{K-1}}, and so on. Each step is a VJP. For a scalar loss, this requires one backward pass, regardless of pp.

The asymmetry is dramatic: reverse mode is O(1)O(1) backward passes for a scalar loss, while forward mode is O(p)O(p). This is why we can train networks with billions of parameters.

Jacobians in normalizing flows

Normalizing flows transform a simple base density pz(z)p_z(z) (e.g., a Gaussian) through an invertible function ff to produce a complex density px(x)p_x(x). By the change-of-variables formula:

px(x)=pz(f1(x))detJf1(x),p_x(x) = p_z(f^{-1}(x)) \cdot |\det J_{f^{-1}}(x)|,

or equivalently, logpx(f(z))=logpz(z)logdetJf(z)\log p_x(f(z)) = \log p_z(z) - \log|\det J_f(z)|. The Jacobian determinant is the “volume correction” that accounts for how ff stretches or compresses probability mass. Architectures like RealNVP and GLOW use coupling layers with triangular Jacobians, so that detJf\det J_f is the product of diagonal entries — cheap to compute.

Jacobian regularization

Adding JfF2\|J_f\|_F^2 (the squared Frobenius norm of the Jacobian) as a regularization term encourages smoothness: a small Jacobian norm means small sensitivity to input perturbations. From the linear approximation (Proposition 1):

f(x)f(x)Jfxx,\|f(x) - f(x')\| \le \|J_f\| \cdot \|x - x'\|,

so bounding the Jacobian norm bounds the Lipschitz constant. This appears in adversarial robustness (small perturbations to xx produce small changes in f(x)f(x)), generative models, and physics-informed neural networks.

Forward links to formalML:

  • Gradient Descent — backpropagation as the chain rule through computation graphs
  • Smooth Manifolds — the Jacobian as the pushforward between tangent spaces
  • Information Geometry — the Fisher information matrix transformation under reparametrization via the Jacobian

Backpropagation as the chain rule: forward pass computes activations (top), backward pass multiplies Jacobians right-to-left via VJPs (bottom).

Normalizing flows: a Gaussian base density transformed through an invertible map f, with the Jacobian determinant providing the volume correction for the output density.

Connections & Further Reading

Within formalCalculus:

  • Partial Derivatives & the Gradient — The gradient is the m=1m = 1 special case of the Jacobian. The total derivative, C1C^1 criterion, and tangent plane from Topic 9 extend directly to vector-valued functions.
  • The Derivative & Chain Rule — The single-variable chain rule (fg)=fg(f \circ g)' = f' \cdot g' is the 1×11 \times 1 case of Jfg=JfJgJ_{f \circ g} = J_f \cdot J_g.
  • Epsilon-Delta & Continuity — The total derivative definition uses a multivariable limit, and the chain rule proof requires careful handling of error terms.
  • Completeness & Compactness — Compactness in Rn\mathbb{R}^n ensures continuous functions on closed bounded sets achieve their extrema.
  • The Hessian & Second-Order Analysis — The Hessian is the Jacobian of the gradient: Hf=J(f)H_f = J(\nabla f). Second-order analysis builds directly on the chain rule and Jacobian framework developed here.
  • Inverse & Implicit Function TheoremsdetJf(a)0\det J_f(a) \neq 0 implies ff is locally invertible near aa. The Implicit Function Theorem expresses part of the Jacobian as a function of the rest.
  • Change of Variables & the Jacobian Determinant — The Jacobian determinant detJϕ|\det J_\phi| appears as the volume scaling factor in the substitution formula for multiple integrals.

A gallery of grid deformations: four vector fields applied to a regular grid, each with the Jacobian matrix and determinant at a sample point.

References

  1. book Munkres (1991). Analysis on Manifolds Chapters 2–3 develop the total derivative and the chain rule in ℝⁿ with full rigor — the primary reference for this topic's proof of the multivariate chain rule
  2. book Spivak (1965). Calculus on Manifolds Chapter 2 treats the derivative as a linear map and proves the chain rule in the most general finite-dimensional setting — elegant and minimal
  3. book Rudin (1976). Principles of Mathematical Analysis Chapter 9 on multivariable differentiation — Theorem 9.15 is the chain rule with a clean proof via the contraction mapping characterization of the derivative
  4. book Axler (2024). Linear Algebra Done Right Determinants, linear maps, and matrix multiplication — the linear algebra foundation for understanding the Jacobian as a matrix representation of a linear map
  5. book Goodfellow, Bengio & Courville (2016). Deep Learning Section 6.5 on backpropagation — the chain rule applied to computation graphs, forward-mode vs. reverse-mode AD
  6. paper Rezende & Mohamed (2015). “Variational Inference with Normalizing Flows” Normalizing flows use the change-of-variables formula with the Jacobian determinant to compute densities of transformed distributions — the probabilistic application of the Jacobian determinant
  7. paper Baydin, Pearlmutter, Radul & Siskind (2018). “Automatic Differentiation in Machine Learning: a Survey” Comprehensive survey of forward-mode and reverse-mode AD — the computational realization of the chain rule in modern ML frameworks