Multivariable Differential · intermediate · 50 min read

The Hessian & Second-Order Analysis

Second-order partial derivatives, the Hessian matrix as the Jacobian of the gradient, critical point classification via eigenvalues, the second-order Taylor expansion, and Newton's method — the curvature machinery behind second-order optimization

Abstract. The Hessian matrix H_f(a) collects all second-order partial derivatives of a scalar-valued function f: ℝⁿ → ℝ into an n × n symmetric matrix. It is precisely the Jacobian of the gradient: H_f = J(∇f) — the first-order derivative of the first-order derivative. Where the gradient tells us which direction is uphill (first-order, slope), the Hessian tells us how the surface curves (second-order, curvature). The eigenvalues of the Hessian are the principal curvatures: all positive indicate a bowl (local minimum), all negative indicate a dome (local maximum), and mixed signs indicate a saddle point. The second-order Taylor expansion f(a+h) ≈ f(a) + ∇f(a)·h + ½ hᵀ H_f(a) h approximates the function by a paraboloid, and classifying critical points reduces to analyzing this quadratic form. Newton's method x_{k+1} = x_k - H_f(x_k)⁻¹ ∇f(x_k) exploits the Hessian to take curvature-corrected optimization steps, achieving quadratic convergence where gradient descent converges only linearly. In machine learning, the Hessian governs loss surface geometry: its condition number κ(H) = λ_max/λ_min determines how much gradient descent struggles with elongated valleys, its spectrum reveals whether critical points are minima or saddle points (in high-dimensional problems, saddle points dominate), and second-order methods — from Newton to L-BFGS to natural gradient — all use the Hessian or its approximations to improve convergence. The Hessian is the bridge between the first-order world (gradient descent) and the second-order world (curvature-aware optimization).

Where this leads → formalML

  • formalML Second-order optimization methods (Newton, quasi-Newton, L-BFGS) use the Hessian to precondition gradient descent. The condition number κ(H_f) = λ_max/λ_min determines the convergence rate — gradient descent takes O(κ) iterations, while Newton's method converges quadratically. Adaptive optimizers like Adam implicitly approximate diagonal Hessian elements.
  • formalML A C² function f is convex if and only if H_f(x) ⪰ 0 for all x. Strong convexity (H_f ⪰ μI) provides the convergence rate guarantee μ/L for gradient descent, where L is the Lipschitz constant of the gradient (equivalently, the largest eigenvalue of the Hessian).
  • formalML The Fisher information matrix I(θ) equals the expected Hessian of the negative log-likelihood E[-H_{log p(x|θ)}] under regularity conditions. The natural gradient I(θ)⁻¹ ∇L(θ) is Newton's method adapted to the Riemannian geometry of the statistical manifold — the curvature here is the Hessian of the KL divergence.

Overview & Motivation

Gradient descent moves “downhill” — in the direction of steepest descent, L(θ)-\nabla L(\theta). But it ignores curvature. In a narrow valley of the loss landscape, the gradient points roughly across the valley rather than along it, causing the optimizer to zigzag back and forth between the valley walls. The gradient tells you which direction is steepest, but it says nothing about how the steepness changes as you move.

The Hessian captures this missing information. Where the gradient is a vector of first-order partial derivatives (slope in each direction), the Hessian is a matrix of second-order partial derivatives (how the slope changes in each direction). Its eigenvalues tell you how steeply the surface curves in each principal direction, and its eigenvectors tell you which directions those are. A narrow valley has one large eigenvalue (steep walls) and one small eigenvalue (gentle slope along the floor) — and the ratio between them, the condition number κ(Hf)=λmax/λmin\kappa(H_f) = \lambda_{\max}/\lambda_{\min}, quantifies exactly how much trouble gradient descent will have.

Second-order methods like Newton’s method use the Hessian to take smarter steps — correcting for curvature so the optimizer moves along the valley floor instead of bouncing between the walls. The price is computing (or approximating) an n×nn \times n matrix of second derivatives, which is why most deep learning still uses first-order methods. But understanding the Hessian is essential for understanding why gradient descent behaves the way it does — and the Hessian is literally the Jacobian of the gradient: Hf=J(f)H_f = J(\nabla f). Everything we built in Topics 9 and 10 converges here.

Second-Order Partial Derivatives & Clairaut’s Theorem

Partial derivatives are themselves functions, so we can differentiate them again. If f:R2Rf: \mathbb{R}^2 \to \mathbb{R} has partial derivatives fxf_x and fyf_y, then fxf_x is itself a function of (x,y)(x, y), and we can take its partial derivative with respect to xx or yy. This gives us four second-order partial derivatives: fxxf_{xx}, fxyf_{xy}, fyxf_{yx}, and fyyf_{yy}.

📐 Definition 1 (Second-Order Partial Derivative)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} have partial derivative fxj\frac{\partial f}{\partial x_j} in a neighborhood of aa. If fxj\frac{\partial f}{\partial x_j} is itself differentiable with respect to xix_i at aa, the second-order partial derivative is

2fxixj(a)=xi(fxj)(a).\frac{\partial^2 f}{\partial x_i \partial x_j}(a) = \frac{\partial}{\partial x_i}\left(\frac{\partial f}{\partial x_j}\right)(a).

When i=ji = j, this is the unmixed (or pure) second partial derivative 2fxi2\frac{\partial^2 f}{\partial x_i^2}. When iji \neq j, this is a mixed partial derivative.

Notation: fxixj(a)f_{x_i x_j}(a), ijf(a)\partial_{ij} f(a), DiDjf(a)D_i D_j f(a).

The order in the notation 2fxixj\frac{\partial^2 f}{\partial x_i \partial x_j} matters: we differentiate first with respect to xjx_j (innermost), then with respect to xix_i (outermost). This raises a natural question: does the order of differentiation matter? For mixed partials, is fxyf_{xy} always equal to fyxf_{yx}?

🔷 Theorem 1 (Clairaut's Theorem (Symmetry of Mixed Partials))

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} and suppose the mixed partial derivatives 2fxixj\frac{\partial^2 f}{\partial x_i \partial x_j} and 2fxjxi\frac{\partial^2 f}{\partial x_j \partial x_i} both exist and are continuous in a neighborhood of aa. Then

2fxixj(a)=2fxjxi(a).\frac{\partial^2 f}{\partial x_i \partial x_j}(a) = \frac{\partial^2 f}{\partial x_j \partial x_i}(a).

In short: if fC2f \in C^2 (all second partial derivatives exist and are continuous), the order of differentiation does not matter. This ensures the Hessian matrix is symmetric.

Proof.

We prove the n=2n = 2 case. Let f:R2Rf: \mathbb{R}^2 \to \mathbb{R} with continuous mixed partials fxyf_{xy} and fyxf_{yx} near (a,b)(a, b). Define the second-difference quotient:

Δ(h,k)=f(a+h,b+k)f(a+h,b)f(a,b+k)+f(a,b).\Delta(h, k) = f(a+h, b+k) - f(a+h, b) - f(a, b+k) + f(a, b).

Let ϕ(x)=f(x,b+k)f(x,b)\phi(x) = f(x, b+k) - f(x, b). Then Δ(h,k)=ϕ(a+h)ϕ(a)\Delta(h,k) = \phi(a+h) - \phi(a). By the Mean Value Theorem, there exists ξ\xi between aa and a+ha+h such that:

Δ(h,k)=hϕ(ξ)=h[fx(ξ,b+k)fx(ξ,b)].\Delta(h,k) = h\,\phi'(\xi) = h\left[f_x(\xi, b+k) - f_x(\xi, b)\right].

Applying the Mean Value Theorem again to g(y)=fx(ξ,y)g(y) = f_x(\xi, y), there exists η\eta between bb and b+kb+k such that:

Δ(h,k)=hkfxy(ξ,η).\Delta(h,k) = hk \cdot f_{xy}(\xi, \eta).

By the same argument with the roles of xx and yy reversed — define ψ(y)=f(a+h,y)f(a,y)\psi(y) = f(a+h, y) - f(a, y) and apply the MVT twice — we obtain:

Δ(h,k)=hkfyx(ξ,η)\Delta(h,k) = hk \cdot f_{yx}(\xi', \eta')

for some ξ\xi' between aa and a+ha+h, η\eta' between bb and b+kb+k.

As (h,k)(0,0)(h, k) \to (0, 0), we have (ξ,η)(a,b)(\xi, \eta) \to (a, b) and (ξ,η)(a,b)(\xi', \eta') \to (a, b). Since fxyf_{xy} and fyxf_{yx} are continuous at (a,b)(a, b):

fxy(a,b)=lim(h,k)(0,0)Δ(h,k)hk=fyx(a,b).f_{xy}(a, b) = \lim_{(h,k) \to (0,0)} \frac{\Delta(h,k)}{hk} = f_{yx}(a, b).

\square

📝 Example 1 (Second-order partials of f(x,y) = x²y + sin(xy))

We compute all four second-order partial derivatives:

First partials:

fx=2xy+ycos(xy),fy=x2+xcos(xy).f_x = 2xy + y\cos(xy), \qquad f_y = x^2 + x\cos(xy).

Second partials:

fxx=2yy2sin(xy),fyy=x2sin(xy),f_{xx} = 2y - y^2\sin(xy), \qquad f_{yy} = -x^2\sin(xy),

fxy=2x+cos(xy)xysin(xy),fyx=2x+cos(xy)xysin(xy).f_{xy} = 2x + \cos(xy) - xy\sin(xy), \qquad f_{yx} = 2x + \cos(xy) - xy\sin(xy).

Observe fxy=fyxf_{xy} = f_{yx} — Clairaut’s theorem confirmed. Both mixed partials are continuous, so the order of differentiation is interchangeable. The four second partials, assembled into a 2×22 \times 2 matrix, form the Hessian.

The four second-order partial derivatives of a sample function, visualized as surface plots. The two mixed partials f_xy and f_yx are identical — Clairaut's theorem in action.

💡 Remark 1 (From second derivative to Hessian matrix)

In single-variable calculus (Topic 5), the second derivative f(a)f''(a) is a single number. For f:RnRf: \mathbb{R}^n \to \mathbb{R}, the second-order information consists of n2n^2 second partial derivatives. Clairaut’s theorem reduces this to n(n+1)2\frac{n(n+1)}{2} independent values (the upper triangle of a symmetric matrix). For n=2n = 2: 3 values (fxxf_{xx}, fxyf_{xy}, fyyf_{yy}). For n=100n = 100 (a small neural network): 5,050 values. For n=108n = 10^8 (a large language model): storing the full Hessian is infeasible — motivating the approximations discussed in Section 8.

The Hessian Matrix

📐 Definition 2 (The Hessian Matrix)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be twice differentiable at aa. The Hessian matrix of ff at aa is the n×nn \times n matrix

Hf(a)=(2fx12(a)2fx1x2(a)2fx1xn(a)2fx2x1(a)2fx22(a)2fx2xn(a)2fxnx1(a)2fxnx2(a)2fxn2(a)).H_f(a) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2}(a) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(a) & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(a) \\[6pt] \frac{\partial^2 f}{\partial x_2 \partial x_1}(a) & \frac{\partial^2 f}{\partial x_2^2}(a) & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(a) \\[6pt] \vdots & \vdots & \ddots & \vdots \\[6pt] \frac{\partial^2 f}{\partial x_n \partial x_1}(a) & \frac{\partial^2 f}{\partial x_n \partial x_2}(a) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(a) \end{pmatrix}.

If fC2f \in C^2, then Hf(a)H_f(a) is symmetric by Clairaut’s theorem.

Notation: Hf(a)H_f(a), 2f(a)\nabla^2 f(a), D2f(a)D^2 f(a).

The key insight is that the Hessian is not a new concept — it is the Jacobian applied to the gradient.

🔷 Proposition 1 (Hessian is the Jacobian of the Gradient)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be C2C^2. The gradient f:RnRn\nabla f: \mathbb{R}^n \to \mathbb{R}^n is a vector-valued function with component functions (f)i=fxi(\nabla f)_i = \frac{\partial f}{\partial x_i}. Its Jacobian matrix is

J(f)(a)ij=xj(fxi)(a)=2fxjxi(a)=Hf(a)ij.J(\nabla f)(a)_{ij} = \frac{\partial}{\partial x_j}\left(\frac{\partial f}{\partial x_i}\right)(a) = \frac{\partial^2 f}{\partial x_j \partial x_i}(a) = H_f(a)_{ij}.

Therefore Hf(a)=J(f)(a)H_f(a) = J(\nabla f)(a). The Hessian is the Jacobian of the gradient. (The last equality uses the symmetry 2fxjxi=2fxixj\frac{\partial^2 f}{\partial x_j \partial x_i} = \frac{\partial^2 f}{\partial x_i \partial x_j} guaranteed by Clairaut’s theorem for C2C^2 functions — without symmetry, J(f)J(\nabla f) would be the transpose of the Hessian as defined above.)

Proof.

Direct verification. The ii-th component of f\nabla f is gi(x)=fxi(x)g_i(x) = \frac{\partial f}{\partial x_i}(x). By the definition of the Jacobian (Topic 10), the Jacobian of g=fg = \nabla f has entries:

Jg(a)ij=gixj(a)=xjfxi(a)=2fxjxi(a).J_g(a)_{ij} = \frac{\partial g_i}{\partial x_j}(a) = \frac{\partial}{\partial x_j}\frac{\partial f}{\partial x_i}(a) = \frac{\partial^2 f}{\partial x_j \partial x_i}(a).

By Clairaut’s theorem (Theorem 1), 2fxjxi=2fxixj=Hf(a)ij\frac{\partial^2 f}{\partial x_j \partial x_i} = \frac{\partial^2 f}{\partial x_i \partial x_j} = H_f(a)_{ij}. Therefore J(f)(a)=Hf(a)J(\nabla f)(a) = H_f(a).

\square

📝 Example 2 (Hessian of a paraboloid)

f(x,y)=x2+4y2f(x,y) = x^2 + 4y^2.

Gradient: f=(2x,8y)\nabla f = (2x, 8y).

Hessian:

Hf=(2008)H_f = \begin{pmatrix} 2 & 0 \\ 0 & 8 \end{pmatrix}

— constant, independent of (x,y)(x,y). The eigenvalues are 2 and 8. The surface curves 4 times more steeply in the yy-direction than in the xx-direction. This is the prototypical ill-conditioned loss surface: gradient descent zigzags because the curvatures are unequal. The condition number is κ=8/2=4\kappa = 8/2 = 4.

📝 Example 3 (Hessian of a saddle function)

f(x,y)=x2y2f(x,y) = x^2 - y^2.

Gradient: f=(2x,2y)\nabla f = (2x, -2y).

Hessian:

Hf=(2002).H_f = \begin{pmatrix} 2 & 0 \\ 0 & -2 \end{pmatrix}.

One positive eigenvalue (2, curves upward in xx) and one negative eigenvalue (2-2, curves downward in yy). The origin is a saddle point — a critical point that is neither a minimum nor a maximum. The surface looks like a horse saddle: it rises in one direction and descends in the other.

📝 Example 4 (Hessian of a neural network loss)

Consider a loss L(θ1,θ2)=(yσ(θ1x1+θ2x2))2L(\theta_1, \theta_2) = (y - \sigma(\theta_1 x_1 + \theta_2 x_2))^2 where σ\sigma is the sigmoid function. The Hessian HL(θ)H_L(\theta) has entries involving σ\sigma' and σ\sigma'' — it depends on the data (x1,x2,y)(x_1, x_2, y) and the current parameters θ\theta. Unlike the paraboloid, this Hessian is not constant — the curvature of the loss surface changes as the parameters move. Computing this 2×22 \times 2 matrix is cheap; computing a 108×10810^8 \times 10^8 matrix for a real network is not. This is why practical deep learning relies on Hessian approximations rather than the full matrix.

Point: (0.000, 0.000)
H_f: [2.000, 0.000; 0.000, 8.000]
λ₁ = 2.000, λ₂ = 8.000
det H: 16.000 | tr H: 10.000 | κ: 4.000
Classification: Positive definite (local min)

Construction of the Hessian matrix: the gradient vector field is differentiated again, and the resulting second-order partials are assembled into a symmetric matrix. Numerical Hessians shown for the paraboloid, saddle, and Rosenbrock functions.

Critical Point Classification

At a critical point aa where f(a)=0\nabla f(a) = 0, the first-order Taylor term vanishes. The function near aa is governed by the second-order term:

f(a+h)f(a)+12hTHf(a)h.f(a + h) \approx f(a) + \frac{1}{2} h^T H_f(a) h.

The sign behavior of the quadratic form hTHf(a)hh^T H_f(a) h — does it produce all positive values, all negative values, or both? — determines whether aa is a local min, local max, or saddle. This is where eigenvalue analysis becomes essential.

📐 Definition 3 (Positive Definite, Negative Definite, Indefinite)

A symmetric n×nn \times n matrix AA is:

  • Positive definite (A0A \succ 0) if hTAh>0h^T A h > 0 for all h0h \neq 0. Equivalently, all eigenvalues of AA are positive.
  • Negative definite (A0A \prec 0) if hTAh<0h^T A h < 0 for all h0h \neq 0. Equivalently, all eigenvalues are negative.
  • Positive semidefinite (A0A \succeq 0) if hTAh0h^T A h \ge 0 for all hh. Equivalently, all eigenvalues are nonnegative.
  • Negative semidefinite (A0A \preceq 0) if hTAh0h^T A h \le 0 for all hh. Equivalently, all eigenvalues are nonpositive.
  • Indefinite if hTAhh^T A h takes both positive and negative values. Equivalently, AA has both positive and negative eigenvalues.

🔷 Proposition 2 (Eigenvalue Criterion for Definiteness)

Let AA be a symmetric n×nn \times n matrix with eigenvalues λ1λ2λn\lambda_1 \le \lambda_2 \le \cdots \le \lambda_n. Then:

(a) A0    λ1>0A \succ 0 \iff \lambda_1 > 0.

(b) A0    λn<0A \prec 0 \iff \lambda_n < 0.

(c) AA is indefinite     λ1<0<λn\iff \lambda_1 < 0 < \lambda_n.

For n=2n = 2: A=(abbc)A = \begin{pmatrix} a & b \\ b & c \end{pmatrix} is positive definite iff a>0a > 0 and detA=acb2>0\det A = ac - b^2 > 0. It is indefinite iff detA<0\det A < 0.

📐 Definition 4 (Saddle Point)

A point aa is a saddle point of ff if f(a)=0\nabla f(a) = 0 and Hf(a)H_f(a) is indefinite — i.e., ff curves upward in some directions and downward in others. Near a saddle point, f(a+h)>f(a)f(a + h) > f(a) for some directions hh and f(a+h)<f(a)f(a + h) < f(a) for others. The origin of f(x,y)=x2y2f(x,y) = x^2 - y^2 is the canonical example.

🔷 Theorem 2 (The Second Derivative Test (Multivariable))

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be C2C^2 near aa, and suppose f(a)=0\nabla f(a) = 0 (so aa is a critical point). Then:

  1. If Hf(a)0H_f(a) \succ 0 (positive definite), then aa is a strict local minimum.
  2. If Hf(a)0H_f(a) \prec 0 (negative definite), then aa is a strict local maximum.
  3. If Hf(a)H_f(a) is indefinite, then aa is a saddle point.
  4. If Hf(a)H_f(a) is positive or negative semidefinite (has a zero eigenvalue), the test is inconclusive — higher-order analysis is needed.

For n=2n = 2, write Hf=(fxxfxyfxyfyy)H_f = \begin{pmatrix} f_{xx} & f_{xy} \\ f_{xy} & f_{yy} \end{pmatrix} and let D=fxxfyyfxy2=detHfD = f_{xx} f_{yy} - f_{xy}^2 = \det H_f. Then:

  • D>0D > 0 and fxx>0f_{xx} > 0: local minimum.
  • D>0D > 0 and fxx<0f_{xx} < 0: local maximum.
  • D<0D < 0: saddle point.
  • D=0D = 0: inconclusive.

Proof.

We sketch the proof via the second-order Taylor expansion (proven in Section 5). At a critical point where f(a)=0\nabla f(a) = 0:

f(a+h)=f(a)+12hTHf(a)h+o(h2).f(a+h) = f(a) + \frac{1}{2} h^T H_f(a) h + o(\|h\|^2).

Case 1: Hf(a)0H_f(a) \succ 0. By the spectral theorem for symmetric matrices, hTHf(a)hλminh2h^T H_f(a) h \ge \lambda_{\min} \|h\|^2 where λmin>0\lambda_{\min} > 0 is the smallest eigenvalue. For h\|h\| sufficiently small, the o(h2)o(\|h\|^2) error is dominated by 12λminh2\frac{1}{2}\lambda_{\min}\|h\|^2, so f(a+h)>f(a)f(a+h) > f(a) for all h0h \neq 0 in a neighborhood — aa is a strict local minimum.

Case 2: Hf(a)0H_f(a) \prec 0. Analogous, with all inequalities reversed.

Case 3: Hf(a)H_f(a) indefinite. Let v+v_+ be an eigenvector with positive eigenvalue λ+>0\lambda_+ > 0 and vv_- an eigenvector with negative eigenvalue λ<0\lambda_- < 0. Along h=tv+h = tv_+ for small t>0t > 0: f(a+tv+)f(a)+12λ+t2>f(a)f(a + tv_+) \approx f(a) + \frac{1}{2}\lambda_+ t^2 > f(a). Along h=tvh = tv_-: f(a+tv)f(a)+12λt2<f(a)f(a + tv_-) \approx f(a) + \frac{1}{2}\lambda_- t^2 < f(a). So aa is neither a local min nor a local max — it is a saddle point.

\square

📝 Example 5 (Critical point classification: f(x,y) = x⁴ + y⁴ − 2x² − 2y² + 1)

Gradient: f=(4x34x,  4y34y)\nabla f = (4x^3 - 4x,\; 4y^3 - 4y). Setting both components to zero: 4x(x21)=04x(x^2 - 1) = 0 and 4y(y21)=04y(y^2 - 1) = 0. The critical points are all combinations of x{1,0,1}x \in \{-1, 0, 1\} and y{1,0,1}y \in \{-1, 0, 1\} — nine points total.

Hessian:

Hf=(12x240012y24).H_f = \begin{pmatrix} 12x^2 - 4 & 0 \\ 0 & 12y^2 - 4 \end{pmatrix}.

At (0,0)(0,0): Hf=(4004)0H_f = \begin{pmatrix} -4 & 0 \\ 0 & -4 \end{pmatrix} \prec 0local maximum.

At (1,0)(1,0): Hf=(8004)H_f = \begin{pmatrix} 8 & 0 \\ 0 & -4 \end{pmatrix} — indefinite — saddle point. Similarly for (1,0)(-1,0), (0,1)(0,1), (0,1)(0,-1).

At (1,1)(1,1): Hf=(8008)0H_f = \begin{pmatrix} 8 & 0 \\ 0 & 8 \end{pmatrix} \succ 0local minimum. Similarly for (1,1)(-1,1), (1,1)(1,-1), (1,1)(-1,-1).

Nine critical points: 1 local max, 4 saddle points, 4 local minima.

💡 Remark 2 (When the second derivative test is inconclusive)

The function f(x,y)=x4+y4f(x,y) = x^4 + y^4 has f(0,0)=0\nabla f(0,0) = 0 and Hf(0,0)=(0000)H_f(0,0) = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix} — the zero matrix. The test is inconclusive. Yet the origin is clearly a local (and global) minimum: f(x,y)=x4+y4>0=f(0,0)f(x,y) = x^4 + y^4 > 0 = f(0,0) for all (x,y)(0,0)(x,y) \neq (0,0).

The issue is that the second-order Taylor expansion is not informative when the quadratic term vanishes — the fourth-order terms dominate. Compare with g(x,y)=x4y4g(x,y) = x^4 - y^4: same zero Hessian at the origin, but now the origin is a saddle point (of fourth order). The zero Hessian tells us the second-order analysis has nothing to say — we need higher-order derivatives to classify the point.

Critical point: (0.000, 0.000)
Type: local-max
λ₁ = -4.000, λ₂ = -4.000
H: [-4.000, 0.000; 0.000, -4.000]

Critical point classification on three surfaces: a bowl (all eigenvalues positive), a standard saddle (mixed eigenvalue signs), and a monkey saddle (degenerate — zero Hessian). Critical points are colored by type with Hessian eigenvalue annotations.

The Second-Order Taylor Expansion

Just as the first-order Taylor expansion approximates ff by a hyperplane (the tangent plane from Topic 9), the second-order expansion approximates ff by a paraboloid. The Hessian provides the curvature information that the gradient approximation misses.

🔷 Theorem 3 (Second-Order Taylor Expansion)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be C2C^2 in a neighborhood of aa. Then for all hh with a+ha + h in that neighborhood:

f(a+h)=f(a)+f(a)h+12hTHf(a)h+R2(h)f(a + h) = f(a) + \nabla f(a) \cdot h + \frac{1}{2} h^T H_f(a)\, h + R_2(h)

where the remainder satisfies limh0R2(h)h2=0\lim_{h \to 0} \frac{R_2(h)}{\|h\|^2} = 0, i.e., R2(h)=o(h2)R_2(h) = o(\|h\|^2).

In expanded form:

f(a+h)=f(a)+i=1nfxi(a)hi+12i=1nj=1n2fxixj(a)hihj+o(h2).f(a+h) = f(a) + \sum_{i=1}^n \frac{\partial f}{\partial x_i}(a)\, h_i + \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \frac{\partial^2 f}{\partial x_i \partial x_j}(a)\, h_i h_j + o(\|h\|^2).

This is the multivariable generalization of f(a+h)=f(a)+f(a)h+12f(a)h2+o(h2)f(a+h) = f(a) + f'(a)h + \frac{1}{2}f''(a)h^2 + o(h^2) from Topic 6.

Proof.

Apply the single-variable Taylor expansion to the function g(t)=f(a+th)g(t) = f(a + th) at t=0t = 0:

g(t)=g(0)+g(0)t+12g(0)t2+o(t2).g(t) = g(0) + g'(0)\,t + \frac{1}{2}g''(0)\,t^2 + o(t^2).

We compute the derivatives of gg:

  • g(0)=f(a)g(0) = f(a).
  • g(t)=f(a+th)hg'(t) = \nabla f(a + th) \cdot h by the chain rule. So g(0)=f(a)hg'(0) = \nabla f(a) \cdot h.
  • g(t)=i,j2fxixj(a+th)hihj=hTHf(a+th)hg''(t) = \sum_{i,j} \frac{\partial^2 f}{\partial x_i \partial x_j}(a + th)\, h_i h_j = h^T H_f(a + th)\, h. So g(0)=hTHf(a)hg''(0) = h^T H_f(a)\, h.

Setting t=1t = 1:

f(a+h)=g(1)=f(a)+f(a)h+12hTHf(a)h+o(h2).f(a+h) = g(1) = f(a) + \nabla f(a) \cdot h + \frac{1}{2} h^T H_f(a)\, h + o(\|h\|^2).

The error bound conversion from o(t2)o(t^2) at t=1t = 1 to o(h2)o(\|h\|^2) uses the continuity of the second partial derivatives.

\square

📝 Example 6 (Quadratic approximation of f(x,y) = eˣ⁺ʸ near the origin)

At (0,0)(0, 0): f(0,0)=1f(0,0) = 1, f(0,0)=(1,1)\nabla f(0,0) = (1, 1), Hf(0,0)=(1111)H_f(0,0) = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}.

The second-order Taylor expansion:

ex+y1+(x+y)+12(x2+2xy+y2)=1+(x+y)+12(x+y)2.e^{x+y} \approx 1 + (x + y) + \frac{1}{2}(x^2 + 2xy + y^2) = 1 + (x+y) + \frac{1}{2}(x+y)^2.

This matches the known Taylor series eu1+u+12u2e^u \approx 1 + u + \frac{1}{2}u^2 with u=x+yu = x + y. The Hessian captures the curvature of ex+ye^{x+y} at the origin: it curves equally in all directions within the subspace {(t,t):tR}\{(t, t) : t \in \mathbb{R}\} and is flat in the perpendicular direction — reflected by the eigenvalues 0 and 2.

Three panels showing the original surface, the first-order Taylor approximation (tangent plane), and the second-order Taylor approximation (paraboloid). The quadratic approximation captures the curvature that the linear approximation misses.

Eigenvalue Analysis & Curvature

The eigenvalues and eigenvectors of Hf(a)H_f(a) have direct geometric meaning. The eigenvalues are the principal curvatures — the maximum and minimum curvatures of ff at aa. The eigenvectors are the principal curvature directions. This section connects the algebraic classification (Definition 3) to geometric visualization.

📝 Example 7 (Curvature of the elliptic paraboloid)

For f(x,y)=ax2+by2f(x,y) = ax^2 + by^2 with a,b>0a, b > 0:

Hf=(2a002b).H_f = \begin{pmatrix} 2a & 0 \\ 0 & 2b \end{pmatrix}.

Eigenvalues: λ1=2a\lambda_1 = 2a, λ2=2b\lambda_2 = 2b, with eigenvectors along the coordinate axes. The condition number κ=λmax/λmin=max(a,b)/min(a,b)\kappa = \lambda_{\max}/\lambda_{\min} = \max(a,b)/\min(a,b) measures eccentricity.

When a=ba = b: κ=1\kappa = 1 (circular contours, isotropic curvature — gradient descent converges in a straight line).

When aba \gg b or aba \ll b: κ1\kappa \gg 1 (elliptical contours, anisotropic curvature — gradient descent zigzags). The severity of the zigzag is directly proportional to κ\kappa.

📝 Example 8 (The monkey saddle: f(x,y) = x³ − 3xy²)

Gradient: f=(3x23y2,  6xy)\nabla f = (3x^2 - 3y^2,\; -6xy), so (0,0)(0,0) is a critical point.

Hessian at origin:

Hf(0,0)=(6x6y6y6x)(0,0)=(0000).H_f(0,0) = \begin{pmatrix} 6x & -6y \\ -6y & -6x \end{pmatrix}\bigg|_{(0,0)} = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}.

The second derivative test is inconclusive. Yet the surface has three “valleys” meeting at the origin (picture a monkey sitting with both legs and a tail hanging down). This is a degenerate critical point requiring third-order analysis — the Hessian’s silence here tells us that the curvature is zero in all directions, so the second-order Taylor expansion provides no useful local information.

💡 Remark 3 (The condition number and optimization)

For a quadratic f(x)=12xTAxbTx+cf(x) = \frac{1}{2}x^T A x - b^T x + c with A0A \succ 0, gradient descent with optimal step size converges in O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) iterations, where κ=λmax(A)/λmin(A)\kappa = \lambda_{\max}(A)/\lambda_{\min}(A) is the condition number. Newton’s method converges in O(loglog(1/ϵ))O(\log\log(1/\epsilon)) iterations (quadratic convergence).

The gap is dramatic: for κ=1000\kappa = 1000, gradient descent needs roughly 7,000 steps while Newton needs roughly 10. The condition number κ(Hf)\kappa(H_f) at a local minimum of a non-quadratic loss measures the local version of this problem — it tells you how much the loss landscape stretches gradient descent steps in the worst direction relative to the best.

Contour plots with eigenvector arrows scaled by eigenvalue magnitude and colored by sign: blue for positive curvature, red for negative curvature. Shown for the paraboloid (all blue, uniform) and Rosenbrock (highly anisotropic near the valley).

Newton’s Method

Newton’s method uses the quadratic approximation from the second-order Taylor expansion to take curvature-corrected steps. At each iterate, we approximate ff by a paraboloid, find the minimum of the paraboloid (which has a closed-form solution), and move there.

📝 Example 9 (Newton's method derivation)

At the current iterate xkx_k, approximate ff by its second-order Taylor expansion:

f(xk+h)f(xk)+f(xk)h+12hTHf(xk)h.f(x_k + h) \approx f(x_k) + \nabla f(x_k) \cdot h + \frac{1}{2} h^T H_f(x_k)\, h.

This is a quadratic function of hh. Its gradient with respect to hh is f(xk)+Hf(xk)h\nabla f(x_k) + H_f(x_k)\, h, which vanishes when:

h=Hf(xk)1f(xk).h^* = -H_f(x_k)^{-1}\, \nabla f(x_k).

The Newton update is:

xk+1=xk+h=xkHf(xk)1f(xk).x_{k+1} = x_k + h^* = x_k - H_f(x_k)^{-1}\, \nabla f(x_k).

Compare with gradient descent: xk+1=xkηf(xk)x_{k+1} = x_k - \eta\, \nabla f(x_k). Newton replaces the scalar step size η\eta with the matrix Hf1H_f^{-1} — a direction- and curvature-dependent step that accounts for the local shape of the surface.

📝 Example 10 (Newton on the Rosenbrock function)

f(x,y)=(1x)2+100(yx2)2f(x,y) = (1-x)^2 + 100(y-x^2)^2.

The minimum is at (1,1)(1, 1). Gradient descent converges slowly because the loss landscape is a narrow, curved valley with condition number κ2500\kappa \approx 2500 at the minimum. The gradient points nearly perpendicular to the valley floor, so GD bounces between the walls with each step.

Newton’s method converges in a few iterations because it uses the Hessian to navigate the curvature. The Hessian-inverse premultiplication rotates the gradient to align with the valley floor and scales the step to match the local curvature — exactly the correction that gradient descent is missing.

💡 Remark 4 (Newton's method can fail)

Newton’s method requires Hf(xk)H_f(x_k) to be positive definite — otherwise the quadratic model has no minimum (it has a maximum or saddle). At a saddle point, HfH_f is indefinite, and the Newton step may move toward the saddle rather than away from it.

This is why second-order methods in practice often use modified Newton methods that ensure the step direction is always a descent direction. Common modifications include:

  • Levenberg-Marquardt damping: Replace Hf1H_f^{-1} with (Hf+μI)1(H_f + \mu I)^{-1} for some μ>0\mu > 0, shifting all eigenvalues to be positive.
  • Trust region methods: Constrain hΔ\|h\| \le \Delta and solve the constrained quadratic subproblem.
  • Eigenvalue modification: Replace negative eigenvalues with their absolute values before inverting.
Click left panel to set start point | κ ≈ 5

Side-by-side contour trajectories: gradient descent zigzags on an ill-conditioned surface while Newton's method converges in a few direct steps. The convergence curve comparison shows linear (GD) vs quadratic (Newton) convergence on a log scale.

Connections to ML

Loss surface curvature

The Hessian of the loss function L(θ)L(\theta) at the current parameters θ\theta determines how the loss surface curves. The eigenvalues of HL(θ)H_L(\theta) are the curvatures in each principal direction. A large condition number κ(HL)=λmax/λmin\kappa(H_L) = \lambda_{\max}/\lambda_{\min} means the surface is an elongated valley — gradient descent overshoots in steep directions and barely moves in flat directions.

This is why learning rate tuning is difficult: no single scalar η\eta works well for all directions simultaneously. If η\eta is small enough not to overshoot in the steepest direction, it makes negligible progress in the flattest direction. If η\eta is large enough to make progress in the flat direction, it oscillates wildly in the steep direction.

Batch normalization, layer normalization, and careful weight initialization schemes all implicitly improve the Hessian’s condition number by making the loss surface more isotropic — reducing the gap between the largest and smallest curvatures.

Saddle points in high dimensions

Dauphin et al. (2014) showed that in high-dimensional optimization, saddle points are exponentially more common than local minima. At a random critical point of a generic function on Rn\mathbb{R}^n, each eigenvalue of the Hessian is independently positive or negative with roughly equal probability.

A local minimum requires all nn eigenvalues to be positive — probability roughly 2n2^{-n}. A saddle point (mixed eigenvalue signs) has probability roughly 121n1 - 2^{1-n}. For n=1000n = 1000: the probability of a random critical point being a local minimum is approximately 2100002^{-1000} \approx 0.

The practical implication: when gradient descent appears to “get stuck” in high-dimensional optimization, it is almost certainly near a saddle point, not a local minimum. Gradient descent can escape saddle points (slowly, along directions of negative curvature), but Newton’s method can actually be attracted to saddle points (because Hf1H_f^{-1} amplifies components along eigenvectors with small eigenvalues, which near saddle points includes directions with negative curvature).

Histogram of Hessian eigenvalue distributions at random critical points for increasing dimension n. As n grows, essentially all critical points are saddle points — local minima become exponentially rare.

Second-order optimizers

  • Full Newton: θk+1=θkHL1L\theta_{k+1} = \theta_k - H_L^{-1} \nabla L. Quadratic convergence but O(n2)O(n^2) memory and O(n3)O(n^3) computation per step. Impractical for n>104n > 10^4.

  • Quasi-Newton (L-BFGS): Approximates HL1H_L^{-1} from gradient differences across recent steps — stores O(mn)O(mn) where m10m \sim 102020 is the memory parameter. Widely used in scientific computing and small-scale ML.

  • Hessian-free optimization: Uses conjugate gradient to solve Hf(xk)h=f(xk)H_f(x_k)\, h = -\nabla f(x_k) without forming HfH_f explicitly — only requires Hessian-vector products HfvH_f v, which can be computed via automatic differentiation in O(n)O(n) time (one forward + one backward pass per product).

  • Natural gradient: θk+1=θkI(θk)1L(θk)\theta_{k+1} = \theta_k - I(\theta_k)^{-1} \nabla L(\theta_k) where I(θ)I(\theta) is the Fisher information matrix. This is Newton’s method on the KL divergence rather than the loss — it adapts the step to the geometry of the probability distribution. See Information Geometry on formalML.

  • Adam and diagonal approximations: Adam’s per-parameter adaptive learning rate η/vt+ϵ\eta / \sqrt{v_t + \epsilon} implicitly approximates the diagonal of HL1/2|H_L|^{1/2}. The second-moment estimate vtv_t tracks the mean square of gradients, which under stationary conditions approximates the diagonal Fisher information.

Hessian spectrum in practice

Empirical studies of neural network loss surfaces show that the Hessian spectrum is highly structured: most eigenvalues are near zero (flat directions corresponding to redundant parameters), a few are large and positive (high-curvature directions), and at saddle points, a few are negative. The “effective dimension” of the optimization problem is much smaller than nn, which explains why first-order methods work surprisingly well despite the theoretical superiority of second-order methods.

The Gauss-Newton approximation G=JTJG = J^T J (where JJ is the Jacobian of the residuals) is always positive semidefinite and provides a useful approximation to the Hessian for least-squares problems, avoiding the indefiniteness issue of the full Hessian.

Three panels showing the full Hessian H, the Gauss-Newton approximation J^T J, and their difference H − J^T J, for a nonlinear least-squares problem.

Forward links:

  • Gradient Descent on formalML — second-order optimization, convergence rate analysis, adaptive methods
  • Convex Analysis on formalML — positive semidefinite Hessian as the second-order certificate of convexity
  • Information Geometry on formalML — Fisher information as the expected Hessian of the negative log-likelihood

Within formalCalculus:

Heatmap of the condition number κ(H_f) over the domain for the Rosenbrock function, showing the narrow high-κ valley where gradient descent struggles most.

Connections & Further Reading

This topic connects to the full prerequisite chain of the Multivariable Differential Calculus track:

TopicConnection
The DerivativeThe single-variable f(a)f''(a) is the 1×11 \times 1 Hessian. The second derivative test generalizes to eigenvalue analysis.
Taylor’s TheoremThe single-variable Taylor expansion f(a+h)=f(a)+fh+12fh2+O(h3)f(a+h) = f(a) + f'h + \frac{1}{2}f''h^2 + O(h^3) generalizes to the multivariable second-order expansion with the quadratic form hTHfhh^T H_f h.
Partial Derivatives & the GradientThe gradient provides first-order information (slope). The Hessian provides second-order information (curvature). At critical points, the gradient vanishes and the Hessian determines the local structure.
The Jacobian & Multivariate Chain RuleThe Hessian is the Jacobian of the gradient: Hf=J(f)H_f = J(\nabla f). The chain rule enables computing Hessians of composed functions.
import jax
import jax.numpy as jnp

def f(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

x = jnp.array([0.5, 0.5])

# Gradient
grad_f = jax.grad(f)
print("∇f:", grad_f(x))

# Hessian
hessian_f = jax.hessian(f)
H = hessian_f(x)
print("H_f:\n", H)

# Eigenvalues of the Hessian
eigenvalues = jnp.linalg.eigvalsh(H)
print("Eigenvalues:", eigenvalues)
print("Condition number:", eigenvalues[-1] / eigenvalues[0])

# Newton step: x_new = x - H⁻¹ ∇f
newton_dir = -jnp.linalg.solve(H, grad_f(x))
print("Newton direction:", newton_dir)

References

  1. book Munkres (1991). Analysis on Manifolds Chapter 3 develops higher-order derivatives and the second-order Taylor formula in ℝⁿ
  2. book Spivak (1965). Calculus on Manifolds Chapter 2 treats higher derivatives and the Taylor expansion for multilinear maps
  3. book Rudin (1976). Principles of Mathematical Analysis Chapter 9 on second-order differentiation — Theorem 9.41 is Clairaut's theorem on symmetry of mixed partials
  4. book Nocedal & Wright (2006). Numerical Optimization Chapters 2–3 on Newton's method, quasi-Newton methods, and the role of the Hessian in optimization convergence theory
  5. book Boyd & Vandenberghe (2004). Convex Optimization Chapter 9 on Newton's method for unconstrained optimization — convergence analysis via the Hessian's condition number
  6. book Goodfellow, Bengio & Courville (2016). Deep Learning Section 8.2 on challenges in optimization including saddle points, ill-conditioning, and the role of the Hessian spectrum
  7. paper Dauphin, Pascanu, Gulcehre, Cho, Ganguli & Bengio (2014). “Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization” Shows that in high-dimensional optimization, saddle points dominate local minima — the Hessian spectrum at critical points has a mixture of positive and negative eigenvalues with high probability
  8. paper Kingma & Ba (2015). “Adam: A Method for Stochastic Optimization” The Adam optimizer's per-parameter learning rates implicitly approximate diagonal Hessian elements via second-moment estimates