Multivariable Differential · foundational · 45 min read

Partial Derivatives & the Gradient

Extending differentiation to functions of several variables — partial derivatives as single-variable slices, the gradient as the direction of steepest ascent, directional derivatives, and the total derivative as the correct notion of multivariable differentiability

Abstract. Partial derivatives extend single-variable differentiation to functions of several variables. Given f: ℝⁿ → ℝ, the partial derivative ∂f/∂xᵢ at a point a is computed by holding all variables except xᵢ fixed and taking the ordinary derivative — it measures the rate of change of f in the xᵢ-direction. Geometrically, for f: ℝ² → ℝ, ∂f/∂x is the slope of the curve obtained by slicing the surface z = f(x,y) with the plane y = a₂. Assembling all partial derivatives into a vector gives the gradient ∇f(a) = (∂f/∂x₁, ..., ∂f/∂xₙ). The gradient connects to directional derivatives via D_u f(a) = ∇f(a) · u: the rate of change of f in direction u is the dot product of the gradient with u. Since |∇f · u| ≤ ‖∇f‖ with equality when u = ∇f/‖∇f‖, the gradient points in the direction of steepest ascent, and its magnitude is the maximum rate of change. This is why gradient descent works: moving in the direction −∇L(θ) decreases the loss as rapidly as possible (locally). The gradient is perpendicular to level sets — contour lines on a topographic map run at right angles to the direction of steepest climb. But partial derivatives alone do not tell the whole story. The function f(x,y) = xy/(x² + y²) has both partial derivatives at the origin, yet is not even continuous there. The correct notion of differentiability in ℝⁿ is the total derivative: a linear map Df(a) satisfying lim_{h→0} |f(a+h) − f(a) − Df(a)·h| / ‖h‖ = 0. When f is differentiable, the total derivative is represented by the gradient (for scalar-valued f) or the Jacobian matrix (for vector-valued f, covered in the next topic). A sufficient condition for differentiability is that all partial derivatives exist and are continuous — the C¹ criterion. In machine learning, the gradient is the engine of optimization: SGD, Adam, and every gradient-based optimizer compute ∇L(θ) and step in the direction −∇L. Gradient magnitude |∂L/∂xᵢ| measures feature importance, saliency maps visualize which input pixels matter most to a classifier, and the geometry of loss landscapes — contour shapes, saddle points, local minima — is understood through the gradient and its higher-order relatives.

Where this leads → formalML

  • formalML Gradient descent updates parameters by stepping in the direction −∇L(θ) — the steepest descent direction. The gradient's orthogonality to level sets explains why gradient descent cuts across contour lines. The directional derivative inequality D_u f ≤ ‖∇f‖ quantifies why no other direction decreases the loss faster (locally).
  • formalML For convex functions, the gradient provides a global lower bound: f(y) ≥ f(x) + ∇f(x)ᵀ(y − x). This first-order convexity condition is the foundation of convex optimization — the gradient at any point gives a tangent hyperplane that lies entirely below the graph.
  • formalML The gradient of a constraint function g is normal to the level set g⁻¹(c). This geometric fact — that ∇g is perpendicular to the constraint surface — is the starting point for Lagrange multipliers and the theory of smooth submanifolds of ℝⁿ.

Overview & Motivation

In single-variable calculus, a function f:RRf: \mathbb{R} \to \mathbb{R} has one input and one direction to differentiate. The derivative f(a)f'(a) captures everything about the local behavior of ff near aa: the rate of change, the tangent slope, the best linear approximation. We built that theory carefully in The Derivative & Chain Rule.

But a neural network’s loss L(θ1,θ2,,θn)L(\theta_1, \theta_2, \ldots, \theta_n) depends on thousands or millions of parameters. To minimize LL, we need to know how LL changes when we nudge each parameter θi\theta_i independently — that is a partial derivative — and then assemble those rates of change into a single direction for the next step — that is the gradient. The gradient L\nabla L tells us the direction of steepest ascent; negating it gives steepest descent; and the update rule θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) is gradient descent.

This topic builds the theory that makes gradient-based optimization precise. We start with the geometric picture — slicing surfaces to extract ordinary derivatives — then assemble partial derivatives into the gradient, connect the gradient to directional derivatives and level sets, and confront the subtle question of what “differentiable” really means in Rn\mathbb{R}^n.

Partial Derivatives

We begin with the simplest multivariable setup: a function f:R2Rf: \mathbb{R}^2 \to \mathbb{R}, which we can visualize as a surface z=f(x,y)z = f(x, y) in three-dimensional space. To compute the partial derivative fx\frac{\partial f}{\partial x} at a point (a1,a2)(a_1, a_2), we slice the surface with the plane y=a2y = a_2. This slice produces a curve in the xzxz-plane — an ordinary single-variable function of xx. The slope of this curve at x=a1x = a_1 is the partial derivative fx(a1,a2)\frac{\partial f}{\partial x}(a_1, a_2).

The key idea: a partial derivative is just an ordinary derivative, with all other variables held fixed.

📐 Definition 1 (Partial Derivative)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} and let a=(a1,,an)Rna = (a_1, \ldots, a_n) \in \mathbb{R}^n. The partial derivative of ff with respect to xix_i at aa is

fxi(a)=limh0f(a1,,ai+h,,an)f(a)h,\frac{\partial f}{\partial x_i}(a) = \lim_{h \to 0} \frac{f(a_1, \ldots, a_i + h, \ldots, a_n) - f(a)}{h},

provided this limit exists. In plain English: hold every input fixed except xix_i, and take the ordinary derivative with respect to xix_i.

Notation: fxi(a)=fxi(a)=if(a)=Dif(a)\frac{\partial f}{\partial x_i}(a) = f_{x_i}(a) = \partial_i f(a) = D_i f(a).

This is exactly the limit definition of the derivative from The Derivative & Chain Rule, applied to the single-variable function g(t)=f(a1,,ai1,t,ai+1,,an)g(t) = f(a_1, \ldots, a_{i-1}, t, a_{i+1}, \ldots, a_n).

📝 Example 1 (Partial derivatives of f(x,y) = x²y + sin(y))

To compute fx\frac{\partial f}{\partial x}: treat yy as a constant and differentiate with respect to xx.

fx=2xy.\frac{\partial f}{\partial x} = 2xy.

To compute fy\frac{\partial f}{\partial y}: treat xx as a constant and differentiate with respect to yy.

fy=x2+cos(y).\frac{\partial f}{\partial y} = x^2 + \cos(y).

At the point (1,π/2)(1, \pi/2): fx(1,π/2)=21π/2=πf_x(1, \pi/2) = 2 \cdot 1 \cdot \pi/2 = \pi and fy(1,π/2)=1+cos(π/2)=1f_y(1, \pi/2) = 1 + \cos(\pi/2) = 1.

📝 Example 2 (Partial derivatives of f(x,y) = e^{x² + y²})

The chain rule from Topic 5 applies directly. Treating yy as constant:

fx=2xex2+y2.\frac{\partial f}{\partial x} = 2x \, e^{x^2 + y^2}.

Treating xx as constant:

fy=2yex2+y2.\frac{\partial f}{\partial y} = 2y \, e^{x^2 + y^2}.

At any point, fxf_x and fyf_y have the same exponential factor — they differ only in the leading coefficient (2x2x vs. 2y2y). This symmetry reflects the rotational structure of x2+y2x^2 + y^2.

Slice curve (y = 0.00) Tangent (slope = ∂f/∂x = 0.000)f(0.00, 0.00) = 0.000

Partial derivative slices: the surface f(x,y) = x² + y² with y-fixed and x-fixed slices highlighted

Tangent Planes & Linear Approximation

In The Derivative & Chain Rule, the tangent line y=f(a)+f(a)(xa)y = f(a) + f'(a)(x - a) was the best linear approximation to ff near aa. In two variables, the tangent line becomes a tangent plane.

🔷 Proposition 1 (Tangent Plane as Linear Approximation)

If f:R2Rf: \mathbb{R}^2 \to \mathbb{R} is differentiable at a=(a1,a2)a = (a_1, a_2), then

f(a1+h1,a2+h2)f(a)+fx(a)h1+fy(a)h2,f(a_1 + h_1, a_2 + h_2) \approx f(a) + f_x(a) \, h_1 + f_y(a) \, h_2,

with error that vanishes faster than (h1,h2)\|(h_1, h_2)\| as (h1,h2)(0,0)(h_1, h_2) \to (0, 0). The surface z=f(a)+fx(a)(xa1)+fy(a)(ya2)z = f(a) + f_x(a)(x - a_1) + f_y(a)(y - a_2) is the tangent plane to z=f(x,y)z = f(x,y) at (a1,a2,f(a))(a_1, a_2, f(a)).

The tangent plane touches the surface at one point and “hugs” it locally — just as the tangent line does in one dimension. Each partial derivative contributes one slope: fxf_x is the slope in the xx-direction, fyf_y is the slope in the yy-direction, and together they tilt the plane to match the surface.

💡 Remark 1 (From tangent line to tangent plane to tangent hyperplane)

In R1\mathbb{R}^1, the best linear approximation is a line. In R2\mathbb{R}^2, it is a plane. In Rn\mathbb{R}^n, it is a hyperplane:

f(a+h)f(a)+i=1nfxi(a)hi.f(a + h) \approx f(a) + \sum_{i=1}^n \frac{\partial f}{\partial x_i}(a) \, h_i.

The pattern is the same — the derivative provides the coefficients of the linear approximation. In the next topic (The Jacobian & Multivariate Chain Rule), the derivative of a vector-valued function becomes a matrix, and the tangent hyperplane becomes an affine subspace.

Tangent plane: the surface with its tangent plane at a point, showing how the plane approximates the surface locally

The Gradient Vector

The individual partial derivatives tell us rates of change along coordinate directions. But there is nothing special about the coordinate axes — they are an arbitrary choice of basis for Rn\mathbb{R}^n. The gradient collects all partial derivatives into a single vector that encodes the rate of change in every direction simultaneously.

📐 Definition 2 (The Gradient)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be a function whose partial derivatives all exist at aa. The gradient of ff at aa is the vector

f(a)=(fx1(a),fx2(a),,fxn(a))Rn.\nabla f(a) = \left(\frac{\partial f}{\partial x_1}(a), \frac{\partial f}{\partial x_2}(a), \ldots, \frac{\partial f}{\partial x_n}(a)\right) \in \mathbb{R}^n.

Notation: f(a)=gradf(a)=Df(a)T\nabla f(a) = \text{grad}\, f(a) = Df(a)^T. We follow the column-vector convention throughout: f\nabla f is a column vector in Rn\mathbb{R}^n.

The gradient transforms the partial derivative — a collection of nn separate numbers — into a single geometric object: a vector that lives in the same space as the input. This shift from “list of rates” to “direction in input space” is the conceptual leap that makes gradient-based optimization geometric.

📝 Example 3 (Gradient of f(x,y) = x² + y²)

f(x,y)=(2x,2y).\nabla f(x, y) = (2x, 2y).

At any point, the gradient points radially outward from the origin — directly “uphill” on the paraboloid. At (1,1)(1, 1): f=(2,2)\nabla f = (2, 2), pointing toward the northeast. The magnitude f(1,1)=8=22\|\nabla f(1,1)\| = \sqrt{8} = 2\sqrt{2} measures how steep the surface is at that point.

📝 Example 4 (Gradient of f(x,y,z) = xyz)

f=(yz,xz,xy).\nabla f = (yz, \, xz, \, xy).

At (1,2,3)(1, 2, 3): f=(6,3,2)\nabla f = (6, 3, 2). The function is most sensitive to changes in xx (rate =6= 6), less sensitive to yy (rate =3= 3), and least sensitive to zz (rate =2= 2). In an ML context, this is feature importance: the gradient magnitude per coordinate tells you which inputs matter most.

Point (1.00, 1.00)∇f = (2.000, 2.000)‖∇f‖ = 2.828Click or drag to move the evaluation point

Gradient vector field: contour plot with gradient arrows perpendicular to contours

Directional Derivatives

Partial derivatives measure the rate of change along the coordinate axes e1,e2,,en\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_n. But we often need the rate of change in a direction u\mathbf{u} that is not aligned with any axis — for instance, the direction a gradient descent step actually takes.

📐 Definition 3 (Directional Derivative)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R}, aRna \in \mathbb{R}^n, and uRn\mathbf{u} \in \mathbb{R}^n a unit vector (u=1\|\mathbf{u}\| = 1). The directional derivative of ff at aa in the direction u\mathbf{u} is

Duf(a)=limt0f(a+tu)f(a)t,D_\mathbf{u}f(a) = \lim_{t \to 0} \frac{f(a + t\mathbf{u}) - f(a)}{t},

provided this limit exists. In plain English: stand at aa, walk in direction u\mathbf{u}, and measure how fast ff changes.

Notice that choosing u=ei\mathbf{u} = \mathbf{e}_i (the ii-th standard basis vector) recovers the partial derivative fxi(a)\frac{\partial f}{\partial x_i}(a). Partial derivatives are special cases of directional derivatives — the special cases where the direction is along a coordinate axis.

The remarkable fact is that when ff is differentiable, the directional derivative in any direction can be computed from the gradient alone:

🔷 Theorem 1 (Gradient–Directional Derivative Relationship)

If f:RnRf: \mathbb{R}^n \to \mathbb{R} is differentiable at aa, then for every unit vector u\mathbf{u}, the directional derivative exists and equals

Duf(a)=f(a)u.D_\mathbf{u}f(a) = \nabla f(a) \cdot \mathbf{u}.

Proof.

Define φ(t)=f(a+tu)\varphi(t) = f(a + t\mathbf{u}). This is a function φ:RR\varphi: \mathbb{R} \to \mathbb{R} — a single-variable function obtained by restricting ff to the line through aa in direction u\mathbf{u}. By definition, Duf(a)=φ(0)D_\mathbf{u}f(a) = \varphi'(0).

Since ff is differentiable at aa, the chain rule (The Derivative & Chain Rule, Theorem 6) applies to the composition φ(t)=f(a+tu)\varphi(t) = f(a + t\mathbf{u}). The “inner function” is γ(t)=a+tu\gamma(t) = a + t\mathbf{u}, with γ(0)=u\gamma'(0) = \mathbf{u}. The chain rule gives:

φ(0)=i=1nfxi(a)ui=f(a)u.\varphi'(0) = \sum_{i=1}^n \frac{\partial f}{\partial x_i}(a) \cdot u_i = \nabla f(a) \cdot \mathbf{u}.

The chain rule is valid here because ff is differentiable at aa — not merely having partial derivatives. This distinction is critical and we return to it in Section 7. ∎

📝 Example 5 (Directional derivative of f(x,y) = x² + y² in direction u = (1,1)/√2)

At (1,1)(1, 1): f=(2,2)\nabla f = (2, 2). In the direction u=12(1,1)\mathbf{u} = \frac{1}{\sqrt{2}}(1, 1):

Duf(1,1)=(2,2)12(1,1)=2+22=222.83.D_\mathbf{u}f(1, 1) = (2, 2) \cdot \frac{1}{\sqrt{2}}(1, 1) = \frac{2 + 2}{\sqrt{2}} = 2\sqrt{2} \approx 2.83.

Compare this with the partial derivatives: fx(1,1)=2f_x(1,1) = 2 and fy(1,1)=2f_y(1,1) = 2. The directional derivative in the diagonal direction is larger than either partial derivative individually — because the gradient points diagonally, and we are measuring rate of change in the gradient’s own direction.

∇f = (2.000, 2.000) u = (1.000, 0.000)D_u f = 2.000‖∇f‖ = 2.828Angle(u, ∇f) = 45.0°

Directional derivative: contour plot with gradient and direction vectors, plus polar rose plot

The Gradient as Steepest Ascent

We now arrive at the geometric crown jewel of this topic: the gradient points in the direction of steepest ascent. This is why gradient descent works — it follows L-\nabla L, which is the direction of steepest descent.

🔷 Theorem 2 (Gradient as Direction of Steepest Ascent)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be differentiable at aa with f(a)0\nabla f(a) \neq 0. Then among all unit vectors uRn\mathbf{u} \in \mathbb{R}^n:

  1. The maximum of Duf(a)D_\mathbf{u}f(a) is f(a)\|\nabla f(a)\|, achieved when u=f(a)f(a)\mathbf{u} = \frac{\nabla f(a)}{\|\nabla f(a)\|}.
  2. The minimum of Duf(a)D_\mathbf{u}f(a) is f(a)-\|\nabla f(a)\|, achieved when u=f(a)f(a)\mathbf{u} = -\frac{\nabla f(a)}{\|\nabla f(a)\|}.
  3. Duf(a)=0D_\mathbf{u}f(a) = 0 if and only if u\mathbf{u} is orthogonal to f(a)\nabla f(a).

Proof.

By Theorem 1, Duf=fuD_\mathbf{u}f = \nabla f \cdot \mathbf{u}. By the Cauchy-Schwarz inequality:

fufu=f,|\nabla f \cdot \mathbf{u}| \le \|\nabla f\| \cdot \|\mathbf{u}\| = \|\nabla f\|,

since u=1\|\mathbf{u}\| = 1. Equality holds if and only if u\mathbf{u} is a scalar multiple of f\nabla f. Since u=1\|\mathbf{u}\| = 1, this means u=±f/f\mathbf{u} = \pm \nabla f / \|\nabla f\|.

The positive sign gives f(f/f)=f2/f=f\nabla f \cdot (\nabla f / \|\nabla f\|) = \|\nabla f\|^2 / \|\nabla f\| = \|\nabla f\| (the maximum). The negative sign gives f-\|\nabla f\| (the minimum). If uf\mathbf{u} \perp \nabla f, then fu=0\nabla f \cdot \mathbf{u} = 0. ∎

This theorem has an immediate and profound consequence for optimization: to decrease ff as fast as possible from the current point aa, walk in the direction f(a)/f(a)-\nabla f(a) / \|\nabla f(a)\|. The maximum rate of decrease is f(a)\|\nabla f(a)\|. No other direction does better.

The gradient is not merely “some direction of increase” — it is the unique optimal direction, and its magnitude tells you the maximum possible rate of change. Directions perpendicular to the gradient produce zero change — they trace out the level set.

🔷 Theorem 3 (Gradient is Orthogonal to Level Sets)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be differentiable at aa with f(a)0\nabla f(a) \neq 0. Let S={xRn:f(x)=f(a)}S = \{x \in \mathbb{R}^n : f(x) = f(a)\} be the level set through aa. If γ:(ε,ε)Rn\gamma: (-\varepsilon, \varepsilon) \to \mathbb{R}^n is a smooth curve with γ(0)=a\gamma(0) = a and γ(t)S\gamma(t) \in S for all tt, then

f(a)γ(0)=0.\nabla f(a) \cdot \gamma'(0) = 0.

In words: the gradient is perpendicular to every curve that stays on the level set.

Proof.

Since f(γ(t))=f(a)=cf(\gamma(t)) = f(a) = c for all tt (the curve stays on the level set), differentiating with respect to tt at t=0t = 0 gives:

ddtf(γ(t))t=0=f(γ(0))γ(0)=f(a)γ(0)=0,\frac{d}{dt} f(\gamma(t))\bigg|_{t=0} = \nabla f(\gamma(0)) \cdot \gamma'(0) = \nabla f(a) \cdot \gamma'(0) = 0,

by the chain rule. This holds for every tangent vector γ(0)\gamma'(0) to SS at aa, so f(a)\nabla f(a) is orthogonal to the tangent space of SS at aa. ∎

💡 Remark 2 (Contour maps and topographic intuition)

On a topographic map, contour lines connect points at the same elevation. The gradient points in the direction of steepest climb — straight up the hillside, perpendicular to the contour lines. A hiker following the gradient path reaches the summit as fast as possible; a hiker walking along a contour line (perpendicular to the gradient) stays at the same elevation. Gradient descent reverses this: it follows f-\nabla f, going downhill as steeply as possible.

For a loss function with circular contours (isotropic loss, like L=x2+y2L = x^2 + y^2), the gradient points radially inward and gradient descent takes a straight path to the minimum. For elongated contours (anisotropic loss, like L=x2+10y2L = x^2 + 10y^2), the gradient does not point toward the minimum — it points perpendicular to the contour, which may oscillate back and forth across the narrow valley. This is exactly why momentum and adaptive methods like Adam were invented.

Gradient arrows (perpendicular to contours)

Gradient-contour orthogonality: circular vs. elliptical contours

Differentiability in Rn\mathbb{R}^n — The Total Derivative

We have been using the word “differentiable” in our theorems, and it is time to be precise about what it means. In single-variable calculus, differentiability meant that the limit limh0[f(a+h)f(a)]/h\lim_{h \to 0} [f(a+h) - f(a)]/h exists. The multivariable situation is more subtle.

The subtle point: having all partial derivatives is not the same as being differentiable. Partial derivatives probe a function only along coordinate axes — a set of nn rays emanating from aa. A function can have well-defined partial derivatives at a point but behave wildly when approached from other directions. The correct notion of differentiability requires the function to be well-approximated by a single linear map in all directions simultaneously.

📐 Definition 4 (Differentiability (Total Derivative))

A function f:RnRf: \mathbb{R}^n \to \mathbb{R} is differentiable at aa if there exists a linear map Df(a):RnRDf(a): \mathbb{R}^n \to \mathbb{R} such that

limh0f(a+h)f(a)Df(a)(h)h=0.\lim_{h \to 0} \frac{|f(a+h) - f(a) - Df(a)(h)|}{\|h\|} = 0.

When ff is differentiable, Df(a)Df(a) is unique and is represented by the gradient: Df(a)(h)=f(a)hDf(a)(h) = \nabla f(a) \cdot h. For f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m, Df(a)Df(a) is represented by the Jacobian matrix — that is the next topic.

The limit says: the linear approximation f(a)+f(a)hf(a) + \nabla f(a) \cdot h matches f(a+h)f(a + h) so well that the error f(a+h)f(a)f(a)h|f(a+h) - f(a) - \nabla f(a) \cdot h| shrinks faster than h\|h\| itself. Compare this with the single-variable case from The Derivative & Chain Rule: f(a+h)f(a)f(a)h/h0|f(a+h) - f(a) - f'(a)h| / |h| \to 0. The multivariable version is the same idea, but the limit is in Rn\mathbb{R}^n — the approach direction is unrestricted.

🔷 Theorem 4 (Differentiability Implies Partial Derivatives Exist)

If f:RnRf: \mathbb{R}^n \to \mathbb{R} is differentiable at aa, then all partial derivatives exist at aa, and Df(a)(ei)=fxi(a)Df(a)(\mathbf{e}_i) = \frac{\partial f}{\partial x_i}(a).

Proof.

Set h=teih = t\mathbf{e}_i (approach along the ii-th coordinate axis) in the total derivative definition:

f(a+tei)f(a)Df(a)(tei)t0as t0.\frac{|f(a + t\mathbf{e}_i) - f(a) - Df(a)(t\mathbf{e}_i)|}{|t|} \to 0 \quad \text{as } t \to 0.

Since Df(a)Df(a) is linear, Df(a)(tei)=tDf(a)(ei)Df(a)(t\mathbf{e}_i) = t \cdot Df(a)(\mathbf{e}_i). Rearranging:

f(a+tei)f(a)tDf(a)(ei)0.\left|\frac{f(a + t\mathbf{e}_i) - f(a)}{t} - Df(a)(\mathbf{e}_i)\right| \to 0.

The left side of the subtraction is exactly the difference quotient defining fxi(a)\frac{\partial f}{\partial x_i}(a). So the partial derivative exists and equals Df(a)(ei)Df(a)(\mathbf{e}_i). ∎

💡 Remark 3 (The converse is false)

Partial derivatives existing does not imply differentiability. This echoes the single-variable situation (continuity does not imply differentiability), but in a stronger form: in Rn\mathbb{R}^n, we can have all partial derivatives at a point and still fail to be continuous there. The following example demonstrates this dramatically.

📝 Example 6 (The critical counterexample)

Define

f(x,y)={xyx2+y2(x,y)(0,0)0(x,y)=(0,0).f(x, y) = \begin{cases} \frac{xy}{x^2 + y^2} & (x, y) \neq (0, 0) \\ 0 & (x, y) = (0, 0). \end{cases}

Partial derivatives at the origin exist and equal zero:

fx(0,0)=limh0f(h,0)f(0,0)h=limh00h=0.f_x(0, 0) = \lim_{h \to 0} \frac{f(h, 0) - f(0, 0)}{h} = \lim_{h \to 0} \frac{0}{h} = 0.

Similarly fy(0,0)=0f_y(0, 0) = 0. Both partial derivatives exist because along the coordinate axes, ff is identically zero.

But ff is not continuous at the origin: Along the line y=xy = x,

f(x,x)=xxx2+x2=x22x2=120=f(0,0).f(x, x) = \frac{x \cdot x}{x^2 + x^2} = \frac{x^2}{2x^2} = \frac{1}{2} \neq 0 = f(0, 0).

The function approaches 1/21/2 along the diagonal but equals 00 at the origin. Since ff is not continuous, it certainly cannot be differentiable — yet both partial derivatives exist. The moral: partial derivatives probe only the coordinate directions; differentiability requires consistent behavior from all directions simultaneously.

Differentiability counterexample: the surface f(x,y) = xy/(x² + y²)

The positive result — a sufficient condition for differentiability — requires the partial derivatives to be continuous, not merely to exist:

🔷 Theorem 5 (C¹ Criterion)

If all partial derivatives fxi\frac{\partial f}{\partial x_i} exist in a neighborhood of aa and are continuous at aa, then ff is differentiable at aa.

Proof.

We sketch the proof for n=2n = 2; the general case is analogous. Write:

f(a+h)f(a)=[f(a1+h1,a2+h2)f(a1,a2+h2)]+[f(a1,a2+h2)f(a1,a2)].f(a + h) - f(a) = [f(a_1 + h_1, a_2 + h_2) - f(a_1, a_2 + h_2)] + [f(a_1, a_2 + h_2) - f(a_1, a_2)].

Apply the single-variable Mean Value Theorem (Mean Value Theorem & Taylor Expansion, Theorem 1) to each bracket:

  • The first bracket equals fx(c1,a2+h2)h1f_x(c_1, a_2 + h_2) \cdot h_1 for some c1c_1 between a1a_1 and a1+h1a_1 + h_1.
  • The second bracket equals fy(a1,c2)h2f_y(a_1, c_2) \cdot h_2 for some c2c_2 between a2a_2 and a2+h2a_2 + h_2.

So:

f(a+h)f(a)=fx(c1,a2+h2)h1+fy(a1,c2)h2.f(a + h) - f(a) = f_x(c_1, a_2 + h_2) \cdot h_1 + f_y(a_1, c_2) \cdot h_2.

Now subtract the linear approximation fx(a)h1+fy(a)h2f_x(a) h_1 + f_y(a) h_2:

f(a+h)f(a)[fx(a)h1+fy(a)h2]=[fx(c1,a2+h2)fx(a)]h1+[fy(a1,c2)fy(a)]h2.f(a+h) - f(a) - [f_x(a) h_1 + f_y(a) h_2] = [f_x(c_1, a_2+h_2) - f_x(a)] h_1 + [f_y(a_1, c_2) - f_y(a)] h_2.

By continuity of fxf_x and fyf_y at aa: as h0h \to 0, we have c1a1c_1 \to a_1 and c2a2c_2 \to a_2, so fx(c1,a2+h2)fx(a)f_x(c_1, a_2 + h_2) \to f_x(a) and fy(a1,c2)fy(a)f_y(a_1, c_2) \to f_y(a). Call these differences α(h)\alpha(h) and β(h)\beta(h), both tending to zero. Then:

f(a+h)f(a)f(a)hα(h)h1+β(h)h2(α(h)+β(h))h.|f(a+h) - f(a) - \nabla f(a) \cdot h| \le |\alpha(h)| \cdot |h_1| + |\beta(h)| \cdot |h_2| \le (|\alpha(h)| + |\beta(h)|) \|h\|.

Dividing by h\|h\|: the error ratio is at most α(h)+β(h)0|\alpha(h)| + |\beta(h)| \to 0. ∎

💡 Remark 4 (Differentiability implies continuity (the chain of implications))

Just as in The Derivative & Chain Rule, differentiability at aa implies continuity at aa:

f(a+h)f(a)=Df(a)(h)+o(h)Df(a)h+o(h)0.|f(a + h) - f(a)| = |Df(a)(h) + o(\|h\|)| \le \|Df(a)\| \cdot \|h\| + o(\|h\|) \to 0.

The chain of implications is:

C1    differentiable    continuous,differentiable    partial derivatives exist,C^1 \;\Rightarrow\; \text{differentiable} \;\Rightarrow\; \text{continuous}, \qquad \text{differentiable} \;\Rightarrow\; \text{partial derivatives exist},

but none of the converses hold in general. The C1C^1 criterion is the practical workhorse: in virtually every function arising in machine learning (polynomials, exponentials, compositions of smooth functions), the partial derivatives are continuous, so differentiability is automatic.

Computational Notes

Computing gradients is the central computational task in gradient-based optimization. There are three approaches:

Analytical gradients. When we know the formula for ff, we compute f\nabla f by hand (or symbolically). For f(x,y)=x2+y2f(x, y) = x^2 + y^2, the gradient is (2x,2y)(2x, 2y) — exact, fast, no approximation error. This is always preferred when available.

Finite difference gradients. When we don’t have a formula (or want to verify one), we approximate each partial derivative numerically:

fxi(a)f(a+hei)f(ahei)2h(central difference).\frac{\partial f}{\partial x_i}(a) \approx \frac{f(a + h\mathbf{e}_i) - f(a - h\mathbf{e}_i)}{2h} \quad \text{(central difference)}.

This is the same central difference from The Derivative & Chain Rule, applied coordinate by coordinate. The tradeoff is the same: hh too large gives truncation error; hh too small gives floating-point cancellation error. For nn variables, we need 2n2n function evaluations — one forward and one backward per coordinate.

In Python, scipy.optimize.approx_fprime(x, f, epsilon) computes the finite-difference gradient. NumPy’s numpy.gradient handles gridded data.

Automatic differentiation. The jax.grad function computes exact gradients (to machine precision) without finite differences, using reverse-mode automatic differentiation — the same algorithm as backpropagation. A preview:

import jax
import jax.numpy as jnp

f = lambda x: x[0]**2 + x[1]**2
grad_f = jax.grad(f)
grad_f(jnp.array([1.0, 1.0]))  # → [2.0, 2.0]

This computes the exact gradient at cost proportional to a single function evaluation, regardless of the number of variables nn. Automatic differentiation is the computational backbone of modern deep learning — we will see how it works in detail when we reach the Jacobian and the multivariate chain rule.

Gradient computation comparison: analytical vs. finite-difference, with error analysis

Connections to ML

The gradient is the most operationally important object in machine learning optimization. Every concept in this topic has a direct counterpart in ML practice.

Gradient descent. The update rule θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) is the literal definition of gradient descent: evaluate the gradient of the loss, step in the opposite direction (steepest descent), and repeat. The learning rate η\eta controls the step size. Theorem 2 explains why this is locally optimal: no other direction decreases LL faster than L/L-\nabla L / \|\nabla L\|.

Loss landscape geometry. The geometry of the loss function L(θ)L(\theta) determines how gradient descent behaves. Contour plots reveal this geometry: circular contours (isotropic loss) mean gradient descent takes a direct path to the minimum. Elongated contours (anisotropic loss) mean gradient descent oscillates — the gradient perpendicularity to contour lines (Theorem 3) causes the path to zig-zag across the narrow valley. This oscillation motivates momentum, RMSProp, and Adam, which adapt the effective learning rate per coordinate. Second-order methods using the Hessian (The Hessian & Second-Order Analysis) address this more directly.

Feature importance and saliency. The magnitude Lxi|\frac{\partial L}{\partial x_i}| measures how sensitive the loss is to feature xix_i. If L/x3\partial L / \partial x_3 is large while L/x7\partial L / \partial x_7 is small, then feature x3x_3 has more influence on the prediction. In deep learning, saliency maps compute Linput pixel\frac{\partial L}{\partial \text{input pixel}} to visualize which pixels most affect the classifier’s output — a direct application of partial derivatives.

Critical points. A critical point has f(a)=0\nabla f(a) = 0: the loss surface is flat. But f=0\nabla f = 0 does not distinguish minima from maxima from saddle points — that requires the Hessian (the matrix of second-order partial derivatives, covered in The Hessian & Second-Order Analysis). In high-dimensional loss landscapes, saddle points are far more common than local minima, and the gradient alone cannot tell them apart.

Click to set start point, then Step or Run

Gradient descent on three loss landscapes: isotropic, anisotropic, and Rosenbrock

Learning rate effects: too small, just right, and too large

Connections & Further Reading

Within formalCalculus

  • The Derivative & Chain Rule — Partial derivatives are single-variable derivatives applied coordinate-by-coordinate. The chain rule proof (Theorem 1 above) directly uses the single-variable chain rule from Topic 5.
  • Epsilon-Delta & Continuity — The total derivative definition uses a multivariable limit (h0\|h\| \to 0 in Rn\mathbb{R}^n). The C1C^1 criterion uses continuity of partial derivatives in the ε\varepsilon-δ\delta sense.
  • Completeness & Compactness — Compactness in Rn\mathbb{R}^n (Heine-Borel generalizes) ensures continuous functions on compact domains achieve extrema — the multivariable Extreme Value Theorem that motivates gradient-based search.
  • The Jacobian & Multivariate Chain Rule — The Jacobian matrix generalizes the gradient to vector-valued functions f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m. Where the gradient is a single row of partials, the Jacobian stacks mm rows. The multivariate chain rule Jfg(a)=Jf(g(a))Jg(a)J_{f \circ g}(a) = J_f(g(a)) \cdot J_g(a) is the backbone of backpropagation.
  • The Hessian & Second-Order Analysis — The Hessian matrix Hij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} classifies critical points (where f=0\nabla f = 0) as minima, maxima, or saddle points. The second-order Taylor expansion in Rn\mathbb{R}^n builds on the gradient as its first-order term.
  • Inverse & Implicit Function Theorems — The Inverse Function Theorem requires the Jacobian (built from partial derivatives) to be invertible. The Implicit Function Theorem uses g\nabla g at the constraint surface.

Forward to formalML

  • Gradient Descent — The gradient is the engine: θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t). Everything in this topic — the gradient’s definition, steepest ascent, orthogonality to level sets — explains why gradient descent works geometrically.
  • Convex Analysis — For convex ff, the first-order condition f(y)f(x)+f(x)T(yx)f(y) \ge f(x) + \nabla f(x)^T(y - x) says the tangent hyperplane lies below the graph — the gradient provides a global lower bound.
  • Smooth Manifolds — The gradient of a constraint function gg is normal to the level set g1(c)g^{-1}(c), the starting point for Lagrange multipliers and the theory of smooth submanifolds.

References

  1. book Abbott (2015). Understanding Analysis Single-variable foundation — Topic 5 of formalCalculus follows Abbott's Chapter 5. This topic extends that foundation to several variables.
  2. book Munkres (1991). Analysis on Manifolds Chapters 2–3 develop partial derivatives, the total derivative, and the chain rule in ℝⁿ with full rigor and exceptional clarity — the primary reference for our multivariable treatment
  3. book Rudin (1976). Principles of Mathematical Analysis Chapter 9 on multivariable differentiation — compact, definitive treatment of the total derivative and the inverse function theorem
  4. book Spivak (1965). Calculus on Manifolds Chapter 2 develops the derivative as a linear map, the multivariate chain rule, and partial derivatives — elegant minimalist treatment
  5. book Goodfellow, Bengio & Courville (2016). Deep Learning Section 4.3 on gradient-based optimization — the gradient as the engine of deep learning training
  6. paper Li, Xu, Taylor, Studer & Goldstein (2018). “Visualizing the Loss Landscape of Neural Nets” Loss landscape geometry — contour plots, saddle points, and the role of gradient direction in navigating high-dimensional optimization surfaces