Single-Variable Calculus · foundational · 45 min read

The Derivative & Chain Rule

Rates of change as limits of difference quotients — the tangent line as the best linear approximation, differentiation rules from first principles, and the chain rule that makes backpropagation possible

Abstract. The derivative of f at a is the limit f'(a) = lim_{h→0} [f(a+h) - f(a)] / h — the instantaneous rate of change, the slope of the tangent line, and the best linear approximation to f near a, all in one definition. Geometrically, the derivative is what you get when secant lines through (a, f(a)) and (a+h, f(a+h)) converge as h → 0: the limiting slope is f'(a), and the limiting line is the tangent. This limit need not exist — |x| is continuous at 0 but has no derivative there because the left and right secant slopes disagree, and the Weierstrass function is continuous everywhere but differentiable nowhere. When the derivative does exist, differentiability implies continuity (but not the converse), and differentiation obeys algebraic rules derived directly from the limit definition: the sum rule, the product rule (Leibniz rule), the quotient rule, and — most importantly — the chain rule (f ∘ g)'(a) = f'(g(a)) · g'(a). The chain rule says that the derivative of a composition is the product of the derivatives along the chain. This is not a coincidence — it reflects the fact that the derivative at each point is a linear map, and composing linear maps means multiplying them. In machine learning, the chain rule is used in backpropagation: given a loss L = L(f(g(h(x)))), the gradient ∂L/∂x is computed by multiplying local derivatives and backpropagating them through the computation graph, which is exactly the chain rule applied layer by layer. Automatic differentiation (reverse mode = backprop, forward mode = tangent propagation) mechanizes this process, and every gradient-based optimizer — SGD, Adam, RMSProp — depends on the chain rule to compute the gradients it needs.

Where this leads → formalML

  • formalML Single-variable gradient descent θ_{t+1} = θ_t - η f'(θ_t) is the prototype for all gradient methods. The chain rule enables backpropagation — computing gradients through compositions of functions (network layers) — which is the computational engine of gradient-based optimization.
  • formalML Derivatives of entropy H(p) = -Σ pᵢ log pᵢ, KL divergence, and cross-entropy loss are central to information-theoretic ML objectives. The derivative d/dp(-p log p) = -log p - 1 connects information content to optimization.
  • formalML The single-variable derivative is the 1D prototype for the pushforward map between tangent spaces. Differentiable functions are the morphisms of smooth manifolds — the chain rule becomes functoriality of the tangent functor.

Overview & Motivation

Every time you train a neural network, you watch a loss curve drop. The optimizer adjusts each parameter θ\theta by a small step — but in which direction, and how far? The answer is the derivative: L(θ)L'(\theta) tells you the rate at which the loss changes with respect to θ\theta, and that rate determines both the direction (the sign of LL') and the magnitude (|LL'|) of the update.

But a modern network doesn’t have one parameter — it has millions, composed in layers. The input xx passes through f1f_1, then f2f_2, then f3f_3, and so on. To compute dL/dθdL/d\theta for a parameter buried in f1f_1, we need to differentiate through the entire chain of compositions. This is the chain rule: the derivative of a composition is the product of the derivatives at each stage. And backpropagation — the algorithm that makes deep learning computationally feasible — is simply the chain rule applied systematically to a computation graph.

This topic makes both the derivative and the chain rule precise. We start with the geometric picture — secant lines rotating into the tangent — define the derivative as a limit, derive the differentiation rules from first principles, and prove the chain rule using the linear approximation characterization. By the end, you’ll see exactly why backpropagation works and what “taking the gradient” means at the level of individual arithmetic operations.

Prerequisites: We use the limit definition from Sequences, Limits & Convergence — specifically, the algebra of limits (Theorem 3) — and the ε-δ framework from Epsilon-Delta & Continuity for the function limit that defines f(a)f'(a).

The Derivative as a Limit

Two points, one slope

Consider a function ff and two points on its graph: (a,f(a))(a, f(a)) and (a+h,f(a+h))(a+h, f(a+h)), where h0h \neq 0. The straight line through these two points is a secant line, and its slope is the difference quotient:

f(a+h)f(a)h.\frac{f(a+h) - f(a)}{h}.

This ratio measures the average rate of change of ff between aa and a+ha+h. When hh is large, the secant line is a crude summary — it ignores everything the function does between the two points. But as hh shrinks toward 00, the secant line pivots around (a,f(a))(a, f(a)) and settles into a unique position: the tangent line. The slope of that tangent is the instantaneous rate of change — the derivative.

📐 Definition 1 (The Derivative)

Let ff be defined on an open interval containing aa. The derivative of ff at aa is

f(a)=limh0f(a+h)f(a)h,f'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h},

provided this limit exists. Equivalently, using the ε-δ framework from Epsilon-Delta & Continuity: for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that

0<h<δ    f(a+h)f(a)hf(a)<ε.0 < |h| < \delta \implies \left|\frac{f(a+h) - f(a)}{h} - f'(a)\right| < \varepsilon.

When this limit exists, we say ff is differentiable at aa. The function f:xf(x)f': x \mapsto f'(x), defined at every point where the limit exists, is the derivative function of ff.

The notation f(a)f'(a) is Lagrange’s. We also write dfdxx=a\frac{df}{dx}\big|_{x=a} (Leibniz) or Df(a)Df(a) (Euler/operator notation). The Leibniz notation is suggestive — it looks like a ratio of infinitesimals — and in practice we often manipulate it as if it were one. But the definition above is what makes it rigorous: f(a)f'(a) is a single number, the limit of a ratio, not a ratio of limits.

📐 Definition 2 (Left and Right Derivatives)

The left derivative and right derivative of ff at aa are

f(a)=limh0f(a+h)f(a)h,f+(a)=limh0+f(a+h)f(a)h.f'_-(a) = \lim_{h \to 0^-} \frac{f(a+h) - f(a)}{h}, \qquad f'_+(a) = \lim_{h \to 0^+} \frac{f(a+h) - f(a)}{h}.

The derivative f(a)f'(a) exists if and only if both one-sided derivatives exist and f(a)=f+(a)f'_-(a) = f'_+(a).

This connects directly to one-sided limits from Epsilon-Delta & Continuity — the derivative is defined by a function limit as h0h \to 0, and that limit exists if and only if both one-sided limits agree.

📝 Example 1 (Derivative of f(x) = x² at a = 3)

We compute f(3)f'(3) directly from the definition:

f(3)=limh0(3+h)29h=limh09+6h+h29h=limh06h+h2h=limh0(6+h)=6.f'(3) = \lim_{h \to 0} \frac{(3+h)^2 - 9}{h} = \lim_{h \to 0} \frac{9 + 6h + h^2 - 9}{h} = \lim_{h \to 0} \frac{6h + h^2}{h} = \lim_{h \to 0} (6 + h) = 6.

Every step is algebra followed by a limit. The cancellation of hh in the numerator and denominator is the key move — it removes the 00\frac{0}{0} indeterminate form and leaves a polynomial in hh that we can evaluate at h=0h = 0.

📝 Example 2 (Derivative of f(x) = √x at a > 0)

For f(x)=xf(x) = \sqrt{x} at a>0a > 0:

f(a)=limh0a+hah.f'(a) = \lim_{h \to 0} \frac{\sqrt{a+h} - \sqrt{a}}{h}.

The direct approach gives 00\frac{0}{0}, so we rationalize — multiply numerator and denominator by the conjugate a+h+a\sqrt{a+h} + \sqrt{a}:

a+haha+h+aa+h+a=(a+h)ah(a+h+a)=1a+h+a.\frac{\sqrt{a+h} - \sqrt{a}}{h} \cdot \frac{\sqrt{a+h} + \sqrt{a}}{\sqrt{a+h} + \sqrt{a}} = \frac{(a+h) - a}{h(\sqrt{a+h} + \sqrt{a})} = \frac{1}{\sqrt{a+h} + \sqrt{a}}.

As h0h \to 0, this converges to 12a\frac{1}{2\sqrt{a}}. So f(a)=12af'(a) = \frac{1}{2\sqrt{a}}, which is defined for a>0a > 0 — the domain of the derivative is smaller than the domain of ff (which includes a=0a = 0). At a=0a = 0, the difference quotient hh=1h\frac{\sqrt{h}}{h} = \frac{1}{\sqrt{h}} \to \infty, so ff is not differentiable there (vertical tangent).

Try dragging the hh slider toward zero in the explorer below — watch how the secant line (amber) rotates into the tangent line (red dashed), and how the secant slope converges to f(a)f'(a).

f(x) Secant- - Tangenth = 1.995262 · Secant slope = 3.9953 · f'(a) = 2.0000 · |Error| = 1.995262

Secant lines approaching tangent on f(x) = x²

The Derivative as Linear Approximation

The tangent line is more than a geometric artifact — it is the best linear approximation to ff near aa. The tangent line at aa is

T(x)=f(a)+f(a)(xa),T(x) = f(a) + f'(a)(x - a),

and the quality of this approximation is measured by the error f(x)T(x)f(x) - T(x). The defining property of the derivative is that this error vanishes faster than the distance from aa:

🔷 Proposition 1 (Derivative as Best Linear Approximation)

ff is differentiable at aa with derivative f(a)f'(a) if and only if there exists a number LL such that

f(a+h)=f(a)+Lh+r(h),f(a+h) = f(a) + Lh + r(h),

where limh0r(h)h=0\lim_{h \to 0} \frac{r(h)}{h} = 0. The number LL is unique and equals f(a)f'(a). The remainder r(h)r(h) is said to be “little-o of hh, written r(h)=o(h)r(h) = o(h).

This characterization is the conceptual bridge to multivariable calculus. In one dimension, the “linear map” hf(a)hh \mapsto f'(a) \cdot h is just multiplication by a scalar. But the same definition works in Rn\mathbb{R}^n: the derivative of f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m at aa is the linear map L:RnRmL: \mathbb{R}^n \to \mathbb{R}^m such that f(a+h)=f(a)+Lh+o(h)f(a+h) = f(a) + Lh + o(\|h\|). That linear map is represented by the Jacobian matrix — and the chain rule becomes matrix multiplication. The single-variable case is the prototype.

💡 Remark 1 (Linear maps preview)

In one dimension, a linear map L:RRL: \mathbb{R} \to \mathbb{R} is multiplication by a scalar: L(h)=chL(h) = ch. The derivative f(a)f'(a) is that scalar. In Rn\mathbb{R}^n, the derivative at aa will be a matrix (the Jacobian Jf(a)J_f(a)), and “multiplication by a scalar” becomes “multiplication by a matrix.” The single-variable derivative is the 1×11 \times 1 case of the Jacobian.

Explore the tangent line below — drag the point along the curve and notice how the error (right panel) decays quadratically near aa, confirming that the tangent is truly the best linear approximation.

f(x) T(x) tangent Point of tangencya = 1.00 · f(a) = 1.000 · f'(a) = 2.000

Tangent line as local approximation with error plot

Differentiability and Continuity

A natural question: if ff is differentiable at aa, must it be continuous there? And conversely: if ff is continuous at aa, must it be differentiable? The answers are “yes” and “no,” respectively — and both answers are important.

🔷 Theorem 1 (Differentiability Implies Continuity)

If ff is differentiable at aa, then ff is continuous at aa.

Proof.

We need to show that limh0f(a+h)=f(a)\lim_{h \to 0} f(a+h) = f(a), which is continuity at aa (Epsilon-Delta & Continuity, Definition 5). Write

f(a+h)f(a)=f(a+h)f(a)hh.f(a+h) - f(a) = \frac{f(a+h) - f(a)}{h} \cdot h.

As h0h \to 0, the first factor converges to f(a)f'(a) (by differentiability) and the second factor converges to 00. By the algebra of limits (Sequences, Limits & Convergence, Theorem 3), the product converges to f(a)0=0f'(a) \cdot 0 = 0. So limh0[f(a+h)f(a)]=0\lim_{h \to 0} [f(a+h) - f(a)] = 0, which means limh0f(a+h)=f(a)\lim_{h \to 0} f(a+h) = f(a). ∎

💡 Remark 2 (The converse is false)

Continuity does not imply differentiability. The function f(x)=xf(x) = |x| is continuous everywhere — including at x=0x = 0 — but is not differentiable at 00. This is not a pathological edge case: every ReLU unit in a neural network has a corner at x=0x = 0 with the same non-differentiability as x|x|. In practice, ML frameworks define the “derivative” at the kink by convention (typically 00 or 11), which works because the set of inputs landing exactly on the kink has measure zero.

📝 Example 3 (f(x) = |x| is not differentiable at 0)

The left derivative at 00:

f(0)=limh00+h0h=limh0hh=limh0hh=1.f'_-(0) = \lim_{h \to 0^-} \frac{|0+h| - |0|}{h} = \lim_{h \to 0^-} \frac{|h|}{h} = \lim_{h \to 0^-} \frac{-h}{h} = -1.

The right derivative at 00:

f+(0)=limh0+hh=limh0+hh=1.f'_+(0) = \lim_{h \to 0^+} \frac{|h|}{h} = \lim_{h \to 0^+} \frac{h}{h} = 1.

Since f(0)=11=f+(0)f'_-(0) = -1 \neq 1 = f'_+(0), the derivative f(0)f'(0) does not exist. Geometrically, the graph of x|x| has a corner at 00 — there is no single tangent line, because the secant from the left settles on slope 1-1 while the secant from the right settles on slope +1+1.

Use the tabs below to compare smooth, corner, cusp, and Weierstrass-type non-differentiability. Drag the hh slider toward zero and watch whether the left and right secant slopes agree.

h = 0.5012 Left slope: -0.5012 Right slope: 0.5012Not differentiable: slopes disagree

f(x) = x² is differentiable everywhere. Both left and right difference quotients converge to f'(0) = 0.

Four types of (non-)differentiability

Differentiation Rules

Computing derivatives from the limit definition every time would be tedious. Fortunately, the limit definition yields general rules that handle sums, products, quotients, and powers. We derive each rule from the definition — no shortcuts, no “it can be shown.”

🔷 Theorem 2 (Sum Rule)

If ff and gg are differentiable at aa, then f+gf + g is differentiable at aa and

(f+g)(a)=f(a)+g(a).(f + g)'(a) = f'(a) + g'(a).

Proof.

(f+g)(a+h)(f+g)(a)h=f(a+h)f(a)h+g(a+h)g(a)h.\frac{(f+g)(a+h) - (f+g)(a)}{h} = \frac{f(a+h) - f(a)}{h} + \frac{g(a+h) - g(a)}{h}.

Taking the limit as h0h \to 0 and applying the sum rule for limits (Sequences, Limits & Convergence, Theorem 3), we get f(a)+g(a)f'(a) + g'(a). ∎

🔷 Theorem 3 (Constant Multiple Rule)

If ff is differentiable at aa and cRc \in \mathbb{R}, then (cf)(a)=cf(a)(cf)'(a) = c \cdot f'(a).

The proof is identical — factor cc out of the difference quotient and use the constant multiple rule for limits.

🔷 Theorem 4 (Product Rule (Leibniz Rule))

If ff and gg are differentiable at aa, then fgfg is differentiable at aa and

(fg)(a)=f(a)g(a)+f(a)g(a).(fg)'(a) = f'(a)\,g(a) + f(a)\,g'(a).

Proof.

The key is the add-and-subtract trick. Write

f(a+h)g(a+h)f(a)g(a)=f(a+h)g(a+h)f(a+h)g(a)+f(a+h)g(a)f(a)g(a).f(a+h)\,g(a+h) - f(a)\,g(a) = f(a+h)\,g(a+h) - f(a+h)\,g(a) + f(a+h)\,g(a) - f(a)\,g(a).

Factor:

=f(a+h)[g(a+h)g(a)]+g(a)[f(a+h)f(a)].= f(a+h)\bigl[g(a+h) - g(a)\bigr] + g(a)\bigl[f(a+h) - f(a)\bigr].

Divide by hh:

(fg)(a+h)(fg)(a)h=f(a+h)g(a+h)g(a)h+g(a)f(a+h)f(a)h.\frac{(fg)(a+h) - (fg)(a)}{h} = f(a+h) \cdot \frac{g(a+h) - g(a)}{h} + g(a) \cdot \frac{f(a+h) - f(a)}{h}.

As h0h \to 0: the first term converges to f(a)g(a)f(a) \cdot g'(a) (using Theorem 1: ff is differentiable hence continuous, so f(a+h)f(a)f(a+h) \to f(a)), and the second term converges to g(a)f(a)g(a) \cdot f'(a). By the algebra of limits, the sum converges to f(a)g(a)+g(a)f(a)f(a)\,g'(a) + g(a)\,f'(a). ∎

🔷 Theorem 5 (Quotient Rule)

If ff and gg are differentiable at aa and g(a)0g(a) \neq 0, then f/gf/g is differentiable at aa and

(fg)(a)=f(a)g(a)f(a)g(a)g(a)2.\left(\frac{f}{g}\right)'(a) = \frac{f'(a)\,g(a) - f(a)\,g'(a)}{g(a)^2}.

Proof.

Write the difference quotient:

f(a+h)g(a+h)f(a)g(a)h=f(a+h)g(a)f(a)g(a+h)hg(a+h)g(a).\frac{\frac{f(a+h)}{g(a+h)} - \frac{f(a)}{g(a)}}{h} = \frac{f(a+h)\,g(a) - f(a)\,g(a+h)}{h \cdot g(a+h) \cdot g(a)}.

Add and subtract f(a)g(a)f(a)\,g(a) in the numerator:

=[f(a+h)f(a)]g(a)f(a)[g(a+h)g(a)]hg(a+h)g(a).= \frac{[f(a+h) - f(a)]\,g(a) - f(a)\,[g(a+h) - g(a)]}{h \cdot g(a+h) \cdot g(a)}.

Split the fraction and take the limit. The numerator converges to f(a)g(a)f(a)g(a)f'(a)\,g(a) - f(a)\,g'(a). The denominator converges to g(a)2g(a)^2 (since gg is continuous at aa by Theorem 1, so g(a+h)g(a)g(a+h) \to g(a), and g(a)0g(a) \neq 0 ensures the limit is well-defined). ∎

📝 Example 4 (The power rule via the binomial theorem)

For f(x)=xnf(x) = x^n with nn a positive integer:

(a+h)nanh=1hk=0n(nk)ankhkanh=k=1n(nk)ankhk1.\frac{(a+h)^n - a^n}{h} = \frac{1}{h}\sum_{k=0}^{n}\binom{n}{k}a^{n-k}h^k - \frac{a^n}{h} = \sum_{k=1}^{n}\binom{n}{k}a^{n-k}h^{k-1}.

As h0h \to 0, every term with k2k \geq 2 vanishes (it contains hh to a positive power), leaving only the k=1k = 1 term:

f(a)=(n1)an1=nan1.f'(a) = \binom{n}{1}a^{n-1} = n\,a^{n-1}.

This is the power rule: ddxxn=nxn1\frac{d}{dx}x^n = nx^{n-1}.

Differentiation rules: product, quotient, and power rule

The Chain Rule

The chain rule is the most important theorem in this topic — and arguably the most consequential single result for machine learning. It tells us how to differentiate a composition fgf \circ g, and the answer is beautifully simple: the derivative of the composition is the product of the derivatives.

Geometric intuition

Think of gg as a function that stretches the input near aa by a factor of g(a)g'(a), and ff as a function that stretches the input near g(a)g(a) by a factor of f(g(a))f'(g(a)). When we compose them — first gg stretches, then ff stretches — the total stretching is the product: f(g(a))g(a)f'(g(a)) \cdot g'(a).

This is exactly what “the derivative of a composition is the product of the derivatives” means geometrically: each function contributes its own local stretching factor, and stretching factors multiply.

🔷 Theorem 6 (The Chain Rule)

If gg is differentiable at aa and ff is differentiable at g(a)g(a), then fgf \circ g is differentiable at aa and

(fg)(a)=f(g(a))g(a).(f \circ g)'(a) = f'(g(a)) \cdot g'(a).

Proof.

We use the linear approximation characterization (Proposition 1). Since ff is differentiable at g(a)g(a), we can write

f(g(a)+k)=f(g(a))+f(g(a))k+φ(k)k,f(g(a) + k) = f(g(a)) + f'(g(a)) \cdot k + \varphi(k) \cdot k,

where φ(k)0\varphi(k) \to 0 as k0k \to 0 (and we define φ(0)=0\varphi(0) = 0).

The subtle point: A naive proof would divide by g(a+h)g(a)g(a+h) - g(a) and recombine, but this fails when g(a+h)=g(a)g(a+h) = g(a) for h0h \neq 0 (which can happen — for example, g(x)=x2sin(1/x)g(x) = x^2\sin(1/x) with g(0)=0g(0) = 0). The linear approximation approach avoids division entirely.

Set k=g(a+h)g(a)k = g(a+h) - g(a). Then

f(g(a+h))=f(g(a)+k)=f(g(a))+f(g(a))k+φ(k)k.f(g(a+h)) = f(g(a) + k) = f(g(a)) + f'(g(a)) \cdot k + \varphi(k) \cdot k.

Subtract f(g(a))f(g(a)) and divide by hh:

f(g(a+h))f(g(a))h=f(g(a))kh+φ(k)kh=[f(g(a))+φ(k)]g(a+h)g(a)h.\frac{f(g(a+h)) - f(g(a))}{h} = f'(g(a)) \cdot \frac{k}{h} + \varphi(k) \cdot \frac{k}{h} = \bigl[f'(g(a)) + \varphi(k)\bigr] \cdot \frac{g(a+h) - g(a)}{h}.

As h0h \to 0: the factor g(a+h)g(a)hg(a)\frac{g(a+h) - g(a)}{h} \to g'(a) by differentiability of gg, and k=g(a+h)g(a)0k = g(a+h) - g(a) \to 0 (by continuity of gg, which follows from differentiability via Theorem 1), so φ(k)0\varphi(k) \to 0. By the algebra of limits:

(fg)(a)=[f(g(a))+0]g(a)=f(g(a))g(a).(f \circ g)'(a) = \bigl[f'(g(a)) + 0\bigr] \cdot g'(a) = f'(g(a)) \cdot g'(a). \quad \text{∎}

📝 Example 5 (Chain rule: d/dx sin(x²))

Let g(x)=x2g(x) = x^2 and f(u)=sin(u)f(u) = \sin(u), so f(g(x))=sin(x2)f(g(x)) = \sin(x^2).

  • Inner derivative: g(x)=2xg'(x) = 2x
  • Outer derivative: f(u)=cos(u)f'(u) = \cos(u), evaluated at u=g(x)=x2u = g(x) = x^2: f(g(x))=cos(x2)f'(g(x)) = \cos(x^2)
  • Chain rule: (fg)(x)=cos(x2)2x(f \circ g)'(x) = \cos(x^2) \cdot 2x

At x=ax = a: ddxsin(x2)x=a=2acos(a2)\frac{d}{dx}\sin(x^2)\big|_{x=a} = 2a\cos(a^2).

📝 Example 6 (Three-layer chain rule: d/dx e^{sin(x²)})

This is a composition of three functions: h1(x)=x2h_1(x) = x^2, h2(u)=sin(u)h_2(u) = \sin(u), h3(v)=evh_3(v) = e^v. The chain rule applies iteratively:

ddxesin(x2)=esin(x2)cos(x2)2x.\frac{d}{dx}e^{\sin(x^2)} = e^{\sin(x^2)} \cdot \cos(x^2) \cdot 2x.

Three derivatives, multiplied in order from outermost to innermost. This is a three-layer “network” — the chain rule applies at each layer. In a neural network with LL layers, we would multiply LL local derivatives.

💡 Remark 3 (The chain rule as composition of linear maps)

Each derivative f(a)f'(a) is a linear map RR\mathbb{R} \to \mathbb{R} (multiplication by f(a)f'(a)). The chain rule says: the derivative of a composition is the composition of the derivatives. In one dimension, composing two linear maps means multiplying two scalars: f(g(a))g(a)f'(g(a)) \cdot g'(a).

In Rn\mathbb{R}^n, “multiplication” becomes matrix multiplication:

Jfg(a)=Jf(g(a))Jg(a),J_{f \circ g}(a) = J_f(g(a)) \cdot J_g(a),

where JfJ_f and JgJ_g are Jacobian matrices. The chain rule is functorial — it respects the compositional structure of functions. This perspective is the starting point for differential geometry, where the derivative is the pushforward map between tangent spaces. (→ formalML: Smooth Manifolds)

Explore the chain rule visually below — the left panel shows the inner function gg with its tangent, and the right panel shows the outer function ff with its tangent at u=g(x)u = g(x). The derivative readout shows the product of local derivatives.

g′(x) = 0.0000f′(g(x)) = 1.0000(f ∘ g)′(x) = f′(g(x)) · g′(x) = 0.0000

Chain rule: inner and outer functions with derivative multiplication

Higher-Order Derivatives

If ff' is itself a function, we can ask: is ff' differentiable? If so, its derivative is the second derivative of ff.

📐 Definition 3 (Higher-Order Derivatives)

The second derivative of ff at aa is f(a)=(f)(a)f''(a) = (f')'(a), provided ff' is differentiable at aa. Inductively, the nn-th derivative is f(n)(a)=(f(n1))(a)f^{(n)}(a) = (f^{(n-1)})'(a).

Notation: ff'' or d2fdx2\frac{d^2f}{dx^2} for the second derivative; f(n)f^{(n)} or dnfdxn\frac{d^nf}{dx^n} for the nn-th derivative.

💡 Remark 4 (Concavity and the second derivative)

The sign of f(a)f''(a) determines the concavity of ff at aa:

  • f(a)>0f''(a) > 0: ff is concave up at aa — the graph curves upward, and the tangent line lies below the graph. This is the 1D condition for a local minimum.
  • f(a)<0f''(a) < 0: ff is concave down at aa — the graph curves downward, and the tangent line lies above the graph. This is the 1D condition for a local maximum.

In higher dimensions, the second derivative becomes the Hessian matrix Hf(a)H_f(a), and “concave up” becomes “positive definite” — the multi-dimensional criterion for a local minimum. This is the foundation of Newton’s method and second-order optimization. (The Hessian & Second-Order Analysis (coming soon))

Higher-order derivatives: f, f', f'' for x³ - 3x

Non-Differentiable Functions

Differentiability is a strong condition — stronger than continuity. There are several ways it can fail:

1. Corners. The function f(x)=xf(x) = |x| is continuous everywhere but has a corner at x=0x = 0 where the left and right derivatives disagree (1-1 and +1+1). This is exactly the non-differentiability of ReLU at zero.

2. Cusps. The function f(x)=x2/3f(x) = |x|^{2/3} is continuous at 00 but the difference quotient h2/3/h=h1/3±|h|^{2/3}/h = |h|^{-1/3} \to \pm\infty as h0h \to 0. The curve comes to a sharp point where the “tangent line” would be vertical — but vertical lines have undefined slope.

3. Vertical tangents. The function f(x)=x1/3f(x) = x^{1/3} at x=0x = 0 has a difference quotient h1/3/h=h2/3+h^{1/3}/h = h^{-2/3} \to +\infty. The tangent exists visually (it’s the yy-axis), but its slope is infinite, so the derivative does not exist as a real number.

4. Everywhere non-differentiable. The Weierstrass function

W(x)=n=0ancos(bnπx),0<a<1,ab>1,W(x) = \sum_{n=0}^{\infty} a^n \cos(b^n \pi x), \qquad 0 < a < 1, \quad ab > 1,

is continuous everywhere (by the Weierstrass M-test, which we proved in Uniform Convergence) but differentiable nowhere. The partial sums are smooth, but each additional term adds higher-frequency oscillations. No matter how far you zoom in, the graph never smooths out — there is no tangent line at any point.

💡 Remark 5 (Non-differentiability in ML)

ReLU(x)=max(0,x)(x) = \max(0, x) has a corner at x=0x = 0 — the same non-differentiability as x|x| shifted. In practice, ML frameworks assign a conventional “derivative” at the kink (00 or 11), and this works because:

  1. The set of inputs landing exactly on the kink has Lebesgue measure zero — almost every input avoids it. (Sigma-Algebras & Measures (coming soon))
  2. The difference quotient from either side is well-defined (it’s 00 or 11), so the “gradient” used in practice is the one-sided derivative — which is fine for optimization.

The Weierstrass function shows that pathology can be extreme — continuous everywhere, differentiable nowhere — but such functions don’t arise as loss landscapes in practice. Real neural network losses are piecewise smooth (smooth between the ReLU kinks), and that’s smooth enough for gradient descent.

Four types of non-differentiability: corner, cusp, vertical tangent, Weierstrass

Connections to ML — Backpropagation & Automatic Differentiation

This is where the derivative and chain rule connect directly to modern machine learning practice. Backpropagation is not a separate algorithm — it is the chain rule, applied systematically to a computation graph.

Backpropagation is the chain rule

Consider a minimal network: input xx, weight ww, bias bb, activation σ\sigma, target yy, and loss LL:

z=wx+b,a=σ(z),L=(ay)2.z = wx + b, \qquad a = \sigma(z), \qquad L = (a - y)^2.

The forward pass computes zz, then aa, then LL — evaluating the composition left to right. The backward pass computes gradients right to left using the chain rule:

Lw=La=2(ay)az=σ(z)zw=x.\frac{\partial L}{\partial w} = \underbrace{\frac{\partial L}{\partial a}}_{= 2(a-y)} \cdot \underbrace{\frac{\partial a}{\partial z}}_{= \sigma'(z)} \cdot \underbrace{\frac{\partial z}{\partial w}}_{= x}.

Each factor is a local derivative — the derivative of one node with respect to its immediate input. The chain rule says the total derivative is the product of the locals. This is exactly Theorem 6 applied to the composition L=(σ(wx+b))L = \ell(\sigma(wx+b)).

Automatic differentiation

Forward mode (tangent propagation): set x˙=1\dot{x} = 1 and propagate forward: z˙=w\dot{z} = w, a˙=σ(z)z˙\dot{a} = \sigma'(z) \cdot \dot{z}, L˙=2(ay)a˙\dot{L} = 2(a-y) \cdot \dot{a}. Cost: one forward sweep per input variable.

Reverse mode (backpropagation): set Lˉ=1\bar{L} = 1 and propagate backward: aˉ=2(ay)\bar{a} = 2(a-y), zˉ=σ(z)aˉ\bar{z} = \sigma'(z) \cdot \bar{a}, wˉ=xzˉ\bar{w} = x \cdot \bar{z}. Cost: one backward sweep for all parameters.

For a network with millions of parameters and a scalar loss, reverse mode wins by a factor of millions — this is why backpropagation, not forward-mode AD, is the standard in deep learning.

Common activation derivatives

Activationσ(z)\sigma(z)σ(z)\sigma'(z)
Sigmoid11+ez\frac{1}{1+e^{-z}}σ(z)(1σ(z))\sigma(z)(1 - \sigma(z))
Tanhtanh(z)\tanh(z)1tanh2(z)1 - \tanh^2(z)
ReLUmax(0,z)\max(0, z)1z>0\mathbf{1}_{z > 0}

The sigmoid and tanh derivatives both compress toward zero for large z|z| — this is the vanishing gradient problem. ReLU avoids it (gradient is 11 for z>0z > 0) at the cost of the non-differentiability at z=0z = 0. (→ formalML: Gradient Descent)

Explore the computation graph below — toggle between forward and backward pass to see how values and gradients flow through the network.

Forward: z = 1.7000, a = σ(z) = 0.8455, L = (a − y)² = 0.0239
Backward: ∂L/∂a = -0.3089, ∂a/∂z = σ'(z) = 0.1306, ∂L/∂w = -0.0323, ∂L/∂b = -0.0403
Activation: σ(z) = 1/(1 + e⁻ᶻ)

Computation graph with forward and backward pass

Computational Notes

When we don’t have a closed-form derivative, we can approximate f(x)f'(x) numerically using finite differences:

Forward difference: f(x+h)f(x)h=f(x)+O(h)\frac{f(x+h) - f(x)}{h} = f'(x) + O(h). Truncation error is O(h)O(h), but catastrophic cancellation occurs at very small hh (around h108h \approx 10^{-8} for double precision).

Central difference: f(x+h)f(xh)2h=f(x)+O(h2)\frac{f(x+h) - f(x-h)}{2h} = f'(x) + O(h^2). The odd-order error terms cancel, giving O(h2)O(h^2) accuracy — a quadratic improvement over forward differences. Optimal h105h \approx 10^{-5}.

Richardson extrapolation: Combine central differences at hh and h/2h/2 to cancel the leading error term: 4d(h/2)d(h)3=f(x)+O(h4)\frac{4d(h/2) - d(h)}{3} = f'(x) + O(h^4). Each level of extrapolation eliminates one more error term.

The tradeoff: smaller hh reduces truncation error but increases floating-point cancellation error. There is an optimal hh that balances both — roughly hεmach1/(p+1)h^* \sim \varepsilon_{\text{mach}}^{1/(p+1)} for a method of order pp, where εmach2.2×1016\varepsilon_{\text{mach}} \approx 2.2 \times 10^{-16} for double precision.

Numerical differentiation: error vs h for forward, central, and Richardson

What Comes Next

The derivative gives us rates of change and the chain rule gives us gradients through compositions. Here is where these ideas lead:

Within formalCalculus:

  • Mean Value Theorem & Taylor Expansion — If ff is differentiable on [a,b][a, b], there exists c(a,b)c \in (a, b) with f(c)=f(b)f(a)baf'(c) = \frac{f(b)-f(a)}{b-a}. Taylor’s theorem extends linear approximation to polynomial approximation of arbitrary degree.
  • The Riemann Integral & FTC — The Fundamental Theorem of Calculus ties differentiation and integration: if F(x)=f(x)F'(x) = f(x), then abf(x)dx=F(b)F(a)\int_a^b f(x)\,dx = F(b) - F(a).
  • Partial Derivatives & the Gradient — The gradient f=(f/x1,,f/xn)\nabla f = (\partial f/\partial x_1, \ldots, \partial f/\partial x_n) generalizes the single-variable derivative to Rn\mathbb{R}^n.
  • The Jacobian & Multivariate Chain Rule — The Jacobian matrix generalizes the scalar derivative, and the multivariate chain rule Jfg=JfJgJ_{f \circ g} = J_f \cdot J_g is what backpropagation actually computes in multi-dimensional networks.

Forward to formalML:

  • Gradient Descent — The derivative is the engine of optimization. Single-variable gradient descent θt+1=θtηf(θt)\theta_{t+1} = \theta_t - \eta f'(\theta_t) is the prototype for SGD, Adam, and all gradient methods.
  • Shannon Entropy — Derivatives of entropy H(p)=pilogpiH(p) = -\sum p_i \log p_i and cross-entropy loss drive information-theoretic ML objectives.
  • Smooth Manifolds — The derivative as a linear map between tangent spaces. The chain rule becomes functoriality of the tangent functor.

References

  1. book Abbott (2015). Understanding Analysis Chapter 5 develops the derivative from the limit definition through the Mean Value Theorem — the primary reference for our rigorous-but-accessible approach
  2. book Rudin (1976). Principles of Mathematical Analysis Chapter 5 on differentiation — the compact, definitive treatment of single-variable derivatives
  3. book Spivak (2008). Calculus Chapters 9–11 develop differentiation with unusual care for geometric intuition alongside full rigor — exceptional treatment of the chain rule proof
  4. book Goodfellow, Bengio & Courville (2016). Deep Learning Section 6.5 on back-propagation — the chain rule as the computational engine of deep learning
  5. paper Baydin, Pearlmutter, Radul & Siskind (2018). “Automatic Differentiation in Machine Learning: a Survey” Comprehensive survey of forward-mode and reverse-mode AD — mechanizing the chain rule for gradient computation