Single-Variable Calculus · intermediate · 55 min read

Mean Value Theorem & Taylor Expansion

Local approximation theory — the theorems that connect a function's derivatives to its global behavior, and the polynomial approximations that power convergence analysis in optimization

Abstract. The Mean Value Theorem says that if f is continuous on [a,b] and differentiable on (a,b), there exists c ∈ (a,b) where f'(c) = [f(b) - f(a)]/(b - a) — the instantaneous rate of change at some interior point equals the average rate of change over the whole interval. Geometrically, this means every secant line has a parallel tangent line somewhere between the endpoints. This single theorem has sweeping consequences: it proves that functions with zero derivative are constant, that the sign of f' determines monotonicity, and that bounded derivatives imply Lipschitz continuity. The proof goes through Rolle's theorem — the special case where f(a) = f(b), so the guaranteed interior point is a local extremum with f'(c) = 0 — which itself depends on the Extreme Value Theorem from compactness. Taylor's theorem extends the Mean Value Theorem from first-order to arbitrary-order approximation: near any point a, a sufficiently smooth function is approximated by its degree-n Taylor polynomial Tₙ(x) = Σ f⁽ᵏ⁾(a)/k! (x - a)ᵏ, with the Lagrange remainder Rₙ = f⁽ⁿ⁺¹⁾(c)/(n+1)! (x - a)ⁿ⁺¹ providing a quantitative error bound. Taylor expansion is the analytical engine behind convergence rate proofs in optimization: the descent lemma f(y) ≤ f(x) + ∇f(x)ᵀ(y - x) + (L/2)‖y - x‖² is a second-order Taylor bound, and Newton's method achieves quadratic convergence by optimizing the local quadratic Taylor model at each step.

Where this leads → formalML

  • formalML The descent lemma — the cornerstone inequality in gradient descent convergence proofs — is a direct consequence of second-order Taylor expansion with Lipschitz gradient. Newton's method replaces f with its local quadratic Taylor model at each step, achieving quadratic convergence where GD achieves only linear.
  • formalML Proximal operators minimize f(y) + (1/2t)‖y - x‖², which is the second-order Taylor model of f at x plus a quadratic penalty. The Moreau envelope uses the same Taylor-inspired local quadratic structure.
  • formalML Taylor expansion on manifolds generalizes via the exponential map. The MVT extends to curves on manifolds, and Taylor approximation in local coordinates is the foundation for Riemannian optimization.

Overview & Motivation

You’re training a neural network and observe that the loss drops from L(θ0)=2.4L(\theta_0) = 2.4 to L(θ100)=0.03L(\theta_{100}) = 0.03 over 100 gradient descent steps. Can you guarantee that at some point during training, the loss was decreasing at a rate of exactly 0.032.4100=0.0237\frac{0.03 - 2.4}{100} = -0.0237 per step? The Mean Value Theorem says yes — and this kind of “existence of an intermediate rate” argument is exactly what convergence proofs exploit.

Moreover, how well can you approximate the loss near the current parameter? The tangent line approximation from The Derivative & Chain Rule gives a first-order answer: L(θ+h)L(θ)+L(θ)hL(\theta + h) \approx L(\theta) + L'(\theta) h. But this approximation degrades as hh grows. Taylor expansion provides higher-order corrections — the second-order approximation L(θ+h)L(θ)+L(θ)h+12L(θ)h2L(\theta + h) \approx L(\theta) + L'(\theta) h + \frac{1}{2} L''(\theta) h^2 is what Newton’s method uses, and the quality of these approximations determines whether your optimizer converges linearly or quadratically.

This topic develops both ideas rigorously. We start with Rolle’s theorem and the Mean Value Theorem — the existence results that connect local derivatives to global behavior — then build Taylor’s theorem, which extends linear approximation to polynomial approximation of arbitrary degree. The MVT earns its name through three sweeping consequences: zero derivative implies constant, derivative sign determines monotonicity, and bounded derivatives imply Lipschitz continuity. Taylor’s theorem earns its keep by powering every convergence rate proof in optimization.

Prerequisites: We use the derivative definition and differentiation rules from The Derivative & Chain Rule, and the Extreme Value Theorem from Completeness & Compactness — specifically, that a continuous function on a closed bounded interval [a,b][a, b] achieves its maximum and minimum.

Rolle’s Theorem

The geometric picture

Imagine a continuous curve that starts and ends at the same height — say, you drive from home, travel some distance, and return to the same elevation. At some point during the trip, your altitude was neither increasing nor decreasing: you were at a turning point. In the language of calculus, there must be at least one point where the tangent to the curve is horizontal.

This is Rolle’s theorem in a sentence: if f(a)=f(b)f(a) = f(b) and ff is smooth enough in between, then f(c)=0f'(c) = 0 for some interior point cc.

🔷 Theorem 1 (Rolle's Theorem)

Let ff be continuous on [a,b][a, b] and differentiable on (a,b)(a, b), with f(a)=f(b)f(a) = f(b). Then there exists c(a,b)c \in (a, b) such that f(c)=0f'(c) = 0.

Proof.

By the Extreme Value Theorem — which we proved in Completeness & Compactness using the Heine-Borel theorem — the continuous function ff on the compact set [a,b][a, b] achieves its maximum value MM and its minimum value mm.

Case 1: If M=mM = m, then ff is constant on [a,b][a, b], so f(c)=0f'(c) = 0 for every c(a,b)c \in (a, b).

Case 2: If MmM \neq m, then since f(a)=f(b)f(a) = f(b), at least one of MM or mm is achieved at an interior point c(a,b)c \in (a, b) (both endpoints have the same value, so if the max and the min were both at the endpoints, we’d have M=f(a)=f(b)=mM = f(a) = f(b) = m, contradicting MmM \neq m).

Without loss of generality, suppose f(c)=Mf(c) = M with c(a,b)c \in (a, b). Since cc is a local maximum and ff is differentiable at cc, we have:

  • For h>0h > 0 sufficiently small: f(c+h)f(c)h0\frac{f(c + h) - f(c)}{h} \leq 0 (since f(c+h)f(c)=Mf(c + h) \leq f(c) = M), so f(c)0f'(c) \leq 0.
  • For h<0h < 0 sufficiently small: f(c+h)f(c)h0\frac{f(c + h) - f(c)}{h} \geq 0 (since f(c+h)f(c)f(c + h) \leq f(c) and h<0h < 0), so f(c)0f'(c) \geq 0.

Together: f(c)=0f'(c) = 0.

📝 Example 1 (Rolle's on f(x) = x² − 4x + 3 on [1, 3])

We verify: f(1)=14+3=0f(1) = 1 - 4 + 3 = 0 and f(3)=912+3=0f(3) = 9 - 12 + 3 = 0, so f(a)=f(b)f(a) = f(b). The derivative is f(x)=2x4f'(x) = 2x - 4, which vanishes at c=2(1,3)c = 2 \in (1, 3). Geometrically, the parabola dips below the xx-axis and turns around at its vertex x=2x = 2.

💡 Remark 1 (Why all three hypotheses matter)

Each hypothesis in Rolle’s theorem is essential — removing any one produces a counterexample:

(i) Not continuous: Define ff on [0,1][0, 1] by f(x)=xf(x) = x for 0x<10 \leq x < 1 and f(1)=0f(1) = 0. Then f(0)=f(1)=0f(0) = f(1) = 0, and ff is differentiable on (0,1)(0, 1) with f(x)=1f'(x) = 1 for all x(0,1)x \in (0, 1) — so there is no cc with f(c)=0f'(c) = 0. The discontinuity at x=1x = 1 breaks Rolle’s conclusion.

(ii) Not differentiable: f(x)=xf(x) = |x| on [1,1][-1, 1] has f(1)=f(1)=1f(-1) = f(1) = 1, but f(0)f'(0) does not exist (the left and right derivatives disagree — we saw this in The Derivative & Chain Rule, Example 3). The minimum of x|x| is at x=0x = 0, exactly where differentiability fails.

(iii) Not f(a)=f(b)f(a) = f(b): The function f(x)=xf(x) = x on [0,1][0, 1] has f(x)=10f'(x) = 1 \neq 0 everywhere. Without the equal-endpoint condition, there’s no reason to expect a horizontal tangent.

Try toggling “Break hypotheses” in the explorer below to see each failure mode.

f(x)- - f'(c) = 0 Endpoints Critical ptsf(a) = 0.0000 · f(b) = 0.0000 · c = 2.0000 · f'(c) = 0.0000
f(1) = f(3) = 0, and f'(2) = 0.

Rolle's theorem on three different functions, showing f(a) = f(b) and the horizontal tangent at c

The Mean Value Theorem

From horizontal to parallel

Rolle’s theorem requires f(a)=f(b)f(a) = f(b) — a strong condition. The Mean Value Theorem removes it. The key insight is geometric: tilt your head until the secant line connecting (a,f(a))(a, f(a)) and (b,f(b))(b, f(b)) is horizontal, and you’re back to Rolle’s setup. More precisely, subtract the secant line from ff, and the resulting function has equal endpoints.

🔷 Theorem 2 (The Mean Value Theorem)

Let ff be continuous on [a,b][a, b] and differentiable on (a,b)(a, b). Then there exists c(a,b)c \in (a, b) such that

f(c)=f(b)f(a)ba.f'(c) = \frac{f(b) - f(a)}{b - a}.

In words: the instantaneous rate of change at some interior point cc equals the average rate of change over [a,b][a, b].

Proof.

Define the auxiliary function

g(x)=f(x)f(b)f(a)ba(xa).g(x) = f(x) - \frac{f(b) - f(a)}{b - a}(x - a).

This is ff minus the secant line through (a,f(a))(a, f(a)) and (b,f(b))(b, f(b)).

Check the hypotheses of Rolle’s theorem:

  • gg is continuous on [a,b][a, b] (difference of continuous functions).
  • gg is differentiable on (a,b)(a, b) (difference of differentiable functions).
  • g(a)=f(a)0=f(a)g(a) = f(a) - 0 = f(a) and g(b)=f(b)f(b)f(a)ba(ba)=f(b)(f(b)f(a))=f(a)g(b) = f(b) - \frac{f(b) - f(a)}{b - a}(b - a) = f(b) - (f(b) - f(a)) = f(a), so g(a)=g(b)g(a) = g(b).

By Rolle’s theorem, there exists c(a,b)c \in (a, b) with g(c)=0g'(c) = 0. Since

g(x)=f(x)f(b)f(a)ba,g'(x) = f'(x) - \frac{f(b) - f(a)}{b - a},

we get f(c)=f(b)f(a)baf'(c) = \frac{f(b) - f(a)}{b - a}.

The proof is a one-line reduction to Rolle’s theorem — but it’s the choice of the auxiliary function gg that makes it work. The idea is to subtract exactly the right linear function so that the endpoints align.

📝 Example 2 (MVT on f(x) = x³ on [0, 2])

The secant slope is f(2)f(0)20=82=4\frac{f(2) - f(0)}{2 - 0} = \frac{8}{2} = 4. We need f(c)=3c2=4f'(c) = 3c^2 = 4, so c=231.155(0,2)c = \frac{2}{\sqrt{3}} \approx 1.155 \in (0, 2).

Drag the aa and bb sliders in the explorer below to see how the secant line and the parallel tangent change for different intervals.

f(x)- - Secant- - Tangent MVT point
Secant slope: 2.0000c = 1.0000 · f'(c) = 2.0000

The Mean Value Theorem: secant line and parallel tangent for two functions

Consequences of the Mean Value Theorem

The MVT is the bridge between local information (the derivative at a point) and global behavior (the function’s values over an interval). Three consequences demonstrate its power — and each one is a workhorse in analysis and optimization.

🔷 Corollary 1 (Zero Derivative Implies Constant)

If ff is differentiable on (a,b)(a, b) and f(x)=0f'(x) = 0 for all x(a,b)x \in (a, b), then ff is constant on (a,b)(a, b).

Proof.

For any x1,x2(a,b)x_1, x_2 \in (a, b) with x1<x2x_1 < x_2, the MVT gives

f(x2)f(x1)=f(c)(x2x1)=0(x2x1)=0f(x_2) - f(x_1) = f'(c)(x_2 - x_1) = 0 \cdot (x_2 - x_1) = 0

for some c(x1,x2)c \in (x_1, x_2). Hence f(x1)=f(x2)f(x_1) = f(x_2) for all x1,x2(a,b)x_1, x_2 \in (a, b).

🔷 Corollary 2 (Monotonicity from Derivative Sign)

Let ff be differentiable on (a,b)(a, b).

  • If f(x)>0f'(x) > 0 for all x(a,b)x \in (a, b), then ff is strictly increasing on (a,b)(a, b).
  • If f(x)<0f'(x) < 0 for all x(a,b)x \in (a, b), then ff is strictly decreasing on (a,b)(a, b).

Proof.

Suppose f(x)>0f'(x) > 0 on (a,b)(a, b) and take x1<x2x_1 < x_2 in (a,b)(a, b). By the MVT, there exists c(x1,x2)c \in (x_1, x_2) with

f(x2)f(x1)=f(c)(x2x1).f(x_2) - f(x_1) = f'(c)(x_2 - x_1).

Since f(c)>0f'(c) > 0 and x2x1>0x_2 - x_1 > 0, we get f(x2)f(x1)>0f(x_2) - f(x_1) > 0, i.e., f(x1)<f(x2)f(x_1) < f(x_2). The decreasing case is symmetric.

🔷 Corollary 3 (Bounded Derivative Implies Lipschitz)

If ff is differentiable on (a,b)(a, b) and f(x)M|f'(x)| \leq M for all x(a,b)x \in (a, b), then

f(x)f(y)Mxyfor all x,y(a,b).|f(x) - f(y)| \leq M|x - y| \quad \text{for all } x, y \in (a, b).

Proof.

By the MVT, f(x)f(y)=f(c)xyMxy|f(x) - f(y)| = |f'(c)| \cdot |x - y| \leq M|x - y|.

💡 Remark 2 (Lipschitz continuity and ML)

The Lipschitz constant MM bounds how fast ff can change. In machine learning, the Lipschitz gradient condition f(x)f(y)Lxy\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| is the single most important regularity assumption in convergence proofs for gradient descent. It comes directly from applying the MVT (or its multivariable analog) to f\nabla f: if 2f\nabla^2 f exists and 2fL\|\nabla^2 f\| \leq L, then the gradient is LL-Lipschitz. This LL determines the maximum safe learning rate η1/L\eta \leq 1/L and the convergence rate O(1/k)O(1/k) for gradient descent on convex functions. (→ formalML: Gradient Descent)

Consequences of MVT: zero derivative implies constant, positive derivative implies increasing, bounded derivative implies Lipschitz

Cauchy’s Mean Value Theorem and L’Hôpital’s Rule

The standard MVT compares ff to the identity function g(x)=xg(x) = x. Cauchy’s generalization compares two functions simultaneously.

🔷 Theorem 3 (Cauchy's Mean Value Theorem)

Let ff and gg be continuous on [a,b][a, b] and differentiable on (a,b)(a, b), with g(x)0g'(x) \neq 0 for all x(a,b)x \in (a, b). Then there exists c(a,b)c \in (a, b) such that

f(c)g(c)=f(b)f(a)g(b)g(a).\frac{f'(c)}{g'(c)} = \frac{f(b) - f(a)}{g(b) - g(a)}.

Proof.

First note that g(b)g(a)g(b) \neq g(a) (otherwise, Rolle’s theorem applied to gg would give g(c)=0g'(c) = 0 for some cc, contradicting our hypothesis). Define

h(x)=f(x)(g(b)g(a))g(x)(f(b)f(a)).h(x) = f(x)(g(b) - g(a)) - g(x)(f(b) - f(a)).

Then hh is continuous on [a,b][a, b], differentiable on (a,b)(a, b), and h(a)=f(a)g(b)g(a)f(b)=h(b)h(a) = f(a)g(b) - g(a)f(b) = h(b). By Rolle’s theorem, there exists c(a,b)c \in (a, b) with h(c)=0h'(c) = 0:

f(c)(g(b)g(a))=g(c)(f(b)f(a)).f'(c)(g(b) - g(a)) = g'(c)(f(b) - f(a)).

Since g(c)0g'(c) \neq 0 and g(b)g(a)0g(b) - g(a) \neq 0, dividing gives the result.

The standard MVT is the special case g(x)=xg(x) = x. The power of Cauchy’s version is that it lets us evaluate limits of ratios — which brings us to L’Hôpital’s Rule.

🔷 Theorem 4 (L'Hôpital's Rule (0/0 form))

Suppose ff and gg are differentiable near aa (except possibly at aa), with g(x)0g'(x) \neq 0 near aa. If

limxaf(x)=0andlimxag(x)=0,\lim_{x \to a} f(x) = 0 \quad \text{and} \quad \lim_{x \to a} g(x) = 0,

and the limit limxaf(x)g(x)=L\lim_{x \to a} \frac{f'(x)}{g'(x)} = L exists (finite or ±\pm\infty), then

limxaf(x)g(x)=L.\lim_{x \to a} \frac{f(x)}{g(x)} = L.

Proof.

Define f(a)=g(a)=0f(a) = g(a) = 0 so that ff and gg are continuous at aa. For xx near aa with xax \neq a, Cauchy’s MVT on [a,x][a, x] (or [x,a][x, a]) gives a point cxc_x between aa and xx with

f(x)g(x)=f(x)f(a)g(x)g(a)=f(cx)g(cx).\frac{f(x)}{g(x)} = \frac{f(x) - f(a)}{g(x) - g(a)} = \frac{f'(c_x)}{g'(c_x)}.

As xax \to a, cxac_x \to a (squeezed between xx and aa), so f(cx)g(cx)L\frac{f'(c_x)}{g'(c_x)} \to L.

📝 Example 3 (L'Hôpital's applied: lim sin(x)/x)

limx0sinxx\lim_{x \to 0} \frac{\sin x}{x} is the 0/00/0 form. Applying L’Hôpital’s:

limx0sinxx=limx0cosx1=1.\lim_{x \to 0} \frac{\sin x}{x} = \lim_{x \to 0} \frac{\cos x}{1} = 1.

We previously established this limit geometrically (a squeeze argument involving areas). Now we have a second, purely analytical proof via Cauchy’s MVT. Both proofs are valid; the geometric argument is more elementary, while L’Hôpital’s is more general.

💡 Remark 3 (L'Hôpital's caveats)

L’Hôpital’s Rule is a sufficient condition, not a necessary one. The limit limx0x2sin(1/x)x=limx0xsin(1/x)=0\lim_{x \to 0} \frac{x^2 \sin(1/x)}{x} = \lim_{x \to 0} x \sin(1/x) = 0 exists, but limx02xsin(1/x)cos(1/x)1\lim_{x \to 0} \frac{2x \sin(1/x) - \cos(1/x)}{1} does not exist (the cos(1/x)\cos(1/x) term oscillates). So L’Hôpital’s fails to apply, yet the original limit exists.

The rule also applies to the /\infty/\infty form, to limits as xx \to \infty, and (via algebraic rearrangement) to 00 \cdot \infty and \infty - \infty indeterminate forms.

L'Hôpital's Rule applied to sin(x)/x and (eˣ − 1)/x

Taylor Polynomials

From linear to polynomial approximation

In The Derivative & Chain Rule, we saw that the tangent line T1(x)=f(a)+f(a)(xa)T_1(x) = f(a) + f'(a)(x - a) is the best linear approximation to ff near aa — “best” meaning that the error f(x)T1(x)f(x) - T_1(x) vanishes faster than xa|x - a| as xax \to a. But what if we allow quadratic approximation?

The best quadratic that matches ff in value, slope, and curvature at aa is

T2(x)=f(a)+f(a)(xa)+f(a)2(xa)2.T_2(x) = f(a) + f'(a)(x - a) + \frac{f''(a)}{2}(x - a)^2.

The coefficient f(a)/2f''(a)/2 is chosen so that T2(a)=f(a)T_2''(a) = f''(a) — the curvatures agree. The pattern continues: the degree-nn Taylor polynomial matches ff through its first nn derivatives at aa.

📐 Definition 1 (Taylor Polynomial)

Let ff be nn-times differentiable at aa. The degree-nn Taylor polynomial of ff centered at aa is

Tn(x)=k=0nf(k)(a)k!(xa)k=f(a)+f(a)(xa)+f(a)2!(xa)2++f(n)(a)n!(xa)n.T_n(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x - a)^k = f(a) + f'(a)(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \cdots + \frac{f^{(n)}(a)}{n!}(x - a)^n.

The special case a=0a = 0 gives the Maclaurin polynomial.

🔷 Proposition 1 (Taylor Polynomial as Best Polynomial Approximation)

Among all polynomials pp of degree n\leq n, the Taylor polynomial TnT_n is the unique polynomial satisfying p(k)(a)=f(k)(a)p^{(k)}(a) = f^{(k)}(a) for k=0,1,,nk = 0, 1, \ldots, n. Equivalently, the error f(x)Tn(x)f(x) - T_n(x) vanishes to order nn at aa:

limxaf(x)Tn(x)(xa)n=0.\lim_{x \to a} \frac{f(x) - T_n(x)}{(x - a)^n} = 0.

📝 Example 4 (Taylor polynomials of eˣ at a = 0)

Since all derivatives of exe^x equal exe^x, we have f(k)(0)=1f^{(k)}(0) = 1 for all kk:

Tn(x)=1+x+x22!+x33!++xnn!.T_n(x) = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots + \frac{x^n}{n!}.

As nn increases, TnT_n approximates exe^x over a wider and wider interval. Try increasing the degree in the explorer below to watch the polynomial “hug” the exponential.

📝 Example 5 (Taylor polynomials of sin(x) at a = 0)

The derivatives of sinx\sin x cycle: sin,cos,sin,cos,sin,\sin, \cos, -\sin, -\cos, \sin, \ldots At x=0x = 0: f(0)=0f(0) = 0, f(0)=1f'(0) = 1, f(0)=0f''(0) = 0, f(0)=1f'''(0) = -1, f(4)(0)=0f^{(4)}(0) = 0, f(5)(0)=1,f^{(5)}(0) = 1, \ldots

Only odd terms survive:

T2n+1(x)=xx33!+x55!x77!++(1)nx2n+1(2n+1)!.T_{2n+1}(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots + \frac{(-1)^n x^{2n+1}}{(2n+1)!}.

The pattern reflects the odd symmetry of sin\sin — an odd function’s Taylor series has only odd powers.

f(x) T1(x) Center (a = 0.00)n = 1 · max |error| = 16.085537

Taylor polynomials of degrees 1, 3, and 7 approximating eˣ

Taylor’s Theorem (with Remainder)

The Taylor polynomial TnT_n approximates ff — but how well? Taylor’s theorem gives a quantitative error bound via the remainder term Rn(x)=f(x)Tn(x)R_n(x) = f(x) - T_n(x).

🔷 Theorem 5 (Taylor's Theorem (Lagrange Remainder))

Let ff be (n+1)(n+1)-times differentiable on an open interval containing aa and xx. Then

f(x)=Tn(x)+Rn(x),whereRn(x)=f(n+1)(c)(n+1)!(xa)n+1f(x) = T_n(x) + R_n(x), \quad \text{where} \quad R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x - a)^{n+1}

for some cc between aa and xx.

Proof.

We prove this by applying Rolle’s theorem to a carefully constructed auxiliary function. Fix xax \neq a and define

ϕ(t)=f(t)Tn(t)K(ta)n+1,\phi(t) = f(t) - T_n(t) - K(t - a)^{n+1},

where Tn(t)=k=0nf(k)(a)k!(ta)kT_n(t) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(t - a)^k is the degree-nn Taylor polynomial of ff centered at aa (evaluated at the variable tt), and KK is a constant chosen so that ϕ(x)=0\phi(x) = 0.

First note that Tn(a)=f(a)T_n(a) = f(a), so ϕ(a)=f(a)f(a)0=0\phi(a) = f(a) - f(a) - 0 = 0.

Next, the condition ϕ(x)=0\phi(x) = 0 gives

0=f(x)Tn(x)K(xa)n+1,soK=f(x)Tn(x)(xa)n+1.0 = f(x) - T_n(x) - K(x - a)^{n+1}, \quad \text{so} \quad K = \frac{f(x) - T_n(x)}{(x - a)^{n+1}}.

Our goal is to show that K=f(n+1)(c)(n+1)!K = \frac{f^{(n+1)}(c)}{(n+1)!} for some cc between aa and xx.

Since ϕ(a)=ϕ(x)=0\phi(a) = \phi(x) = 0, Rolle’s theorem gives some c1c_1 between aa and xx with ϕ(c1)=0\phi'(c_1) = 0. We differentiate ϕ\phi a total of n+1n+1 times. At each stage, the polynomial terms from Tn(t)T_n(t) lose one degree, and after n+1n+1 differentiations the Taylor polynomial terms vanish entirely:

ϕ(n+1)(t)=f(n+1)(t)K(n+1)!\phi^{(n+1)}(t) = f^{(n+1)}(t) - K \cdot (n+1)!

(since (ta)n+1(t - a)^{n+1} differentiates n+1n+1 times to (n+1)!(n+1)!, and Tn(t)T_n(t) is a degree-nn polynomial whose (n+1)(n+1)-th derivative is 00).

By applying Rolle’s theorem successively — ϕ(a)=ϕ(x)=0\phi(a) = \phi(x) = 0 gives a zero of ϕ\phi', which combined with ϕ(a)=0\phi'(a) = 0 gives a zero of ϕ\phi'', and so on — we obtain a point cc between aa and xx with ϕ(n+1)(c)=0\phi^{(n+1)}(c) = 0. Setting this to zero:

0=f(n+1)(c)K(n+1)!K=f(n+1)(c)(n+1)!.0 = f^{(n+1)}(c) - K(n+1)! \quad \Rightarrow \quad K = \frac{f^{(n+1)}(c)}{(n+1)!}.

Combining with the expression K=f(x)Tn(x)(xa)n+1K = \frac{f(x) - T_n(x)}{(x - a)^{n+1}} gives the Lagrange remainder:

f(x)Tn(x)=f(n+1)(c)(n+1)!(xa)n+1.f(x) - T_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x - a)^{n+1}.

Note: The base case n=0n = 0 gives f(x)=f(a)+f(c)(xa)f(x) = f(a) + f'(c)(x - a), which is exactly the Mean Value Theorem. Taylor’s theorem is the higher-order generalization of the MVT.

💡 Remark 4 (The integral form of the remainder)

An alternative expression for the remainder is

Rn(x)=axf(n+1)(t)n!(xt)ndt.R_n(x) = \int_a^x \frac{f^{(n+1)}(t)}{n!}(x - t)^n \, dt.

This avoids the unknown point cc and is sometimes more useful for estimation. The integral form is proved using integration by parts in The Riemann Integral & FTC, §9.

📝 Example 6 (Taylor remainder for eˣ)

For f(x)=exf(x) = e^x with a=0a = 0: all derivatives are exe^x, so f(n+1)(c)=ecf^{(n+1)}(c) = e^c for some cc between 00 and xx. The Lagrange remainder is

Rn(x)=ec(n+1)!xn+1ex(n+1)!xn+1.|R_n(x)| = \frac{e^c}{(n+1)!}|x|^{n+1} \leq \frac{e^{|x|}}{(n+1)!}|x|^{n+1}.

For any fixed xx, Rn(x)0R_n(x) \to 0 as nn \to \infty because the factorial (n+1)!(n+1)! grows faster than any power xn+1|x|^{n+1}. This proves that ex=k=0xkk!e^x = \sum_{k=0}^{\infty} \frac{x^k}{k!} converges for all xRx \in \mathbb{R}.

📝 Example 7 (Taylor remainder for sin(x))

For f(x)=sinxf(x) = \sin x with a=0a = 0: all derivatives of sin\sin are bounded by 11 in absolute value, so

Rn(x)1(n+1)!xn+10as n|R_n(x)| \leq \frac{1}{(n+1)!}|x|^{n+1} \to 0 \quad \text{as } n \to \infty

for any fixed xx. The Maclaurin series of sinx\sin x converges to sinx\sin x for all xx.

Show:
Max actual error: 1.61e+1Lagrange bound (edge): 9.04e+1Ratio (actual / bound): 0.1780

Actual Taylor error vs. Lagrange bound for eˣ and sin(x)

When Taylor Series Fail

Not every smooth function equals its Taylor series. The fact that Taylor polynomials approximate well locally (small xa|x - a|, fixed nn) does not guarantee that the Taylor series converges to ff (fixed xx, nn \to \infty).

📝 Example 8 (Smooth but not analytic: f(x) = e^{−1/x²})

Define

f(x)={e1/x2if x0,0if x=0.f(x) = \begin{cases} e^{-1/x^2} & \text{if } x \neq 0, \\ 0 & \text{if } x = 0. \end{cases}

This function is CC^\infty (infinitely differentiable) everywhere. At x=0x = 0, one can show by induction (using L’Hôpital’s Rule from §5 at each step) that f(k)(0)=0f^{(k)}(0) = 0 for all k0k \geq 0. So the Taylor series at 00 is

T(x)=0+0+0+=0,T(x) = 0 + 0 + 0 + \cdots = 0,

but f(x)>0f(x) > 0 for all x0x \neq 0. The Taylor series converges (to 00), but it does not converge to f(x)f(x). The function “flattens” against zero so aggressively at x=0x = 0 that no polynomial can detect the non-zero values away from the origin.

📐 Definition 2 (Analytic Function)

A function ff is analytic at aa if its Taylor series at aa converges to f(x)f(x) in some neighborhood of aa: there exists r>0r > 0 such that f(x)=k=0f(k)(a)k!(xa)kf(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}(x - a)^k for all xa<r|x - a| < r.

Most functions encountered in practice — exe^x, sinx\sin x, cosx\cos x, polynomials, rational functions away from poles — are analytic. The function e1/x2e^{-1/x^2} is the standard example of smooth but not analytic.

💡 Remark 5 (Smooth vs. analytic in ML)

In practice, loss functions in ML are typically compositions of analytic functions (exponentials, logarithms, polynomials), so the Taylor expansion is a reliable local model. ReLU is piecewise linear (hence piecewise analytic). The distinction between smooth and analytic matters more in theory — the existence of smooth bump functions (which are CC^\infty with compact support and are not analytic) is essential for partition-of-unity arguments in differential geometry. (→ formalML: Smooth Manifolds)

The function exp(-1/x squared) and its Taylor series (identically zero)

Connections to ML — Taylor Expansion in Optimization

Taylor expansion is not just used in ML proofs — it is the primary analytical technique for understanding optimization algorithms.

The descent lemma

Suppose ff is twice differentiable and f\nabla f is LL-Lipschitz: f(x)f(y)Lxy\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|. By the second-order Taylor expansion with Lagrange remainder (in one dimension, this is Theorem 5 with n=1n = 1):

f(y)=f(x)+f(x)(yx)+f(c)2(yx)2f(y) = f(x) + f'(x)(y - x) + \frac{f''(c)}{2}(y - x)^2

for some cc between xx and yy. Since f(c)L|f''(c)| \leq L (the Lipschitz gradient condition — Corollary 3 applied to ff'), we get the descent lemma:

f(y)f(x)+f(x)(yx)+L2(yx)2.f(y) \leq f(x) + f'(x)(y - x) + \frac{L}{2}(y - x)^2.

Setting y=x1Lf(x)y = x - \frac{1}{L}f'(x) (a gradient descent step with learning rate η=1/L\eta = 1/L) and simplifying:

f(y)f(x)12L(f(x))2.f(y) \leq f(x) - \frac{1}{2L}(f'(x))^2.

Each GD step decreases ff by at least 12Lf(x)2\frac{1}{2L}|f'(x)|^2. This is the foundation of every gradient descent convergence proof.

Newton’s method

Newton’s method approximates ff by its local quadratic Taylor model:

m(x)=f(xk)+f(xk)(xxk)+12f(xk)(xxk)2.m(x) = f(x_k) + f'(x_k)(x - x_k) + \frac{1}{2}f''(x_k)(x - x_k)^2.

Minimizing mm gives m(x)=0m'(x) = 0, yielding the Newton update:

xk+1=xkf(xk)f(xk).x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}.

For root-finding (solve g(x)=0g(x) = 0), the linear Taylor model g(xk)+g(xk)(xxk)=0g(x_k) + g'(x_k)(x - x_k) = 0 gives xk+1=xkg(xk)g(xk)x_{k+1} = x_k - \frac{g(x_k)}{g'(x_k)}. Near a root, the Taylor remainder analysis shows the error satisfies xk+1xCxkx2|x_{k+1} - x^*| \leq C|x_k - x^*|^2quadratic convergence. Each iteration roughly doubles the number of correct digits.

Convergence rate comparison

Using Taylor expansion, we can compare:

  • Gradient descent (first-order, uses T1T_1 model): linear convergence xkxrk|x_k - x^*| \leq r^k for some r<1r < 1.
  • Newton’s method (second-order, uses T2T_2 model): quadratic convergence xkxCr2k|x_k - x^*| \leq C \cdot r^{2^k}, which is doubly exponential decay.

The practical trade-off: Newton needs ff'' (the Hessian in higher dimensions), which is expensive to compute. This is why quasi-Newton methods like L-BFGS exist — they approximate the second-order Taylor model cheaply. (→ formalML: Gradient Descent)

Step 0/20
GD: xₖ = 2.500000, f(xₖ) = 6.250000, |xₖ − x*| = 2.50e+0Newton: xₖ = 2.500000, f(xₖ) = 6.250000, |xₖ − x*| = 2.50e+0

Convergence rate comparison: gradient descent (linear) vs Newton's method (quadratic)

Computational Notes

Taylor approximation is straightforward to implement numerically. In Python:

import numpy as np
from math import factorial

def taylor_polynomial(f_derivs, a, n, x):
    """Evaluate the degree-n Taylor polynomial of f at x.
    f_derivs: list of derivative functions [f, f', f'', ...]
    """
    return sum(f_derivs[k](a) / factorial(k) * (x - a)**k for k in range(n + 1))

# Taylor polynomials of e^x at a = 0
x = np.linspace(-3, 3, 1000)
for n in [1, 3, 5, 10]:
    T_n = sum((x**k) / factorial(k) for k in range(n + 1))
    max_error = np.max(np.abs(np.exp(x) - T_n))
    print(f"T_{n}: max error on [-3,3] = {max_error:.6e}")

For Newton’s method in practice, SciPy provides scipy.optimize.minimize(f, x0, method='Newton-CG'), which implements a Newton-conjugate gradient method that approximates the Hessian-vector product without forming the full Hessian matrix.

Taylor polynomial accuracy vs. degree and distance from center

Connections & Further Reading

Within formalCalculus

This topic builds directly on:

Coming next

  • The Riemann Integral & FTC — the Fundamental Theorem of Calculus uses the MVT in its proof. Taylor’s theorem provides the integral form of the remainder Rn(x)=axf(n+1)(t)n!(xt)ndtR_n(x) = \int_a^x \frac{f^{(n+1)}(t)}{n!}(x-t)^n \, dt.
  • Improper Integrals & Special Functions — Taylor expansion of integrands near singularities, Stirling’s approximation for the Gamma function.
  • Power Series & Taylor Series — Taylor polynomials are finite truncations of Taylor series. This topic provided the approximation theory; the series topic addresses convergence of the infinite sum.
  • Partial Derivatives & the Gradient — the MVT generalizes to the multivariable setting. The descent lemma in Rn\mathbb{R}^n uses the multivariable second-order Taylor expansion.
  • The Hessian & Second-Order Analysis (coming soon) — Newton’s method in Rn\mathbb{R}^n uses the Hessian matrix 2f\nabla^2 f, the multivariable analog of ff'', in the local quadratic Taylor model.
  • Gradient Descent → formalML — descent lemma, convergence rates, and the role of Taylor expansion in optimization analysis.
  • Proximal Methods → formalML — proximal operators as quadratic Taylor models with regularization.
  • Smooth Manifolds → formalML — Taylor expansion on manifolds, smooth bump functions, and partition of unity.

References

  1. book Abbott (2015). Understanding Analysis Chapter 5 develops MVT from Rolle's theorem with exceptional clarity; Chapter 6 covers Taylor's theorem — the primary reference for our rigorous-but-accessible approach
  2. book Rudin (1976). Principles of Mathematical Analysis Chapter 5 on differentiation covers MVT and Taylor's theorem in the definitive compact style
  3. book Spivak (2008). Calculus Chapters 11–12 develop MVT and Taylor's theorem with unmatched geometric intuition alongside full rigor — exceptional treatment of the Taylor remainder
  4. book Boyd & Vandenberghe (2004). Convex Optimization Section 9.1 on unconstrained minimization — the descent lemma and convergence rate proofs that directly use Taylor expansion
  5. book Nesterov (2004). Introductory Lectures on Convex Optimization Chapter 1 develops the Lipschitz gradient framework and descent lemma — the canonical application of Taylor expansion to optimization convergence analysis