Limits & Continuity · foundational · 45 min read

Epsilon-Delta & Continuity

From sequence limits to function limits — the ε-δ definition, continuity, the Intermediate and Extreme Value Theorems, and the Lipschitz condition that controls gradient descent

Abstract. A function f has limit L at a if for every ε > 0, there exists δ > 0 such that 0 < |x - a| < δ implies |f(x) - L| < ε. This ε-δ definition extends the ε-N framework for sequences to arbitrary functions and serves as the foundation for all subsequent calculus. Continuity at a point means the limit equals the function value — equivalently, small perturbations in the input produce small perturbations in the output. The Intermediate Value Theorem guarantees that continuous functions on intervals cannot 'skip' values, which implies the existence of roots and decision boundaries. The Extreme Value Theorem guarantees that continuous functions on closed, bounded intervals attain their maximum and minimum — the theoretical foundation for the existence of optimal parameters in constrained optimization. Beyond pointwise continuity, Lipschitz continuity — |f(x) - f(y)| ≤ K|x - y| — quantifies how 'well-behaved' a function is. In machine learning, the Lipschitz constant of the loss gradient directly controls gradient descent convergence rates, spectral normalization enforces Lipschitz constraints on neural network layers for Wasserstein GANs, and the continuity properties of activation functions (ReLU is continuous but not differentiable; sigmoid is smooth with K = 1/4) determine whether gradient-based training is possible at all.

Where this leads → formalML

  • formalML Convex functions on open sets are automatically continuous — a non-trivial theorem. Semicontinuity generalizes continuity for optimization, and the Lipschitz gradient condition (∇f is L-Lipschitz) is the key assumption in gradient descent convergence proofs.
  • formalML The convergence rate of gradient descent on smooth functions depends on the Lipschitz constant of the gradient: step size η must satisfy η < 2/L for L-smooth functions. The continuity of the loss landscape is the minimal requirement for gradient-based optimization.
  • formalML Continuous hypothesis classes and the uniform convergence of empirical risk to true risk rely on continuity and compactness arguments developed here.

Overview & Motivation

Picture a loss function landscape — a surface where each point represents a set of model parameters and its height represents the loss. When we nudge the parameters slightly, does the loss change by a small, predictable amount? Or could it jump discontinuously, making gradient-based optimization impossible? If we know one set of parameters gives positive loss and another gives negative loss, must there be parameters that give exactly zero loss?

These are questions about continuity, and the ε-δ framework we develop here makes them precise. Continuity is the minimal requirement for gradient-based optimization: if the loss function is not continuous, we cannot even define a meaningful gradient, let alone follow it downhill.

In Sequences, Limits & Convergence, we made “arbitrarily close” precise for sequences with the ε-N definition: anL<ε|a_n - L| < \varepsilon for all nNn \geq N. Now we extend this framework to functions. Instead of asking “what happens as nn gets large?”, we ask “what happens as xx approaches a point aa?” The machinery is analogous — we still quantify “for every ε, there exists…” — but the ε-δ definition for functions is subtler because δ can depend on both ε and the point aa.

The ε-δ Definition of a Function Limit

From intuition to precision

We want to say ”f(x)f(x) approaches LL as xx approaches aa.” Informally, this means: as we take xx values closer and closer to aa, the corresponding f(x)f(x) values get closer and closer to LL.

Our first attempt at making this precise: “for xx close to aa, f(x)f(x) is close to LL.” But how close is “close”? We need to quantify both closeness conditions — and the key insight is that the first one (how close xx is to aa) should be a response to the second one (how close we want f(x)f(x) to be to LL).

This gives us the quantifier structure: for every closeness requirement on f(x)f(x), there exists a sufficient closeness requirement on xx.

📐 Definition 1 (Limit of a Function)

Let ff be defined on an open interval containing aa, except possibly at aa itself. We say limxaf(x)=L\lim_{x \to a} f(x) = L if for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that

0<xa<δ    f(x)L<ε.0 < |x - a| < \delta \implies |f(x) - L| < \varepsilon.

We write f(x)Lf(x) \to L as xax \to a.

The condition 0<xa0 < |x - a| is critical: we require xax \neq a. The limit describes the behavior of ff near aa, not at aa. The function need not even be defined at aa — all that matters is what happens in every punctured neighborhood of aa.

Geometrically, the ε-δ definition says: draw a horizontal band of width 2ε2\varepsilon centered at LL. No matter how narrow this band, we can find a vertical band of width 2δ2\delta centered at aa such that the function graph inside the vertical band stays inside the horizontal band. The intersection of these bands forms an “ε-δ box,” and the function graph must stay inside the box.

The ε-δ definition

Worked examples

📝 Example 1 (Proof that lim(2x + 1) = 3 as x → 1)

Claim: limx1(2x+1)=3\lim_{x \to 1} (2x + 1) = 3.

Proof. Let ε>0\varepsilon > 0 be given. We need to find δ>0\delta > 0 such that 0<x1<δ0 < |x - 1| < \delta implies (2x+1)3<ε|(2x + 1) - 3| < \varepsilon.

We work backward from the conclusion:

(2x+1)3=2x2=2x1.|(2x + 1) - 3| = |2x - 2| = 2|x - 1|.

So (2x+1)3<ε|(2x + 1) - 3| < \varepsilon is equivalent to 2x1<ε2|x - 1| < \varepsilon, which is x1<ε/2|x - 1| < \varepsilon/2.

Choose δ=ε/2\delta = \varepsilon / 2. Then 0<x1<δ0 < |x - 1| < \delta implies

(2x+1)3=2x1<2δ=2ε2=ε.|(2x + 1) - 3| = 2|x - 1| < 2\delta = 2 \cdot \frac{\varepsilon}{2} = \varepsilon.

\square

For linear functions, the algebra is clean: δ\delta is simply ε\varepsilon divided by the slope. But for nonlinear functions, we need to be more careful.

📝 Example 2 (Proof that lim x² = 4 as x → 2)

Claim: limx2x2=4\lim_{x \to 2} x^2 = 4.

Proof. Let ε>0\varepsilon > 0 be given. We need δ>0\delta > 0 such that 0<x2<δ0 < |x - 2| < \delta implies x24<ε|x^2 - 4| < \varepsilon.

Factor: x24=x2x+2|x^2 - 4| = |x - 2| \cdot |x + 2|. The challenge is that x+2|x + 2| depends on xx, so we need to bound it. If we restrict δ1\delta \leq 1, then x2<1|x - 2| < 1 means 1<x<31 < x < 3, so x+2<5|x + 2| < 5.

Under this restriction: x24=x2x+2<x25|x^2 - 4| = |x - 2| \cdot |x + 2| < |x - 2| \cdot 5.

Choose δ=min(1,ε/5)\delta = \min(1, \varepsilon/5). Then 0<x2<δ0 < |x - 2| < \delta implies:

  • δ1\delta \leq 1, so x+2<5|x + 2| < 5
  • δε/5\delta \leq \varepsilon/5, so x2<ε/5|x - 2| < \varepsilon/5

Therefore x24=x2x+2<ε55=ε|x^2 - 4| = |x - 2| \cdot |x + 2| < \frac{\varepsilon}{5} \cdot 5 = \varepsilon. \square

The “restrict δ first to bound a factor, then choose δ as a minimum” technique is the standard approach for polynomial and rational limits. The key step — bounding x+2|x + 2| by restricting to a neighborhood of aa — is where the real work happens.

A non-example: Consider f(x)=sin(1/x)f(x) = \sin(1/x) at a=0a = 0. As x0x \to 0, the function oscillates between 1-1 and +1+1 infinitely often. For any candidate limit LL and any ε<1\varepsilon < 1, no matter how small we make δ\delta, the interval (0,δ)(0, \delta) contains points where f(x)=1f(x) = 1 and points where f(x)=1f(x) = -1. No single LL can be within ε\varepsilon of both. Therefore limx0sin(1/x)\lim_{x \to 0} \sin(1/x) does not exist.

🔷 Proposition 1 (Uniqueness of Limits)

If limxaf(x)\lim_{x \to a} f(x) exists, it is unique. That is, if limxaf(x)=L1\lim_{x \to a} f(x) = L_1 and limxaf(x)=L2\lim_{x \to a} f(x) = L_2, then L1=L2L_1 = L_2.

Proof.

Suppose limxaf(x)=L1\lim_{x \to a} f(x) = L_1 and limxaf(x)=L2\lim_{x \to a} f(x) = L_2 with L1L2L_1 \neq L_2. Set ε=L1L2/2>0\varepsilon = |L_1 - L_2| / 2 > 0. By the limit definition, there exist δ1,δ2>0\delta_1, \delta_2 > 0 such that:

  • 0<xa<δ1    f(x)L1<ε0 < |x - a| < \delta_1 \implies |f(x) - L_1| < \varepsilon
  • 0<xa<δ2    f(x)L2<ε0 < |x - a| < \delta_2 \implies |f(x) - L_2| < \varepsilon

Let δ=min(δ1,δ2)\delta = \min(\delta_1, \delta_2) and pick any xx with 0<xa<δ0 < |x - a| < \delta. By the triangle inequality:

L1L2f(x)L1+f(x)L2<ε+ε=L1L2.|L_1 - L_2| \leq |f(x) - L_1| + |f(x) - L_2| < \varepsilon + \varepsilon = |L_1 - L_2|.

This gives L1L2<L1L2|L_1 - L_2| < |L_1 - L_2|, a contradiction. Therefore L1=L2L_1 = L_2. \square

Try dragging the ε slider below to see the ε-δ box shrink around the limit. As ε decreases, δ must decrease too — the box tightens, but the function graph always stays inside it.

Inside ε-δ box Outside ε-bandε = 0.316 · δ = 0.1581 · δ/ε = 0.500 (δ = ε/2)

One-Sided Limits and Limits at Infinity

Sometimes we care about the behavior of ff as xx approaches aa from only one side. This is essential for functions with different behavior on the left and right — like the floor function at integers, or ReLU at 0.

📐 Definition 2 (Left-Hand Limit)

We say limxaf(x)=L\lim_{x \to a^-} f(x) = L if for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that

aδ<x<a    f(x)L<ε.a - \delta < x < a \implies |f(x) - L| < \varepsilon.

📐 Definition 3 (Right-Hand Limit)

We say limxa+f(x)=L\lim_{x \to a^+} f(x) = L if for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that

a<x<a+δ    f(x)L<ε.a < x < a + \delta \implies |f(x) - L| < \varepsilon.

🔷 Proposition 2 (Two-Sided Limit from One-Sided Limits)

limxaf(x)=L\lim_{x \to a} f(x) = L if and only if limxaf(x)=L\lim_{x \to a^-} f(x) = L and limxa+f(x)=L\lim_{x \to a^+} f(x) = L.

Proof.

Forward direction. If limxaf(x)=L\lim_{x \to a} f(x) = L, then for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that 0<xa<δ0 < |x - a| < \delta implies f(x)L<ε|f(x) - L| < \varepsilon. This condition includes both x(aδ,a)x \in (a - \delta, a) and x(a,a+δ)x \in (a, a + \delta), so both one-sided limits equal LL.

Reverse direction. Suppose both one-sided limits equal LL. Given ε>0\varepsilon > 0, there exist δ1,δ2>0\delta_1, \delta_2 > 0 such that:

  • aδ1<x<a    f(x)L<εa - \delta_1 < x < a \implies |f(x) - L| < \varepsilon
  • a<x<a+δ2    f(x)L<εa < x < a + \delta_2 \implies |f(x) - L| < \varepsilon

Set δ=min(δ1,δ2)\delta = \min(\delta_1, \delta_2). Then 0<xa<δ0 < |x - a| < \delta means either aδ<x<aa - \delta < x < a or a<x<a+δa < x < a + \delta, and in either case f(x)L<ε|f(x) - L| < \varepsilon. \square

📐 Definition 4 (Limit at Infinity)

We say limxf(x)=L\lim_{x \to \infty} f(x) = L if for every ε>0\varepsilon > 0, there exists M>0M > 0 such that

x>M    f(x)L<ε.x > M \implies |f(x) - L| < \varepsilon.

Similarly, limxf(x)=L\lim_{x \to -\infty} f(x) = L if for every ε>0\varepsilon > 0, there exists M>0M > 0 such that x<Mx < -M implies f(x)L<ε|f(x) - L| < \varepsilon.

One-sided limits

ML context: One-sided limits appear at ReLU’s kink at 0. The left-hand limit of ReLU(x)\text{ReLU}'(x) as x0x \to 0^- is 0, and the right-hand limit as x0+x \to 0^+ is 1. Since both one-sided limits of the function itself exist and equal 0, ReLU is continuous at 0 — even though it is not differentiable there. Limits at infinity describe the asymptotic behavior of sigmoid (σ(x)1\sigma(x) \to 1 as xx \to \infty) and tanh (tanh(x)1\tanh(x) \to 1 as xx \to \infty).

Continuity at a Point

A function is continuous at a point if there is no “break” in its graph there. The formal definition captures three conditions that must all hold.

📐 Definition 5 (Continuity at a Point)

A function ff is continuous at aa if:

  1. f(a)f(a) is defined,
  2. limxaf(x)\lim_{x \to a} f(x) exists, and
  3. limxaf(x)=f(a)\lim_{x \to a} f(x) = f(a).

Equivalently: for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that xa<δ|x - a| < \delta implies f(x)f(a)<ε|f(x) - f(a)| < \varepsilon.

We say ff is continuous on an interval II if ff is continuous at every point of II.

Notice the subtle difference from the limit definition: we write xa<δ|x - a| < \delta (not 0<xa<δ0 < |x - a| < \delta). For continuity, we include x=ax = a itself — and condition 3 ensures that f(a)f(a) is the “right” value.

The sequential characterization provides a powerful bridge back to Topic 1.

🔷 Theorem 1 (Sequential Characterization of Continuity)

A function ff is continuous at aa if and only if for every sequence (xn)(x_n) with xnax_n \to a, we have f(xn)f(a)f(x_n) \to f(a).

Proof.

Forward direction. Assume ff is continuous at aa, and let (xn)(x_n) be a sequence with xnax_n \to a. Given ε>0\varepsilon > 0, by continuity there exists δ>0\delta > 0 such that xa<δ|x - a| < \delta implies f(x)f(a)<ε|f(x) - f(a)| < \varepsilon. Since xnax_n \to a, there exists NN such that nNn \geq N implies xna<δ|x_n - a| < \delta. For such nn, we have f(xn)f(a)<ε|f(x_n) - f(a)| < \varepsilon. This proves f(xn)f(a)f(x_n) \to f(a).

Reverse direction (contrapositive). Assume ff is not continuous at aa. Then there exists ε0>0\varepsilon_0 > 0 such that for every δ>0\delta > 0, there is some xx with xa<δ|x - a| < \delta but f(x)f(a)ε0|f(x) - f(a)| \geq \varepsilon_0. For each nNn \in \mathbb{N}, apply this with δ=1/n\delta = 1/n: there exists xnx_n with xna<1/n|x_n - a| < 1/n but f(xn)f(a)ε0|f(x_n) - f(a)| \geq \varepsilon_0. Then xnax_n \to a (since xna<1/n0|x_n - a| < 1/n \to 0) but f(xn)↛f(a)f(x_n) \not\to f(a) (since f(xn)f(a)ε0|f(x_n) - f(a)| \geq \varepsilon_0 for all nn). \square

💡 Remark 1 (Bridge Between Topics 1 and 2)

The sequential characterization is the bridge between sequence convergence (Topic 1) and function continuity (this topic). It lets us apply all the sequence-convergence theorems — Monotone Convergence, Squeeze, Bolzano-Weierstrass, the Algebra of Limits — in the function setting. When we need to prove something about a continuous function, we can often reduce it to a statement about convergent sequences, where we already have powerful tools.

Algebra of Continuous Functions

🔷 Theorem 2 (Algebra of Continuous Functions)

If ff and gg are continuous at aa, then so are:

  1. f+gf + g and fgf - g
  2. fgf \cdot g
  3. f/gf / g (provided g(a)0g(a) \neq 0)
  4. fgf \circ g (if gg is continuous at aa and ff is continuous at g(a)g(a))

Proof.

We prove this via the sequential characterization. Let (xn)(x_n) be any sequence with xnax_n \to a. Since ff and gg are continuous at aa, we have f(xn)f(a)f(x_n) \to f(a) and g(xn)g(a)g(x_n) \to g(a).

Sum: By the Algebra of Limits for sequences (Theorem 3 from Sequences, Limits & Convergence), (f+g)(xn)=f(xn)+g(xn)f(a)+g(a)=(f+g)(a)(f + g)(x_n) = f(x_n) + g(x_n) \to f(a) + g(a) = (f + g)(a).

Product: Similarly, f(xn)g(xn)f(a)g(a)f(x_n) \cdot g(x_n) \to f(a) \cdot g(a).

Quotient: If g(a)0g(a) \neq 0, then f(xn)/g(xn)f(a)/g(a)f(x_n)/g(x_n) \to f(a)/g(a) by the quotient rule for sequences.

Composition: If gg is continuous at aa, then g(xn)g(a)g(x_n) \to g(a). If ff is continuous at g(a)g(a), then f(g(xn))f(g(a))f(g(x_n)) \to f(g(a)). This shows (fg)(xn)(fg)(a)(f \circ g)(x_n) \to (f \circ g)(a). \square

🔷 Corollary 1 (Continuity of Polynomials and Rational Functions)

Every polynomial is continuous on R\mathbb{R}. Every rational function p(x)/q(x)p(x)/q(x) is continuous on its domain (wherever q(x)0q(x) \neq 0).

💡 Remark 2 (Continuity of Neural Networks)

In machine learning, this theorem is why compositions of continuous layers produce continuous networks. If each activation function σi\sigma_i is continuous and each affine map Wix+biW_i x + b_i is continuous (which it is, as a polynomial), then the full network

f=σLWLσ1W1f = \sigma_L \circ W_L \circ \cdots \circ \sigma_1 \circ W_1

is continuous by iterated application of the composition rule. This is the minimal requirement for the network to have well-defined gradients almost everywhere.

Types of Discontinuities

Not all discontinuities are created equal. Understanding the classification tells us what can be “fixed” and what cannot.

📐 Definition 6 (Removable Discontinuity)

A function ff has a removable discontinuity at aa if limxaf(x)\lim_{x \to a} f(x) exists, but either f(a)f(a) is undefined or f(a)limxaf(x)f(a) \neq \lim_{x \to a} f(x). Redefining f(a)=limxaf(x)f(a) = \lim_{x \to a} f(x) “removes” the discontinuity.

📐 Definition 7 (Jump Discontinuity)

A function ff has a jump discontinuity at aa if both one-sided limits limxaf(x)\lim_{x \to a^-} f(x) and limxa+f(x)\lim_{x \to a^+} f(x) exist, but limxaf(x)limxa+f(x)\lim_{x \to a^-} f(x) \neq \lim_{x \to a^+} f(x).

📐 Definition 8 (Essential Discontinuity)

A function ff has an essential discontinuity at aa if at least one of the one-sided limits fails to exist (either diverges to infinity or oscillates).

Concrete examples:

  • Removable: f(x)=sin(x)/xf(x) = \sin(x)/x at x=0x = 0 — the limit is 1, but the function is undefined at 0.
  • Jump: f(x)=xf(x) = \lfloor x \rfloor (floor function) at any integer — left and right limits differ by 1.
  • Essential: f(x)=sin(1/x)f(x) = \sin(1/x) at x=0x = 0 — infinite oscillation, no limit exists.

Continuity types

Explore the four types below. The ε-δ box shows why continuity fails for each type: for the continuous case, a valid δ always exists. For removable discontinuities, the limit exists but misses f(a). For jump discontinuities, no single δ-band can contain both sides. For essential discontinuities, the oscillation defeats any ε.

f(x) = x² at x = 1: All three conditions hold: f(1) = 1, the limit exists and equals 1, and f(1) equals the limit.

The Intermediate Value Theorem

The IVT is one of the most intuitive theorems in analysis — a continuous function cannot “jump” over a value — but its proof requires the completeness of R\mathbb{R}.

🔷 Theorem 3 (Intermediate Value Theorem)

Let f:[a,b]Rf: [a, b] \to \mathbb{R} be continuous. If yy is any value between f(a)f(a) and f(b)f(b), then there exists c(a,b)c \in (a, b) such that f(c)=yf(c) = y.

Proof.

Without loss of generality, assume f(a)<y<f(b)f(a) < y < f(b) (the case f(a)>y>f(b)f(a) > y > f(b) follows by applying the argument to f-f).

Define S={x[a,b]:f(x)y}S = \{x \in [a, b] : f(x) \leq y\}. This set is non-empty (it contains aa, since f(a)<yyf(a) < y \leq y) and bounded above by bb. By the completeness of R\mathbb{R} (the least upper bound property), c=supSc = \sup S exists.

We claim f(c)=yf(c) = y. We rule out the other two possibilities:

Case 1: f(c)<yf(c) < y. Set ε=yf(c)>0\varepsilon = y - f(c) > 0. By continuity, there exists δ>0\delta > 0 such that xc<δ|x - c| < \delta implies f(x)f(c)<ε|f(x) - f(c)| < \varepsilon, i.e., f(x)<f(c)+ε=yf(x) < f(c) + \varepsilon = y. Since c<bc < b (because f(b)>y>f(c)f(b) > y > f(c)), the interval (c,c+δ)(c, c + \delta') for δ=min(δ,bc)\delta' = \min(\delta, b - c) is non-empty, and every xx in this interval satisfies f(x)<yf(x) < y, hence xSx \in S. But then cc is not an upper bound of SS — contradiction with c=supSc = \sup S.

Case 2: f(c)>yf(c) > y. Set ε=f(c)y>0\varepsilon = f(c) - y > 0. By continuity, there exists δ>0\delta > 0 such that xc<δ|x - c| < \delta implies f(x)>f(c)ε=yf(x) > f(c) - \varepsilon = y. So no point in (cδ,c)(c - \delta, c) belongs to SS. But c=supSc = \sup S means every interval (cδ,c)(c - \delta, c) must contain points of SS — contradiction.

Both cases lead to contradictions, so f(c)=yf(c) = y. \square

🔷 Corollary 2 (Root Existence)

If ff is continuous on [a,b][a, b] and f(a)f(b)<0f(a) \cdot f(b) < 0 (i.e., ff changes sign), then there exists c(a,b)c \in (a, b) with f(c)=0f(c) = 0.

💡 Remark 3 (IVT and Decision Boundaries)

The IVT is the theoretical foundation of the bisection method for root-finding. More broadly, it guarantees the existence of decision boundaries: if a continuous classifier assigns a positive score to one data point and a negative score to another, the IVT guarantees that somewhere between them, the classifier outputs exactly zero — a decision boundary exists. This is why continuous classifiers always produce connected decision regions.

IVT demonstration

Drag the horizontal target line to see IVT in action. Toggle bisection mode to watch the algorithm narrow in on the root step by step.

Intersection (c where f(c) = y)Drag the dashed red line to change the target y value1 crossing found

The Extreme Value Theorem

🔷 Theorem 4 (Extreme Value Theorem)

If f:[a,b]Rf: [a, b] \to \mathbb{R} is continuous, then ff attains its maximum and minimum on [a,b][a, b]. That is, there exist c,d[a,b]c, d \in [a, b] such that f(c)f(x)f(d)f(c) \leq f(x) \leq f(d) for all x[a,b]x \in [a, b].

Proof.

We prove the maximum case; the minimum case follows by applying the argument to f-f.

Step 1: ff is bounded above on [a,b][a, b]. Suppose not. Then for each nNn \in \mathbb{N}, there exists xn[a,b]x_n \in [a, b] with f(xn)>nf(x_n) > n. The sequence (xn)(x_n) is bounded (all terms lie in [a,b][a, b]), so by the Bolzano-Weierstrass Theorem (from Sequences, Limits & Convergence), there is a convergent subsequence xnkx[a,b]x_{n_k} \to x^* \in [a, b] (the interval is closed, so the limit stays in [a,b][a, b]). By continuity, f(xnk)f(x)f(x_{n_k}) \to f(x^*). But f(xnk)>nkf(x_{n_k}) > n_k \to \infty, which contradicts convergence. Therefore ff is bounded above.

Step 2: The supremum is attained. Let M=supx[a,b]f(x)M = \sup_{x \in [a,b]} f(x), which exists by Step 1. By the definition of supremum, for each nn there exists xn[a,b]x_n \in [a, b] with f(xn)>M1/nf(x_n) > M - 1/n. Again by Bolzano-Weierstrass, extract a convergent subsequence xnkd[a,b]x_{n_k} \to d \in [a, b]. By continuity, f(d)=limf(xnk)f(d) = \lim f(x_{n_k}). Since M1/nk<f(xnk)MM - 1/n_k < f(x_{n_k}) \leq M for all kk, we get f(d)=Mf(d) = M by the Squeeze Theorem. \square

💡 Remark 4 (EVT and Optimization)

The EVT is the existence theorem for optimization: it guarantees that minx[a,b]f(x)\min_{x \in [a,b]} f(x) has a solution when ff is continuous and the domain [a,b][a, b] is compact (closed and bounded). In machine learning, regularization and weight clipping create compact parameter spaces — for example, constraining θC\|\theta\| \leq C — precisely to ensure that minimizers exist. Without compactness, a continuous function may approach but never reach its infimum (think of exe^{-x} on (0,)(0, \infty), which approaches 0 but never reaches it).

EVT demonstration

Uniform Continuity and Lipschitz Continuity

So far, our notion of continuity is pointwise: at each point aa, we find a δ that works for that specific aa. But δ might shrink as aa moves — different points might need different δ values. Uniform continuity asks for a single δ that works everywhere simultaneously.

📐 Definition 9 (Uniform Continuity)

A function ff is uniformly continuous on a set DD if for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that for all x,yDx, y \in D:

xy<δ    f(x)f(y)<ε.|x - y| < \delta \implies |f(x) - f(y)| < \varepsilon.

The key difference: in pointwise continuity, δ\delta depends on both ε\varepsilon and the point aa. In uniform continuity, δ\delta depends only on ε\varepsilon — the same δ works for every pair of points in DD.

Lipschitz continuity goes further: it demands a linear relationship between input perturbation and output perturbation.

📐 Definition 10 (Lipschitz Continuity)

A function ff is Lipschitz continuous with constant K0K \geq 0 on a set DD if

f(x)f(y)Kxyfor all x,yD.|f(x) - f(y)| \leq K|x - y| \quad \text{for all } x, y \in D.

The smallest such KK is called the Lipschitz constant of ff on DD.

Geometrically, ff is KK-Lipschitz if and only if at every point (x0,f(x0))(x_0, f(x_0)), the graph of ff stays inside a cone with slopes ±K\pm K. The Lipschitz constant is the steepest the function is allowed to be — anywhere.

🔷 Proposition 3 (The Continuity Hierarchy)

Lipschitz continuity \Rightarrow uniform continuity \Rightarrow (pointwise) continuity.

Both converses are false.

Proof.

Lipschitz \Rightarrow uniform continuity. Given ε>0\varepsilon > 0, choose δ=ε/K\delta = \varepsilon / K. Then xy<δ|x - y| < \delta implies f(x)f(y)Kxy<Kε/K=ε|f(x) - f(y)| \leq K|x - y| < K \cdot \varepsilon/K = \varepsilon. This δ\delta depends only on ε\varepsilon (not on xx or yy), so ff is uniformly continuous.

Uniform continuity \Rightarrow continuity. Immediate: uniform continuity at every pair (x,a)(x, a) with xx near aa gives pointwise continuity at aa.

Counterexample for first converse: f(x)=xf(x) = \sqrt{x} on [0,1][0, 1] is uniformly continuous (continuous on a compact set — Heine-Cantor, Theorem 5 below) but not Lipschitz: f(x)f(0)/x0=1/x|f(x) - f(0)| / |x - 0| = 1/\sqrt{x} \to \infty as x0+x \to 0^+.

Counterexample for second converse: f(x)=x2f(x) = x^2 on R\mathbb{R} is continuous but not uniformly continuous. For any δ>0\delta > 0, take x=1/δx = 1/\delta and y=x+δ/2y = x + \delta/2. Then xy=δ/2<δ|x - y| = \delta/2 < \delta but f(x)f(y)=x2y2=xyx+y>(δ/2)(2/δ)=1|f(x) - f(y)| = |x^2 - y^2| = |x - y| \cdot |x + y| > (\delta/2)(2/\delta) = 1. So ε=1\varepsilon = 1 has no valid δ\delta. \square

🔷 Theorem 5 (Heine-Cantor Theorem)

If ff is continuous on a compact set KK (closed and bounded in R\mathbb{R}), then ff is uniformly continuous on KK.

💡 Remark 5 (Compactness Upgrades Continuity)

The Heine-Cantor theorem explains why compactness is so important for optimization. On compact domains, continuity automatically gives us uniform continuity — uniform control over function behavior everywhere. This is the machinery behind many existence and approximation theorems in analysis and ML, and we develop it fully in Completeness & Compactness.

Lipschitz continuity

Drag the point along the function to see the Lipschitz cone follow. Adjust KK to see how the constraint tightens. For x\sqrt{x}, notice how the cone fails near x=0x = 0 — the slope blows up, so no finite KK works.

✓ Inside cone
Drag point to reposition coneThis function is 1-Lipschitz on its domain

Connections to ML

Activation function continuity

The continuity properties of activation functions determine whether gradient-based training is possible at all.

  • ReLU σ(x)=max(0,x)\sigma(x) = \max(0, x): Continuous everywhere (both one-sided limits agree at 0). Lipschitz with K=1K = 1. Not differentiable at 0, but the subgradient exists.
  • Sigmoid σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}): Smooth (infinitely differentiable). Lipschitz with K=1/4K = 1/4 (maximum slope at x=0x = 0).
  • Tanh σ(x)=tanh(x)\sigma(x) = \tanh(x): Smooth. Lipschitz with K=1K = 1.
  • GELU σ(x)=xΦ(x)\sigma(x) = x \cdot \Phi(x): Smooth. Lipschitz with K1.13K \approx 1.13.
  • Step function σ(x)=1x>0\sigma(x) = \mathbf{1}_{x > 0}: Discontinuous at 0 (jump). Cannot be trained with gradient descent — gradients are zero almost everywhere.

The step function illustrates why continuity matters: its discontinuity makes gradient-based optimization impossible, which is why smooth approximations (sigmoid, softmax) replaced it in neural network training. See formalML Gradient Descent for the full convergence analysis.

Activation functions

Loss landscape smoothness

Different loss functions have different continuity properties, which determines which optimization methods apply:

  • MSE (y,y^)=(yy^)2\ell(y, \hat{y}) = (y - \hat{y})^2: Smooth, Lipschitz gradient. Gradient descent converges.
  • Hinge (y,y^)=max(0,1yy^)\ell(y, \hat{y}) = \max(0, 1 - y\hat{y}): Continuous but not differentiable at yy^=1y\hat{y} = 1. Subgradient methods work.
  • Cross-entropy (p)=log(p)\ell(p) = -\log(p): Smooth on (0,1)(0, 1) but singular at p=0p = 0 (loss \to \infty). Requires clipping or label smoothing.
  • 0-1 loss (y,y^)=1yy^\ell(y, \hat{y}) = \mathbf{1}_{y \neq \hat{y}}: Discontinuous. Not trainable with gradients — this is why we use continuous surrogates.

See formalML Convex Analysis for the full treatment of loss function properties.

Loss landscape continuity

Lipschitz constraints in GANs

The Wasserstein GAN (WGAN) critic DD must be 1-Lipschitz to ensure the Wasserstein distance W1(Pr,Pg)=supDL1ExPr[D(x)]ExPg[D(x)]W_1(P_r, P_g) = \sup_{\|D\|_L \leq 1} \mathbb{E}_{x \sim P_r}[D(x)] - \mathbb{E}_{x \sim P_g}[D(x)] is well-defined (Kantorovich-Rubinstein duality). The original paper enforced this via weight clipping: w[c,c]w \in [-c, c]. Spectral normalization (Miyato et al., 2018) provides a tighter enforcement by normalizing each weight matrix WW by its spectral norm σ(W)\sigma(W):

Wˉ=W/σ(W)\bar{W} = W / \sigma(W)

This ensures the Lipschitz constant of the network is at most 1, since the Lipschitz constant of a composition is bounded by the product of the layer-wise constants. The trade-off: smaller KK means a more constrained (less expressive) critic, but more stable training.

IVT and decision boundaries

If a continuous classifier f:RdRf: \mathbb{R}^d \to \mathbb{R} assigns a positive score to one data point and a negative score to another, the IVT guarantees that along any continuous path between them, there exists a point where f=0f = 0 — a decision boundary. This means continuous classifiers always produce well-defined, connected decision regions. Discontinuous classifiers can create “islands” with no clear boundary. See formalML PAC Learning for how continuity assumptions on hypothesis classes affect learnability.

Computational Notes

Evaluating limits numerically

import numpy as np

def numerical_limit(f, a, n_steps=20):
    """Evaluate lim_{x->a} f(x) by approaching from both sides."""
    h_values = [10**(-k * 0.5) for k in range(1, n_steps + 1)]
    left = [f(a - h) for h in h_values]
    right = [f(a + h) for h in h_values]
    print(f"Left:  {left[-1]:.12f}")
    print(f"Right: {right[-1]:.12f}")
    return (left[-1] + right[-1]) / 2

# Example: lim_{x->0} sin(x)/x = 1
numerical_limit(lambda x: np.sin(x) / x, 0)

Estimating Lipschitz constants

def estimate_lipschitz(f, a, b, n=10000):
    """Estimate the Lipschitz constant of f on [a, b]."""
    x = np.linspace(a, b, n)
    fx = f(x)
    max_ratio = 0
    for i in range(n - 1):
        ratio = abs(fx[i+1] - fx[i]) / abs(x[i+1] - x[i])
        max_ratio = max(max_ratio, ratio)
    return max_ratio

# sin(x) on [-pi, pi]: K should be close to 1
K = estimate_lipschitz(np.sin, -np.pi, np.pi)
print(f"Lipschitz constant of sin(x): {K:.4f}")

Bisection root-finding

def bisection(f, a, b, tol=1e-10, max_iter=100):
    """Find root of f in [a, b] using the bisection method (IVT)."""
    assert f(a) * f(b) < 0, "f must change sign on [a, b]"
    for i in range(max_iter):
        mid = (a + b) / 2
        if abs(f(mid)) < tol or (b - a) / 2 < tol:
            return mid, i + 1
        if f(a) * f(mid) < 0:
            b = mid
        else:
            a = mid
    return (a + b) / 2, max_iter

# Find root of x^3 - 2x - 2 in [0, 3]
root, iters = bisection(lambda x: x**3 - 2*x - 2, 0, 3)
print(f"Root: {root:.10f} (found in {iters} iterations)")

Connections & Further Reading

Where this leads within formalCalculus

  • Sequences, Limits & Convergence — the prerequisite. The ε-N definition that this topic extends, and the Bolzano-Weierstrass Theorem that powers the EVT proof.
  • Completeness & Compactness — reproves EVT and Heine-Cantor as consequences of compactness, revealing the structural reason these theorems hold.
  • Uniform Convergence — the Uniform Limit Theorem uses ε-δ continuity to prove that uniform convergence preserves continuity, and equicontinuity generalizes uniform continuity to families of functions.
  • The Derivative & Chain Rule — differentiability implies continuity (but not conversely — ReLU is continuous but not differentiable at 0). The derivative is itself a limit.
  • The Riemann Integral & FTC — continuous functions on [a,b][a, b] are Riemann integrable. Continuity is the “easy” sufficient condition.
  • Mean Value Theorem & Taylor Expansion — MVT requires continuity on [a,b][a, b] and differentiability on (a,b)(a, b).

Where this leads in machine learning

  • formalML Convex Analysis — continuity of convex functions, semicontinuity, Lipschitz gradient conditions.
  • formalML Gradient Descent — the Lipschitz gradient assumption controls convergence rates.
  • formalML PAC Learning — continuous hypothesis classes and uniform convergence of empirical processes.

References

  1. book Abbott (2015). Understanding Analysis Chapter 4 develops continuity with the ε-δ definition and proves IVT and EVT — the primary reference for our exposition
  2. book Rudin (1976). Principles of Mathematical Analysis Chapter 4 on continuity — the definitive compact treatment of uniform continuity and compactness
  3. book Tao (2016). Analysis I Chapter 9 on continuous functions and Chapter 13 on uniform continuity — first-principles construction with careful scaffolding
  4. book Boyd & Vandenberghe (2004). Convex Optimization Section 2.2 on convex function continuity, Section 9.3 on Lipschitz gradient conditions for convergence
  5. paper Arjovsky, Chintala & Bottou (2017). “Wasserstein Generative Adversarial Networks” The 1-Lipschitz constraint on the WGAN critic is a direct application of Lipschitz continuity to generative modeling
  6. paper Miyato, Kataoka, Koyama & Yoshida (2018). “Spectral Normalization for Generative Adversarial Networks” Spectral normalization enforces the Lipschitz constraint by normalizing weight matrices by their spectral norm