Limits & Continuity · intermediate · 45 min read

Uniform Convergence

When do limits of functions inherit the properties of the functions themselves? The sup-norm, the ε/3 trick, the Weierstrass M-test, Arzelà-Ascoli, and the uniform convergence framework that makes PAC learning possible

Abstract. A sequence of functions (f_n) converges pointwise to f on D if for each x in D, f_n(x) converges to f(x) as n tends to infinity. But pointwise convergence is deceptively weak: the limit of continuous functions can be discontinuous (f_n(x) = x^n on [0,1] converges pointwise to a step function), integration and limits cannot be interchanged (a moving bump with constant integral can converge pointwise to zero), and derivatives of the limit may bear no relation to the limits of the derivatives. Uniform convergence fixes these pathologies by requiring that the convergence rate be independent of x: (f_n) converges uniformly to f if for every epsilon > 0, there exists N such that |f_n(x) - f(x)| < epsilon for ALL x in D simultaneously — equivalently, the sup-norm ||f_n - f||_infinity tends to zero. The Uniform Limit Theorem guarantees that the uniform limit of continuous functions is continuous (the epsilon/3 argument), and uniform convergence permits interchange of limits with Riemann integration. The Weierstrass M-test provides a practical criterion: if |g_k(x)| <= M_k for all x and the sum of M_k converges, then the series of g_k converges uniformly. The Arzela-Ascoli theorem characterizes compact subsets of C([a,b]) via equicontinuity and pointwise boundedness — it is the compactness theorem for function spaces, extending the Heine-Borel theorem from R^n to infinite-dimensional spaces. In machine learning, uniform convergence is the mathematical backbone of generalization theory. PAC learning guarantees that the empirical risk R_hat(h) converges uniformly to the true risk R(h) over a hypothesis class H, and the VC dimension controls the rate of this uniform convergence. The Glivenko-Cantelli theorem — the empirical CDF converges uniformly to the true CDF — is the foundational uniform convergence result in statistics.

Overview & Motivation

You’re training a neural network with increasing width: $f_1, f_2, f_4, f_8, \ldots$ neurons per hidden layer. Each $f_n$ is a continuous function — compositions of affine maps and continuous activations like ReLU or sigmoid. As $n \to \infty$ , does $f_n$ converge to some function $f$ ? And if so, is $f$ still continuous? Can you integrate $f$ by integrating the $f_n$ ? Can you differentiate $f$ by differentiating the $f_n$ ?

The answers depend on how $f_n$ converges — and the wrong notion of convergence gives the wrong answers to all three questions.

In Sequences, Limits & Convergence, we studied convergence of sequences of numbers: given a sequence $a_1, a_2, a_3, \ldots$ in $\mathbb{R}$ , does $a_n \to L$ ? In Epsilon-Delta & Continuity, we studied continuity of individual functions: is $f$ continuous at $a$ ? In Completeness & Compactness, we studied structural properties that guarantee limits exist and optima are attained. Now we combine all three: when does a sequence of functions converge, and when does the limit inherit the properties of the functions themselves?

The answer will require a new notion of convergence — uniform convergence — that is strictly stronger than the naive “pointwise” approach. The distinction between these two notions is not a technicality. It is the mathematical backbone of generalization theory in machine learning.

Pointwise Convergence of Function Sequences

The simplest way to define convergence of a function sequence is to check convergence at each point separately.

📐 Definition 1 (Pointwise Convergence of Function Sequences)

A sequence of functions $(f_n)$ where $f_n: D \to \mathbb{R}$ converges pointwise to $f: D \to \mathbb{R}$ if for every $x \in D$ ,

$\lim_{n \to \infty} f_n(x) = f(x).$

Equivalently, for every $x \in D$ and every $\varepsilon > 0$ , there exists $N \in \mathbb{N}$ (which may depend on both $x$ and $\varepsilon$ ) such that

$n \geq N \implies |f_n(x) - f(x)| < \varepsilon.$

The crucial phrase is “which may depend on both $x$ and $\varepsilon$ .” We write $N = N(x, \varepsilon)$ to emphasize this dependence. At each fixed $x$ , we’re just asking for ordinary sequence convergence — the $\varepsilon$ - $N$ definition from Sequences, Limits & Convergence. But the $N$ required might be very different at different points $x$ .

The canonical example makes this concrete:

📝 Example 1

Consider $f_n(x) = x^n$ on $[0, 1]$ . The pointwise limit is

$f(x) = \begin{cases} 0 & \text{if } x \in [0, 1) \\ 1 & \text{if } x = 1 \end{cases}$

Proof. For $x \in [0, 1)$ , we have $|x| < 1$ , so $x^n \to 0$ as $n \to \infty$ (geometric sequence with ratio $|r| < 1$ ). For $x = 1$ , we have $1^n = 1$ for all $n$ . $\blacksquare$

Notice what happened: each $f_n(x) = x^n$ is continuous on $[0, 1]$ , but the pointwise limit $f$ has a jump discontinuity at $x = 1$ . The limit of continuous functions is not continuous.

A second example where things go right: $g_n(x) = x^n / n$ on $[0, 1]$ also converges pointwise to $f \equiv 0$ (the same limit for $x \in [0,1)$ ; at $x = 1$ , $g_n(1) = 1/n \to 0$ ). And this time the limit is continuous. What’s different?

Pointwise vs. uniform convergence

The difference is the convergence rate. For $f_n(x) = x^n$ , points near $x = 1$ require enormous $N$ to get close to the limit — $N(x, \varepsilon) = \lceil \log \varepsilon / \log x \rceil \to \infty$ as $x \to 1^-$ . For $g_n(x) = x^n/n$ , the sup-norm $\|g_n\|_\infty = 1/n$ is independent of $x$ , so a single $N$ works everywhere.

The Three Pathologies of Pointwise Convergence

Pointwise convergence fails to preserve the three fundamental properties of functions:

1. Continuity failure. We’ve just seen this: $f_n(x) = x^n$ is continuous for every $n$ , but $f_n \to f$ pointwise where $f$ is discontinuous. The limit of continuous functions need not be continuous under pointwise convergence.

2. Integration failure. Consider the “moving bump” sequence $f_n(x) = n^2 x (1 - x)^n$ on $[0, 1]$ . Each $f_n$ has a bump near $x = 0$ that grows taller and narrower. One can verify that $\int_0^1 f_n(x)\,dx = \frac{n^2}{(n+1)(n+2)} \to 1$ as $n \to \infty$ . But $f_n(x) \to 0$ for each fixed $x > 0$ (and $f_n(0) = 0$ ), so the pointwise limit is $f \equiv 0$ , and $\int_0^1 f(x)\,dx = 0$ . We get

$\lim_{n \to \infty} \int_0^1 f_n(x)\,dx = 1 \neq 0 = \int_0^1 \left(\lim_{n \to \infty} f_n(x)\right) dx.$

The limit and integral do not commute.

3. Differentiation failure. Let $f_n(x) = \frac{\sin(nx)}{\sqrt{n}}$ on $\mathbb{R}$ . Then $|f_n(x)| \leq 1/\sqrt{n} \to 0$ uniformly, so $f_n \to 0$ everywhere. But $f_n'(x) = \sqrt{n}\cos(nx)$ , and $|f_n'(0)| = \sqrt{n} \to \infty$ . The derivatives of the $f_n$ don’t converge at all, even though the $f_n$ themselves converge nicely.

The three pathologies of pointwise convergence

💡 Remark 1

These three failures are not pathological edge cases. They arise naturally whenever convergence is “fast in some places and slow in others,” which is typical in machine learning: a neural network may fit training data well in dense regions and poorly in sparse regions, so the convergence of $f_n$ to $f$ is inherently non-uniform across the input domain. The distinction between pointwise and uniform convergence is the mathematical formalization of this problem.

Uniform Convergence — The Fix

If we require the convergence rate to be the same at every $x$ — that is, the $N$ depends only on $\varepsilon$ , not on $x$ — then all three pathologies disappear.

📐 Definition 2 (Uniform Convergence)

A sequence of functions $(f_n)$ converges uniformly to $f$ on $D$ if for every $\varepsilon > 0$ , there exists $N \in \mathbb{N}$ such that

$n \geq N \implies |f_n(x) - f(x)| < \varepsilon \quad \text{for all } x \in D \text{ simultaneously.}$

Equivalently, $\|f_n - f\|_\infty = \sup_{x \in D} |f_n(x) - f(x)| \to 0$ as $n \to \infty$ .

The geometric picture is the ε-band characterization: $f_n \to f$ uniformly if and only if for every $\varepsilon > 0$ , eventually the entire graph of $f_n$ lies within an $\varepsilon$ -band around the graph of $f$ . Not just at each point separately, but everywhere at once.

📐 Definition 3 (Sup-Norm / Uniform Metric)

For bounded functions $f, g: D \to \mathbb{R}$ , the sup-norm (or uniform norm) is

$\|f - g\|_\infty = \sup_{x \in D} |f(x) - g(x)|.$

This is a genuine metric on the space of bounded functions on $D$ . Uniform convergence is convergence in this metric.

The comparison is now sharp:

Pointwise: “for each $x$ , the vertical distance $|f_n(x) - f(x)| \to 0$ .”
Uniform: “the maximum vertical distance over all $x$ simultaneously $\to 0$ .”

The ε-band characterization

🔷 Proposition 1 (Uniform ⟹ Pointwise)

If $f_n \to f$ uniformly on $D$ , then $f_n \to f$ pointwise on $D$ . The converse is false.

Proof.

If $\|f_n - f\|_\infty < \varepsilon$ for $n \geq N$ , then in particular $|f_n(x) - f(x)| \leq \|f_n - f\|_\infty < \varepsilon$ for each fixed $x \in D$ , so $f_n(x) \to f(x)$ .

The converse fails: $f_n(x) = x^n$ on $[0, 1]$ converges pointwise but not uniformly — the sup-norm $\sup_{x \in [0,1)} |x^n - 0|$ does not tend to $0$ because for any $n$ , we can find $x$ close to $1$ with $x^n$ close to $1$ . $\blacksquare$

∎

📝 Example 2

$g_n(x) = x^n / n$ on $[0, 1]$ . Then $\|g_n\|_\infty = \sup_{x \in [0,1]} x^n / n = 1/n \to 0$ , so $g_n \to 0$ uniformly. Compare with $f_n(x) = x^n$ , where $\sup_{x \in [0,1)} x^n$ approaches $1$ , not $0$ .

Use the explorer below to feel the difference. Drag the $n$ slider and watch whether the function’s graph stays inside the ε-band. The moment when $x^n$ escapes the band near $x = 1$ — but $x^n/n$ stays inside — is the difference between pointwise and uniform convergence.

Sequence:Show ε-band

n = 5ε = 0.100

‖f_n − f‖_∞ = 0.9900✗ Escapes ε-band (not uniform at this n)Pointwise only — NOT uniform

||f_n - f||_∞ → 1 (never → 0) · Analogous to an overfit model converging fast in-distribution but failing at the boundary.

The Uniform Limit Theorem

The most important theorem in this topic: uniform convergence preserves continuity.

🔷 Theorem 1 (Uniform Limit Theorem)

If each $f_n: D \to \mathbb{R}$ is continuous at $a \in D$ and $f_n \to f$ uniformly on $D$ , then $f$ is continuous at $a$ .

Proof.

Fix $\varepsilon > 0$ . We need $\delta > 0$ such that $|x - a| < \delta \implies |f(x) - f(a)| < \varepsilon$ . Apply the triangle inequality:

$|f(x) - f(a)| \leq |f(x) - f_n(x)| + |f_n(x) - f_n(a)| + |f_n(a) - f(a)|.$

Step 2 (Continuity of $f_N$ ). Since $f_N$ is continuous at $a$ , choose $\delta > 0$ such that $|x - a| < \delta \implies |f_N(x) - f_N(a)| < \varepsilon/3$ .

Step 3 (Combine). For $|x - a| < \delta$ :

$|f(x) - f(a)| < \frac{\varepsilon}{3} + \frac{\varepsilon}{3} + \frac{\varepsilon}{3} = \varepsilon. \quad \blacksquare$

∎

The key insight: uniform convergence provides a single $N$ that simultaneously controls the first and third terms for all $x$ . Pointwise convergence would give different $N$ for different $x$ , and we couldn’t fix $N$ before choosing $\delta$ .

💡 Remark 2

The ε/3 trick is a paradigm: when bounding $|f(x) - f(a)|$ through an intermediary $f_n$ , split the triangle inequality into three terms and allocate $\varepsilon/3$ to each. This proof strategy recurs throughout analysis — and in ML, it appears in decomposing generalization error into approximation error + estimation error + optimization error, each bounded by $\varepsilon/3$ .

The Uniform Limit Theorem

Mode:

n = 10Click on the graph to probe continuity at a point

Uniform convergence (x^n/n → 0): the limit is continuous. The ε/3 argument works because one N controls all x.

Interchange Theorems — Integration and Differentiation

Uniform convergence enables two crucial interchange theorems, but with different hypotheses.

🔷 Theorem 2 (Interchange of Limit and Integral)

If $f_n \to f$ uniformly on $[a, b]$ and each $f_n$ is Riemann integrable on $[a, b]$ , then $f$ is Riemann integrable and

$\lim_{n \to \infty} \int_a^b f_n(x)\, dx = \int_a^b f(x)\, dx.$

Proof.

Since $\|f_n - f\|_\infty \to 0$ by uniform convergence, the right side tends to $0$ . $\blacksquare$

∎

The proof is clean and direct — the sup-norm bound converts a pointwise inequality into a uniform one, and the integral of a uniform bound is just the bound times the interval length.

📝 Example 3

The interchange fails for pointwise convergence. Our “moving bump” sequence has $\int f_n = 1$ for all $n$ but $f_n \to 0$ pointwise, so $\int (\lim f_n) = 0 \neq 1 = \lim (\int f_n)$ .

Interchange of limit and integral

The differentiation interchange is harder and requires a stronger hypothesis:

🔷 Theorem 3 (Interchange of Limit and Derivative)

If $(f_n)$ is a sequence of differentiable functions on $[a, b]$ such that:

(i) $f_n(x_0) \to L$ for at least one $x_0 \in [a, b]$ , and

(ii) $f_n' \to g$ uniformly on $[a, b]$ ,

then $f_n \to f$ uniformly for some differentiable $f$ , and $f' = g = \lim f_n'$ .

💡 Remark 3

The differentiation interchange requires uniform convergence of the derivatives, not the functions themselves. This asymmetry is fundamental: integration is a smoothing operation (it improves convergence), while differentiation is a roughening operation (it can destroy convergence). In ML terms: backpropagation computes derivatives, so uniform convergence of loss landscapes $\mathcal{L}_n(\theta) \to \mathcal{L}(\theta)$ does not automatically imply well-behaved gradients $\nabla \mathcal{L}_n \to \nabla \mathcal{L}$ .

The Cauchy Criterion for Uniform Convergence

Just as numerical sequences have a Cauchy criterion that certifies convergence without knowing the limit (Sequences, Limits & Convergence, Definition 6), function sequences have an analogous criterion.

🔷 Proposition 2 (Cauchy Criterion for Uniform Convergence)

A sequence $(f_n)$ converges uniformly on $D$ if and only if for every $\varepsilon > 0$ , there exists $N$ such that

$m, n \geq N \implies \|f_m - f_n\|_\infty = \sup_{x \in D} |f_m(x) - f_n(x)| < \varepsilon.$

Proof.

(⟸) For each fixed $x$ , the numerical sequence $(f_n(x))$ satisfies $|f_m(x) - f_n(x)| \leq \|f_m - f_n\|_\infty < \varepsilon$ , so $(f_n(x))$ is Cauchy in $\mathbb{R}$ . By completeness of $\mathbb{R}$ (from Completeness & Compactness), $f_n(x) \to f(x)$ for some value $f(x)$ .

To show the convergence is uniform: fix $\varepsilon > 0$ and choose $N$ from the hypothesis. For any $m, n \geq N$ and any $x$ , $|f_m(x) - f_n(x)| < \varepsilon$ . Taking $m \to \infty$ gives $|f(x) - f_n(x)| \leq \varepsilon$ for all $x \in D$ , which is $\|f - f_n\|_\infty \leq \varepsilon$ . $\blacksquare$

∎

💡 Remark 4

The Cauchy criterion shows that $C([a,b])$ with the sup-norm is a complete space — every uniformly Cauchy sequence of continuous functions converges uniformly to a continuous function. This completeness is the function-space analog of the completeness of $\mathbb{R}$ from Completeness & Compactness. It will be formalized as a Banach space structure in Normed & Banach Spaces.

Cauchy criterion for function sequences

Series of Functions and the Weierstrass M-Test

A series of functions $\sum_{k=1}^\infty g_k(x)$ is just a sequence of partial sums. The Weierstrass M-test provides a practical, checkable criterion for uniform convergence.

📐 Definition 4 (Uniform Convergence of a Series)

A series $\sum_{k=1}^\infty g_k(x)$ converges uniformly on $D$ if the sequence of partial sums $S_n(x) = \sum_{k=1}^n g_k(x)$ converges uniformly on $D$ .

🔷 Theorem 4 (Weierstrass M-Test)

Let $g_k: D \to \mathbb{R}$ and suppose there exist constants $M_k \geq 0$ such that $|g_k(x)| \leq M_k$ for all $x \in D$ and all $k$ . If $\sum_{k=1}^\infty M_k$ converges, then $\sum_{k=1}^\infty g_k$ converges uniformly (and absolutely) on $D$ .

Proof.

For $m > n$ :

$\|S_m - S_n\|_\infty = \sup_{x \in D} \left|\sum_{k=n+1}^m g_k(x)\right| \leq \sum_{k=n+1}^m \sup_{x \in D} |g_k(x)| \leq \sum_{k=n+1}^m M_k.$

Since $\sum M_k$ converges, its tail $\sum_{k=n+1}^\infty M_k \to 0$ , so $(S_n)$ is uniformly Cauchy. By the Cauchy criterion (Proposition 2), $(S_n)$ converges uniformly. $\blacksquare$

∎

📝 Example 4

The series $\sum_{k=1}^\infty \frac{\sin(kx)}{k^2}$ converges uniformly on $\mathbb{R}$ . Take $M_k = 1/k^2$ ; since $|\sin(kx)| \leq 1$ and $\sum 1/k^2 = \pi^2/6 < \infty$ , the M-test applies. Since each $\sin(kx)/k^2$ is continuous and the convergence is uniform, the Uniform Limit Theorem (applied to partial sums) guarantees the sum defines a continuous function.

The Weierstrass M-test

Series:Show M_k bounds

n = 5 terms

Actual ‖S − S_n‖_∞ ≈ 0.10488M-test bound ≈ 0.17634✓ M-test passes

M_k = 1/k² (∑ = π²/6) · Left: partial sums S_n(x) vs. full sum. Right: M_k bound bars (blue = included, gray = tail).

Equicontinuity and the Arzelà-Ascoli Theorem

The Heine-Borel theorem from Completeness & Compactness says that a subset of $\mathbb{R}^n$ is compact if and only if it is closed and bounded. What is the analog for subsets of the function space $C([a,b])$ ? Boundedness alone is not enough — we need a condition that controls how much the functions in the family can oscillate. That condition is equicontinuity.

📐 Definition 5 (Equicontinuity)

A family of functions $\mathcal{F} \subseteq C([a,b])$ is equicontinuous if for every $\varepsilon > 0$ , there exists $\delta > 0$ such that

$|x - y| < \delta \implies |f(x) - f(y)| < \varepsilon \quad \text{for ALL } f \in \mathcal{F}.$

Compare this with uniform continuity from Epsilon-Delta & Continuity (Definition 9): uniform continuity says a single function has one $\delta$ that works at every point; equicontinuity says a family of functions has one $\delta$ that works for every function and every point simultaneously.

📐 Definition 6 (Pointwise Boundedness)

A family $\mathcal{F}$ is pointwise bounded if for each $x \in [a,b]$ , $\sup_{f \in \mathcal{F}} |f(x)| < \infty$ .

🔷 Theorem 5 (Arzelà-Ascoli Theorem)

A subset $\mathcal{F} \subseteq C([a,b])$ is relatively compact (every sequence in $\mathcal{F}$ has a uniformly convergent subsequence) if and only if $\mathcal{F}$ is equicontinuous and pointwise bounded.

Proof.

(⟸, main direction.) Let $(f_n) \subseteq \mathcal{F}$ .

Step 1 (Diagonal extraction on the rationals). Let $\{r_1, r_2, \ldots\}$ be an enumeration of the rationals in $[a,b]$ . By pointwise boundedness, $(f_n(r_1))$ is a bounded sequence in $\mathbb{R}$ , so by the Bolzano-Weierstrass theorem (Sequences, Limits & Convergence, Theorem 4), we can extract a subsequence that converges at $r_1$ . From that subsequence, extract a further subsequence converging at $r_2$ . Continue. The diagonal subsequence — the $k$ -th term of the $k$ -th subsequence — converges at every rational $r_j$ .

Step 2 (Extend from rationals to all of $[a,b]$ ). By equicontinuity, for any $\varepsilon > 0$ , choose $\delta$ from the definition. For any $x \in [a,b]$ , find a rational $r$ with $|x - r| < \delta$ . Then for indices $n_k, n_j$ in our diagonal subsequence:

$|f_{n_k}(x) - f_{n_j}(x)| \leq |f_{n_k}(x) - f_{n_k}(r)| + |f_{n_k}(r) - f_{n_j}(r)| + |f_{n_j}(r) - f_{n_j}(x)| < \frac{\varepsilon}{3} + \frac{\varepsilon}{3} + \frac{\varepsilon}{3} = \varepsilon$

for $k, j$ sufficiently large. The first and third terms use equicontinuity ( $\delta$ works for all $f$ ); the middle term uses convergence at $r$ .

Step 3 (Convergence is uniform). The same $\delta$ from equicontinuity works at every $x$ , so the bound above is independent of $x$ . This gives $\|f_{n_k} - f_{n_j}\|_\infty < \varepsilon$ , so the diagonal subsequence is uniformly Cauchy and hence uniformly convergent.

(⟹, sketch.) If $\mathcal{F}$ is relatively compact, a finite $\varepsilon$ -net argument shows equicontinuity: cover $\mathcal{F}$ with finitely many $\varepsilon/3$ -balls, use the uniform continuity of each center function, and take the minimum $\delta$ . Pointwise boundedness follows similarly. $\blacksquare$

∎

💡 Remark 5

The Arzelà-Ascoli proof uses the same diagonal argument that appears in Cantor’s proof and in many ML contexts — extracting convergent subsequences from parameter sequences in optimization. The equicontinuity condition has a direct ML interpretation: a family of neural networks with bounded Lipschitz constants (achieved by spectral normalization or gradient clipping) is equicontinuous, and Arzelà-Ascoli guarantees a convergent subsequence — the function-space analog of compactness in parameter space.

The Arzelà-Ascoli theorem

Family:

Equicontinuous: ✓Pointwise bounded: ✓Arzelà-Ascoli applies: ✓Lipschitz bound: 3

Click on the graph to probe equicontinuity at a point. Blue curves = extracted subsequence. Dashed red = limit approximation.

Connections to Statistics

Uniform convergence is the foundation of empirical-process theory and of every consistency result for M-estimators in statistics.

Glivenko–Cantelli and Donsker

The Glivenko–Cantelli theorem $\sup_x |F_n(x) - F(x)| \to_{a.s.} 0$ is the canonical uniform Law of Large Numbers — the empirical CDF converges uniformly to the true CDF. Donsker’s theorem lifts this to a CLT-style statement uniform over function classes. The pointwise-vs-uniform distinction developed here is the entire foundation of empirical-process theory. See formalStatistics Empirical Processes.

Bootstrap validity

Bootstrap consistency rests on Glivenko–Cantelli: the bootstrap samples from $F_n$ , and uniform closeness of $F_n$ to $F$ underwrites the consistency of bootstrap estimates. Without uniform convergence, bootstrap distributions could agree pointwise with the truth yet miss it sup-norm-wise — exactly the kind of gap GC closes. See formalStatistics Bootstrap.

M-estimator consistency

The Uniform LLN over a parameter set $\theta \in \Theta$ is required for consistency of M-estimators: $\hat{\theta}_n = \arg\min M_n(\theta)$ converges to $\arg\min M(\theta)$ only when $M_n \to M$ uniformly. This is precisely why textbooks insist on uniform — not pointwise — convergence in the consistency theorem for the MLE. See formalStatistics Law of Large Numbers.

Connections to ML

Uniform convergence is not merely a theoretical curiosity that happens to appear in machine learning. It is the mathematical foundation of generalization theory — the question of why models that perform well on training data also perform well on unseen data.

Uniform convergence of empirical processes

Given a hypothesis class $\mathcal{H}$ (e.g., the set of all neural networks with a fixed architecture), the empirical risk of a hypothesis $h \in \mathcal{H}$ on $n$ training examples is

$\hat{R}_n(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(x_i), y_i),$

while the true risk is $R(h) = \mathbb{E}[\ell(h(X), Y)]$ . The law of large numbers guarantees that for any fixed $h$ , $\hat{R}_n(h) \to R(h)$ as $n \to \infty$ . This is pointwise convergence — one hypothesis at a time.

But learning requires more: we need $\hat{R}_n(h)$ to be close to $R(h)$ for all $h \in \mathcal{H}$ simultaneously, so that the hypothesis $\hat{h}$ that minimizes $\hat{R}_n$ also approximately minimizes $R$ . This is uniform convergence:

$\sup_{h \in \mathcal{H}} \left|\hat{R}_n(h) - R(h)\right| \to 0.$

This is exactly the sup-norm $\|\hat{R}_n - R\|_\infty \to 0$ over the function class $\mathcal{H}$ . → formalML PAC Learning

The Glivenko-Cantelli theorem

The oldest and most fundamental uniform convergence result in statistics: the empirical CDF $\hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}_{X_i \leq x}$ converges uniformly to the true CDF $F(x)$ :

$\|\hat{F}_n - F\|_\infty = \sup_{x \in \mathbb{R}} |\hat{F}_n(x) - F(x)| \to 0 \quad \text{almost surely.}$

The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality gives the rate: $P\left(\|\hat{F}_n - F\|_\infty > t\right) \leq 2e^{-2nt^2}$ , yielding $\|\hat{F}_n - F\|_\infty = O(1/\sqrt{n})$ . The Kolmogorov-Smirnov test is built on this sup-norm. → formalML Concentration Inequalities

VC dimension and complexity control

The VC dimension $d$ of a hypothesis class $\mathcal{H}$ quantifies how “complex” $\mathcal{H}$ is — it is the largest number of points that $\mathcal{H}$ can shatter (classify in all $2^d$ possible ways). The fundamental theorem of statistical learning: $\mathcal{H}$ is PAC-learnable if and only if $\mathcal{H}$ has finite VC dimension. The generalization bound

$\sup_{h \in \mathcal{H}} |\hat{R}_n(h) - R(h)| \leq O\left(\sqrt{\frac{d \log n}{n}}\right)$

is a rate of uniform convergence — it tells you how fast the sup-norm between empirical and true risk shrinks with sample size. → formalML PAC Learning

Arzelà-Ascoli and neural network approximation

The universal approximation theorem says that neural networks are dense in $C([a,b])$ under the sup-norm: for any continuous $f$ and any $\varepsilon > 0$ , there exists a neural network $g$ with $\|f - g\|_\infty < \varepsilon$ . Arzelà-Ascoli constrains which families of networks are compact (and hence tractable): a family with bounded Lipschitz constants is equicontinuous, and equicontinuity + boundedness gives compactness. This is why Lipschitz constraints (spectral normalization, gradient penalty) appear in generative models — they enforce the compactness that makes optimization well-posed.

ML connections

Computational Notes

Uniform convergence can be verified computationally by evaluating the sup-norm on a dense grid.

Computing the sup-norm. Given a function sequence $f_n$ and its limit $f$ , we approximate $\|f_n - f\|_\infty$ by sampling:

import numpy as np

def sup_norm(fn, f, domain, grid_size=1000):
    """Approximate ||fn - f||_∞ on domain."""
    x = np.linspace(domain[0], domain[1], grid_size)
    return np.max(np.abs(fn(x) - f(x)))

# x^n on [0, 1]: not uniform (sup norm → 1)
for n in [10, 50, 100, 500]:
    fn = lambda x, n=n: x**n
    f = lambda x: np.where(x < 1, 0.0, 1.0)
    print(f"n={n:3d}: ||f_n - f||_∞ = {sup_norm(fn, f, (0, 1)):.6f}")

# x^n/n on [0, 1]: uniform (sup norm = 1/n → 0)
for n in [10, 50, 100, 500]:
    gn = lambda x, n=n: x**n / n
    g = lambda x: np.zeros_like(x)
    print(f"n={n:3d}: ||g_n - g||_∞ = {sup_norm(gn, g, (0, 1)):.6f}")

Verifying the integration interchange. Compare $\int f_n$ with $\int f$ :

from scipy import integrate

def check_interchange(fn, f, domain, n_values):
    """Compare lim ∫f_n with ∫(lim f_n)."""
    a, b = domain
    integral_f = integrate.quad(f, a, b)[0]
    for n in n_values:
        integral_fn = integrate.quad(lambda x: fn(x, n), a, b)[0]
        print(f"n={n:3d}: ∫f_n = {integral_fn:.6f}, ∫f = {integral_f:.6f}, "
              f"gap = {abs(integral_fn - integral_f):.6f}")

# Uniform case: gap → 0
check_interchange(
    fn=lambda x, n: x**n / n,
    f=lambda x: 0.0,
    domain=(0, 1),
    n_values=[5, 20, 100]
)

Implementing the Weierstrass M-test. Check the bound and compare with the actual tail error:

def weierstrass_m_test(term, Mk, domain, n_terms, max_k=200, grid=1000):
    """Apply the M-test: compare actual tail error with M-bound."""
    x = np.linspace(domain[0], domain[1], grid)
    partial = sum(term(x, k) for k in range(1, n_terms + 1))
    full = sum(term(x, k) for k in range(1, max_k + 1))
    actual_error = np.max(np.abs(full - partial))
    m_bound = sum(Mk(k) for k in range(n_terms + 1, max_k + 1))
    print(f"n={n_terms}: actual ||S-S_n||_∞ = {actual_error:.6f}, "
          f"M-bound = {m_bound:.6f}")

# ∑ sin(kx)/k²
weierstrass_m_test(
    term=lambda x, k: np.sin(k * x) / k**2,
    Mk=lambda k: 1 / k**2,
    domain=(0, 2 * np.pi),
    n_terms=10
)

Computing the Kolmogorov-Smirnov statistic. The sup-norm between empirical and true CDF:

def kolmogorov_smirnov(samples, cdf):
    """Compute ||F_hat_n - F||_∞."""
    n = len(samples)
    sorted_samples = np.sort(samples)
    ecdf = np.arange(1, n + 1) / n
    sup_diff = np.max(np.abs(ecdf - cdf(sorted_samples)))
    return sup_diff

# Standard normal: Glivenko-Cantelli in action
from scipy.stats import norm
for n in [100, 1000, 10000]:
    samples = np.random.randn(n)
    ks = kolmogorov_smirnov(samples, norm.cdf)
    dkw_bound = np.sqrt(np.log(2 / 0.05) / (2 * n))
    print(f"n={n:5d}: KS = {ks:.4f}, DKW 95% bound = {dkw_bound:.4f}")

Connections & Further Reading

Prerequisites — topics you need first

foundational Limits & Continuity 40 min

Sequences, Limits & Convergence

Pointwise and uniform convergence of function sequences extend the epsilon-N convergence of numerical sequences from Topic 1. The Cauchy criterion for uniform convergence is the function-space analog of the Cauchy criterion for sequences. Convergence rates (sublinear, linear, quadratic) reappear as rates of uniform convergence over function classes.

foundational Limits & Continuity 45 min

Epsilon-Delta & Continuity

The Uniform Limit Theorem proves that uniform convergence preserves continuity — the epsilon-delta framework from Topic 2 is the tool used in the proof (the epsilon/3 argument). Uniform continuity and Lipschitz continuity from Topic 2 reappear in the equicontinuity condition of the Arzela-Ascoli theorem.

intermediate Limits & Continuity 40 min

Completeness & Compactness

The Arzela-Ascoli theorem is a compactness result in function space C([a,b]), extending the Heine-Borel theorem from R^n. Equicontinuity + pointwise boundedness plays the role that closed + bounded plays in Heine-Borel. Completeness of R under the sup-norm means that uniformly Cauchy sequences converge.

Where this leads — next in formalCalculus

foundational Series & Approximation 45 min

Series Convergence & Tests

intermediate Series & Approximation 50 min

Power Series & Taylor Series

intermediate Series & Approximation 55 min

Fourier Series & Orthogonal Expansions

intermediate Series & Approximation 50 min

Approximation Theory

intermediate Functional Analysis 45 min

Metric Spaces & Topology

On to formalStatistics — where this calculus powers inference

Empirical Processes

Glivenko–Cantelli (sup_x |F_n(x) - F(x)| →_a.s. 0) is the canonical uniform LLN. Donsker's theorem lifts this to a CLT uniform over function classes. The pointwise-vs-uniform distinction is the foundation of the entire field.

Bootstrap

Bootstrap validity rests on Glivenko–Cantelli — uniform convergence of F_n to F. The bootstrap samples from F_n, and uniform closeness of F_n to F underwrites the consistency of bootstrap estimates.

Law Of Large Numbers

The Uniform LLN (uniform over a parameter θ ∈ Θ) is required for consistency of M-estimators: θ̂_n = argmin M_n(θ) converges to argmin M(θ) only when M_n → M uniformly.

On to formalML — where this calculus powers ML

PAC Learning

PAC learning is fundamentally a uniform convergence result: the empirical risk R_hat_n(h) converges uniformly to the true risk R(h) over the hypothesis class H as the sample size n grows. The VC dimension quantifies the complexity of H and controls the rate of uniform convergence — larger VC dimension means slower convergence and more data required for generalization.

Concentration Inequalities

Uniform convergence bounds like the Glivenko-Cantelli theorem and the Dvoretzky-Kiefer-Wolfowitz inequality are concentration inequalities applied uniformly over function classes. The DKW inequality bounds ||F_hat_n - F||_infinity — the sup-norm distance between empirical and true CDFs — at rate O(1/sqrt(n)).

Measure Theoretic Probability

Modes of convergence in probability — almost sure convergence, convergence in probability, convergence in distribution — extend the pointwise/uniform distinction from deterministic function sequences to sequences of random variables. Uniform integrability, the probabilistic analog of equicontinuity, controls interchange of limits and expectations.

References

book Abbott (2015). Understanding Analysis Chapter 6 develops uniform convergence with the epsilon-band characterization and proves the Uniform Limit Theorem — the primary reference for our geometric-first approach
book Rudin (1976). Principles of Mathematical Analysis Chapter 7 on sequences and series of functions — the definitive treatment of uniform convergence, the Weierstrass M-test, and equicontinuity
book Tao (2016). Analysis I Chapter 14 on uniform convergence — careful first-principles development distinguishing pointwise and uniform convergence at every step
book Kreyszig (1989). Functional Analysis Chapter 1 on normed spaces and Chapter 5 on compact operators — the sup-norm as a genuine norm on C([a,b]) and Arzela-Ascoli as a compactness criterion
book Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms Chapters 4-6 develop PAC learning and VC dimension as uniform convergence results over hypothesis classes — the direct ML application of the theory developed here
paper Vapnik & Chervonenkis (1971). “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities” The foundational paper introducing VC dimension and establishing the connection between uniform convergence and learnability