Limits & Continuity · intermediate · 45 min read

Uniform Convergence

When do limits of functions inherit the properties of the functions themselves? The sup-norm, the ε/3 trick, the Weierstrass M-test, Arzelà-Ascoli, and the uniform convergence framework that makes PAC learning possible

Abstract. A sequence of functions (f_n) converges pointwise to f on D if for each x in D, f_n(x) converges to f(x) as n tends to infinity. But pointwise convergence is deceptively weak: the limit of continuous functions can be discontinuous (f_n(x) = x^n on [0,1] converges pointwise to a step function), integration and limits cannot be interchanged (a moving bump with constant integral can converge pointwise to zero), and derivatives of the limit may bear no relation to the limits of the derivatives. Uniform convergence fixes these pathologies by requiring that the convergence rate be independent of x: (f_n) converges uniformly to f if for every epsilon > 0, there exists N such that |f_n(x) - f(x)| < epsilon for ALL x in D simultaneously — equivalently, the sup-norm ||f_n - f||_infinity tends to zero. The Uniform Limit Theorem guarantees that the uniform limit of continuous functions is continuous (the epsilon/3 argument), and uniform convergence permits interchange of limits with Riemann integration. The Weierstrass M-test provides a practical criterion: if |g_k(x)| <= M_k for all x and the sum of M_k converges, then the series of g_k converges uniformly. The Arzela-Ascoli theorem characterizes compact subsets of C([a,b]) via equicontinuity and pointwise boundedness — it is the compactness theorem for function spaces, extending the Heine-Borel theorem from R^n to infinite-dimensional spaces. In machine learning, uniform convergence is the mathematical backbone of generalization theory. PAC learning guarantees that the empirical risk R_hat(h) converges uniformly to the true risk R(h) over a hypothesis class H, and the VC dimension controls the rate of this uniform convergence. The Glivenko-Cantelli theorem — the empirical CDF converges uniformly to the true CDF — is the foundational uniform convergence result in statistics.

Where this leads → formalML

  • formalML PAC learning is fundamentally a uniform convergence result: the empirical risk R_hat_n(h) converges uniformly to the true risk R(h) over the hypothesis class H as the sample size n grows. The VC dimension quantifies the complexity of H and controls the rate of uniform convergence — larger VC dimension means slower convergence and more data required for generalization.
  • formalML Uniform convergence bounds like the Glivenko-Cantelli theorem and the Dvoretzky-Kiefer-Wolfowitz inequality are concentration inequalities applied uniformly over function classes. The DKW inequality bounds ||F_hat_n - F||_infinity — the sup-norm distance between empirical and true CDFs — at rate O(1/sqrt(n)).
  • formalML Modes of convergence in probability — almost sure convergence, convergence in probability, convergence in distribution — extend the pointwise/uniform distinction from deterministic function sequences to sequences of random variables. Uniform integrability, the probabilistic analog of equicontinuity, controls interchange of limits and expectations.

Overview & Motivation

You’re training a neural network with increasing width: f1,f2,f4,f8,f_1, f_2, f_4, f_8, \ldots neurons per hidden layer. Each fnf_n is a continuous function — compositions of affine maps and continuous activations like ReLU or sigmoid. As nn \to \infty, does fnf_n converge to some function ff? And if so, is ff still continuous? Can you integrate ff by integrating the fnf_n? Can you differentiate ff by differentiating the fnf_n?

The answers depend on how fnf_n converges — and the wrong notion of convergence gives the wrong answers to all three questions.

In Sequences, Limits & Convergence, we studied convergence of sequences of numbers: given a sequence a1,a2,a3,a_1, a_2, a_3, \ldots in R\mathbb{R}, does anLa_n \to L? In Epsilon-Delta & Continuity, we studied continuity of individual functions: is ff continuous at aa? In Completeness & Compactness, we studied structural properties that guarantee limits exist and optima are attained. Now we combine all three: when does a sequence of functions converge, and when does the limit inherit the properties of the functions themselves?

The answer will require a new notion of convergence — uniform convergence — that is strictly stronger than the naive “pointwise” approach. The distinction between these two notions is not a technicality. It is the mathematical backbone of generalization theory in machine learning.

Pointwise Convergence of Function Sequences

The simplest way to define convergence of a function sequence is to check convergence at each point separately.

📐 Definition 1 (Pointwise Convergence of Function Sequences)

A sequence of functions (fn)(f_n) where fn:DRf_n: D \to \mathbb{R} converges pointwise to f:DRf: D \to \mathbb{R} if for every xDx \in D,

limnfn(x)=f(x).\lim_{n \to \infty} f_n(x) = f(x).

Equivalently, for every xDx \in D and every ε>0\varepsilon > 0, there exists NNN \in \mathbb{N} (which may depend on both xx and ε\varepsilon) such that

nN    fn(x)f(x)<ε.n \geq N \implies |f_n(x) - f(x)| < \varepsilon.

The crucial phrase is “which may depend on both xx and ε\varepsilon.” We write N=N(x,ε)N = N(x, \varepsilon) to emphasize this dependence. At each fixed xx, we’re just asking for ordinary sequence convergence — the ε\varepsilon-NN definition from Sequences, Limits & Convergence. But the NN required might be very different at different points xx.

The canonical example makes this concrete:

📝 Example 1

Consider fn(x)=xnf_n(x) = x^n on [0,1][0, 1]. The pointwise limit is

f(x)={0if x[0,1)1if x=1f(x) = \begin{cases} 0 & \text{if } x \in [0, 1) \\ 1 & \text{if } x = 1 \end{cases}

Proof. For x[0,1)x \in [0, 1), we have x<1|x| < 1, so xn0x^n \to 0 as nn \to \infty (geometric sequence with ratio r<1|r| < 1). For x=1x = 1, we have 1n=11^n = 1 for all nn. \blacksquare

Notice what happened: each fn(x)=xnf_n(x) = x^n is continuous on [0,1][0, 1], but the pointwise limit ff has a jump discontinuity at x=1x = 1. The limit of continuous functions is not continuous.

A second example where things go right: gn(x)=xn/ng_n(x) = x^n / n on [0,1][0, 1] also converges pointwise to f0f \equiv 0 (the same limit for x[0,1)x \in [0,1); at x=1x = 1, gn(1)=1/n0g_n(1) = 1/n \to 0). And this time the limit is continuous. What’s different?

Pointwise vs. uniform convergence

The difference is the convergence rate. For fn(x)=xnf_n(x) = x^n, points near x=1x = 1 require enormous NN to get close to the limit — N(x,ε)=logε/logxN(x, \varepsilon) = \lceil \log \varepsilon / \log x \rceil \to \infty as x1x \to 1^-. For gn(x)=xn/ng_n(x) = x^n/n, the sup-norm gn=1/n\|g_n\|_\infty = 1/n is independent of xx, so a single NN works everywhere.

The Three Pathologies of Pointwise Convergence

Pointwise convergence fails to preserve the three fundamental properties of functions:

1. Continuity failure. We’ve just seen this: fn(x)=xnf_n(x) = x^n is continuous for every nn, but fnff_n \to f pointwise where ff is discontinuous. The limit of continuous functions need not be continuous under pointwise convergence.

2. Integration failure. Consider the “moving bump” sequence fn(x)=n2x(1x)nf_n(x) = n^2 x (1 - x)^n on [0,1][0, 1]. Each fnf_n has a bump near x=0x = 0 that grows taller and narrower. One can verify that 01fn(x)dx=n2(n+1)(n+2)1\int_0^1 f_n(x)\,dx = \frac{n^2}{(n+1)(n+2)} \to 1 as nn \to \infty. But fn(x)0f_n(x) \to 0 for each fixed x>0x > 0 (and fn(0)=0f_n(0) = 0), so the pointwise limit is f0f \equiv 0, and 01f(x)dx=0\int_0^1 f(x)\,dx = 0. We get

limn01fn(x)dx=10=01(limnfn(x))dx.\lim_{n \to \infty} \int_0^1 f_n(x)\,dx = 1 \neq 0 = \int_0^1 \left(\lim_{n \to \infty} f_n(x)\right) dx.

The limit and integral do not commute.

3. Differentiation failure. Let fn(x)=sin(nx)nf_n(x) = \frac{\sin(nx)}{\sqrt{n}} on R\mathbb{R}. Then fn(x)1/n0|f_n(x)| \leq 1/\sqrt{n} \to 0 uniformly, so fn0f_n \to 0 everywhere. But fn(x)=ncos(nx)f_n'(x) = \sqrt{n}\cos(nx), and fn(0)=n|f_n'(0)| = \sqrt{n} \to \infty. The derivatives of the fnf_n don’t converge at all, even though the fnf_n themselves converge nicely.

The three pathologies of pointwise convergence

💡 Remark 1

These three failures are not pathological edge cases. They arise naturally whenever convergence is “fast in some places and slow in others,” which is typical in machine learning: a neural network may fit training data well in dense regions and poorly in sparse regions, so the convergence of fnf_n to ff is inherently non-uniform across the input domain. The distinction between pointwise and uniform convergence is the mathematical formalization of this problem.

Uniform Convergence — The Fix

If we require the convergence rate to be the same at every xx — that is, the NN depends only on ε\varepsilon, not on xx — then all three pathologies disappear.

📐 Definition 2 (Uniform Convergence)

A sequence of functions (fn)(f_n) converges uniformly to ff on DD if for every ε>0\varepsilon > 0, there exists NNN \in \mathbb{N} such that

nN    fn(x)f(x)<εfor all xD simultaneously.n \geq N \implies |f_n(x) - f(x)| < \varepsilon \quad \text{for all } x \in D \text{ simultaneously.}

Equivalently, fnf=supxDfn(x)f(x)0\|f_n - f\|_\infty = \sup_{x \in D} |f_n(x) - f(x)| \to 0 as nn \to \infty.

The geometric picture is the ε-band characterization: fnff_n \to f uniformly if and only if for every ε>0\varepsilon > 0, eventually the entire graph of fnf_n lies within an ε\varepsilon-band around the graph of ff. Not just at each point separately, but everywhere at once.

📐 Definition 3 (Sup-Norm / Uniform Metric)

For bounded functions f,g:DRf, g: D \to \mathbb{R}, the sup-norm (or uniform norm) is

fg=supxDf(x)g(x).\|f - g\|_\infty = \sup_{x \in D} |f(x) - g(x)|.

This is a genuine metric on the space of bounded functions on DD. Uniform convergence is convergence in this metric.

The comparison is now sharp:

  • Pointwise: “for each xx, the vertical distance fn(x)f(x)0|f_n(x) - f(x)| \to 0.”
  • Uniform: “the maximum vertical distance over all xx simultaneously 0\to 0.”

The ε-band characterization

🔷 Proposition 1 (Uniform ⟹ Pointwise)

If fnff_n \to f uniformly on DD, then fnff_n \to f pointwise on DD. The converse is false.

Proof.

If fnf<ε\|f_n - f\|_\infty < \varepsilon for nNn \geq N, then in particular fn(x)f(x)fnf<ε|f_n(x) - f(x)| \leq \|f_n - f\|_\infty < \varepsilon for each fixed xDx \in D, so fn(x)f(x)f_n(x) \to f(x).

The converse fails: fn(x)=xnf_n(x) = x^n on [0,1][0, 1] converges pointwise but not uniformly — the sup-norm supx[0,1)xn0\sup_{x \in [0,1)} |x^n - 0| does not tend to 00 because for any nn, we can find xx close to 11 with xnx^n close to 11. \blacksquare

📝 Example 2

gn(x)=xn/ng_n(x) = x^n / n on [0,1][0, 1]. Then gn=supx[0,1]xn/n=1/n0\|g_n\|_\infty = \sup_{x \in [0,1]} x^n / n = 1/n \to 0, so gn0g_n \to 0 uniformly. Compare with fn(x)=xnf_n(x) = x^n, where supx[0,1)xn\sup_{x \in [0,1)} x^n approaches 11, not 00.

Use the explorer below to feel the difference. Drag the nn slider and watch whether the function’s graph stays inside the ε-band. The moment when xnx^n escapes the band near x=1x = 1 — but xn/nx^n/n stays inside — is the difference between pointwise and uniform convergence.

‖fn − f‖ = 0.9900✗ Escapes ε-band (not uniform at this n)Pointwise only — NOT uniform

||f_n - f||_∞ → 1 (never → 0) · Analogous to an overfit model converging fast in-distribution but failing at the boundary.

The Uniform Limit Theorem

The most important theorem in this topic: uniform convergence preserves continuity.

🔷 Theorem 1 (Uniform Limit Theorem)

If each fn:DRf_n: D \to \mathbb{R} is continuous at aDa \in D and fnff_n \to f uniformly on DD, then ff is continuous at aa.

Proof.

Fix ε>0\varepsilon > 0. We need δ>0\delta > 0 such that xa<δ    f(x)f(a)<ε|x - a| < \delta \implies |f(x) - f(a)| < \varepsilon. Apply the triangle inequality:

f(x)f(a)f(x)fn(x)+fn(x)fn(a)+fn(a)f(a).|f(x) - f(a)| \leq |f(x) - f_n(x)| + |f_n(x) - f_n(a)| + |f_n(a) - f(a)|.

Step 1 (Uniform convergence). Choose NN such that fNf<ε/3\|f_N - f\|_\infty < \varepsilon/3. Then f(x)fN(x)<ε/3|f(x) - f_N(x)| < \varepsilon/3 for all xx, and fN(a)f(a)<ε/3|f_N(a) - f(a)| < \varepsilon/3.

Step 2 (Continuity of fNf_N). Since fNf_N is continuous at aa, choose δ>0\delta > 0 such that xa<δ    fN(x)fN(a)<ε/3|x - a| < \delta \implies |f_N(x) - f_N(a)| < \varepsilon/3.

Step 3 (Combine). For xa<δ|x - a| < \delta:

f(x)f(a)<ε3+ε3+ε3=ε.|f(x) - f(a)| < \frac{\varepsilon}{3} + \frac{\varepsilon}{3} + \frac{\varepsilon}{3} = \varepsilon. \quad \blacksquare

The key insight: uniform convergence provides a single NN that simultaneously controls the first and third terms for all xx. Pointwise convergence would give different NN for different xx, and we couldn’t fix NN before choosing δ\delta.

💡 Remark 2

The ε/3 trick is a paradigm: when bounding f(x)f(a)|f(x) - f(a)| through an intermediary fnf_n, split the triangle inequality into three terms and allocate ε/3\varepsilon/3 to each. This proof strategy recurs throughout analysis — and in ML, it appears in decomposing generalization error into approximation error + estimation error + optimization error, each bounded by ε/3\varepsilon/3.

The Uniform Limit Theorem

Click on the graph to probe continuity at a point

Uniform convergence (x^n/n → 0): the limit is continuous. The ε/3 argument works because one N controls all x.

Interchange Theorems — Integration and Differentiation

Uniform convergence enables two crucial interchange theorems, but with different hypotheses.

🔷 Theorem 2 (Interchange of Limit and Integral)

If fnff_n \to f uniformly on [a,b][a, b] and each fnf_n is Riemann integrable on [a,b][a, b], then ff is Riemann integrable and

limnabfn(x)dx=abf(x)dx.\lim_{n \to \infty} \int_a^b f_n(x)\, dx = \int_a^b f(x)\, dx.

Proof.

abfn(x)dxabf(x)dx=ab[fn(x)f(x)]dxabfn(x)f(x)dxfnf(ba).\left|\int_a^b f_n(x)\, dx - \int_a^b f(x)\, dx\right| = \left|\int_a^b [f_n(x) - f(x)]\, dx\right| \leq \int_a^b |f_n(x) - f(x)|\, dx \leq \|f_n - f\|_\infty \cdot (b - a).

Since fnf0\|f_n - f\|_\infty \to 0 by uniform convergence, the right side tends to 00. \blacksquare

The proof is clean and direct — the sup-norm bound converts a pointwise inequality into a uniform one, and the integral of a uniform bound is just the bound times the interval length.

📝 Example 3

The interchange fails for pointwise convergence. Our “moving bump” sequence has fn=1\int f_n = 1 for all nn but fn0f_n \to 0 pointwise, so (limfn)=01=lim(fn)\int (\lim f_n) = 0 \neq 1 = \lim (\int f_n).

Interchange of limit and integral

The differentiation interchange is harder and requires a stronger hypothesis:

🔷 Theorem 3 (Interchange of Limit and Derivative)

If (fn)(f_n) is a sequence of differentiable functions on [a,b][a, b] such that:

(i) fn(x0)Lf_n(x_0) \to L for at least one x0[a,b]x_0 \in [a, b], and

(ii) fngf_n' \to g uniformly on [a,b][a, b],

then fnff_n \to f uniformly for some differentiable ff, and f=g=limfnf' = g = \lim f_n'.

💡 Remark 3

The differentiation interchange requires uniform convergence of the derivatives, not the functions themselves. This asymmetry is fundamental: integration is a smoothing operation (it improves convergence), while differentiation is a roughening operation (it can destroy convergence). In ML terms: backpropagation computes derivatives, so uniform convergence of loss landscapes Ln(θ)L(θ)\mathcal{L}_n(\theta) \to \mathcal{L}(\theta) does not automatically imply well-behaved gradients LnL\nabla \mathcal{L}_n \to \nabla \mathcal{L}.

The Cauchy Criterion for Uniform Convergence

Just as numerical sequences have a Cauchy criterion that certifies convergence without knowing the limit (Sequences, Limits & Convergence, Definition 6), function sequences have an analogous criterion.

🔷 Proposition 2 (Cauchy Criterion for Uniform Convergence)

A sequence (fn)(f_n) converges uniformly on DD if and only if for every ε>0\varepsilon > 0, there exists NN such that

m,nN    fmfn=supxDfm(x)fn(x)<ε.m, n \geq N \implies \|f_m - f_n\|_\infty = \sup_{x \in D} |f_m(x) - f_n(x)| < \varepsilon.

Proof.

(⟹) If fnff_n \to f uniformly, then fmfnfmf+fnf\|f_m - f_n\|_\infty \leq \|f_m - f\|_\infty + \|f_n - f\|_\infty. Given ε>0\varepsilon > 0, choose NN such that fkf<ε/2\|f_k - f\|_\infty < \varepsilon/2 for kNk \geq N. Then fmfn<ε\|f_m - f_n\|_\infty < \varepsilon for m,nNm, n \geq N.

(⟸) For each fixed xx, the numerical sequence (fn(x))(f_n(x)) satisfies fm(x)fn(x)fmfn<ε|f_m(x) - f_n(x)| \leq \|f_m - f_n\|_\infty < \varepsilon, so (fn(x))(f_n(x)) is Cauchy in R\mathbb{R}. By completeness of R\mathbb{R} (from Completeness & Compactness), fn(x)f(x)f_n(x) \to f(x) for some value f(x)f(x).

To show the convergence is uniform: fix ε>0\varepsilon > 0 and choose NN from the hypothesis. For any m,nNm, n \geq N and any xx, fm(x)fn(x)<ε|f_m(x) - f_n(x)| < \varepsilon. Taking mm \to \infty gives f(x)fn(x)ε|f(x) - f_n(x)| \leq \varepsilon for all xDx \in D, which is ffnε\|f - f_n\|_\infty \leq \varepsilon. \blacksquare

💡 Remark 4

The Cauchy criterion shows that C([a,b])C([a,b]) with the sup-norm is a complete space — every uniformly Cauchy sequence of continuous functions converges uniformly to a continuous function. This completeness is the function-space analog of the completeness of R\mathbb{R} from Completeness & Compactness. It will be formalized as a Banach space structure in Normed & Banach Spaces.

Cauchy criterion for function sequences

Series of Functions and the Weierstrass M-Test

A series of functions k=1gk(x)\sum_{k=1}^\infty g_k(x) is just a sequence of partial sums. The Weierstrass M-test provides a practical, checkable criterion for uniform convergence.

📐 Definition 4 (Uniform Convergence of a Series)

A series k=1gk(x)\sum_{k=1}^\infty g_k(x) converges uniformly on DD if the sequence of partial sums Sn(x)=k=1ngk(x)S_n(x) = \sum_{k=1}^n g_k(x) converges uniformly on DD.

🔷 Theorem 4 (Weierstrass M-Test)

Let gk:DRg_k: D \to \mathbb{R} and suppose there exist constants Mk0M_k \geq 0 such that gk(x)Mk|g_k(x)| \leq M_k for all xDx \in D and all kk. If k=1Mk\sum_{k=1}^\infty M_k converges, then k=1gk\sum_{k=1}^\infty g_k converges uniformly (and absolutely) on DD.

Proof.

For m>nm > n:

SmSn=supxDk=n+1mgk(x)k=n+1msupxDgk(x)k=n+1mMk.\|S_m - S_n\|_\infty = \sup_{x \in D} \left|\sum_{k=n+1}^m g_k(x)\right| \leq \sum_{k=n+1}^m \sup_{x \in D} |g_k(x)| \leq \sum_{k=n+1}^m M_k.

Since Mk\sum M_k converges, its tail k=n+1Mk0\sum_{k=n+1}^\infty M_k \to 0, so (Sn)(S_n) is uniformly Cauchy. By the Cauchy criterion (Proposition 2), (Sn)(S_n) converges uniformly. \blacksquare

📝 Example 4

The series k=1sin(kx)k2\sum_{k=1}^\infty \frac{\sin(kx)}{k^2} converges uniformly on R\mathbb{R}. Take Mk=1/k2M_k = 1/k^2; since sin(kx)1|\sin(kx)| \leq 1 and 1/k2=π2/6<\sum 1/k^2 = \pi^2/6 < \infty, the M-test applies. Since each sin(kx)/k2\sin(kx)/k^2 is continuous and the convergence is uniform, the Uniform Limit Theorem (applied to partial sums) guarantees the sum defines a continuous function.

The Weierstrass M-test

Actual ‖S − Sn 0.10488M-test bound ≈ 0.17634✓ M-test passes

M_k = 1/k² (∑ = π²/6) · Left: partial sums S_n(x) vs. full sum. Right: M_k bound bars (blue = included, gray = tail).

Equicontinuity and the Arzelà-Ascoli Theorem

The Heine-Borel theorem from Completeness & Compactness says that a subset of Rn\mathbb{R}^n is compact if and only if it is closed and bounded. What is the analog for subsets of the function space C([a,b])C([a,b])? Boundedness alone is not enough — we need a condition that controls how much the functions in the family can oscillate. That condition is equicontinuity.

📐 Definition 5 (Equicontinuity)

A family of functions FC([a,b])\mathcal{F} \subseteq C([a,b]) is equicontinuous if for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that

xy<δ    f(x)f(y)<εfor ALL fF.|x - y| < \delta \implies |f(x) - f(y)| < \varepsilon \quad \text{for ALL } f \in \mathcal{F}.

Compare this with uniform continuity from Epsilon-Delta & Continuity (Definition 9): uniform continuity says a single function has one δ\delta that works at every point; equicontinuity says a family of functions has one δ\delta that works for every function and every point simultaneously.

📐 Definition 6 (Pointwise Boundedness)

A family F\mathcal{F} is pointwise bounded if for each x[a,b]x \in [a,b], supfFf(x)<\sup_{f \in \mathcal{F}} |f(x)| < \infty.

🔷 Theorem 5 (Arzelà-Ascoli Theorem)

A subset FC([a,b])\mathcal{F} \subseteq C([a,b]) is relatively compact (every sequence in F\mathcal{F} has a uniformly convergent subsequence) if and only if F\mathcal{F} is equicontinuous and pointwise bounded.

Proof.

(⟸, main direction.) Let (fn)F(f_n) \subseteq \mathcal{F}.

Step 1 (Diagonal extraction on the rationals). Let {r1,r2,}\{r_1, r_2, \ldots\} be an enumeration of the rationals in [a,b][a,b]. By pointwise boundedness, (fn(r1))(f_n(r_1)) is a bounded sequence in R\mathbb{R}, so by the Bolzano-Weierstrass theorem (Sequences, Limits & Convergence, Theorem 4), we can extract a subsequence that converges at r1r_1. From that subsequence, extract a further subsequence converging at r2r_2. Continue. The diagonal subsequence — the kk-th term of the kk-th subsequence — converges at every rational rjr_j.

Step 2 (Extend from rationals to all of [a,b][a,b]). By equicontinuity, for any ε>0\varepsilon > 0, choose δ\delta from the definition. For any x[a,b]x \in [a,b], find a rational rr with xr<δ|x - r| < \delta. Then for indices nk,njn_k, n_j in our diagonal subsequence:

fnk(x)fnj(x)fnk(x)fnk(r)+fnk(r)fnj(r)+fnj(r)fnj(x)<ε3+ε3+ε3=ε|f_{n_k}(x) - f_{n_j}(x)| \leq |f_{n_k}(x) - f_{n_k}(r)| + |f_{n_k}(r) - f_{n_j}(r)| + |f_{n_j}(r) - f_{n_j}(x)| < \frac{\varepsilon}{3} + \frac{\varepsilon}{3} + \frac{\varepsilon}{3} = \varepsilon

for k,jk, j sufficiently large. The first and third terms use equicontinuity (δ\delta works for all ff); the middle term uses convergence at rr.

Step 3 (Convergence is uniform). The same δ\delta from equicontinuity works at every xx, so the bound above is independent of xx. This gives fnkfnj<ε\|f_{n_k} - f_{n_j}\|_\infty < \varepsilon, so the diagonal subsequence is uniformly Cauchy and hence uniformly convergent.

(⟹, sketch.) If F\mathcal{F} is relatively compact, a finite ε\varepsilon-net argument shows equicontinuity: cover F\mathcal{F} with finitely many ε/3\varepsilon/3-balls, use the uniform continuity of each center function, and take the minimum δ\delta. Pointwise boundedness follows similarly. \blacksquare

💡 Remark 5

The Arzelà-Ascoli proof uses the same diagonal argument that appears in Cantor’s proof and in many ML contexts — extracting convergent subsequences from parameter sequences in optimization. The equicontinuity condition has a direct ML interpretation: a family of neural networks with bounded Lipschitz constants (achieved by spectral normalization or gradient clipping) is equicontinuous, and Arzelà-Ascoli guarantees a convergent subsequence — the function-space analog of compactness in parameter space.

The Arzelà-Ascoli theorem

Equicontinuous: Pointwise bounded: Arzelà-Ascoli applies: Lipschitz bound: 3

Click on the graph to probe equicontinuity at a point. Blue curves = extracted subsequence. Dashed red = limit approximation.

Connections to ML

Uniform convergence is not merely a theoretical curiosity that happens to appear in machine learning. It is the mathematical foundation of generalization theory — the question of why models that perform well on training data also perform well on unseen data.

Uniform convergence of empirical processes

Given a hypothesis class H\mathcal{H} (e.g., the set of all neural networks with a fixed architecture), the empirical risk of a hypothesis hHh \in \mathcal{H} on nn training examples is

R^n(h)=1ni=1n(h(xi),yi),\hat{R}_n(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(x_i), y_i),

while the true risk is R(h)=E[(h(X),Y)]R(h) = \mathbb{E}[\ell(h(X), Y)]. The law of large numbers guarantees that for any fixed hh, R^n(h)R(h)\hat{R}_n(h) \to R(h) as nn \to \infty. This is pointwise convergence — one hypothesis at a time.

But learning requires more: we need R^n(h)\hat{R}_n(h) to be close to R(h)R(h) for all hHh \in \mathcal{H} simultaneously, so that the hypothesis h^\hat{h} that minimizes R^n\hat{R}_n also approximately minimizes RR. This is uniform convergence:

suphHR^n(h)R(h)0.\sup_{h \in \mathcal{H}} \left|\hat{R}_n(h) - R(h)\right| \to 0.

This is exactly the sup-norm R^nR0\|\hat{R}_n - R\|_\infty \to 0 over the function class H\mathcal{H}. → formalML PAC Learning

The Glivenko-Cantelli theorem

The oldest and most fundamental uniform convergence result in statistics: the empirical CDF F^n(x)=1ni=1n1Xix\hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}_{X_i \leq x} converges uniformly to the true CDF F(x)F(x):

F^nF=supxRF^n(x)F(x)0almost surely.\|\hat{F}_n - F\|_\infty = \sup_{x \in \mathbb{R}} |\hat{F}_n(x) - F(x)| \to 0 \quad \text{almost surely.}

The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality gives the rate: P(F^nF>t)2e2nt2P\left(\|\hat{F}_n - F\|_\infty > t\right) \leq 2e^{-2nt^2}, yielding F^nF=O(1/n)\|\hat{F}_n - F\|_\infty = O(1/\sqrt{n}). The Kolmogorov-Smirnov test is built on this sup-norm. → formalML Concentration Inequalities

VC dimension and complexity control

The VC dimension dd of a hypothesis class H\mathcal{H} quantifies how “complex” H\mathcal{H} is — it is the largest number of points that H\mathcal{H} can shatter (classify in all 2d2^d possible ways). The fundamental theorem of statistical learning: H\mathcal{H} is PAC-learnable if and only if H\mathcal{H} has finite VC dimension. The generalization bound

suphHR^n(h)R(h)O(dlognn)\sup_{h \in \mathcal{H}} |\hat{R}_n(h) - R(h)| \leq O\left(\sqrt{\frac{d \log n}{n}}\right)

is a rate of uniform convergence — it tells you how fast the sup-norm between empirical and true risk shrinks with sample size. → formalML PAC Learning

Arzelà-Ascoli and neural network approximation

The universal approximation theorem says that neural networks are dense in C([a,b])C([a,b]) under the sup-norm: for any continuous ff and any ε>0\varepsilon > 0, there exists a neural network gg with fg<ε\|f - g\|_\infty < \varepsilon. Arzelà-Ascoli constrains which families of networks are compact (and hence tractable): a family with bounded Lipschitz constants is equicontinuous, and equicontinuity + boundedness gives compactness. This is why Lipschitz constraints (spectral normalization, gradient penalty) appear in generative models — they enforce the compactness that makes optimization well-posed.

ML connections

Computational Notes

Uniform convergence can be verified computationally by evaluating the sup-norm on a dense grid.

Computing the sup-norm. Given a function sequence fnf_n and its limit ff, we approximate fnf\|f_n - f\|_\infty by sampling:

import numpy as np

def sup_norm(fn, f, domain, grid_size=1000):
    """Approximate ||fn - f||_∞ on domain."""
    x = np.linspace(domain[0], domain[1], grid_size)
    return np.max(np.abs(fn(x) - f(x)))

# x^n on [0, 1]: not uniform (sup norm → 1)
for n in [10, 50, 100, 500]:
    fn = lambda x, n=n: x**n
    f = lambda x: np.where(x < 1, 0.0, 1.0)
    print(f"n={n:3d}: ||f_n - f||_∞ = {sup_norm(fn, f, (0, 1)):.6f}")

# x^n/n on [0, 1]: uniform (sup norm = 1/n → 0)
for n in [10, 50, 100, 500]:
    gn = lambda x, n=n: x**n / n
    g = lambda x: np.zeros_like(x)
    print(f"n={n:3d}: ||g_n - g||_∞ = {sup_norm(gn, g, (0, 1)):.6f}")

Verifying the integration interchange. Compare fn\int f_n with f\int f:

from scipy import integrate

def check_interchange(fn, f, domain, n_values):
    """Compare lim ∫f_n with ∫(lim f_n)."""
    a, b = domain
    integral_f = integrate.quad(f, a, b)[0]
    for n in n_values:
        integral_fn = integrate.quad(lambda x: fn(x, n), a, b)[0]
        print(f"n={n:3d}: ∫f_n = {integral_fn:.6f}, ∫f = {integral_f:.6f}, "
              f"gap = {abs(integral_fn - integral_f):.6f}")

# Uniform case: gap → 0
check_interchange(
    fn=lambda x, n: x**n / n,
    f=lambda x: 0.0,
    domain=(0, 1),
    n_values=[5, 20, 100]
)

Implementing the Weierstrass M-test. Check the bound and compare with the actual tail error:

def weierstrass_m_test(term, Mk, domain, n_terms, max_k=200, grid=1000):
    """Apply the M-test: compare actual tail error with M-bound."""
    x = np.linspace(domain[0], domain[1], grid)
    partial = sum(term(x, k) for k in range(1, n_terms + 1))
    full = sum(term(x, k) for k in range(1, max_k + 1))
    actual_error = np.max(np.abs(full - partial))
    m_bound = sum(Mk(k) for k in range(n_terms + 1, max_k + 1))
    print(f"n={n_terms}: actual ||S-S_n||_∞ = {actual_error:.6f}, "
          f"M-bound = {m_bound:.6f}")

# ∑ sin(kx)/k²
weierstrass_m_test(
    term=lambda x, k: np.sin(k * x) / k**2,
    Mk=lambda k: 1 / k**2,
    domain=(0, 2 * np.pi),
    n_terms=10
)

Computing the Kolmogorov-Smirnov statistic. The sup-norm between empirical and true CDF:

def kolmogorov_smirnov(samples, cdf):
    """Compute ||F_hat_n - F||_∞."""
    n = len(samples)
    sorted_samples = np.sort(samples)
    ecdf = np.arange(1, n + 1) / n
    sup_diff = np.max(np.abs(ecdf - cdf(sorted_samples)))
    return sup_diff

# Standard normal: Glivenko-Cantelli in action
from scipy.stats import norm
for n in [100, 1000, 10000]:
    samples = np.random.randn(n)
    ks = kolmogorov_smirnov(samples, norm.cdf)
    dkw_bound = np.sqrt(np.log(2 / 0.05) / (2 * n))
    print(f"n={n:5d}: KS = {ks:.4f}, DKW 95% bound = {dkw_bound:.4f}")

Connections & Further Reading

Within formalCalculus

  • Sequences, Limits & Convergence — the ε-N convergence framework for numerical sequences. Pointwise convergence of fn(x)f_n(x) at each fixed xx is literally a collection of numerical sequence convergence problems. The Cauchy criterion for uniform convergence (Proposition 2) is the function-space analog of the Cauchy sequence criterion. The Bolzano-Weierstrass theorem is the engine of the diagonal argument in the Arzelà-Ascoli proof.
  • Epsilon-Delta & Continuity — the ε-δ continuity framework used in the Uniform Limit Theorem proof (the ε/3 argument). Uniform continuity from Topic 2 is the single-function version of the equicontinuity concept. The Lipschitz hierarchy reappears: a family with bounded Lipschitz constants is equicontinuous.
  • Completeness & Compactness — the Heine-Borel theorem is the finite-dimensional analog of Arzelà-Ascoli. Equicontinuity + pointwise boundedness plays the role that closed + bounded plays in Heine-Borel. Completeness of R\mathbb{R} underlies the Cauchy criterion proof.
  • Series Convergence & Tests — the Weierstrass M-test applied to specific series; absolute vs. conditional convergence for function series.
  • Power Series & Taylor Series — power series converge uniformly on compact subsets of the radius of convergence; term-by-term differentiation and integration are direct applications of the interchange theorems.
  • Metric Spaces & TopologyC([a,b])C([a,b]) with the sup-norm is proved complete, and the Arzelà–Ascoli theorem is proved in full via the diagonal subsequence argument, closing both promises from this topic.
  • Normed & Banach SpacesC([a,b])C([a,b]) is a Banach space (complete normed vector space); completeness is the Cauchy criterion for uniform convergence.
  • Approximation Theory & Convergence Rates — Stone-Weierstrass theorem (polynomials are dense in C([a,b])C([a,b]) under the sup-norm); rates of uniform approximation.

Forward to formalML

  • formalML PAC Learning — uniform convergence of empirical risk to true risk; VC dimension controls the rate.
  • formalML Concentration Inequalities — Glivenko-Cantelli, DKW inequality, uniform bounds over function classes.
  • formalML Measure-Theoretic Probability — modes of convergence in probability extend the pointwise/uniform distinction to random functions.

References

  1. book Abbott (2015). Understanding Analysis Chapter 6 develops uniform convergence with the epsilon-band characterization and proves the Uniform Limit Theorem — the primary reference for our geometric-first approach
  2. book Rudin (1976). Principles of Mathematical Analysis Chapter 7 on sequences and series of functions — the definitive treatment of uniform convergence, the Weierstrass M-test, and equicontinuity
  3. book Tao (2016). Analysis I Chapter 14 on uniform convergence — careful first-principles development distinguishing pointwise and uniform convergence at every step
  4. book Kreyszig (1989). Functional Analysis Chapter 1 on normed spaces and Chapter 5 on compact operators — the sup-norm as a genuine norm on C([a,b]) and Arzela-Ascoli as a compactness criterion
  5. book Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms Chapters 4-6 develop PAC learning and VC dimension as uniform convergence results over hypothesis classes — the direct ML application of the theory developed here
  6. paper Vapnik & Chervonenkis (1971). “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities” The foundational paper introducing VC dimension and establishing the connection between uniform convergence and learnability