Functional Analysis · intermediate · 45 min read

Metric Spaces & Topology

Generalizing distance beyond Euclidean space — from metric axioms through completeness and compactness to the contraction mapping theorem and Arzelà–Ascoli, with applications to fixed-point iteration, embedding spaces, and metric learning.

Abstract. Every machine learning algorithm that computes a distance — every k-nearest-neighbor lookup, every gradient descent step, every embedding similarity score — is operating in a metric space. This topic generalizes the completeness and compactness theory from Topic 3 beyond Euclidean space to arbitrary metric spaces, where the notion of distance is axiomatized by three properties: positivity, symmetry, and the triangle inequality. The payoff is immediate. The Banach Contraction Mapping Theorem guarantees that any contraction on a complete metric space has a unique fixed point — this single result explains why Picard iteration solves ODEs, why Newton's method converges, why value iteration in reinforcement learning finds the optimal policy, and why many iterative algorithms in machine learning are guaranteed to converge. The Arzelà–Ascoli theorem characterizes compact subsets of function spaces and explains when a sequence of functions must have a uniformly convergent subsequence. Along the way, we develop the language of topology — open and closed sets, continuity via preimages, and the distinction between metric and general topological spaces — that will power the remaining three topics in this track.

Where this leads → formalML

  • formalML The contraction mapping theorem guarantees convergence of fixed-point iterations. Lipschitz continuity of gradient maps, formalized here, controls convergence rates of gradient descent.
  • formalML The Bellman operator is a contraction in the sup-norm on bounded value functions. Banach's fixed-point theorem guarantees that value iteration converges to the unique optimal value function V*.
  • formalML Metric spaces formalize learned similarity. Mahalanobis distance, Siamese networks, and triplet loss all learn metrics on embedding spaces — the axioms here are exactly the constraints those losses enforce.
  • formalML Positive-definite kernels induce metrics on feature spaces. The RKHS norm distance is a metric-space distance, and kernel methods exploit metric-space completeness.
  • formalML Word2Vec, BERT, and graph embeddings live in metric spaces where cosine or Euclidean distance measures semantic proximity. The triangle inequality constrains embedding geometry.

1. Why Metric Spaces?

Every machine learning algorithm that computes a distance is quietly operating in a metric space. The Euclidean distance in Rn\mathbb{R}^n is the obvious example — it is what kk-nearest neighbors uses, what gradient descent steps respect, and what the word “similarity” means for Word2Vec embeddings. But Euclidean distance is not the only game in town, and for several of the questions we are about to ask, it is not even the right tool.

Three motivating puzzles set the stage.

Puzzle 1: Comparing functions. Given two continuous functions f,g:[0,1]Rf, g : [0, 1] \to \mathbb{R}, how do we measure how far apart they are? One natural answer is the sup-norm distance fg=supx[0,1]f(x)g(x)\|f - g\|_\infty = \sup_{x \in [0,1]} |f(x) - g(x)|, which we saw already in Uniform Convergence — it is the metric that makes “convergence” mean “uniform convergence.” But there are other answers: the integral-based L1L^1 distance 01fg\int_0^1 |f - g|, the L2L^2 distance 01(fg)2\sqrt{\int_0^1 (f - g)^2}, and many more. Each is a legitimate metric, and each gives C([0,1])C([0, 1]) a different geometric character. The framework we develop in this topic will explain why.

Puzzle 2: When does iteration converge? Suppose TT is a map from some space to itself and we iterate: start with x0x_0, set x1=T(x0)x_1 = T(x_0), x2=T(x1)x_2 = T(x_1), and so on. When does the sequence (xn)(x_n) converge, and when it does, to what? We are going to prove a theorem — the Banach Contraction Mapping Theorem — that answers this completely for any map that “shrinks distances.” The same theorem is why Picard iteration solves ordinary differential equations, why Newton’s method converges in a neighborhood of a simple root, and why reinforcement-learning value iteration finds the optimal policy. A single abstract result unifies half a dozen concrete convergence arguments.

Puzzle 3: Which function families are “small” enough to extract a convergent subsequence? In Topic 3 we proved the Bolzano–Weierstrass theorem: every bounded sequence in Rn\mathbb{R}^n has a convergent subsequence. In C([0,1])C([0, 1]) this is simply false — the sequence fn(x)=sin(nπx)f_n(x) = \sin(n\pi x) is bounded in the sup-norm (each fn=1\|f_n\|_\infty = 1), yet no subsequence converges uniformly. We flagged this counterexample in Topic 3 and promised to explain it here. The explanation is that “bounded” is the wrong condition in infinite dimensions; the right one involves equicontinuity, and the theorem that packages it is Arzelà–Ascoli.

These three puzzles are really one question asked three ways: what structure does a space need for the usual analytic intuition to work? The answer — the metric space axioms — is small, geometric, and surprisingly powerful. By the end of this topic we will have generalized completeness, compactness, and continuity from Rn\mathbb{R}^n to arbitrary metric spaces, and we will have proved the two theorems that turn this machinery into concrete convergence guarantees for machine learning algorithms.

2. Metric Space Axioms and Examples

The metric space abstraction compresses the idea of “distance” into three axioms. Everything we prove from here down follows from these three statements and nothing else — which is what gives the theory its reach.

📐 Definition 1 (Metric Space)

A metric space is a pair (X,d)(X, d) consisting of a set XX and a function d:X×X[0,)d : X \times X \to [0, \infty) called a metric (or distance function) satisfying:

  1. Positivity: d(x,y)=0d(x, y) = 0 if and only if x=yx = y, and d(x,y)0d(x, y) \geq 0 for all x,yx, y.
  2. Symmetry: d(x,y)=d(y,x)d(x, y) = d(y, x) for all x,yXx, y \in X.
  3. Triangle inequality: d(x,z)d(x,y)+d(y,z)d(x, z) \leq d(x, y) + d(y, z) for all x,y,zXx, y, z \in X.

The three axioms are exactly the properties the word “distance” should have. Positivity says distance is never negative and vanishes only at a single point. Symmetry says the distance from xx to yy does not depend on which one you started at. The triangle inequality says that going via an intermediate point yy is never a shortcut. If any of these fail, the machinery we are about to build breaks — which is why metric learning papers spend so much effort ensuring their learned “distances” actually satisfy these properties. Examples build intuition faster than axioms; here are four canonical ones.

📝 Example 1 (The ℓᵖ Metrics on Rⁿ)

For 1p<1 \leq p < \infty, the p\ell^p metric on Rn\mathbb{R}^n is

dp(x,y)=(i=1nxiyip)1/p.d_p(x, y) = \left(\sum_{i=1}^n |x_i - y_i|^p\right)^{1/p}.

For p=p = \infty the formula degenerates and we use

d(x,y)=max1inxiyi.d_\infty(x, y) = \max_{1 \leq i \leq n} |x_i - y_i|.

All three axioms are satisfied for every p[1,]p \in [1, \infty] (the triangle inequality is Minkowski’s inequality). Three values of pp are worth remembering: p=2p = 2 gives the usual Euclidean distance, p=1p = 1 gives the “Manhattan” or taxicab distance, and p=p = \infty gives the Chebyshev (max-coordinate) distance. In R2\mathbb{R}^2, the “unit ball” {x:dp(x,0)1}\{x : d_p(x, 0) \leq 1\} is a diamond for p=1p = 1, a circle for p=2p = 2, and a square for p=p = \infty — the same set R2\mathbb{R}^2 acquires three visibly different geometries depending on which metric we choose.

📝 Example 2 (The Sup-Norm Metric on C([a,b]))

The space C([a,b])C([a, b]) of continuous real-valued functions on [a,b][a, b] becomes a metric space under

d(f,g)=fg=supx[a,b]f(x)g(x).d_\infty(f, g) = \|f - g\|_\infty = \sup_{x \in [a,b]} |f(x) - g(x)|.

Positivity and symmetry are immediate. The triangle inequality follows from the pointwise inequality f(x)h(x)f(x)g(x)+g(x)h(x)|f(x) - h(x)| \leq |f(x) - g(x)| + |g(x) - h(x)| by taking suprema on both sides. This is the same sup-norm that appeared in Uniform Convergence: a sequence fnf_n converges to ff in this metric if and only if fnff_n \to f uniformly. Section 5 will prove that (C([a,b]),d)(C([a, b]), d_\infty) is complete — every Cauchy sequence converges — which is why the sup-norm is the right tool for function-space fixed-point arguments.

📝 Example 3 (The Discrete Metric)

On any set XX, define

d(x,y)={0if x=y,1if xy.d(x, y) = \begin{cases} 0 & \text{if } x = y, \\ 1 & \text{if } x \neq y. \end{cases}

All three axioms are trivial to verify. Under this metric every subset is open, every convergent sequence is eventually constant, and the “unit ball” B(x,1)B(x, 1) contains only the single point xx. The discrete metric shows that metric-space theory applies to exotic settings — there is no requirement that XX look anything like Rn\mathbb{R}^n — and it is a useful source of counterexamples whenever you suspect a theorem secretly needs continuity.

📝 Example 4 (Hamming Distance)

On the set {0,1}n\{0, 1\}^n of binary strings of length nn, the Hamming distance is

dH(x,y)=#{i:xiyi},d_H(x, y) = \#\{\, i : x_i \neq y_i\,\},

the number of coordinates in which xx and yy differ. Symmetry is clear, positivity is clear, and the triangle inequality holds because “coordinates where xx and zz differ” is a subset of “coordinates where xx and yy differ” \cup “coordinates where yy and zz differ.” Hamming distance is the metric behind error-correcting codes and is directly relevant to discrete ML: training a classifier to produce outputs within Hamming distance kk of a target is a metric-space constraint.

💡 Remark 1 (Different Metrics, Different Topologies)

The same underlying set can carry many different metrics, and the choice shapes everything else: which sequences converge, which sets are open, which functions are continuous. In R2\mathbb{R}^2 the 1\ell^1, 2\ell^2, and \ell^\infty metrics all generate the same notion of “open set” (they are topologically equivalent — we will define this precisely in Section 8), so they all agree on which sequences converge. But the Hamming metric on {0,1}n\{0, 1\}^n is qualitatively different from the Euclidean metric on the same set viewed as a subset of Rn\mathbb{R}^n: under Hamming distance, consecutive binary strings (differing in a single bit) are always at distance 1, but under Euclidean distance their positions depend on which bit changed. The metric determines the geometry, and the geometry determines the analysis.

Four-panel figure: the ℓ¹ diamond, ℓ² disk, and ℓ∞ square unit balls in R², plus the discrete-metric "ball" containing a single point.

3. Open Sets, Closed Sets, and the Language of Topology

With the metric in hand we can start building the vocabulary of topology: balls, open sets, closed sets, interiors, closures, boundaries. This is the language that every subsequent theorem in the topic will speak, and it is the same vocabulary Topic 3 used for Rn\mathbb{R}^n — only now it works in any metric space.

📐 Definition 2 (Open and Closed Balls)

Let (X,d)(X, d) be a metric space, let xXx \in X, and let r>0r > 0. The open ball of radius rr centered at xx is

B(x,r)={yX:d(x,y)<r}.B(x, r) = \{\, y \in X : d(x, y) < r\,\}.

The closed ball of radius rr centered at xx is

B(x,r)={yX:d(x,y)r}.\overline{B}(x, r) = \{\, y \in X : d(x, y) \leq r\,\}.

The open ball has a strict inequality; the closed ball includes the boundary sphere. In Rn\mathbb{R}^n with d2d_2 these are the familiar solid balls without and with their bounding spheres. In C([a,b])C([a, b]) with the sup-norm, the open ball B(f,ε)B(f, \varepsilon) is the set of all continuous functions whose graph lies strictly within an ε\varepsilon-tube around the graph of ff — exactly the “uniform tube” that Topic 4 used to define uniform convergence.

📐 Definition 3 (Open and Closed Sets)

A set UXU \subseteq X is open if for every point xUx \in U there exists some r>0r > 0 such that the open ball B(x,r)B(x, r) is entirely contained in UU:

xU r>0:B(x,r)U.\forall x \in U \ \exists r > 0 : B(x, r) \subseteq U.

A set FXF \subseteq X is closed if its complement XFX \setminus F is open.

A set is open if every point has some breathing room: a small open ball around it that stays inside the set. A set is closed if its complement has that property. Note that “closed” does not mean “not open” — the empty set and the whole space XX are both open and closed in every metric space (check the definitions directly), and in the discrete metric every subset is both open and closed. Openness and closedness are more subtle than they look.

🔷 Proposition 1 (Properties of Open and Closed Sets)

In any metric space (X,d)(X, d):

  1. \emptyset and XX are both open and closed.
  2. Arbitrary unions of open sets are open.
  3. Finite intersections of open sets are open.
  4. Arbitrary intersections of closed sets are closed.
  5. Finite unions of closed sets are closed.

The asymmetry between “arbitrary” and “finite” matters. Arbitrary intersections of open sets need not be open — the classic example is n1(1/n,1/n)={0}\bigcap_{n \geq 1} (-1/n, 1/n) = \{0\}, an intersection of open intervals whose limit is a single point, which is not open. Similarly, arbitrary unions of closed sets need not be closed. The finiteness restriction in items 3 and 5 is therefore essential, not cosmetic.

📐 Definition 4 (Interior, Closure, Boundary, Limit Point)

For a subset AXA \subseteq X:

  • The interior AA^\circ is the union of all open sets contained in AA; equivalently, it is the largest open set inside AA.
  • The closure A\overline{A} is the intersection of all closed sets containing AA; equivalently, it is the smallest closed set that contains AA.
  • The boundary is A=AA\partial A = \overline{A} \setminus A^\circ.
  • A point xXx \in X is a limit point of AA if every open ball B(x,r)B(x, r) contains a point of AA other than xx.

Two facts follow immediately. First, AAAA^\circ \subseteq A \subseteq \overline{A}: the interior is contained in AA, which is contained in the closure. Second, AA is open exactly when A=AA = A^\circ, and AA is closed exactly when A=AA = \overline{A} — these are not additional hypotheses but equivalent characterizations. The closure can also be described as A=A{limit points of A}\overline{A} = A \cup \{\text{limit points of } A\}, which is often the easiest way to compute it in practice.

📝 Example 5 (Open Sets in C([0,1]))

In C([0,1])C([0, 1]) with the sup-norm, the open ball B(f,ε)B(f, \varepsilon) consists of all continuous functions gg whose graph lies within the horizontal band {(x,y):f(x)ε<y<f(x)+ε}\{(x, y) : f(x) - \varepsilon < y < f(x) + \varepsilon\}. A set UC([0,1])\mathcal{U} \subseteq C([0, 1]) is open precisely when for every fUf \in \mathcal{U} we can find a small enough ε\varepsilon so that every continuous function inside the ε\varepsilon-tube around ff is also in U\mathcal{U}. This is exactly the “ε\varepsilon-tube” picture from Topic 4’s visualization of uniform convergence — the only difference is that now we are thinking of a tube as an open ball in a metric space.

Four-panel figure: interior, closure, and boundary of a subset of R; an open ball in R²; an ε-tube open ball in C([0,1]); and the trivial "balls" of the discrete metric.

4. Continuity in Metric Spaces

Continuity is the next concept that lifts from R\mathbb{R} to arbitrary metric spaces. We give three equivalent characterizations — an ε\varepsilon-δ\delta definition, a sequential definition, and a preimage-of-open-set definition — and show that all three are equivalent. The third characterization is the conceptual payoff: it describes continuity using only the structure of open sets, which is what lets the definition generalize to spaces without a metric at all.

📐 Definition 5 (Continuity (ε–δ))

Let (X,dX)(X, d_X) and (Y,dY)(Y, d_Y) be metric spaces. A function f:XYf : X \to Y is continuous at a point x0Xx_0 \in X if for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that

dX(x,x0)<δdY(f(x),f(x0))<ε.d_X(x, x_0) < \delta \quad \Longrightarrow \quad d_Y(f(x), f(x_0)) < \varepsilon.

The function ff is continuous (on all of XX) if it is continuous at every point of XX.

This is the same ε\varepsilon-δ\delta definition we used in Topic 2 for real-valued functions of a real variable, with dXd_X and dYd_Y replacing the absolute value. Now the three characterizations.

🔷 Theorem 1 (Three Characterizations of Continuity)

Let f:(X,dX)(Y,dY)f : (X, d_X) \to (Y, d_Y) be a function between metric spaces. The following are equivalent:

  1. ff is continuous at every point of XX (the ε\varepsilonδ\delta definition).
  2. For every convergent sequence xnxx_n \to x in XX, we have f(xn)f(x)f(x_n) \to f(x) in YY.
  3. For every open set VYV \subseteq Y, the preimage f1(V)Xf^{-1}(V) \subseteq X is open in XX.

Proof.

We prove the three implications (1)(2)(3)(1)(1) \Rightarrow (2) \Rightarrow (3) \Rightarrow (1).

(1)(2)(1) \Rightarrow (2). Assume ff is continuous at every point, let xnxx_n \to x in XX, and fix ε>0\varepsilon > 0. Continuity at xx gives some δ>0\delta > 0 such that dX(y,x)<δd_X(y, x) < \delta implies dY(f(y),f(x))<εd_Y(f(y), f(x)) < \varepsilon. Since xnxx_n \to x, there exists NN such that for all nNn \geq N we have dX(xn,x)<δd_X(x_n, x) < \delta, and therefore dY(f(xn),f(x))<εd_Y(f(x_n), f(x)) < \varepsilon. This is exactly the statement that f(xn)f(x)f(x_n) \to f(x).

(2)(3)(2) \Rightarrow (3). Assume sequential continuity, let VYV \subseteq Y be open, and let xf1(V)x \in f^{-1}(V). We need to show some ball B(x,r)B(x, r) is contained in f1(V)f^{-1}(V). Suppose not. Then for every n1n \geq 1 there exists a point xnx_n with dX(xn,x)<1/nd_X(x_n, x) < 1/n but f(xn)Vf(x_n) \notin V. By construction xnxx_n \to x, so by sequential continuity f(xn)f(x)f(x_n) \to f(x). But f(x)Vf(x) \in V and VV is open, so some ball B(f(x),ε)B(f(x), \varepsilon) is contained in VV, which means f(xn)Vf(x_n) \in V for all sufficiently large nn — contradicting f(xn)Vf(x_n) \notin V.

(3)(1)(3) \Rightarrow (1). Assume preimages of open sets are open, let x0Xx_0 \in X, and fix ε>0\varepsilon > 0. The open ball B(f(x0),ε)B(f(x_0), \varepsilon) is open in YY, so by hypothesis its preimage f1(B(f(x0),ε))f^{-1}(B(f(x_0), \varepsilon)) is open in XX and contains x0x_0. By the definition of “open” we can find some δ>0\delta > 0 with B(x0,δ)f1(B(f(x0),ε))B(x_0, \delta) \subseteq f^{-1}(B(f(x_0), \varepsilon)), which rewrites as: dX(x,x0)<δdY(f(x),f(x0))<εd_X(x, x_0) < \delta \Rightarrow d_Y(f(x), f(x_0)) < \varepsilon. That is the ε\varepsilonδ\delta condition at x0x_0. Since x0x_0 was arbitrary, ff is continuous.

💡 Remark 2 (Why Preimages Matter)

Characterization (3) is the conceptual leap: it describes continuity entirely in terms of the structure of open sets, with no reference to distances or sequences. This is what lets the definition generalize to topological spaces in Section 8, where there may be no metric at all. In metric spaces all three characterizations are equivalent — we just proved it — so the preimage formulation looks like one more way to say the same thing. The power of (3) only becomes apparent when we leave metrics behind: in an arbitrary topological space, the ε\varepsilonδ\delta definition is not even expressible, and the sequential definition can fail in strange ways. The preimage characterization is the universally correct one.

📐 Definition 6 (Lipschitz Continuity)

A function f:(X,dX)(Y,dY)f : (X, d_X) \to (Y, d_Y) is Lipschitz continuous with constant L0L \geq 0 if

dY(f(x),f(y))LdX(x,y)for all x,yX.d_Y(f(x), f(y)) \leq L \cdot d_X(x, y) \quad \text{for all } x, y \in X.

The smallest such LL is called the Lipschitz constant of ff. If L<1L < 1 the map is called a contraction, and this case will drive Section 7.

Lipschitz continuity is strictly stronger than continuity (every Lipschitz function is continuous, take δ=ε/L\delta = \varepsilon / L) and strictly weaker than differentiability with bounded derivative (every such function is Lipschitz on a bounded domain). For ML purposes, Lipschitz constants quantify how rapidly a function can change: they control the convergence rate of gradient descent, the sensitivity of a neural network to input perturbations, and the stability of numerical ODE solvers. We will see concrete examples in Sections 7 and 10.

Two-panel figure: the ε–δ picture in a metric space, and the preimage-of-open-set characterization — an open ball V in Y pulled back to an open set in X.

5. Completeness in Metric Spaces

Completeness generalizes the Cauchy-completeness of R\mathbb{R} (Topic 3, Section 1) to arbitrary metric spaces. The crucial new example — the one that powers every fixed-point argument in the rest of the topic — is that C([a,b])C([a, b]) with the sup-norm is complete.

📐 Definition 7 (Cauchy Sequence in a Metric Space)

A sequence (xn)(x_n) in a metric space (X,d)(X, d) is Cauchy if for every ε>0\varepsilon > 0 there exists NN such that

d(xm,xn)<εfor all m,nN.d(x_m, x_n) < \varepsilon \quad \text{for all } m, n \geq N.

📐 Definition 8 (Complete Metric Space)

A metric space (X,d)(X, d) is complete if every Cauchy sequence in XX converges to a limit that is also in XX.

Cauchy says the terms of the sequence eventually cluster together; complete says that clustering is enough to guarantee a limit. The two concepts are linked but distinct: Q\mathbb{Q} is not complete (the sequence 1,1.4,1.41,1.414,1, 1.4, 1.41, 1.414, \ldots of decimal approximations to 2\sqrt{2} is Cauchy in Q\mathbb{Q} but has no rational limit), while R\mathbb{R} is complete — and this is exactly how R\mathbb{R} was constructed from Q\mathbb{Q} (by filling in the missing limits).

📝 Example 6 (Completeness of Rⁿ)

R\mathbb{R} is complete by the Cauchy-completeness theorem from Topic 3, Section 1. Rn\mathbb{R}^n is complete under any of the p\ell^p metrics: if (x(k))(x^{(k)}) is a Cauchy sequence in Rn\mathbb{R}^n, then each coordinate sequence (xi(k))(x^{(k)}_i) is Cauchy in R\mathbb{R} (because xi(k)xi()dp(x(k),x())|x^{(k)}_i - x^{(\ell)}_i| \leq d_p(x^{(k)}, x^{(\ell)})), so each coordinate converges to some limit xix^*_i, and coordinate-wise convergence in Rn\mathbb{R}^n implies p\ell^p convergence. This is the same argument Topic 3 ran for the Euclidean metric; all it needed was the coordinate inequality, which holds for every p1p \geq 1.

🔷 Theorem 2 (Completeness of C([a,b]))

The space C([a,b])C([a, b]) with the sup-norm metric d(f,g)=fgd_\infty(f, g) = \|f - g\|_\infty is complete.

Proof.

Let (fn)(f_n) be a Cauchy sequence in C([a,b])C([a, b]). We need to exhibit a continuous function ff such that fnf0\|f_n - f\|_\infty \to 0.

Step 1: Pointwise limit exists. For each x[a,b]x \in [a, b], the sequence of real numbers (fn(x))(f_n(x)) is Cauchy in R\mathbb{R}, because

fn(x)fm(x)supt[a,b]fn(t)fm(t)=fnfm.|f_n(x) - f_m(x)| \leq \sup_{t \in [a,b]} |f_n(t) - f_m(t)| = \|f_n - f_m\|_\infty.

Since R\mathbb{R} is complete, (fn(x))(f_n(x)) converges to some real number, which we define to be f(x)f(x). This gives a function f:[a,b]Rf : [a, b] \to \mathbb{R} as the pointwise limit.

Step 2: Convergence is uniform. Given ε>0\varepsilon > 0, the Cauchy condition gives NN such that fnfm<ε\|f_n - f_m\|_\infty < \varepsilon for all m,nNm, n \geq N. Fix x[a,b]x \in [a, b] and any nNn \geq N. Letting mm \to \infty in the inequality fn(x)fm(x)<ε|f_n(x) - f_m(x)| < \varepsilon, the right side stays ε\leq \varepsilon and the left side converges to fn(x)f(x)|f_n(x) - f(x)|. Therefore fn(x)f(x)ε|f_n(x) - f(x)| \leq \varepsilon for all xx, which means fnfε\|f_n - f\|_\infty \leq \varepsilon. Since ε\varepsilon was arbitrary, fnff_n \to f uniformly.

Step 3: The limit is continuous. This is exactly the Uniform Limit Theorem from Uniform Convergence: a uniform limit of continuous functions is continuous. So fC([a,b])f \in C([a, b]), and the previous step says fnf0\|f_n - f\|_\infty \to 0. The sequence converges in the sup-norm metric, as needed.

This theorem is small but it is the backbone of the rest of the topic. Every fixed-point argument in Sections 7 and 10 — Picard iteration, Newton’s method, Bellman iteration — lives inside (C([a,b]),d)(C([a, b]), d_\infty) or an analogous function space, and every such argument needs to know that Cauchy sequences converge. The completeness of C([a,b])C([a, b]) is what lets us “extract a limit” at the end of each proof.

📝 Example 7 (Incompleteness of C([0,1]) under a Different Metric)

The same set C([0,1])C([0, 1]) is not complete under the integral-based metric

d2(f,g)=01(f(x)g(x))2dx.d_2(f, g) = \sqrt{\int_0^1 (f(x) - g(x))^2 \, dx}.

Consider the sequence of continuous functions fnf_n on [0,1][0, 1] defined to be 00 on [0,1/21/n][0, 1/2 - 1/n], linear from 00 to 11 on [1/21/n,1/2+1/n][1/2 - 1/n, 1/2 + 1/n], and 11 on [1/2+1/n,1][1/2 + 1/n, 1] — a “ramp” of width 2/n2/n around x=1/2x = 1/2. For large m,nm, n, the difference fmfnf_m - f_n is supported in a shrinking interval of length 2/min(m,n)2/\min(m, n), so d2(fm,fn)0d_2(f_m, f_n) \to 0 as m,nm, n \to \infty: the sequence is Cauchy in d2d_2. But its pointwise limit is the step function that jumps from 00 to 11 at x=1/2x = 1/2, which is discontinuous — and no continuous function can be the d2d_2-limit of this Cauchy sequence. So (C([0,1]),d2)(C([0, 1]), d_2) is not complete. The completeness of a space depends on the metric, not just the set.

💡 Remark 3 (The Completion Theorem)

Every metric space (X,d)(X, d) has a completion: a complete metric space (X^,d^)(\hat{X}, \hat{d}) together with an isometric embedding ι:XX^\iota : X \hookrightarrow \hat{X} such that ι(X)\iota(X) is dense in X^\hat{X}. The construction mirrors how R\mathbb{R} is built from Q\mathbb{Q}: form equivalence classes of Cauchy sequences in XX, where two Cauchy sequences are equivalent if their interleaving is also Cauchy. The completion of (C([0,1]),d2)(C([0, 1]), d_2) is a larger space containing limits like our ramp-sequence step function — a space that analysts call L2[0,1]L^2[0, 1] and that shows up constantly in modern machine learning. We state the completion theorem as a fact rather than proving it in detail; Munkres, Chapter 3, has the full construction.

Three-panel figure: a Cauchy sequence in R converging; a Cauchy sequence in (C([0,1]), sup-norm) converging uniformly; the ramp-sequence Cauchy in d₂ but with no continuous limit.

6. Compactness in Metric Spaces

Compactness is the most technically important concept in the topic. In Rn\mathbb{R}^n, the Heine–Borel theorem from Topic 3 told us that compact sets are exactly the closed and bounded ones. In general metric spaces this is false — and the proof of “why it fails” is the sin(nπx)\sin(n\pi x) counterexample we promised back in Topic 3. In this section we give the general definition, prove the three-way equivalence with sequential compactness and total boundedness, and visualize the counterexample.

📐 Definition 9 (Open-Cover Compactness)

A metric space (X,d)(X, d) is compact if every open cover of XX has a finite subcover: whenever X=αIUαX = \bigcup_{\alpha \in I} U_\alpha with each UαU_\alpha open, there is a finite set {α1,,αN}I\{\alpha_1, \ldots, \alpha_N\} \subseteq I such that X=Uα1UαNX = U_{\alpha_1} \cup \cdots \cup U_{\alpha_N}. A subset KXK \subseteq X is compact if it is compact as a metric space under the restricted metric.

📐 Definition 10 (Sequential Compactness)

A metric space (X,d)(X, d) is sequentially compact if every sequence in XX has a convergent subsequence.

📐 Definition 11 (Total Boundedness)

A metric space (X,d)(X, d) is totally bounded if for every ε>0\varepsilon > 0 there exist finitely many points x1,,xNXx_1, \ldots, x_N \in X such that the corresponding ε\varepsilon-balls cover XX:

X=i=1NB(xi,ε).X = \bigcup_{i=1}^N B(x_i, \varepsilon).

The collection {x1,,xN}\{x_1, \ldots, x_N\} is called a finite ε\varepsilon-net for XX.

Total boundedness is strictly stronger than boundedness. A bounded set of a metric space only needs to fit inside a single large ball; a totally bounded set needs to be approximable, at every resolution ε\varepsilon, by a finite number of points. In Rn\mathbb{R}^n the two concepts agree (because Rn\mathbb{R}^n is finite-dimensional and a single bounded ball contains only “finitely many balls-worth of material”). In infinite-dimensional spaces they come apart dramatically — and that is exactly why Heine–Borel fails in C([0,1])C([0, 1]).

🔷 Theorem 3 (Compactness Equivalence in Metric Spaces)

For a metric space (X,d)(X, d), the following three conditions are equivalent:

  1. XX is compact (every open cover has a finite subcover).
  2. XX is sequentially compact (every sequence has a convergent subsequence).
  3. XX is complete and totally bounded.

Proof.

We sketch the three directions. Full proofs are in Munkres, Chapter 3, or Rudin, Chapter 2.

(1)(2)(1) \Rightarrow (2). Suppose XX is compact and let (xn)(x_n) be a sequence with no convergent subsequence. Then no point xXx \in X is a cluster point of (xn)(x_n), so each xx has an open ball B(x,rx)B(x, r_x) containing only finitely many terms of the sequence. These balls form an open cover of XX. Compactness gives a finite subcover, which therefore contains only finitely many terms in total — but the full sequence has infinitely many terms, a contradiction.

(2)(3)(2) \Rightarrow (3). Sequential compactness implies completeness: a Cauchy sequence has a convergent subsequence by hypothesis, and a Cauchy sequence with a convergent subsequence converges to the same limit (standard Cauchy-subsequence lemma). For total boundedness, suppose XX has no finite ε\varepsilon-net for some ε>0\varepsilon > 0. Pick any x1Xx_1 \in X. Since B(x1,ε)B(x_1, \varepsilon) is not all of XX, pick x2B(x1,ε)x_2 \notin B(x_1, \varepsilon). Since B(x1,ε)B(x2,ε)B(x_1, \varepsilon) \cup B(x_2, \varepsilon) is not all of XX, pick x3x_3 outside both balls. Continuing, we construct a sequence with pairwise distances ε\geq \varepsilon, which has no convergent subsequence — contradicting sequential compactness.

(3)(1)(3) \Rightarrow (1) (the hardest direction). Suppose XX is complete and totally bounded but has an open cover {Uα}\{U_\alpha\} with no finite subcover. Total boundedness gives a finite ε1\varepsilon_1-net; at least one of the balls B(xi,ε1)B(x_i, \varepsilon_1) intersected with XX has no finite subcover (otherwise combining them would give a finite subcover of the whole space). Pick one such ball, shrink ε\varepsilon to ε1/2\varepsilon_1 / 2, and apply total boundedness inside that ball to get a smaller uncovered ball B(x2,ε1/2)B(x_2, \varepsilon_1/2). Iterating, we obtain a nested sequence of sets with diameters shrinking to zero; picking one point from each gives a Cauchy sequence. By completeness it converges to some limit xx^*, which belongs to some UαU_\alpha in the cover. Because UαU_\alpha is open, UαU_\alpha contains a ball around xx^*, so UαU_\alpha covers all the tail sets in our nested sequence once they are small enough — giving a finite subcover for that tail after all, contradicting our construction.

💡 Remark 4 (Heine–Borel as a Special Case)

In Rn\mathbb{R}^n, the Heine–Borel theorem from Completeness & Compactness in Rⁿ says compact ⟺ closed and bounded. This is a special case of the general three-way equivalence we just proved: in Rn\mathbb{R}^n, bounded and totally bounded are the same thing (because Rn\mathbb{R}^n is finite-dimensional, so a single bounded ball contains only finitely many small balls’ worth of volume). And a closed subset of a complete space is complete. So in Rn\mathbb{R}^n, closed + bounded is the same as complete + totally bounded, which by Theorem 3 is the same as compact. The equivalence “bounded ⟺ totally bounded” fails in infinite dimensions — and we are about to see exactly how badly.

Family:
All off-diagonal entries ≥ 1.76 — no subsequence is Cauchy.
The closed unit ball in C([0,1]) is closed and bounded but not compact: no subsequence of these functions converges in the sup-norm.

📝 Example 8 (The C([0,1]) Counterexample)

The closed unit ball B={fC([0,1]):f1}\mathcal{B} = \{f \in C([0, 1]) : \|f\|_\infty \leq 1\} in the sup-norm is closed and bounded, but not compact. This is the counterexample promised in Topic 3.

Proof. Consider the sequence fn(x)=sin(nπx)f_n(x) = \sin(n\pi x) for n=1,2,3,n = 1, 2, 3, \ldots. Each fnf_n is continuous and fn=1\|f_n\|_\infty = 1, so every fnBf_n \in \mathcal{B}. We claim that no subsequence of (fn)(f_n) is Cauchy in the sup-norm — and therefore no subsequence converges, so B\mathcal{B} is not sequentially compact and (by Theorem 3) not compact.

To see why, fix distinct indices mnm \neq n. The functions sin(mπx)\sin(m\pi x) and sin(nπx)\sin(n\pi x) are orthogonal in L2[0,1]L^2[0, 1]:

01sin(mπx)sin(nπx)dx=0.\int_0^1 \sin(m\pi x) \sin(n\pi x) \, dx = 0.

Expanding 01(sin(mπx)sin(nπx))2dx\int_0^1 (\sin(m\pi x) - \sin(n\pi x))^2 \, dx using orthogonality and 01sin2(kπx)dx=1/2\int_0^1 \sin^2(k\pi x)\,dx = 1/2:

01(sin(mπx)sin(nπx))2dx=12+120=1.\int_0^1 (\sin(m\pi x) - \sin(n\pi x))^2 \, dx = \tfrac{1}{2} + \tfrac{1}{2} - 0 = 1.

Now use the bound f22f2\|f\|_2^2 \leq \|f\|_\infty^2 on [0,1][0, 1]: if the sup-norm of the difference were less than 1, the integral above would be less than 1. But the integral equals 1, so

fmfnfmfn2=1for all mn.\|f_m - f_n\|_\infty \geq \|f_m - f_n\|_2 = 1 \quad \text{for all } m \neq n.

Every pair of terms in the sequence is at sup-norm distance at least 1 — so no subsequence is Cauchy, and no subsequence converges.

The lesson. In an infinite-dimensional space, closed and bounded is not enough for compactness. The missing ingredient is equicontinuity: the family sin(nπx)\sin(n\pi x) is bounded, but the functions “wiggle faster and faster” — for large nn there is no single δ\delta that controls fn(x)fn(y)|f_n(x) - f_n(y)| uniformly across the family. We will name this obstruction and package it into the Arzelà–Ascoli theorem in Section 9.

Two-panel figure: the family sin(nπx) for n = 1..8 plotted, and the pairwise sup-norm distance matrix showing all off-diagonal entries ≥ 1.

7. The Contraction Mapping Theorem

We now prove the theorem that justifies most of the fixed-point arguments scattered throughout the rest of this curriculum. It is short, clean, and extraordinarily useful: four earlier topics have been waiting to cite it.

🔷 Theorem 4 (Banach Contraction Mapping Theorem)

Let (X,d)(X, d) be a complete metric space and let T:XXT : X \to X be a contraction with Lipschitz constant 0k<10 \leq k < 1, meaning

d(T(x),T(y))kd(x,y)for all x,yX.d(T(x), T(y)) \leq k \cdot d(x, y) \quad \text{for all } x, y \in X.

Then:

  1. TT has a unique fixed point xXx^* \in X — a point with T(x)=xT(x^*) = x^*.
  2. For any starting point x0Xx_0 \in X, the iterates xn+1=T(xn)x_{n+1} = T(x_n) converge to xx^*.
  3. The convergence rate is geometric: d(xn,x)kn1kd(x0,x1)d(x_n, x^*) \leq \dfrac{k^n}{1 - k} \cdot d(x_0, x_1).

Proof.

Fix a starting point x0Xx_0 \in X and define the iterates xn+1=T(xn)x_{n+1} = T(x_n). The contraction property applied to consecutive iterates gives

d(xn+1,xn)=d(T(xn),T(xn1))kd(xn,xn1).d(x_{n+1}, x_n) = d(T(x_n), T(x_{n-1})) \leq k \cdot d(x_n, x_{n-1}).

By induction, this bound extends to

d(xn+1,xn)knd(x1,x0).d(x_{n+1}, x_n) \leq k^n \cdot d(x_1, x_0).

Now we show (xn)(x_n) is Cauchy. For any m>nm > n, the triangle inequality stacks up mnm - n consecutive-iterate bounds:

d(xm,xn)j=nm1d(xj+1,xj)d(x1,x0)j=nm1kj.d(x_m, x_n) \leq \sum_{j = n}^{m-1} d(x_{j+1}, x_j) \leq d(x_1, x_0) \sum_{j = n}^{m-1} k^j.

The geometric-series tail bound gives j=nm1kjkn1k\sum_{j = n}^{m-1} k^j \leq \dfrac{k^n}{1 - k}, so

d(xm,xn)kn1kd(x1,x0).d(x_m, x_n) \leq \frac{k^n}{1 - k} \cdot d(x_1, x_0).

Since k<1k < 1, the right-hand side tends to 00 as nn \to \infty uniformly in mm — the sequence (xn)(x_n) is Cauchy. Because XX is complete, (xn)(x_n) converges to some limit xXx^* \in X.

Next we show xx^* is a fixed point. Contractions are Lipschitz (with constant kk), hence continuous, so we can pass TT through the limit:

T(x)=T(limnxn)=limnT(xn)=limnxn+1=x.T(x^*) = T\big(\lim_{n \to \infty} x_n\big) = \lim_{n \to \infty} T(x_n) = \lim_{n \to \infty} x_{n+1} = x^*.

Uniqueness: suppose yy^* is another fixed point of TT. Then

d(x,y)=d(T(x),T(y))kd(x,y).d(x^*, y^*) = d(T(x^*), T(y^*)) \leq k \cdot d(x^*, y^*).

Since k<1k < 1, this forces d(x,y)=0d(x^*, y^*) = 0, so y=xy^* = x^*.

Finally, the convergence rate. Letting mm \to \infty in the Cauchy-tail bound d(xm,xn)kn1kd(x1,x0)d(x_m, x_n) \leq \frac{k^n}{1-k} \cdot d(x_1, x_0), the left side converges to d(x,xn)d(x^*, x_n), proving

d(xn,x)kn1kd(x0,x1).d(x_n, x^*) \leq \frac{k^n}{1 - k} \cdot d(x_0, x_1).

Three things to note. First, the theorem tells us nothing about which starting point to use — any x0Xx_0 \in X works, and they all converge to the same fixed point. Second, the rate bound is sharp enough to give a stopping criterion: after nn iterations, the distance to xx^* is at most kn1k\frac{k^n}{1-k} times the first step d(x0,x1)d(x_0, x_1), a quantity we can measure after one evaluation of TT. Third, the contraction hypothesis k<1k < 1 is strict: at k=1k = 1 the theorem breaks, as the viz below demonstrates by letting you set k=1k = 1 and watch iterates orbit instead of converging.

Iterations: 60
d(x_n, x*): 8.378e-10
Bound kⁿ/(1−k)·d(x₀,x₁): 1.471e-9
k < 1 — contraction ✓

An affine map T(x) = k·Rθ·x + c with θ = 30°. Operator norm of k·Rθ is exactly k. Click anywhere in the plot to set a new starting point x₀; drag the slider to change k.

📝 Example 9 (Picard Iteration for ODEs)

The Picard–Lindelöf existence-and-uniqueness theorem for first-order ODEs from First-Order ODEs rewrites the initial-value problem y(t)=f(t,y(t))y'(t) = f(t, y(t)), y(0)=y0y(0) = y_0 as the fixed-point equation

y(t)=y0+0tf(s,y(s))ds.y(t) = y_0 + \int_0^t f(s, y(s)) \, ds.

Defining the Picard operator Φ[y](t)=y0+0tf(s,y(s))ds\Phi[y](t) = y_0 + \int_0^t f(s, y(s)) \, ds, a fixed point of Φ\Phi is exactly a solution of the ODE. Topic 21 showed that when ff is Lipschitz in its second argument with constant LL and we work on a short interval [0,h][0, h] with Lh<1Lh < 1, the Picard operator is a contraction on C([0,h])C([0, h]) with the sup-norm — constant LhLh — and C([0,h])C([0, h]) is complete by Theorem 2. The Banach theorem then produces a unique fixed point, which is the ODE solution. The abstract theorem we just proved is what makes Topic 21’s concrete argument go through.

📝 Example 10 (Newton Iteration for Implicit Methods)

The implicit Euler method from Numerical Methods for ODEs requires solving the nonlinear equation yn+1=yn+hf(tn+1,yn+1)y_{n+1} = y_n + h \cdot f(t_{n+1}, y_{n+1}) at each time step. Newton’s method reduces this to iterating g(y)=y[Ihfy(tn+1,y)]1(yynhf(tn+1,y))g(y) = y - [I - h f_y(t_{n+1}, y)]^{-1}(y - y_n - h f(t_{n+1}, y)). For sufficiently small hh, the derivative of gg at the root is bounded in norm by some k<1k < 1, so by the mean value theorem gg is a contraction on a small neighborhood of the root and Banach’s theorem guarantees convergence. Topic 24 invoked this conclusion in its convergence analysis without proof; we have now proved the general result.

📝 Example 11 (The Inverse Function Theorem)

The proof of the Inverse Function Theorem in Topic 12 constructs a local inverse by setting up the fixed-point equation x=ϕ(x)x = \phi(x) where ϕ\phi is a map on a small ball in Rn\mathbb{R}^n whose Jacobian is bounded below 1 in operator norm. Banach’s theorem then gives a unique fixed point, which is the preimage of the target point. The whole construction is a single application of Theorem 4 to a carefully chosen contraction — the argument was specialized to Rn\mathbb{R}^n in Topic 12, but the contraction-mapping framework developed here shows why it works in full generality.

💡 Remark 5 (Bernstein Operators Are Not Contractions)

Topic 20 proved that Bernstein operators BnfB_n f converge uniformly to ff on [0,1][0, 1] — a statement that lives entirely in the metric space (C([0,1]),d)(C([0, 1]), d_\infty). But BnB_n is not a contraction: it does not uniformly shrink sup-norm distances between arbitrary functions. Bernstein convergence is therefore not an application of Theorem 4; it is a different kind of approximation phenomenon that happens to share the same ambient metric space. The general lesson: sitting inside a complete metric space does not automatically make every self-map a contraction, and density-of-polynomials results require separate machinery (probabilistic moment arguments in the Bernstein case). What Topic 20 does share with this topic is the underlying metric-space structure — the sup-norm distance between BnfB_n f and ff is how “convergence” is defined there, and that distance is a metric in exactly the sense defined here.

Three-panel figure: iterates converging for k < 1, orbiting for k = 1, and diverging for k > 1.

Log-scale convergence plot: actual d(x_n, x*) vs. theoretical bound k^n/(1−k)·d(x₀, x₁) for several values of k.

8. General Topology — Beyond Metric Spaces

Everything we have done so far has used the metric dd. But many of the concepts — continuity, open sets, closure, compactness — make sense in spaces where there is no metric at all, provided you know which sets are “open.” This section introduces the general notion of a topological space and explains why the preimage characterization of continuity is the one that generalizes.

📐 Definition 12 (Topological Space)

A topological space is a pair (X,τ)(X, \tau) where XX is a set and τ\tau is a collection of subsets of XX — called open sets — satisfying:

  1. τ\emptyset \in \tau and XτX \in \tau.
  2. Arbitrary unions of elements of τ\tau are in τ\tau.
  3. Finite intersections of elements of τ\tau are in τ\tau.

The collection τ\tau is called a topology on XX. A metric space (X,d)(X, d) induces a topology τd={UX:U is open in the metric-space sense}\tau_d = \{\, U \subseteq X : U \text{ is open in the metric-space sense}\,\} by Proposition 1, so every metric space is a topological space. But there are topological spaces with no compatible metric, and those are strictly more general.

The three axioms are exactly the properties Proposition 1 established for metric-space open sets. Any collection of subsets satisfying them qualifies as a topology, regardless of whether it came from a metric. The question is whether all topologies come from metrics — and the answer is no.

📝 Example 12 (The Cofinite Topology)

Let XX be an uncountable set — for concreteness, take X=RX = \mathbb{R}. Define τ\tau to consist of the empty set together with all subsets UXU \subseteq X whose complement XUX \setminus U is finite:

τ={}{UX:XU<}.\tau = \{\emptyset\} \cup \{\, U \subseteq X : |X \setminus U| < \infty\,\}.

This is a topology: the empty set and XX are in τ\tau, arbitrary unions of cofinite sets are cofinite (their complement is an intersection of finite sets, which is finite), and finite intersections of cofinite sets are cofinite (their complement is a finite union of finite sets, which is finite). So (X,τ)(X, \tau) is a valid topological space.

But (X,τ)(X, \tau) is not metrizable: there is no metric that generates this topology. Here is one quick reason. In the cofinite topology, any two non-empty open sets must intersect (the union of their complements is finite, so the intersection of the sets themselves is cofinite, hence non-empty). So for any two distinct points x,yXx, y \in X, every neighborhood of xx intersects every neighborhood of yy. In a metric space we can always separate distinct points by disjoint balls B(x,d(x,y)/3)B(x, d(x,y)/3) and B(y,d(x,y)/3)B(y, d(x,y)/3) — a property called the Hausdorff property. The cofinite topology on an uncountable set fails this, so no metric can generate it.

📐 Definition 13 (Continuity in Topological Spaces)

A function f:(X,τX)(Y,τY)f : (X, \tau_X) \to (Y, \tau_Y) between topological spaces is continuous if for every open set VτYV \in \tau_Y, the preimage f1(V)τXf^{-1}(V) \in \tau_X.

This is characterization (3) from Theorem 1 — the preimage-of-open-set condition — lifted verbatim to the topological setting. It is the only definition that works universally: the ε\varepsilon-δ\delta definition requires a metric, and the sequential definition fails in non-first-countable spaces (where there are “too many” open sets for sequences to detect). In metric spaces all three definitions coincide, as we proved in Theorem 1; in general topology, we start with (3) and take the other two only when the space is metrizable.

📐 Definition 14 (Homeomorphism)

A homeomorphism between topological spaces (X,τX)(X, \tau_X) and (Y,τY)(Y, \tau_Y) is a bijection f:XYf : X \to Y such that both ff and f1f^{-1} are continuous. Two spaces related by a homeomorphism are topologically equivalent — they have the same open sets up to relabeling, the same convergence, the same compactness properties, and the same connectedness properties.

Homeomorphism is the “sameness” relation for topological spaces. Two very different-looking metric spaces can be topologically equivalent — for example, the open interval (0,1)(0, 1) and the whole real line R\mathbb{R} are homeomorphic (via any monotone continuous bijection like xtan(πxπ/2)x \mapsto \tan(\pi x - \pi/2)), even though one has finite diameter and the other doesn’t. The metric “forgot” when we reduced to the topology.

💡 Remark 6 (Metrization and Fundamental Groups)

Topic 15 mentioned simply connected domains and fundamental groups as the setting for its discussion of conservative vector fields and path independence. These concepts live at the general-topology level: a domain is simply connected if every closed loop in it can be continuously contracted to a point, a property defined entirely in terms of continuous maps (homotopies of loops), and the fundamental group π1(X,x0)\pi_1(X, x_0) classifies loops up to continuous deformation. Fundamental groups make sense in any topological space and are preserved by homeomorphism — they are a topological invariant. Metric spaces inherit these invariants whenever they carry the metric-induced topology, which is almost always in analysis. The Urysohn Metrization Theorem gives a clean sufficient condition: a topological space is metrizable if (and essentially only if) it is second-countable, regular, and Hausdorff. Most spaces that appear in analysis satisfy these conditions, which is why metric-space theory covers so much ground. But the topological vocabulary — connectedness, path-connectedness, fundamental groups, homotopy — belongs to the general theory, not to metric spaces in particular.

9. Equicontinuity and the Arzelà–Ascoli Theorem

In Section 6 we saw that the closed unit ball of C([0,1])C([0, 1]) is not compact — the sequence fn(x)=sin(nπx)f_n(x) = \sin(n\pi x) has no convergent subsequence. The Arzelà–Ascoli theorem identifies exactly what extra condition is needed to restore compactness in function spaces: equicontinuity. Topic 4 introduced the concept and promised the proof; we pay that debt now.

📐 Definition 15 (Equicontinuity)

A family FC([a,b])\mathcal{F} \subseteq C([a, b]) is equicontinuous if for every ε>0\varepsilon > 0 there exists a single δ>0\delta > 0depending on ε\varepsilon but not on ff — such that for every fFf \in \mathcal{F} and every x,y[a,b]x, y \in [a, b] with xy<δ|x - y| < \delta:

f(x)f(y)<ε.|f(x) - f(y)| < \varepsilon.

This is the concept Topic 4 introduced in the context of equicontinuity figures. The crucial word is single: uniform continuity of a single function says there is one δ\delta that works for all points at once; equicontinuity of a family says there is one δ\delta that works for all points and all functions at once. The family fn(x)=sin(nπx)f_n(x) = \sin(n\pi x) fails equicontinuity because faster-wiggling functions need smaller δ\delta‘s — no single choice works for the whole family. By contrast, any family of functions with a uniform Lipschitz bound (say f(x)f(y)Lxy|f(x) - f(y)| \leq L |x - y| for all ff in the family) is automatically equicontinuous: take δ=ε/L\delta = \varepsilon / L.

📐 Definition 16 (Uniform Boundedness of a Family)

A family FC([a,b])\mathcal{F} \subseteq C([a, b]) is uniformly bounded if there exists M>0M > 0 such that

f(x)Mfor every fF and every x[a,b].|f(x)| \leq M \quad \text{for every } f \in \mathcal{F} \text{ and every } x \in [a, b].

Equivalently, the family fits inside a single closed ball of the sup-norm: fM\|f\|_\infty \leq M for all fFf \in \mathcal{F}.

🔷 Theorem 5 (Arzelà–Ascoli)

A subset FC([a,b])\mathcal{F} \subseteq C([a, b]) (with the sup-norm metric) has compact closure if and only if F\mathcal{F} is uniformly bounded and equicontinuous.

Equivalently: every uniformly bounded and equicontinuous sequence (fn)(f_n) in C([a,b])C([a, b]) has a subsequence that converges uniformly.

Proof.

We prove both directions. The forward direction is the routine one; the reverse direction is the workhorse “diagonal subsequence” argument.

(\Leftarrow) Uniformly bounded + equicontinuous ⟹ sequentially compact closure. Let (fn)(f_n) be a sequence in F\mathcal{F}. We will extract a subsequence that is Cauchy in the sup-norm; by completeness of C([a,b])C([a, b]) (Theorem 2) it then converges, proving sequential compactness.

Step 1: pointwise convergence on a countable dense set. Enumerate the rationals in [a,b][a, b] as r1,r2,r3,r_1, r_2, r_3, \ldots — a countable dense set. At r1r_1, the sequence of real numbers (fn(r1))(f_n(r_1)) is bounded (by uniform boundedness of F\mathcal{F}), so by the Bolzano-Weierstrass theorem it has a convergent subsequence, which we index as fnk(1)f_{n_k^{(1)}}. Next, the sequence (fnk(1)(r2))(f_{n_k^{(1)}}(r_2)) is bounded in R\mathbb{R}, so we extract a further subsequence fnk(2)f_{n_k^{(2)}} converging at r2r_2 — and still converging at r1r_1, since subsequences of convergent sequences converge to the same limit. Iterate this procedure: at step jj we have a subsequence fnk(j)f_{n_k^{(j)}} that converges at each of r1,r2,,rjr_1, r_2, \ldots, r_j.

Step 2: diagonal subsequence. Define the diagonal subsequence gk=fnk(k)g_k = f_{n_k^{(k)}}. For every fixed rational rjr_j, the sequence (gk)kj(g_k)_{k \geq j} is a subsequence of (fnk(j))(f_{n_k^{(j)}}), which converges at rjr_j; therefore (gk(rj))(g_k(r_j)) converges too. So (gk)(g_k) converges pointwise at every rational in [a,b][a, b].

Step 3: equicontinuity promotes pointwise to uniform. Given ε>0\varepsilon > 0, equicontinuity gives δ>0\delta > 0 such that

f(x)f(y)<ε/3for all fF whenever xy<δ.|f(x) - f(y)| < \varepsilon / 3 \quad \text{for all } f \in \mathcal{F} \text{ whenever } |x - y| < \delta.

Cover [a,b][a, b] by finitely many open intervals of length less than δ\delta centered at rationals rj1,,rjmr_{j_1}, \ldots, r_{j_m} (we can pick rationals because they are dense). By Step 2, (gk(rji))(g_k(r_{j_i})) converges for each i=1,,mi = 1, \ldots, m, so there exists KK such that for all k,Kk, \ell \geq K and all ii:

gk(rji)g(rji)<ε/3.|g_k(r_{j_i}) - g_\ell(r_{j_i})| < \varepsilon / 3.

Now let x[a,b]x \in [a, b] be arbitrary. Pick rjir_{j_i} with xrji<δ|x - r_{j_i}| < \delta. The triangle inequality gives

gk(x)g(x)gk(x)gk(rji)+gk(rji)g(rji)+g(rji)g(x).|g_k(x) - g_\ell(x)| \leq |g_k(x) - g_k(r_{j_i})| + |g_k(r_{j_i}) - g_\ell(r_{j_i})| + |g_\ell(r_{j_i}) - g_\ell(x)|.

The first and third terms are each less than ε/3\varepsilon / 3 by equicontinuity (applied to gkg_k and gg_\ell, which lie in F\mathcal{F}). The middle term is less than ε/3\varepsilon / 3 by the choice of KK. So gk(x)g(x)<ε|g_k(x) - g_\ell(x)| < \varepsilon for every xx, which means gkgε\|g_k - g_\ell\|_\infty \leq \varepsilon for all k,Kk, \ell \geq K. The subsequence (gk)(g_k) is Cauchy in (C([a,b]),d)(C([a, b]), d_\infty), and by completeness it converges to some continuous function — which is the uniformly convergent subsequence we wanted.

(\Rightarrow) Compact closure ⟹ uniformly bounded + equicontinuous. Suppose F\overline{\mathcal{F}} is compact. Then it is bounded (compact sets are bounded in any metric space), so F\mathcal{F} is uniformly bounded. For equicontinuity, fix ε>0\varepsilon > 0. Compactness of F\overline{\mathcal{F}} in the sup-norm means we can cover it by finitely many sup-norm balls B(φ1,ε/3),,B(φN,ε/3)B(\varphi_1, \varepsilon / 3), \ldots, B(\varphi_N, \varepsilon / 3) centered at functions φ1,,φNC([a,b])\varphi_1, \ldots, \varphi_N \in C([a, b]). Each φi\varphi_i is continuous on the compact interval [a,b][a, b], hence uniformly continuous, giving some δi\delta_i that makes φi(x)φi(y)<ε/3|\varphi_i(x) - \varphi_i(y)| < \varepsilon / 3 whenever xy<δi|x - y| < \delta_i. Take δ=min(δ1,,δN)\delta = \min(\delta_1, \ldots, \delta_N).

Now for any fFf \in \mathcal{F}, pick ii with fφi<ε/3\|f - \varphi_i\|_\infty < \varepsilon / 3. For xy<δ|x - y| < \delta:

f(x)f(y)f(x)φi(x)+φi(x)φi(y)+φi(y)f(y).|f(x) - f(y)| \leq |f(x) - \varphi_i(x)| + |\varphi_i(x) - \varphi_i(y)| + |\varphi_i(y) - f(y)|.

Each of the three terms is bounded by ε/3\varepsilon / 3, so f(x)f(y)<ε|f(x) - f(y)| < \varepsilon. This bound holds for every fFf \in \mathcal{F} with the same δ\delta — equicontinuity.

📝 Example 13 (Why sin(nπx) Fails Arzelà–Ascoli)

The counterexample family fn(x)=sin(nπx)f_n(x) = \sin(n\pi x) from Section 6 is uniformly bounded (fn=1\|f_n\|_\infty = 1 for every nn) but not equicontinuous. To see why, pick ε=1\varepsilon = 1 and ask for a δ\delta that forces fn(x)fn(y)<1|f_n(x) - f_n(y)| < 1 whenever xy<δ|x - y| < \delta. For any proposed δ>0\delta > 0, take nn large enough that 1/(2n)<δ1 / (2n) < \delta, and take x=0x = 0, y=1/(2n)y = 1 / (2n). Then xy=1/(2n)<δ|x - y| = 1/(2n) < \delta but

fn(0)fn(1/(2n))=sin(0)sin(π/2)=1.|f_n(0) - f_n(1/(2n))| = |\sin(0) - \sin(\pi/2)| = 1.

So the chosen δ\delta fails at fnf_n. No single δ\delta works for the whole family — equicontinuity fails exactly because the functions oscillate faster and faster as nn grows. The failure of equicontinuity is precisely what blocks the Arzelà–Ascoli conclusion and explains why no subsequence converges.

💡 Remark 7 (Arzelà–Ascoli in Practice)

The theorem is the workhorse compactness criterion for function spaces. It gets used constantly in the existence theory for partial differential equations (bound the approximate solutions in a uniform-Lipschitz sense, extract a convergent subsequence, pass to the limit), in the regularity theory of calculus-of-variations minimizers, and in proving convergence of numerical schemes. In machine learning, it appears implicitly whenever you control the Lipschitz constant of a hypothesis class: a uniformly bounded, uniformly Lipschitz family of functions has compact closure in C([a,b])C([a, b]), which means you can always extract convergent subsequences from training trajectories and study their limits. Spectral norm regularization of neural networks is partially motivated by exactly this kind of Arzelà–Ascoli argument.

Equicontinuous: Pointwise bounded: Arzelà-Ascoli applies: Lipschitz bound: 3

Click on the graph to probe equicontinuity at a point. Blue curves = extracted subsequence. Dashed red = limit approximation.

Three-panel figure: an equicontinuous family (a single δ works for all functions), the non-equicontinuous sin(nπx) family, and a sketch of the diagonal subsequence extraction.

10. ML Connections — Distance, Fixed Points, and Embeddings

Everything in this topic has concrete machine learning payoffs. The Banach theorem explains why iterative algorithms converge; the metric-space axioms explain what “distance” must satisfy for similarity-based methods to work; the Arzelà–Ascoli theorem controls the compactness of function classes. Here are four connections that appear across modern ML.

📝 Example 14 (The Bellman Operator Is a Contraction)

In reinforcement learning, the Bellman optimality operator TT^* acts on bounded value functions V:SRV : \mathcal{S} \to \mathbb{R} by

(TV)(s)=maxa[R(s,a)+γsP(ss,a)V(s)],(T^* V)(s) = \max_a \left[\, R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \cdot V(s') \,\right],

where γ[0,1)\gamma \in [0, 1) is the discount factor. Under the sup-norm V=supsSV(s)\|V\|_\infty = \sup_{s \in \mathcal{S}} |V(s)|, the space of bounded value functions on a finite state space is complete, and TT^* is a contraction with constant γ\gamma:

TVTWγVW.\|T^* V - T^* W\|_\infty \leq \gamma \cdot \|V - W\|_\infty.

The proof is a short computation using the inequality maxaf(a)maxag(a)maxaf(a)g(a)|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)| and the fact that transition probabilities sum to 1. By Banach’s theorem, value iteration — the algorithm that repeatedly applies TT^* starting from any initial V0V_0 — converges to the unique optimal value function VV^*, at a geometric rate γn\gamma^n. This single fact is the reason value iteration works; it is also why a smaller discount γ\gamma (more aggressively short-sighted) gives faster convergence but different optimal policies. The viz below shows this concretely on a 3-state MDP.

n: 0
‖V_n − V*‖∞: 8.839e+1
γⁿ bound: 8.839e+1
π*: A→move, B→move, C→stay

Three states with two actions each. C is the reward-rich absorbing-ish state; the optimal policy is "move toward C then stay". The Bellman optimality operator T* is a γ-contraction in the sup-norm, so Banach's theorem guarantees V_n → V* at the geometric rate γⁿ.

📝 Example 15 (Lipschitz Continuity and Gradient Descent)

A function f:RnRf : \mathbb{R}^n \to \mathbb{R} has LL-Lipschitz gradient if f(x)f(y)Lxy\|\nabla f(x) - \nabla f(y)\| \leq L \cdot \|x - y\| for all x,yx, y — the gradient map is Lipschitz with constant LL. For such functions, the gradient descent map T(x)=xηf(x)T(x) = x - \eta \cdot \nabla f(x) has a Lipschitz constant of at most max(1ημ,1ηL)\max(|1 - \eta \mu|, |1 - \eta L|) on an μ\mu-strongly-convex function, which is strictly less than 1 whenever 0<η<2/L0 < \eta < 2/L. So gradient descent is a contraction on strongly convex LL-smooth objectives, and the Banach theorem gives convergence to the unique minimizer at a geometric rate — this is the simplest known convergence proof for gradient descent. Much of convex optimization theory elaborates this idea: different step-size rules, accelerated methods, stochastic variants, and second-order methods all trade off against how tightly you can control the contraction constant.

📝 Example 16 (Metric Learning and Embedding Spaces)

Modern neural networks routinely learn metrics rather than using a fixed one. Three examples:

  • Siamese networks learn an embedding ϕ:XRd\phi : X \to \mathbb{R}^d and use d(ϕ(x),ϕ(y))=ϕ(x)ϕ(y)2d(\phi(x), \phi(y)) = \|\phi(x) - \phi(y)\|_2 as a learned similarity. The training objective forces the embedding to place semantically similar inputs close together.
  • Triplet loss enforces d(ϕ(a),ϕ(p))+α<d(ϕ(a),ϕ(n))d(\phi(a), \phi(p)) + \alpha < d(\phi(a), \phi(n)) for anchor aa, positive example pp, and negative example nn. The triangle inequality of the embedding metric constrains which triplet configurations are even geometrically possible, which is why well-trained embeddings exhibit the “tripletable” structure we see in Word2Vec and BERT representations.
  • Contrastive learning (SimCLR, CLIP) implicitly learns a metric on the representation space where positive pairs are close and negative pairs are far, using cross-entropy loss as a surrogate for metric constraints.

In every case, the metric space axioms — positivity, symmetry, triangle inequality — are the constraints that well-behaved similarity scores must satisfy. A “distance” that violates symmetry (d(x,y)d(y,x)d(x, y) \neq d(y, x)) or the triangle inequality cannot be used with nearest-neighbor search acceleration structures like ball trees, and training losses that silently break the axioms often produce embedding spaces with weird pathologies.

💡 Remark 8 (k-NN and Metric Spaces)

The kk-nearest-neighbors algorithm is defined entirely in terms of a metric on the input space. Different metrics give different classifiers: Euclidean kk-NN, Manhattan kk-NN, Mahalanobis kk-NN, and cosine-similarity kk-NN all disagree on which points are “close” to a query. The triangle inequality is not just an axiom — it is what makes efficient nearest-neighbor search possible. Metric trees (ball trees, KD-trees, VP-trees) use the triangle inequality to prune branches of the search: if the distance from the query to a node’s center plus the node’s radius is larger than the best candidate found so far, the entire subtree can be skipped. Without the triangle inequality, no such pruning is correct, and nearest-neighbor search becomes linear in the dataset size. This is a concrete case where a mathematical axiom translates directly into algorithmic efficiency.

Four-panel figure: Bellman iteration convergence, Lipschitz gradient descent trajectory, metric learning embedding separation, and k-NN decision boundaries under different metrics.

11. Computational Notes

Every concept in this topic has a short Python implementation. NumPy and SciPy have the basic machinery; the reference notebook spells out each one in detail.

p\ell^p distances in Rn\mathbb{R}^n are a single NumPy call:

import numpy as np

def lp_distance(x, y, p=2):
    return np.linalg.norm(x - y, ord=p)

# Euclidean (p=2), Manhattan (p=1), Chebyshev (p=inf)
lp_distance(np.array([1, 2, 3]), np.array([4, 6, 8]), p=2)

Fixed-point iteration is a for-loop with a convergence check. For a contraction TT:

def fixed_point_iterate(T, x0, tol=1e-10, max_iter=1000):
    x = np.asarray(x0, dtype=float)
    for n in range(max_iter):
        x_new = T(x)
        if np.linalg.norm(x_new - x) < tol:
            return x_new, n + 1
        x = x_new
    return x, max_iter

The loop returns the iterate and the number of steps needed. Feed this a Picard operator, a Newton step, or a Bellman backup and the same skeleton applies.

Sup-norm distance between sampled functions is max(abs(...)) on a shared grid:

def sup_norm_distance(f_samples, g_samples):
    return np.max(np.abs(f_samples - g_samples))

x_grid = np.linspace(0, 1, 1000)
f = np.sin(2 * np.pi * x_grid)
g = np.sin(3 * np.pi * x_grid)
sup_norm_distance(f, g)

Pairwise distance matrices use scipy.spatial.distance.cdist, which supports every p\ell^p metric out of the box:

from scipy.spatial.distance import cdist

points = np.random.randn(100, 5)
D = cdist(points, points, metric='euclidean')  # or 'cityblock', 'chebyshev', ...

Value iteration on a small MDP is just repeated Bellman-optimality updates:

def value_iteration(P, R, gamma, tol=1e-10, max_iter=1000):
    # P: shape (S, A, S) transitions; R: shape (S, A) rewards
    S, A, _ = P.shape
    V = np.zeros(S)
    for n in range(max_iter):
        Q = R + gamma * P @ V  # shape (S, A)
        V_new = Q.max(axis=1)
        if np.max(np.abs(V_new - V)) < tol:
            return V_new, n + 1
        V = V_new
    return V, max_iter

The convergence rate matches the theoretical γn\gamma^n bound — you can verify this by plotting logVnV\log \|V_n - V^*\|_\infty against nn and reading off the slope.

12. Summary and Key Results

The topic built a small library of theorems on top of the three metric-space axioms. Here is the proof-dependency structure in compact form.

ResultSectionKey idea
Metric space axioms2Positivity, symmetry, triangle inequality
Three characterizations of continuity (Theorem 1)4ε–δ = sequential = preimage-of-open
Completeness of C([a,b])C([a, b]) (Theorem 2)5Uniform limit of continuous is continuous
Compactness equivalence (Theorem 3)6Compact ⟺ sequentially compact ⟺ complete + totally bounded
Banach Contraction Mapping Theorem (Theorem 4)7Geometric-series Cauchy bound + completeness
Arzelà–Ascoli Theorem (Theorem 5)9Diagonal subsequence + equicontinuity promotes to uniform

The dependency chain is linear. The compactness equivalence (Theorem 3) gives us the tools to recognize compactness without unrolling open covers every time. The completeness of C([a,b])C([a, b]) (Theorem 2) is the missing ingredient for fixed-point arguments, and Theorem 2’s proof depends on the Uniform Limit Theorem from Uniform Convergence. The Banach theorem (Theorem 4) depends on completeness. Arzelà–Ascoli (Theorem 5) depends on completeness and on the compactness equivalence. All of these depend on the three characterizations of continuity from Theorem 1.

One result unifies four earlier topics. The Banach Contraction Mapping Theorem gives a single proof that covers:

  • The Picard–Lindelöf existence theorem for first-order ODEs (Topic 21)
  • Newton iteration for implicit numerical methods (Topic 24)
  • The Inverse Function Theorem (Topic 12)
  • Value iteration in reinforcement learning (Section 10)

Before this topic, each of those was a separate argument using a specifically constructed self-map. After this topic, they are four corollaries of one theorem about complete metric spaces — and the proof you read in Section 7 is the proof of all four at once.

13. Track 8 Opening — The Road to Functional Analysis

This topic opens Track 8: Functional Analysis Essentials, a four-topic arc that adds successively more structure to metric spaces until we reach the framework behind modern machine learning optimization. Each topic adds one layer:

Topic 29 (this topic): Metric Spaces & Topology. The language of distance, open sets, completeness, and compactness in abstract spaces. The Banach Contraction Mapping Theorem unifies convergence arguments across the curriculum. The Arzelà–Ascoli theorem characterizes compact function families. Forward-looking: this is the geometric foundation.

Topic 30: Normed & Banach Spaces. Add algebraic structure — a vector-space operation and a norm that interacts with it. Bounded linear operators replace generic continuous maps; the three pillars of functional analysis (Open Mapping, Closed Graph, Uniform Boundedness) emerge as consequences of the Baire Category Theorem we deferred from this topic. The payoff: “metric space + vector space = analysis powerhouse.” Dual spaces of linear functionals live here, and the LpL^p spaces we have been quietly alluding to are Banach spaces — their theory gets its full treatment in Topic 30, not here.

Topic 31: Inner Product & Hilbert Spaces. Add geometric structure — an inner product that gives us angles, orthogonality, and projections. Orthogonal projections replace Banach-space linear functionals as the primary construction; the Riesz representation theorem identifies the dual of a Hilbert space with the space itself. This is the mathematical home of kernel methods, Reproducing Kernel Hilbert Spaces (RKHS), and Gaussian processes. Every symmetric positive-definite kernel in machine learning induces a Hilbert space; Topic 31 makes that construction rigorous.

Topic 32: Calculus of Variations. Optimization on function spaces. The Euler–Lagrange equations characterize critical points of functionals J:function spaceRJ : \text{function space} \to \mathbb{R}, first and second variation generalize the gradient and Hessian, and the direct method of the calculus of variations gives existence theorems via compactness arguments (Arzelà–Ascoli and its Sobolev-space cousins). This is the framework behind physics-informed neural networks, optimal transport, variational autoencoders, and diffusion models — wherever ML optimizes over functions rather than parameters, calculus of variations is quietly at work.

The refinement sequence is metric → normed → inner product → optimization. Every Hilbert space is a Banach space; every Banach space is a metric space. The additional structure at each step buys additional theorems: metric spaces give us continuity and contraction arguments, normed spaces add linear operators and duality, inner product spaces add orthogonal projection, and the optimization layer turns projection into the Euler–Lagrange machinery.

Within formalCalculus — upcoming topics in this track:

  • Normed & Banach Spaces — Adds algebraic structure (vector space + norm) to metric spaces. The Baire Category Theorem, Open Mapping Theorem, Closed Graph Theorem, and Uniform Boundedness Principle — all deferred from this topic — are proved there.
  • Inner Product & Hilbert Spaces — Inner products, angles, orthogonality, the projection theorem, Riesz representation, RKHS construction, and kernel methods.
  • Calculus of Variations — The Euler-Lagrange equation, second variation, direct method for existence of minimizers via weak compactness, and compactness arguments (Arzelà-Ascoli in Sobolev spaces).

Forward links to formalml.com. Every topic in Track 8 has downstream machine-learning applications, and Topic 29 already touches several:

  • Gradient Descent — Lipschitz constants from Section 4 and the contraction argument from Section 7 give the cleanest convergence proof for gradient descent on strongly convex LL-smooth objectives.
  • Reinforcement Learning — The Bellman operator is a contraction in the sup-norm with constant γ\gamma, so value iteration converges to VV^* at rate γn\gamma^n. The BellmanIterationExplorer in Section 10 demonstrates this concretely on a 3-state MDP.
  • Metric Learning — Siamese networks, triplet loss, and contrastive learning all learn metrics on embedding spaces. The metric space axioms are exactly the constraints those training losses enforce.
  • Kernel Methods — Every positive-definite kernel induces a metric on its feature space. The RKHS norm is a metric-space distance, and Topic 31 will make the full construction rigorous.
  • Embedding Spaces — Word2Vec, BERT, and graph embeddings all live in metric spaces where Euclidean or cosine distance measures semantic proximity. The triangle inequality constrains the possible geometry of embeddings.

References

  1. book Munkres, J. R. (2000). Topology Second edition. Chapters 2–3 (metric spaces, topological spaces, connectedness, compactness). The standard reference for general topology.
  2. book Rudin, W. (1976). Principles of Mathematical Analysis Third edition. Chapters 2–3 (basic topology, continuity). Clean treatment of metric-space compactness.
  3. book Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications Second edition. Chapter 4 (point-set topology in analysis context). Good for the Arzelà–Ascoli proof.
  4. book Kreyszig, E. (1978). Introductory Functional Analysis with Applications Chapter 1 (metric spaces as foundation for functional analysis). Excellent pedagogy for the intermediate reader; Section 1.4 covers the contraction mapping theorem with worked examples.
  5. book Stein, E. M. & Shakarchi, R. (2005). Real Analysis: Measure Theory, Integration, and Hilbert Spaces Chapter 7 (Hausdorff measure, fractals). Peripheral to Topic 29 but useful for the general topology bridge.