Functional Analysis · intermediate · 45 min read

Metric Spaces & Topology

Generalizing distance beyond Euclidean space — from metric axioms through completeness and compactness to the contraction mapping theorem and Arzelà–Ascoli, with applications to fixed-point iteration, embedding spaces, and metric learning.

Abstract. Every machine learning algorithm that computes a distance — every k-nearest-neighbor lookup, every gradient descent step, every embedding similarity score — is operating in a metric space. This topic generalizes the completeness and compactness theory from Topic 3 beyond Euclidean space to arbitrary metric spaces, where the notion of distance is axiomatized by three properties: positivity, symmetry, and the triangle inequality. The payoff is immediate. The Banach Contraction Mapping Theorem guarantees that any contraction on a complete metric space has a unique fixed point — this single result explains why Picard iteration solves ODEs, why Newton's method converges, why value iteration in reinforcement learning finds the optimal policy, and why many iterative algorithms in machine learning are guaranteed to converge. The Arzelà–Ascoli theorem characterizes compact subsets of function spaces and explains when a sequence of functions must have a uniformly convergent subsequence. Along the way, we develop the language of topology — open and closed sets, continuity via preimages, and the distinction between metric and general topological spaces — that will power the remaining three topics in this track.

1. Why Metric Spaces?

Every machine learning algorithm that computes a distance is quietly operating in a metric space. The Euclidean distance in $\mathbb{R}^n$ is the obvious example — it is what $k$ -nearest neighbors uses, what gradient descent steps respect, and what the word “similarity” means for Word2Vec embeddings. But Euclidean distance is not the only game in town, and for several of the questions we are about to ask, it is not even the right tool.

Three motivating puzzles set the stage.

Puzzle 1: Comparing functions. Given two continuous functions $f, g : [0, 1] \to \mathbb{R}$ , how do we measure how far apart they are? One natural answer is the sup-norm distance $\|f - g\|_\infty = \sup_{x \in [0,1]} |f(x) - g(x)|$ , which we saw already in Uniform Convergence — it is the metric that makes “convergence” mean “uniform convergence.” But there are other answers: the integral-based $L^1$ distance $\int_0^1 |f - g|$ , the $L^2$ distance $\sqrt{\int_0^1 (f - g)^2}$ , and many more. Each is a legitimate metric, and each gives $C([0, 1])$ a different geometric character. The framework we develop in this topic will explain why.

Puzzle 2: When does iteration converge? Suppose $T$ is a map from some space to itself and we iterate: start with $x_0$ , set $x_1 = T(x_0)$ , $x_2 = T(x_1)$ , and so on. When does the sequence $(x_n)$ converge, and when it does, to what? We are going to prove a theorem — the Banach Contraction Mapping Theorem — that answers this completely for any map that “shrinks distances.” The same theorem is why Picard iteration solves ordinary differential equations, why Newton’s method converges in a neighborhood of a simple root, and why reinforcement-learning value iteration finds the optimal policy. A single abstract result unifies half a dozen concrete convergence arguments.

Puzzle 3: Which function families are “small” enough to extract a convergent subsequence? In Topic 3 we proved the Bolzano–Weierstrass theorem: every bounded sequence in $\mathbb{R}^n$ has a convergent subsequence. In $C([0, 1])$ this is simply false — the sequence $f_n(x) = \sin(n\pi x)$ is bounded in the sup-norm (each $\|f_n\|_\infty = 1$ ), yet no subsequence converges uniformly. We flagged this counterexample in Topic 3 and promised to explain it here. The explanation is that “bounded” is the wrong condition in infinite dimensions; the right one involves equicontinuity, and the theorem that packages it is Arzelà–Ascoli.

These three puzzles are really one question asked three ways: what structure does a space need for the usual analytic intuition to work? The answer — the metric space axioms — is small, geometric, and surprisingly powerful. By the end of this topic we will have generalized completeness, compactness, and continuity from $\mathbb{R}^n$ to arbitrary metric spaces, and we will have proved the two theorems that turn this machinery into concrete convergence guarantees for machine learning algorithms.

2. Metric Space Axioms and Examples

The metric space abstraction compresses the idea of “distance” into three axioms. Everything we prove from here down follows from these three statements and nothing else — which is what gives the theory its reach.

📐 Definition 1 (Metric Space)

A metric space is a pair $(X, d)$ consisting of a set $X$ and a function $d : X \times X \to [0, \infty)$ called a metric (or distance function) satisfying:

Positivity: $d(x, y) = 0$ if and only if $x = y$ , and $d(x, y) \geq 0$ for all $x, y$ .
Symmetry: $d(x, y) = d(y, x)$ for all $x, y \in X$ .
Triangle inequality: $d(x, z) \leq d(x, y) + d(y, z)$ for all $x, y, z \in X$ .

The three axioms are exactly the properties the word “distance” should have. Positivity says distance is never negative and vanishes only at a single point. Symmetry says the distance from $x$ to $y$ does not depend on which one you started at. The triangle inequality says that going via an intermediate point $y$ is never a shortcut. If any of these fail, the machinery we are about to build breaks — which is why metric learning papers spend so much effort ensuring their learned “distances” actually satisfy these properties. Examples build intuition faster than axioms; here are four canonical ones.

📝 Example 1 (The ℓᵖ Metrics on Rⁿ)

For $1 \leq p < \infty$ , the $\ell^p$ metric on $\mathbb{R}^n$ is

$d_p(x, y) = \left(\sum_{i=1}^n |x_i - y_i|^p\right)^{1/p}.$

For $p = \infty$ the formula degenerates and we use

$d_\infty(x, y) = \max_{1 \leq i \leq n} |x_i - y_i|.$

All three axioms are satisfied for every $p \in [1, \infty]$ (the triangle inequality is Minkowski’s inequality). Three values of $p$ are worth remembering: $p = 2$ gives the usual Euclidean distance, $p = 1$ gives the “Manhattan” or taxicab distance, and $p = \infty$ gives the Chebyshev (max-coordinate) distance. In $\mathbb{R}^2$ , the “unit ball” $\{x : d_p(x, 0) \leq 1\}$ is a diamond for $p = 1$ , a circle for $p = 2$ , and a square for $p = \infty$ — the same set $\mathbb{R}^2$ acquires three visibly different geometries depending on which metric we choose.

📝 Example 2 (The Sup-Norm Metric on C([a,b]))

The space $C([a, b])$ of continuous real-valued functions on $[a, b]$ becomes a metric space under

$d_\infty(f, g) = \|f - g\|_\infty = \sup_{x \in [a,b]} |f(x) - g(x)|.$

Positivity and symmetry are immediate. The triangle inequality follows from the pointwise inequality $|f(x) - h(x)| \leq |f(x) - g(x)| + |g(x) - h(x)|$ by taking suprema on both sides. This is the same sup-norm that appeared in Uniform Convergence: a sequence $f_n$ converges to $f$ in this metric if and only if $f_n \to f$ uniformly. Section 5 will prove that $(C([a, b]), d_\infty)$ is complete — every Cauchy sequence converges — which is why the sup-norm is the right tool for function-space fixed-point arguments.

📝 Example 3 (The Discrete Metric)

On any set $X$ , define

$d(x, y) = \begin{cases} 0 & \text{if } x = y, \\ 1 & \text{if } x \neq y. \end{cases}$

All three axioms are trivial to verify. Under this metric every subset is open, every convergent sequence is eventually constant, and the “unit ball” $B(x, 1)$ contains only the single point $x$ . The discrete metric shows that metric-space theory applies to exotic settings — there is no requirement that $X$ look anything like $\mathbb{R}^n$ — and it is a useful source of counterexamples whenever you suspect a theorem secretly needs continuity.

📝 Example 4 (Hamming Distance)

On the set $\{0, 1\}^n$ of binary strings of length $n$ , the Hamming distance is

$d_H(x, y) = \#\{\, i : x_i \neq y_i\,\},$

the number of coordinates in which $x$ and $y$ differ. Symmetry is clear, positivity is clear, and the triangle inequality holds because “coordinates where $x$ and $z$ differ” is a subset of “coordinates where $x$ and $y$ differ” $\cup$ “coordinates where $y$ and $z$ differ.” Hamming distance is the metric behind error-correcting codes and is directly relevant to discrete ML: training a classifier to produce outputs within Hamming distance $k$ of a target is a metric-space constraint.

💡 Remark 1 (Different Metrics, Different Topologies)

The same underlying set can carry many different metrics, and the choice shapes everything else: which sequences converge, which sets are open, which functions are continuous. In $\mathbb{R}^2$ the $\ell^1$ , $\ell^2$ , and $\ell^\infty$ metrics all generate the same notion of “open set” (they are topologically equivalent — we will define this precisely in Section 8), so they all agree on which sequences converge. But the Hamming metric on $\{0, 1\}^n$ is qualitatively different from the Euclidean metric on the same set viewed as a subset of $\mathbb{R}^n$ : under Hamming distance, consecutive binary strings (differing in a single bit) are always at distance 1, but under Euclidean distance their positions depend on which bit changed. The metric determines the geometry, and the geometry determines the analysis.

Four-panel figure: the ℓ¹ diamond, ℓ² disk, and ℓ∞ square unit balls in R², plus the discrete-metric "ball" containing a single point.

3. Open Sets, Closed Sets, and the Language of Topology

With the metric in hand we can start building the vocabulary of topology: balls, open sets, closed sets, interiors, closures, boundaries. This is the language that every subsequent theorem in the topic will speak, and it is the same vocabulary Topic 3 used for $\mathbb{R}^n$ — only now it works in any metric space.

📐 Definition 2 (Open and Closed Balls)

Let $(X, d)$ be a metric space, let $x \in X$ , and let $r > 0$ . The open ball of radius $r$ centered at $x$ is

$B(x, r) = \{\, y \in X : d(x, y) < r\,\}.$

The closed ball of radius $r$ centered at $x$ is

$\overline{B}(x, r) = \{\, y \in X : d(x, y) \leq r\,\}.$

The open ball has a strict inequality; the closed ball includes the boundary sphere. In $\mathbb{R}^n$ with $d_2$ these are the familiar solid balls without and with their bounding spheres. In $C([a, b])$ with the sup-norm, the open ball $B(f, \varepsilon)$ is the set of all continuous functions whose graph lies strictly within an $\varepsilon$ -tube around the graph of $f$ — exactly the “uniform tube” that Topic 4 used to define uniform convergence.

📐 Definition 3 (Open and Closed Sets)

A set $U \subseteq X$ is open if for every point $x \in U$ there exists some $r > 0$ such that the open ball $B(x, r)$ is entirely contained in $U$ :

$\forall x \in U \ \exists r > 0 : B(x, r) \subseteq U.$

A set $F \subseteq X$ is closed if its complement $X \setminus F$ is open.

A set is open if every point has some breathing room: a small open ball around it that stays inside the set. A set is closed if its complement has that property. Note that “closed” does not mean “not open” — the empty set and the whole space $X$ are both open and closed in every metric space (check the definitions directly), and in the discrete metric every subset is both open and closed. Openness and closedness are more subtle than they look.

🔷 Proposition 1 (Properties of Open and Closed Sets)

In any metric space $(X, d)$ :

$\emptyset$ and $X$ are both open and closed.
Arbitrary unions of open sets are open.
Finite intersections of open sets are open.
Arbitrary intersections of closed sets are closed.
Finite unions of closed sets are closed.

The asymmetry between “arbitrary” and “finite” matters. Arbitrary intersections of open sets need not be open — the classic example is $\bigcap_{n \geq 1} (-1/n, 1/n) = \{0\}$ , an intersection of open intervals whose limit is a single point, which is not open. Similarly, arbitrary unions of closed sets need not be closed. The finiteness restriction in items 3 and 5 is therefore essential, not cosmetic.

📐 Definition 4 (Interior, Closure, Boundary, Limit Point)

For a subset $A \subseteq X$ :

The interior $A^\circ$ is the union of all open sets contained in $A$ ; equivalently, it is the largest open set inside $A$ .
The closure $\overline{A}$ is the intersection of all closed sets containing $A$ ; equivalently, it is the smallest closed set that contains $A$ .
The boundary is $\partial A = \overline{A} \setminus A^\circ$ .
A point $x \in X$ is a limit point of $A$ if every open ball $B(x, r)$ contains a point of $A$ other than $x$ .

Two facts follow immediately. First, $A^\circ \subseteq A \subseteq \overline{A}$ : the interior is contained in $A$ , which is contained in the closure. Second, $A$ is open exactly when $A = A^\circ$ , and $A$ is closed exactly when $A = \overline{A}$ — these are not additional hypotheses but equivalent characterizations. The closure can also be described as $\overline{A} = A \cup \{\text{limit points of } A\}$ , which is often the easiest way to compute it in practice.

📝 Example 5 (Open Sets in C([0,1]))

In $C([0, 1])$ with the sup-norm, the open ball $B(f, \varepsilon)$ consists of all continuous functions $g$ whose graph lies within the horizontal band $\{(x, y) : f(x) - \varepsilon < y < f(x) + \varepsilon\}$ . A set $\mathcal{U} \subseteq C([0, 1])$ is open precisely when for every $f \in \mathcal{U}$ we can find a small enough $\varepsilon$ so that every continuous function inside the $\varepsilon$ -tube around $f$ is also in $\mathcal{U}$ . This is exactly the “ $\varepsilon$ -tube” picture from Topic 4’s visualization of uniform convergence — the only difference is that now we are thinking of a tube as an open ball in a metric space.

Four-panel figure: interior, closure, and boundary of a subset of R; an open ball in R²; an ε-tube open ball in C([0,1]); and the trivial "balls" of the discrete metric.

4. Continuity in Metric Spaces

Continuity is the next concept that lifts from $\mathbb{R}$ to arbitrary metric spaces. We give three equivalent characterizations — an $\varepsilon$ - $\delta$ definition, a sequential definition, and a preimage-of-open-set definition — and show that all three are equivalent. The third characterization is the conceptual payoff: it describes continuity using only the structure of open sets, which is what lets the definition generalize to spaces without a metric at all.

📐 Definition 5 (Continuity (ε–δ))

Let $(X, d_X)$ and $(Y, d_Y)$ be metric spaces. A function $f : X \to Y$ is continuous at a point $x_0 \in X$ if for every $\varepsilon > 0$ there exists $\delta > 0$ such that

$d_X(x, x_0) < \delta \quad \Longrightarrow \quad d_Y(f(x), f(x_0)) < \varepsilon.$

The function $f$ is continuous (on all of $X$ ) if it is continuous at every point of $X$ .

This is the same $\varepsilon$ - $\delta$ definition we used in Topic 2 for real-valued functions of a real variable, with $d_X$ and $d_Y$ replacing the absolute value. Now the three characterizations.

🔷 Theorem 1 (Three Characterizations of Continuity)

Let $f : (X, d_X) \to (Y, d_Y)$ be a function between metric spaces. The following are equivalent:

$f$ is continuous at every point of $X$ (the $\varepsilon$ – $\delta$ definition).
For every convergent sequence $x_n \to x$ in $X$ , we have $f(x_n) \to f(x)$ in $Y$ .
For every open set $V \subseteq Y$ , the preimage $f^{-1}(V) \subseteq X$ is open in $X$ .

Proof.

We prove the three implications $(1) \Rightarrow (2) \Rightarrow (3) \Rightarrow (1)$ .

$(1) \Rightarrow (2)$ . Assume $f$ is continuous at every point, let $x_n \to x$ in $X$ , and fix $\varepsilon > 0$ . Continuity at $x$ gives some $\delta > 0$ such that $d_X(y, x) < \delta$ implies $d_Y(f(y), f(x)) < \varepsilon$ . Since $x_n \to x$ , there exists $N$ such that for all $n \geq N$ we have $d_X(x_n, x) < \delta$ , and therefore $d_Y(f(x_n), f(x)) < \varepsilon$ . This is exactly the statement that $f(x_n) \to f(x)$ .

$(2) \Rightarrow (3)$ . Assume sequential continuity, let $V \subseteq Y$ be open, and let $x \in f^{-1}(V)$ . We need to show some ball $B(x, r)$ is contained in $f^{-1}(V)$ . Suppose not. Then for every $n \geq 1$ there exists a point $x_n$ with $d_X(x_n, x) < 1/n$ but $f(x_n) \notin V$ . By construction $x_n \to x$ , so by sequential continuity $f(x_n) \to f(x)$ . But $f(x) \in V$ and $V$ is open, so some ball $B(f(x), \varepsilon)$ is contained in $V$ , which means $f(x_n) \in V$ for all sufficiently large $n$ — contradicting $f(x_n) \notin V$ .

$(3) \Rightarrow (1)$ . Assume preimages of open sets are open, let $x_0 \in X$ , and fix $\varepsilon > 0$ . The open ball $B(f(x_0), \varepsilon)$ is open in $Y$ , so by hypothesis its preimage $f^{-1}(B(f(x_0), \varepsilon))$ is open in $X$ and contains $x_0$ . By the definition of “open” we can find some $\delta > 0$ with $B(x_0, \delta) \subseteq f^{-1}(B(f(x_0), \varepsilon))$ , which rewrites as: $d_X(x, x_0) < \delta \Rightarrow d_Y(f(x), f(x_0)) < \varepsilon$ . That is the $\varepsilon$ – $\delta$ condition at $x_0$ . Since $x_0$ was arbitrary, $f$ is continuous.

∎

💡 Remark 2 (Why Preimages Matter)

Characterization (3) is the conceptual leap: it describes continuity entirely in terms of the structure of open sets, with no reference to distances or sequences. This is what lets the definition generalize to topological spaces in Section 8, where there may be no metric at all. In metric spaces all three characterizations are equivalent — we just proved it — so the preimage formulation looks like one more way to say the same thing. The power of (3) only becomes apparent when we leave metrics behind: in an arbitrary topological space, the $\varepsilon$ – $\delta$ definition is not even expressible, and the sequential definition can fail in strange ways. The preimage characterization is the universally correct one.

📐 Definition 6 (Lipschitz Continuity)

A function $f : (X, d_X) \to (Y, d_Y)$ is Lipschitz continuous with constant $L \geq 0$ if

$d_Y(f(x), f(y)) \leq L \cdot d_X(x, y) \quad \text{for all } x, y \in X.$

The smallest such $L$ is called the Lipschitz constant of $f$ . If $L < 1$ the map is called a contraction, and this case will drive Section 7.

Lipschitz continuity is strictly stronger than continuity (every Lipschitz function is continuous, take $\delta = \varepsilon / L$ ) and strictly weaker than differentiability with bounded derivative (every such function is Lipschitz on a bounded domain). For ML purposes, Lipschitz constants quantify how rapidly a function can change: they control the convergence rate of gradient descent, the sensitivity of a neural network to input perturbations, and the stability of numerical ODE solvers. We will see concrete examples in Sections 7 and 10.

Two-panel figure: the ε–δ picture in a metric space, and the preimage-of-open-set characterization — an open ball V in Y pulled back to an open set in X.

5. Completeness in Metric Spaces

Completeness generalizes the Cauchy-completeness of $\mathbb{R}$ (Topic 3, Section 1) to arbitrary metric spaces. The crucial new example — the one that powers every fixed-point argument in the rest of the topic — is that $C([a, b])$ with the sup-norm is complete.

📐 Definition 7 (Cauchy Sequence in a Metric Space)

A sequence $(x_n)$ in a metric space $(X, d)$ is Cauchy if for every $\varepsilon > 0$ there exists $N$ such that

$d(x_m, x_n) < \varepsilon \quad \text{for all } m, n \geq N.$

📐 Definition 8 (Complete Metric Space)

A metric space $(X, d)$ is complete if every Cauchy sequence in $X$ converges to a limit that is also in $X$ .

Cauchy says the terms of the sequence eventually cluster together; complete says that clustering is enough to guarantee a limit. The two concepts are linked but distinct: $\mathbb{Q}$ is not complete (the sequence $1, 1.4, 1.41, 1.414, \ldots$ of decimal approximations to $\sqrt{2}$ is Cauchy in $\mathbb{Q}$ but has no rational limit), while $\mathbb{R}$ is complete — and this is exactly how $\mathbb{R}$ was constructed from $\mathbb{Q}$ (by filling in the missing limits).

📝 Example 6 (Completeness of Rⁿ)

$\mathbb{R}$ is complete by the Cauchy-completeness theorem from Topic 3, Section 1. $\mathbb{R}^n$ is complete under any of the $\ell^p$ metrics: if $(x^{(k)})$ is a Cauchy sequence in $\mathbb{R}^n$ , then each coordinate sequence $(x^{(k)}_i)$ is Cauchy in $\mathbb{R}$ (because $|x^{(k)}_i - x^{(\ell)}_i| \leq d_p(x^{(k)}, x^{(\ell)})$ ), so each coordinate converges to some limit $x^*_i$ , and coordinate-wise convergence in $\mathbb{R}^n$ implies $\ell^p$ convergence. This is the same argument Topic 3 ran for the Euclidean metric; all it needed was the coordinate inequality, which holds for every $p \geq 1$ .

🔷 Theorem 2 (Completeness of C([a,b]))

The space $C([a, b])$ with the sup-norm metric $d_\infty(f, g) = \|f - g\|_\infty$ is complete.

Proof.

Let $(f_n)$ be a Cauchy sequence in $C([a, b])$ . We need to exhibit a continuous function $f$ such that $\|f_n - f\|_\infty \to 0$ .

Step 1: Pointwise limit exists. For each $x \in [a, b]$ , the sequence of real numbers $(f_n(x))$ is Cauchy in $\mathbb{R}$ , because

$|f_n(x) - f_m(x)| \leq \sup_{t \in [a,b]} |f_n(t) - f_m(t)| = \|f_n - f_m\|_\infty.$

Since $\mathbb{R}$ is complete, $(f_n(x))$ converges to some real number, which we define to be $f(x)$ . This gives a function $f : [a, b] \to \mathbb{R}$ as the pointwise limit.

Step 2: Convergence is uniform. Given $\varepsilon > 0$ , the Cauchy condition gives $N$ such that $\|f_n - f_m\|_\infty < \varepsilon$ for all $m, n \geq N$ . Fix $x \in [a, b]$ and any $n \geq N$ . Letting $m \to \infty$ in the inequality $|f_n(x) - f_m(x)| < \varepsilon$ , the right side stays $\leq \varepsilon$ and the left side converges to $|f_n(x) - f(x)|$ . Therefore $|f_n(x) - f(x)| \leq \varepsilon$ for all $x$ , which means $\|f_n - f\|_\infty \leq \varepsilon$ . Since $\varepsilon$ was arbitrary, $f_n \to f$ uniformly.

Step 3: The limit is continuous. This is exactly the Uniform Limit Theorem from Uniform Convergence: a uniform limit of continuous functions is continuous. So $f \in C([a, b])$ , and the previous step says $\|f_n - f\|_\infty \to 0$ . The sequence converges in the sup-norm metric, as needed.

∎

This theorem is small but it is the backbone of the rest of the topic. Every fixed-point argument in Sections 7 and 10 — Picard iteration, Newton’s method, Bellman iteration — lives inside $(C([a, b]), d_\infty)$ or an analogous function space, and every such argument needs to know that Cauchy sequences converge. The completeness of $C([a, b])$ is what lets us “extract a limit” at the end of each proof.

📝 Example 7 (Incompleteness of C([0,1]) under a Different Metric)

The same set $C([0, 1])$ is not complete under the integral-based metric

$d_2(f, g) = \sqrt{\int_0^1 (f(x) - g(x))^2 \, dx}.$

Consider the sequence of continuous functions $f_n$ on $[0, 1]$ defined to be $0$ on $[0, 1/2 - 1/n]$ , linear from $0$ to $1$ on $[1/2 - 1/n, 1/2 + 1/n]$ , and $1$ on $[1/2 + 1/n, 1]$ — a “ramp” of width $2/n$ around $x = 1/2$ . For large $m, n$ , the difference $f_m - f_n$ is supported in a shrinking interval of length $2/\min(m, n)$ , so $d_2(f_m, f_n) \to 0$ as $m, n \to \infty$ : the sequence is Cauchy in $d_2$ . But its pointwise limit is the step function that jumps from $0$ to $1$ at $x = 1/2$ , which is discontinuous — and no continuous function can be the $d_2$ -limit of this Cauchy sequence. So $(C([0, 1]), d_2)$ is not complete. The completeness of a space depends on the metric, not just the set.

💡 Remark 3 (The Completion Theorem)

Every metric space $(X, d)$ has a completion: a complete metric space $(\hat{X}, \hat{d})$ together with an isometric embedding $\iota : X \hookrightarrow \hat{X}$ such that $\iota(X)$ is dense in $\hat{X}$ . The construction mirrors how $\mathbb{R}$ is built from $\mathbb{Q}$ : form equivalence classes of Cauchy sequences in $X$ , where two Cauchy sequences are equivalent if their interleaving is also Cauchy. The completion of $(C([0, 1]), d_2)$ is a larger space containing limits like our ramp-sequence step function — a space that analysts call $L^2[0, 1]$ and that shows up constantly in modern machine learning. We state the completion theorem as a fact rather than proving it in detail; Munkres, Chapter 3, has the full construction.

Three-panel figure: a Cauchy sequence in R converging; a Cauchy sequence in (C([0,1]), sup-norm) converging uniformly; the ramp-sequence Cauchy in d₂ but with no continuous limit.

6. Compactness in Metric Spaces

Compactness is the most technically important concept in the topic. In $\mathbb{R}^n$ , the Heine–Borel theorem from Topic 3 told us that compact sets are exactly the closed and bounded ones. In general metric spaces this is false — and the proof of “why it fails” is the $\sin(n\pi x)$ counterexample we promised back in Topic 3. In this section we give the general definition, prove the three-way equivalence with sequential compactness and total boundedness, and visualize the counterexample.

📐 Definition 9 (Open-Cover Compactness)

A metric space $(X, d)$ is compact if every open cover of $X$ has a finite subcover: whenever $X = \bigcup_{\alpha \in I} U_\alpha$ with each $U_\alpha$ open, there is a finite set $\{\alpha_1, \ldots, \alpha_N\} \subseteq I$ such that $X = U_{\alpha_1} \cup \cdots \cup U_{\alpha_N}$ . A subset $K \subseteq X$ is compact if it is compact as a metric space under the restricted metric.

📐 Definition 10 (Sequential Compactness)

A metric space $(X, d)$ is sequentially compact if every sequence in $X$ has a convergent subsequence.

📐 Definition 11 (Total Boundedness)

A metric space $(X, d)$ is totally bounded if for every $\varepsilon > 0$ there exist finitely many points $x_1, \ldots, x_N \in X$ such that the corresponding $\varepsilon$ -balls cover $X$ :

$X = \bigcup_{i=1}^N B(x_i, \varepsilon).$

The collection $\{x_1, \ldots, x_N\}$ is called a finite $\varepsilon$ -net for $X$ .

Total boundedness is strictly stronger than boundedness. A bounded set of a metric space only needs to fit inside a single large ball; a totally bounded set needs to be approximable, at every resolution $\varepsilon$ , by a finite number of points. In $\mathbb{R}^n$ the two concepts agree (because $\mathbb{R}^n$ is finite-dimensional and a single bounded ball contains only “finitely many balls-worth of material”). In infinite-dimensional spaces they come apart dramatically — and that is exactly why Heine–Borel fails in $C([0, 1])$ .

🔷 Theorem 3 (Compactness Equivalence in Metric Spaces)

For a metric space $(X, d)$ , the following three conditions are equivalent:

$X$ is compact (every open cover has a finite subcover).
$X$ is sequentially compact (every sequence has a convergent subsequence).
$X$ is complete and totally bounded.

Proof.

We sketch the three directions. Full proofs are in Munkres, Chapter 3, or Rudin, Chapter 2.

$(1) \Rightarrow (2)$ . Suppose $X$ is compact and let $(x_n)$ be a sequence with no convergent subsequence. Then no point $x \in X$ is a cluster point of $(x_n)$ , so each $x$ has an open ball $B(x, r_x)$ containing only finitely many terms of the sequence. These balls form an open cover of $X$ . Compactness gives a finite subcover, which therefore contains only finitely many terms in total — but the full sequence has infinitely many terms, a contradiction.

$(2) \Rightarrow (3)$ . Sequential compactness implies completeness: a Cauchy sequence has a convergent subsequence by hypothesis, and a Cauchy sequence with a convergent subsequence converges to the same limit (standard Cauchy-subsequence lemma). For total boundedness, suppose $X$ has no finite $\varepsilon$ -net for some $\varepsilon > 0$ . Pick any $x_1 \in X$ . Since $B(x_1, \varepsilon)$ is not all of $X$ , pick $x_2 \notin B(x_1, \varepsilon)$ . Since $B(x_1, \varepsilon) \cup B(x_2, \varepsilon)$ is not all of $X$ , pick $x_3$ outside both balls. Continuing, we construct a sequence with pairwise distances $\geq \varepsilon$ , which has no convergent subsequence — contradicting sequential compactness.

$(3) \Rightarrow (1)$ (the hardest direction). Suppose $X$ is complete and totally bounded but has an open cover $\{U_\alpha\}$ with no finite subcover. Total boundedness gives a finite $\varepsilon_1$ -net; at least one of the balls $B(x_i, \varepsilon_1)$ intersected with $X$ has no finite subcover (otherwise combining them would give a finite subcover of the whole space). Pick one such ball, shrink $\varepsilon$ to $\varepsilon_1 / 2$ , and apply total boundedness inside that ball to get a smaller uncovered ball $B(x_2, \varepsilon_1/2)$ . Iterating, we obtain a nested sequence of sets with diameters shrinking to zero; picking one point from each gives a Cauchy sequence. By completeness it converges to some limit $x^*$ , which belongs to some $U_\alpha$ in the cover. Because $U_\alpha$ is open, $U_\alpha$ contains a ball around $x^*$ , so $U_\alpha$ covers all the tail sets in our nested sequence once they are small enough — giving a finite subcover for that tail after all, contradicting our construction.

∎

💡 Remark 4 (Heine–Borel as a Special Case)

In $\mathbb{R}^n$ , the Heine–Borel theorem from Completeness & Compactness in Rⁿ says compact ⟺ closed and bounded. This is a special case of the general three-way equivalence we just proved: in $\mathbb{R}^n$ , bounded and totally bounded are the same thing (because $\mathbb{R}^n$ is finite-dimensional, so a single bounded ball contains only finitely many small balls’ worth of volume). And a closed subset of a complete space is complete. So in $\mathbb{R}^n$ , closed + bounded is the same as complete + totally bounded, which by Theorem 3 is the same as compact. The equivalence “bounded ⟺ totally bounded” fails in infinite dimensions — and we are about to see exactly how badly.

Family:

N:8

All off-diagonal entries ≥ 1.76 — no subsequence is Cauchy.

The closed unit ball in C([0,1]) is closed and bounded but not compact: no subsequence of these functions converges in the sup-norm.

📝 Example 8 (The C([0,1]) Counterexample)

The closed unit ball $\mathcal{B} = \{f \in C([0, 1]) : \|f\|_\infty \leq 1\}$ in the sup-norm is closed and bounded, but not compact. This is the counterexample promised in Topic 3.

Proof. Consider the sequence $f_n(x) = \sin(n\pi x)$ for $n = 1, 2, 3, \ldots$ . Each $f_n$ is continuous and $\|f_n\|_\infty = 1$ , so every $f_n \in \mathcal{B}$ . We claim that no subsequence of $(f_n)$ is Cauchy in the sup-norm — and therefore no subsequence converges, so $\mathcal{B}$ is not sequentially compact and (by Theorem 3) not compact.

To see why, fix distinct indices $m \neq n$ . The functions $\sin(m\pi x)$ and $\sin(n\pi x)$ are orthogonal in $L^2[0, 1]$ :

$\int_0^1 \sin(m\pi x) \sin(n\pi x) \, dx = 0.$

Expanding $\int_0^1 (\sin(m\pi x) - \sin(n\pi x))^2 \, dx$ using orthogonality and $\int_0^1 \sin^2(k\pi x)\,dx = 1/2$ :

$\int_0^1 (\sin(m\pi x) - \sin(n\pi x))^2 \, dx = \tfrac{1}{2} + \tfrac{1}{2} - 0 = 1.$

Now use the bound $\|f\|_2^2 \leq \|f\|_\infty^2$ on $[0, 1]$ : if the sup-norm of the difference were less than 1, the integral above would be less than 1. But the integral equals 1, so

$\|f_m - f_n\|_\infty \geq \|f_m - f_n\|_2 = 1 \quad \text{for all } m \neq n.$

Every pair of terms in the sequence is at sup-norm distance at least 1 — so no subsequence is Cauchy, and no subsequence converges.

The lesson. In an infinite-dimensional space, closed and bounded is not enough for compactness. The missing ingredient is equicontinuity: the family $\sin(n\pi x)$ is bounded, but the functions “wiggle faster and faster” — for large $n$ there is no single $\delta$ that controls $|f_n(x) - f_n(y)|$ uniformly across the family. We will name this obstruction and package it into the Arzelà–Ascoli theorem in Section 9.

Two-panel figure: the family sin(nπx) for n = 1..8 plotted, and the pairwise sup-norm distance matrix showing all off-diagonal entries ≥ 1.

7. The Contraction Mapping Theorem

We now prove the theorem that justifies most of the fixed-point arguments scattered throughout the rest of this curriculum. It is short, clean, and extraordinarily useful: four earlier topics have been waiting to cite it.

🔷 Theorem 4 (Banach Contraction Mapping Theorem)

Let $(X, d)$ be a complete metric space and let $T : X \to X$ be a contraction with Lipschitz constant $0 \leq k < 1$ , meaning

$d(T(x), T(y)) \leq k \cdot d(x, y) \quad \text{for all } x, y \in X.$

Then:

$T$ has a unique fixed point $x^* \in X$ — a point with $T(x^*) = x^*$ .
For any starting point $x_0 \in X$ , the iterates $x_{n+1} = T(x_n)$ converge to $x^*$ .
The convergence rate is geometric: $d(x_n, x^*) \leq \dfrac{k^n}{1 - k} \cdot d(x_0, x_1)$ .

Proof.

Fix a starting point $x_0 \in X$ and define the iterates $x_{n+1} = T(x_n)$ . The contraction property applied to consecutive iterates gives

$d(x_{n+1}, x_n) = d(T(x_n), T(x_{n-1})) \leq k \cdot d(x_n, x_{n-1}).$

By induction, this bound extends to

$d(x_{n+1}, x_n) \leq k^n \cdot d(x_1, x_0).$

Now we show $(x_n)$ is Cauchy. For any $m > n$ , the triangle inequality stacks up $m - n$ consecutive-iterate bounds:

$d(x_m, x_n) \leq \sum_{j = n}^{m-1} d(x_{j+1}, x_j) \leq d(x_1, x_0) \sum_{j = n}^{m-1} k^j.$

The geometric-series tail bound gives $\sum_{j = n}^{m-1} k^j \leq \dfrac{k^n}{1 - k}$ , so

$d(x_m, x_n) \leq \frac{k^n}{1 - k} \cdot d(x_1, x_0).$

Since $k < 1$ , the right-hand side tends to $0$ as $n \to \infty$ uniformly in $m$ — the sequence $(x_n)$ is Cauchy. Because $X$ is complete, $(x_n)$ converges to some limit $x^* \in X$ .

Next we show $x^*$ is a fixed point. Contractions are Lipschitz (with constant $k$ ), hence continuous, so we can pass $T$ through the limit:

$T(x^*) = T\big(\lim_{n \to \infty} x_n\big) = \lim_{n \to \infty} T(x_n) = \lim_{n \to \infty} x_{n+1} = x^*.$

Uniqueness: suppose $y^*$ is another fixed point of $T$ . Then

$d(x^*, y^*) = d(T(x^*), T(y^*)) \leq k \cdot d(x^*, y^*).$

Since $k < 1$ , this forces $d(x^*, y^*) = 0$ , so $y^* = x^*$ .

Finally, the convergence rate. Letting $m \to \infty$ in the Cauchy-tail bound $d(x_m, x_n) \leq \frac{k^n}{1-k} \cdot d(x_1, x_0)$ , the left side converges to $d(x^*, x_n)$ , proving

$d(x_n, x^*) \leq \frac{k^n}{1 - k} \cdot d(x_0, x_1).$

∎

Three things to note. First, the theorem tells us nothing about which starting point to use — any $x_0 \in X$ works, and they all converge to the same fixed point. Second, the rate bound is sharp enough to give a stopping criterion: after $n$ iterations, the distance to $x^*$ is at most $\frac{k^n}{1-k}$ times the first step $d(x_0, x_1)$ , a quantity we can measure after one evaluation of $T$ . Third, the contraction hypothesis $k < 1$ is strict: at $k = 1$ the theorem breaks, as the viz below demonstrates by letting you set $k = 1$ and watch iterates orbit instead of converging.

Scenario:Lipschitz k:0.70

Iterations: 60

d(x_n, x*): 8.378e-10

Bound kⁿ/(1−k)·d(x₀,x₁): 1.471e-9

k < 1 — contraction ✓

An affine map T(x) = k·Rθ·x + c with θ = 30°. Operator norm of k·Rθ is exactly k. Click anywhere in the plot to set a new starting point x₀; drag the slider to change k.

📝 Example 9 (Picard Iteration for ODEs)

The Picard–Lindelöf existence-and-uniqueness theorem for first-order ODEs from First-Order ODEs rewrites the initial-value problem $y'(t) = f(t, y(t))$ , $y(0) = y_0$ as the fixed-point equation

$y(t) = y_0 + \int_0^t f(s, y(s)) \, ds.$

Defining the Picard operator $\Phi[y](t) = y_0 + \int_0^t f(s, y(s)) \, ds$ , a fixed point of $\Phi$ is exactly a solution of the ODE. Topic 21 showed that when $f$ is Lipschitz in its second argument with constant $L$ and we work on a short interval $[0, h]$ with $Lh < 1$ , the Picard operator is a contraction on $C([0, h])$ with the sup-norm — constant $Lh$ — and $C([0, h])$ is complete by Theorem 2. The Banach theorem then produces a unique fixed point, which is the ODE solution. The abstract theorem we just proved is what makes Topic 21’s concrete argument go through.

📝 Example 10 (Newton Iteration for Implicit Methods)

The implicit Euler method from Numerical Methods for ODEs requires solving the nonlinear equation $y_{n+1} = y_n + h \cdot f(t_{n+1}, y_{n+1})$ at each time step. Newton’s method reduces this to iterating $g(y) = y - [I - h f_y(t_{n+1}, y)]^{-1}(y - y_n - h f(t_{n+1}, y))$ . For sufficiently small $h$ , the derivative of $g$ at the root is bounded in norm by some $k < 1$ , so by the mean value theorem $g$ is a contraction on a small neighborhood of the root and Banach’s theorem guarantees convergence. Topic 24 invoked this conclusion in its convergence analysis without proof; we have now proved the general result.

📝 Example 11 (The Inverse Function Theorem)

The proof of the Inverse Function Theorem in Topic 12 constructs a local inverse by setting up the fixed-point equation $x = \phi(x)$ where $\phi$ is a map on a small ball in $\mathbb{R}^n$ whose Jacobian is bounded below 1 in operator norm. Banach’s theorem then gives a unique fixed point, which is the preimage of the target point. The whole construction is a single application of Theorem 4 to a carefully chosen contraction — the argument was specialized to $\mathbb{R}^n$ in Topic 12, but the contraction-mapping framework developed here shows why it works in full generality.

💡 Remark 5 (Bernstein Operators Are Not Contractions)

Topic 20 proved that Bernstein operators $B_n f$ converge uniformly to $f$ on $[0, 1]$ — a statement that lives entirely in the metric space $(C([0, 1]), d_\infty)$ . But $B_n$ is not a contraction: it does not uniformly shrink sup-norm distances between arbitrary functions. Bernstein convergence is therefore not an application of Theorem 4; it is a different kind of approximation phenomenon that happens to share the same ambient metric space. The general lesson: sitting inside a complete metric space does not automatically make every self-map a contraction, and density-of-polynomials results require separate machinery (probabilistic moment arguments in the Bernstein case). What Topic 20 does share with this topic is the underlying metric-space structure — the sup-norm distance between $B_n f$ and $f$ is how “convergence” is defined there, and that distance is a metric in exactly the sense defined here.

Three-panel figure: iterates converging for k < 1, orbiting for k = 1, and diverging for k > 1.

Log-scale convergence plot: actual d(x_n, x*) vs. theoretical bound k^n/(1−k)·d(x₀, x₁) for several values of k.

8. General Topology — Beyond Metric Spaces

Everything we have done so far has used the metric $d$ . But many of the concepts — continuity, open sets, closure, compactness — make sense in spaces where there is no metric at all, provided you know which sets are “open.” This section introduces the general notion of a topological space and explains why the preimage characterization of continuity is the one that generalizes.

📐 Definition 12 (Topological Space)

A topological space is a pair $(X, \tau)$ where $X$ is a set and $\tau$ is a collection of subsets of $X$ — called open sets — satisfying:

$\emptyset \in \tau$ and $X \in \tau$ .
Arbitrary unions of elements of $\tau$ are in $\tau$ .
Finite intersections of elements of $\tau$ are in $\tau$ .

The collection $\tau$ is called a topology on $X$ . A metric space $(X, d)$ induces a topology $\tau_d = \{\, U \subseteq X : U \text{ is open in the metric-space sense}\,\}$ by Proposition 1, so every metric space is a topological space. But there are topological spaces with no compatible metric, and those are strictly more general.

The three axioms are exactly the properties Proposition 1 established for metric-space open sets. Any collection of subsets satisfying them qualifies as a topology, regardless of whether it came from a metric. The question is whether all topologies come from metrics — and the answer is no.

📝 Example 12 (The Cofinite Topology)

Let $X$ be an uncountable set — for concreteness, take $X = \mathbb{R}$ . Define $\tau$ to consist of the empty set together with all subsets $U \subseteq X$ whose complement $X \setminus U$ is finite:

$\tau = \{\emptyset\} \cup \{\, U \subseteq X : |X \setminus U| < \infty\,\}.$

This is a topology: the empty set and $X$ are in $\tau$ , arbitrary unions of cofinite sets are cofinite (their complement is an intersection of finite sets, which is finite), and finite intersections of cofinite sets are cofinite (their complement is a finite union of finite sets, which is finite). So $(X, \tau)$ is a valid topological space.

But $(X, \tau)$ is not metrizable: there is no metric that generates this topology. Here is one quick reason. In the cofinite topology, any two non-empty open sets must intersect (the union of their complements is finite, so the intersection of the sets themselves is cofinite, hence non-empty). So for any two distinct points $x, y \in X$ , every neighborhood of $x$ intersects every neighborhood of $y$ . In a metric space we can always separate distinct points by disjoint balls $B(x, d(x,y)/3)$ and $B(y, d(x,y)/3)$ — a property called the Hausdorff property. The cofinite topology on an uncountable set fails this, so no metric can generate it.

📐 Definition 13 (Continuity in Topological Spaces)

A function $f : (X, \tau_X) \to (Y, \tau_Y)$ between topological spaces is continuous if for every open set $V \in \tau_Y$ , the preimage $f^{-1}(V) \in \tau_X$ .

This is characterization (3) from Theorem 1 — the preimage-of-open-set condition — lifted verbatim to the topological setting. It is the only definition that works universally: the $\varepsilon$ - $\delta$ definition requires a metric, and the sequential definition fails in non-first-countable spaces (where there are “too many” open sets for sequences to detect). In metric spaces all three definitions coincide, as we proved in Theorem 1; in general topology, we start with (3) and take the other two only when the space is metrizable.

📐 Definition 14 (Homeomorphism)

A homeomorphism between topological spaces $(X, \tau_X)$ and $(Y, \tau_Y)$ is a bijection $f : X \to Y$ such that both $f$ and $f^{-1}$ are continuous. Two spaces related by a homeomorphism are topologically equivalent — they have the same open sets up to relabeling, the same convergence, the same compactness properties, and the same connectedness properties.

Homeomorphism is the “sameness” relation for topological spaces. Two very different-looking metric spaces can be topologically equivalent — for example, the open interval $(0, 1)$ and the whole real line $\mathbb{R}$ are homeomorphic (via any monotone continuous bijection like $x \mapsto \tan(\pi x - \pi/2)$ ), even though one has finite diameter and the other doesn’t. The metric “forgot” when we reduced to the topology.

💡 Remark 6 (Metrization and Fundamental Groups)

Topic 15 mentioned simply connected domains and fundamental groups as the setting for its discussion of conservative vector fields and path independence. These concepts live at the general-topology level: a domain is simply connected if every closed loop in it can be continuously contracted to a point, a property defined entirely in terms of continuous maps (homotopies of loops), and the fundamental group $\pi_1(X, x_0)$ classifies loops up to continuous deformation. Fundamental groups make sense in any topological space and are preserved by homeomorphism — they are a topological invariant. Metric spaces inherit these invariants whenever they carry the metric-induced topology, which is almost always in analysis. The Urysohn Metrization Theorem gives a clean sufficient condition: a topological space is metrizable if (and essentially only if) it is second-countable, regular, and Hausdorff. Most spaces that appear in analysis satisfy these conditions, which is why metric-space theory covers so much ground. But the topological vocabulary — connectedness, path-connectedness, fundamental groups, homotopy — belongs to the general theory, not to metric spaces in particular.

9. Equicontinuity and the Arzelà–Ascoli Theorem

In Section 6 we saw that the closed unit ball of $C([0, 1])$ is not compact — the sequence $f_n(x) = \sin(n\pi x)$ has no convergent subsequence. The Arzelà–Ascoli theorem identifies exactly what extra condition is needed to restore compactness in function spaces: equicontinuity. Topic 4 introduced the concept and promised the proof; we pay that debt now.

📐 Definition 15 (Equicontinuity)

A family $\mathcal{F} \subseteq C([a, b])$ is equicontinuous if for every $\varepsilon > 0$ there exists a single $\delta > 0$ — depending on $\varepsilon$ but not on $f$ — such that for every $f \in \mathcal{F}$ and every $x, y \in [a, b]$ with $|x - y| < \delta$ :

$|f(x) - f(y)| < \varepsilon.$

This is the concept Topic 4 introduced in the context of equicontinuity figures. The crucial word is single: uniform continuity of a single function says there is one $\delta$ that works for all points at once; equicontinuity of a family says there is one $\delta$ that works for all points and all functions at once. The family $f_n(x) = \sin(n\pi x)$ fails equicontinuity because faster-wiggling functions need smaller $\delta$ ‘s — no single choice works for the whole family. By contrast, any family of functions with a uniform Lipschitz bound (say $|f(x) - f(y)| \leq L |x - y|$ for all $f$ in the family) is automatically equicontinuous: take $\delta = \varepsilon / L$ .

📐 Definition 16 (Uniform Boundedness of a Family)

A family $\mathcal{F} \subseteq C([a, b])$ is uniformly bounded if there exists $M > 0$ such that

$|f(x)| \leq M \quad \text{for every } f \in \mathcal{F} \text{ and every } x \in [a, b].$

Equivalently, the family fits inside a single closed ball of the sup-norm: $\|f\|_\infty \leq M$ for all $f \in \mathcal{F}$ .

🔷 Theorem 5 (Arzelà–Ascoli)

A subset $\mathcal{F} \subseteq C([a, b])$ (with the sup-norm metric) has compact closure if and only if $\mathcal{F}$ is uniformly bounded and equicontinuous.

Equivalently: every uniformly bounded and equicontinuous sequence $(f_n)$ in $C([a, b])$ has a subsequence that converges uniformly.

Proof.

We prove both directions. The forward direction is the routine one; the reverse direction is the workhorse “diagonal subsequence” argument.

( $\Leftarrow$ ) Uniformly bounded + equicontinuous ⟹ sequentially compact closure. Let $(f_n)$ be a sequence in $\mathcal{F}$ . We will extract a subsequence that is Cauchy in the sup-norm; by completeness of $C([a, b])$ (Theorem 2) it then converges, proving sequential compactness.

Step 1: pointwise convergence on a countable dense set. Enumerate the rationals in $[a, b]$ as $r_1, r_2, r_3, \ldots$ — a countable dense set. At $r_1$ , the sequence of real numbers $(f_n(r_1))$ is bounded (by uniform boundedness of $\mathcal{F}$ ), so by the Bolzano-Weierstrass theorem it has a convergent subsequence, which we index as $f_{n_k^{(1)}}$ . Next, the sequence $(f_{n_k^{(1)}}(r_2))$ is bounded in $\mathbb{R}$ , so we extract a further subsequence $f_{n_k^{(2)}}$ converging at $r_2$ — and still converging at $r_1$ , since subsequences of convergent sequences converge to the same limit. Iterate this procedure: at step $j$ we have a subsequence $f_{n_k^{(j)}}$ that converges at each of $r_1, r_2, \ldots, r_j$ .

Step 2: diagonal subsequence. Define the diagonal subsequence $g_k = f_{n_k^{(k)}}$ . For every fixed rational $r_j$ , the sequence $(g_k)_{k \geq j}$ is a subsequence of $(f_{n_k^{(j)}})$ , which converges at $r_j$ ; therefore $(g_k(r_j))$ converges too. So $(g_k)$ converges pointwise at every rational in $[a, b]$ .

Step 3: equicontinuity promotes pointwise to uniform. Given $\varepsilon > 0$ , equicontinuity gives $\delta > 0$ such that

$|f(x) - f(y)| < \varepsilon / 3 \quad \text{for all } f \in \mathcal{F} \text{ whenever } |x - y| < \delta.$

Cover $[a, b]$ by finitely many open intervals of length less than $\delta$ centered at rationals $r_{j_1}, \ldots, r_{j_m}$ (we can pick rationals because they are dense). By Step 2, $(g_k(r_{j_i}))$ converges for each $i = 1, \ldots, m$ , so there exists $K$ such that for all $k, \ell \geq K$ and all $i$ :

$|g_k(r_{j_i}) - g_\ell(r_{j_i})| < \varepsilon / 3.$

Now let $x \in [a, b]$ be arbitrary. Pick $r_{j_i}$ with $|x - r_{j_i}| < \delta$ . The triangle inequality gives

$|g_k(x) - g_\ell(x)| \leq |g_k(x) - g_k(r_{j_i})| + |g_k(r_{j_i}) - g_\ell(r_{j_i})| + |g_\ell(r_{j_i}) - g_\ell(x)|.$

The first and third terms are each less than $\varepsilon / 3$ by equicontinuity (applied to $g_k$ and $g_\ell$ , which lie in $\mathcal{F}$ ). The middle term is less than $\varepsilon / 3$ by the choice of $K$ . So $|g_k(x) - g_\ell(x)| < \varepsilon$ for every $x$ , which means $\|g_k - g_\ell\|_\infty \leq \varepsilon$ for all $k, \ell \geq K$ . The subsequence $(g_k)$ is Cauchy in $(C([a, b]), d_\infty)$ , and by completeness it converges to some continuous function — which is the uniformly convergent subsequence we wanted.

( $\Rightarrow$ ) Compact closure ⟹ uniformly bounded + equicontinuous. Suppose $\overline{\mathcal{F}}$ is compact. Then it is bounded (compact sets are bounded in any metric space), so $\mathcal{F}$ is uniformly bounded. For equicontinuity, fix $\varepsilon > 0$ . Compactness of $\overline{\mathcal{F}}$ in the sup-norm means we can cover it by finitely many sup-norm balls $B(\varphi_1, \varepsilon / 3), \ldots, B(\varphi_N, \varepsilon / 3)$ centered at functions $\varphi_1, \ldots, \varphi_N \in C([a, b])$ . Each $\varphi_i$ is continuous on the compact interval $[a, b]$ , hence uniformly continuous, giving some $\delta_i$ that makes $|\varphi_i(x) - \varphi_i(y)| < \varepsilon / 3$ whenever $|x - y| < \delta_i$ . Take $\delta = \min(\delta_1, \ldots, \delta_N)$ .

Now for any $f \in \mathcal{F}$ , pick $i$ with $\|f - \varphi_i\|_\infty < \varepsilon / 3$ . For $|x - y| < \delta$ :

$|f(x) - f(y)| \leq |f(x) - \varphi_i(x)| + |\varphi_i(x) - \varphi_i(y)| + |\varphi_i(y) - f(y)|.$

Each of the three terms is bounded by $\varepsilon / 3$ , so $|f(x) - f(y)| < \varepsilon$ . This bound holds for every $f \in \mathcal{F}$ with the same $\delta$ — equicontinuity.

∎

📝 Example 13 (Why sin(nπx) Fails Arzelà–Ascoli)

The counterexample family $f_n(x) = \sin(n\pi x)$ from Section 6 is uniformly bounded ( $\|f_n\|_\infty = 1$ for every $n$ ) but not equicontinuous. To see why, pick $\varepsilon = 1$ and ask for a $\delta$ that forces $|f_n(x) - f_n(y)| < 1$ whenever $|x - y| < \delta$ . For any proposed $\delta > 0$ , take $n$ large enough that $1 / (2n) < \delta$ , and take $x = 0$ , $y = 1 / (2n)$ . Then $|x - y| = 1/(2n) < \delta$ but

$|f_n(0) - f_n(1/(2n))| = |\sin(0) - \sin(\pi/2)| = 1.$

So the chosen $\delta$ fails at $f_n$ . No single $\delta$ works for the whole family — equicontinuity fails exactly because the functions oscillate faster and faster as $n$ grows. The failure of equicontinuity is precisely what blocks the Arzelà–Ascoli conclusion and explains why no subsequence converges.

💡 Remark 7 (Arzelà–Ascoli in Practice)

The theorem is the workhorse compactness criterion for function spaces. It gets used constantly in the existence theory for partial differential equations (bound the approximate solutions in a uniform-Lipschitz sense, extract a convergent subsequence, pass to the limit), in the regularity theory of calculus-of-variations minimizers, and in proving convergence of numerical schemes. In machine learning, it appears implicitly whenever you control the Lipschitz constant of a hypothesis class: a uniformly bounded, uniformly Lipschitz family of functions has compact closure in $C([a, b])$ , which means you can always extract convergent subsequences from training trajectories and study their limits. Spectral norm regularization of neural networks is partially motivated by exactly this kind of Arzelà–Ascoli argument.

Family:

Equicontinuous: ✓Pointwise bounded: ✓Arzelà-Ascoli applies: ✓Lipschitz bound: 3

Click on the graph to probe equicontinuity at a point. Blue curves = extracted subsequence. Dashed red = limit approximation.

Three-panel figure: an equicontinuous family (a single δ works for all functions), the non-equicontinuous sin(nπx) family, and a sketch of the diagonal subsequence extraction.

10. ML Connections — Distance, Fixed Points, and Embeddings

Everything in this topic has concrete machine learning payoffs. The Banach theorem explains why iterative algorithms converge; the metric-space axioms explain what “distance” must satisfy for similarity-based methods to work; the Arzelà–Ascoli theorem controls the compactness of function classes. Here are four connections that appear across modern ML.

📝 Example 14 (The Bellman Operator Is a Contraction)

In reinforcement learning, the Bellman optimality operator $T^*$ acts on bounded value functions $V : \mathcal{S} \to \mathbb{R}$ by

$(T^* V)(s) = \max_a \left[\, R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \cdot V(s') \,\right],$

where $\gamma \in [0, 1)$ is the discount factor. Under the sup-norm $\|V\|_\infty = \sup_{s \in \mathcal{S}} |V(s)|$ , the space of bounded value functions on a finite state space is complete, and $T^*$ is a contraction with constant $\gamma$ :

$\|T^* V - T^* W\|_\infty \leq \gamma \cdot \|V - W\|_\infty.$

The proof is a short computation using the inequality $|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|$ and the fact that transition probabilities sum to 1. By Banach’s theorem, value iteration — the algorithm that repeatedly applies $T^*$ starting from any initial $V_0$ — converges to the unique optimal value function $V^*$ , at a geometric rate $\gamma^n$ . This single fact is the reason value iteration works; it is also why a smaller discount $\gamma$ (more aggressively short-sighted) gives faster convergence but different optimal policies. The viz below shows this concretely on a 3-state MDP.

Discount γ:0.90

n: 0

‖V_n − V*‖∞: 8.839e+1

γⁿ bound: 8.839e+1

π*: A→move, B→move, C→stay

Three states with two actions each. C is the reward-rich absorbing-ish state; the optimal policy is "move toward C then stay". The Bellman optimality operator T* is a γ-contraction in the sup-norm, so Banach's theorem guarantees V_n → V* at the geometric rate γⁿ.

📝 Example 15 (Lipschitz Continuity and Gradient Descent)

A function $f : \mathbb{R}^n \to \mathbb{R}$ has $L$ -Lipschitz gradient if $\|\nabla f(x) - \nabla f(y)\| \leq L \cdot \|x - y\|$ for all $x, y$ — the gradient map is Lipschitz with constant $L$ . For such functions, the gradient descent map $T(x) = x - \eta \cdot \nabla f(x)$ has a Lipschitz constant of at most $\max(|1 - \eta \mu|, |1 - \eta L|)$ on an $\mu$ -strongly-convex function, which is strictly less than 1 whenever $0 < \eta < 2/L$ . So gradient descent is a contraction on strongly convex $L$ -smooth objectives, and the Banach theorem gives convergence to the unique minimizer at a geometric rate — this is the simplest known convergence proof for gradient descent. Much of convex optimization theory elaborates this idea: different step-size rules, accelerated methods, stochastic variants, and second-order methods all trade off against how tightly you can control the contraction constant.

📝 Example 16 (Metric Learning and Embedding Spaces)

Modern neural networks routinely learn metrics rather than using a fixed one. Three examples:

Siamese networks learn an embedding $\phi : X \to \mathbb{R}^d$ and use $d(\phi(x), \phi(y)) = \|\phi(x) - \phi(y)\|_2$ as a learned similarity. The training objective forces the embedding to place semantically similar inputs close together.
Triplet loss enforces $d(\phi(a), \phi(p)) + \alpha < d(\phi(a), \phi(n))$ for anchor $a$ , positive example $p$ , and negative example $n$ . The triangle inequality of the embedding metric constrains which triplet configurations are even geometrically possible, which is why well-trained embeddings exhibit the “tripletable” structure we see in Word2Vec and BERT representations.
Contrastive learning (SimCLR, CLIP) implicitly learns a metric on the representation space where positive pairs are close and negative pairs are far, using cross-entropy loss as a surrogate for metric constraints.

In every case, the metric space axioms — positivity, symmetry, triangle inequality — are the constraints that well-behaved similarity scores must satisfy. A “distance” that violates symmetry ( $d(x, y) \neq d(y, x)$ ) or the triangle inequality cannot be used with nearest-neighbor search acceleration structures like ball trees, and training losses that silently break the axioms often produce embedding spaces with weird pathologies.

💡 Remark 8 (k-NN and Metric Spaces)

The $k$ -nearest-neighbors algorithm is defined entirely in terms of a metric on the input space. Different metrics give different classifiers: Euclidean $k$ -NN, Manhattan $k$ -NN, Mahalanobis $k$ -NN, and cosine-similarity $k$ -NN all disagree on which points are “close” to a query. The triangle inequality is not just an axiom — it is what makes efficient nearest-neighbor search possible. Metric trees (ball trees, KD-trees, VP-trees) use the triangle inequality to prune branches of the search: if the distance from the query to a node’s center plus the node’s radius is larger than the best candidate found so far, the entire subtree can be skipped. Without the triangle inequality, no such pruning is correct, and nearest-neighbor search becomes linear in the dataset size. This is a concrete case where a mathematical axiom translates directly into algorithmic efficiency.

Four-panel figure: Bellman iteration convergence, Lipschitz gradient descent trajectory, metric learning embedding separation, and k-NN decision boundaries under different metrics.

11. Computational Notes

Every concept in this topic has a short Python implementation. NumPy and SciPy have the basic machinery; the reference notebook spells out each one in detail.

$\ell^p$ distances in $\mathbb{R}^n$ are a single NumPy call:

import numpy as np

def lp_distance(x, y, p=2):
    return np.linalg.norm(x - y, ord=p)

# Euclidean (p=2), Manhattan (p=1), Chebyshev (p=inf)
lp_distance(np.array([1, 2, 3]), np.array([4, 6, 8]), p=2)

Fixed-point iteration is a for-loop with a convergence check. For a contraction $T$ :

def fixed_point_iterate(T, x0, tol=1e-10, max_iter=1000):
    x = np.asarray(x0, dtype=float)
    for n in range(max_iter):
        x_new = T(x)
        if np.linalg.norm(x_new - x) < tol:
            return x_new, n + 1
        x = x_new
    return x, max_iter

The loop returns the iterate and the number of steps needed. Feed this a Picard operator, a Newton step, or a Bellman backup and the same skeleton applies.

Sup-norm distance between sampled functions is max(abs(...)) on a shared grid:

def sup_norm_distance(f_samples, g_samples):
    return np.max(np.abs(f_samples - g_samples))

x_grid = np.linspace(0, 1, 1000)
f = np.sin(2 * np.pi * x_grid)
g = np.sin(3 * np.pi * x_grid)
sup_norm_distance(f, g)

Pairwise distance matrices use scipy.spatial.distance.cdist, which supports every $\ell^p$ metric out of the box:

from scipy.spatial.distance import cdist

points = np.random.randn(100, 5)
D = cdist(points, points, metric='euclidean')  # or 'cityblock', 'chebyshev', ...

Value iteration on a small MDP is just repeated Bellman-optimality updates:

def value_iteration(P, R, gamma, tol=1e-10, max_iter=1000):
    # P: shape (S, A, S) transitions; R: shape (S, A) rewards
    S, A, _ = P.shape
    V = np.zeros(S)
    for n in range(max_iter):
        Q = R + gamma * P @ V  # shape (S, A)
        V_new = Q.max(axis=1)
        if np.max(np.abs(V_new - V)) < tol:
            return V_new, n + 1
        V = V_new
    return V, max_iter

The convergence rate matches the theoretical $\gamma^n$ bound — you can verify this by plotting $\log \|V_n - V^*\|_\infty$ against $n$ and reading off the slope.

12. Summary and Key Results

The topic built a small library of theorems on top of the three metric-space axioms. Here is the proof-dependency structure in compact form.

Result	Section	Key idea
Metric space axioms	2	Positivity, symmetry, triangle inequality
Three characterizations of continuity (Theorem 1)	4	ε–δ = sequential = preimage-of-open
Completeness of $C([a, b])$ (Theorem 2)	5	Uniform limit of continuous is continuous
Compactness equivalence (Theorem 3)	6	Compact ⟺ sequentially compact ⟺ complete + totally bounded
Banach Contraction Mapping Theorem (Theorem 4)	7	Geometric-series Cauchy bound + completeness
Arzelà–Ascoli Theorem (Theorem 5)	9	Diagonal subsequence + equicontinuity promotes to uniform

The dependency chain is linear. The compactness equivalence (Theorem 3) gives us the tools to recognize compactness without unrolling open covers every time. The completeness of $C([a, b])$ (Theorem 2) is the missing ingredient for fixed-point arguments, and Theorem 2’s proof depends on the Uniform Limit Theorem from Uniform Convergence. The Banach theorem (Theorem 4) depends on completeness. Arzelà–Ascoli (Theorem 5) depends on completeness and on the compactness equivalence. All of these depend on the three characterizations of continuity from Theorem 1.

One result unifies four earlier topics. The Banach Contraction Mapping Theorem gives a single proof that covers:

The Picard–Lindelöf existence theorem for first-order ODEs (Topic 21)
Newton iteration for implicit numerical methods (Topic 24)
The Inverse Function Theorem (Topic 12)
Value iteration in reinforcement learning (Section 10)

Before this topic, each of those was a separate argument using a specifically constructed self-map. After this topic, they are four corollaries of one theorem about complete metric spaces — and the proof you read in Section 7 is the proof of all four at once.

13. Track 8 Opening — The Road to Functional Analysis

This topic opens Track 8: Functional Analysis Essentials, a four-topic arc that adds successively more structure to metric spaces until we reach the framework behind modern machine learning optimization. Each topic adds one layer:

Topic 29 (this topic): Metric Spaces & Topology. The language of distance, open sets, completeness, and compactness in abstract spaces. The Banach Contraction Mapping Theorem unifies convergence arguments across the curriculum. The Arzelà–Ascoli theorem characterizes compact function families. Forward-looking: this is the geometric foundation.

Topic 30: Normed & Banach Spaces. Add algebraic structure — a vector-space operation and a norm that interacts with it. Bounded linear operators replace generic continuous maps; the three pillars of functional analysis (Open Mapping, Closed Graph, Uniform Boundedness) emerge as consequences of the Baire Category Theorem we deferred from this topic. The payoff: “metric space + vector space = analysis powerhouse.” Dual spaces of linear functionals live here, and the $L^p$ spaces we have been quietly alluding to are Banach spaces — their theory gets its full treatment in Topic 30, not here.

Topic 31: Inner Product & Hilbert Spaces. Add geometric structure — an inner product that gives us angles, orthogonality, and projections. Orthogonal projections replace Banach-space linear functionals as the primary construction; the Riesz representation theorem identifies the dual of a Hilbert space with the space itself. This is the mathematical home of kernel methods, Reproducing Kernel Hilbert Spaces (RKHS), and Gaussian processes. Every symmetric positive-definite kernel in machine learning induces a Hilbert space; Topic 31 makes that construction rigorous.

Topic 32: Calculus of Variations. Optimization on function spaces. The Euler–Lagrange equations characterize critical points of functionals $J : \text{function space} \to \mathbb{R}$ , first and second variation generalize the gradient and Hessian, and the direct method of the calculus of variations gives existence theorems via compactness arguments (Arzelà–Ascoli and its Sobolev-space cousins). This is the framework behind physics-informed neural networks, optimal transport, variational autoencoders, and diffusion models — wherever ML optimizes over functions rather than parameters, calculus of variations is quietly at work.

The refinement sequence is metric → normed → inner product → optimization. Every Hilbert space is a Banach space; every Banach space is a metric space. The additional structure at each step buys additional theorems: metric spaces give us continuity and contraction arguments, normed spaces add linear operators and duality, inner product spaces add orthogonal projection, and the optimization layer turns projection into the Euler–Lagrange machinery.

Connections & Further Reading

Prerequisites — topics you need first

intermediate Limits & Continuity 40 min

Completeness & Compactness

Topic 3 established completeness and compactness in R^n via Bolzano-Weierstrass and Heine-Borel. Topic 29 generalizes both concepts to arbitrary metric spaces, resolving the C([0,1]) counterexample promised in Topic 3.

intermediate Limits & Continuity 45 min

Uniform Convergence

Topic 4 introduced the sup-norm, equicontinuity, and previewed Arzelà-Ascoli. Topic 29 proves Arzelà-Ascoli in full and establishes C([a,b]) as a complete metric space under the sup-norm.

foundational ODEs 50 min

First-Order ODEs & Existence Theorems

Topic 21's Picard-Lindelöf existence theorem used a contraction argument on C([a,b]). Topic 29 proves the abstract Contraction Mapping Theorem that generalizes it.

advanced Multivariable Differential 45 min

Inverse & Implicit Function Theorems

Topic 12's proof of the Inverse Function Theorem used a contraction mapping argument. Topic 29 provides the abstract framework.

intermediate ODEs 45 min

Numerical Methods for ODEs

Topic 24's convergence analysis of implicit methods invoked contraction mapping for Newton iteration. Topic 29 proves the general result.

intermediate Series & Approximation 50 min

Approximation Theory

Topic 20's density of polynomials in C[a,b] used the sup-norm metric. Topic 29 formalizes C([a,b]) as a complete metric space and connects density to the contraction mapping framework.

intermediate Multivariable Integral 50 min

Line Integrals & Conservative Fields

Topic 15 mentioned simply connected domains and fundamental groups. Topic 29's general topology section provides the topological vocabulary for these concepts.

Where this leads — next in formalCalculus

advanced Functional Analysis 50 min

Normed & Banach Spaces

Adds algebraic structure (vector space + norm) on top of the metric-space framework. The Baire Category Theorem, Open Mapping Theorem, Closed Graph Theorem, and Uniform Boundedness Principle — all deferred from this topic — are proved there.

advanced Functional Analysis 55 min

Inner Product & Hilbert Spaces

Adds the inner-product structure that gives angles, orthogonality, the projection theorem, Riesz representation, and the RKHS construction behind kernel methods.

advanced Functional Analysis 50 min

Calculus of Variations

The Euler-Lagrange equation, second variation, and the direct method for existence of minimizers via weak compactness. Arzelà-Ascoli from this topic reappears in Sobolev spaces as the compactness tool variational problems rely on.

On to formalStatistics — where this calculus powers inference

Empirical Processes

Donsker's theorem is a statement in the metric space D[0,1] with the Skorohod metric: √n(F_n - F) converges in distribution to the Brownian bridge. Continuity of functionals in this metric is what the continuous mapping theorem exploits.

Bayesian Computation And MCMC

Convergence rates of MCMC are metric-space statements: the chain converges to its stationary distribution in total-variation distance or Wasserstein distance. Contraction-mapping arguments (coupling) give geometric convergence rates.

On to formalML — where this calculus powers ML

Gradient Descent

The contraction-mapping theorem guarantees convergence of fixed-point iterations, and Lipschitz continuity of gradient maps — formalized in this topic — controls convergence rates of gradient descent. The gradient-descent map $T(\theta) = \theta - \eta\nabla L(\theta)$ becomes a contraction with constant $\le 1 - \eta\mu$ on $\mu$-strongly-convex objectives.

References

book Munkres, J. R. (2000). Topology Second edition. Chapters 2–3 (metric spaces, topological spaces, connectedness, compactness). The standard reference for general topology.
book Rudin, W. (1976). Principles of Mathematical Analysis Third edition. Chapters 2–3 (basic topology, continuity). Clean treatment of metric-space compactness.
book Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications Second edition. Chapter 4 (point-set topology in analysis context). Good for the Arzelà–Ascoli proof.
book Kreyszig, E. (1978). Introductory Functional Analysis with Applications Chapter 1 (metric spaces as foundation for functional analysis). Excellent pedagogy for the intermediate reader; Section 1.4 covers the contraction mapping theorem with worked examples.
book Stein, E. M. & Shakarchi, R. (2005). Real Analysis: Measure Theory, Integration, and Hilbert Spaces Chapter 7 (Hausdorff measure, fractals). Peripheral to Topic 29 but useful for the general topology bridge.