Limits & Continuity · intermediate · 40 min read

Completeness & Compactness

The structural properties of ℝ that guarantee limits exist and optima are attained — why regularization works and when your optimization problem has a solution

Abstract. The completeness of ℝ — every nonempty set bounded above has a least upper bound — is the axiom that separates the real numbers from the rationals and makes calculus possible. This single property implies the Monotone Convergence Theorem, the Bolzano-Weierstrass Theorem, and the Cauchy completeness that guarantees iterative algorithms converge to something. Compactness, the other structural pillar, says that a set K ⊆ ℝ is compact (closed and bounded) if and only if every sequence in K has a subsequence converging to a point in K (Heine-Borel). Compactness is the topological property behind the Extreme Value Theorem: continuous functions on compact sets attain their maximum and minimum, which is exactly the existence guarantee for optimization problems like min_{θ ∈ Θ} L(θ) when Θ is compact. In machine learning, neural network parameter spaces are ℝ^d — decidedly not compact. Regularization techniques (L2 weight decay constraining ‖θ‖₂ ≤ R, L1 lasso constraining ‖θ‖₁ ≤ R, gradient clipping, spectral normalization) effectively restrict parameters to compact subsets, restoring the existence guarantees that compactness provides. Coercive loss functions (L(θ) → ∞ as ‖θ‖ → ∞) offer an alternative: their sublevel sets are compact, so minimizers exist even on unbounded domains.

Where this leads → formalML

  • formalML The Weierstrass extreme value theorem — continuous functions on compact sets attain their extrema — is the foundational existence result for convex optimization. Compactness of sublevel sets for coercive convex functions guarantees minimizers exist even on unbounded domains.
  • formalML Tightness of probability measures (Prokhorov's theorem) is a compactness condition: a family of measures is tight if for every ε > 0, there exists a compact set K with μ(K) > 1 - ε for all μ in the family. This is used in convergence-in-distribution proofs.
  • formalML The existence of minimizers for regularized objectives (L(θ) + λ‖θ‖²) follows from compactness arguments: the regularization term makes the effective objective coercive, so sublevel sets are compact, and the EVT guarantees a minimizer exists.

Overview & Motivation

Suppose you’re training a neural network and the loss decreases at every step: L(θ1)>L(θ2)>L(θ3)>\mathcal{L}(\theta_1) > \mathcal{L}(\theta_2) > \mathcal{L}(\theta_3) > \cdots. This is a bounded, decreasing sequence of real numbers. Does it converge? And does there actually exist a parameter vector θ\theta^* that achieves the infimum of the loss? The first question is about completeness; the second is about compactness. These are the two structural properties of R\mathbb{R} that make calculus — and optimization — work.

In Sequences, Limits & Convergence, we used completeness when we proved the Monotone Convergence Theorem and the Bolzano-Weierstrass Theorem. In Epsilon-Delta & Continuity, we used compactness (implicitly) when we proved the Extreme Value Theorem and the Heine-Cantor Theorem. Now we develop the theory explicitly. We will see that a single axiom — the least upper bound property — is the foundation for all of these results, and that compactness is the structural property that makes continuous optimization well-posed.

This is where the abstraction level increases. In the first two topics, we worked with individual sequences and functions. Here we reason about structural properties of the real line itself — asking not “does this particular sequence converge?” but “what property of R\mathbb{R} guarantees that every bounded monotone sequence converges?”

Completeness of ℝ — The Least Upper Bound Property

Why Q\mathbb{Q} is not enough

Consider the set S={qQ:q2<2}S = \{q \in \mathbb{Q} : q^2 < 2\}. This set is nonempty (1S1 \in S) and bounded above (22 is an upper bound). Does SS have a least upper bound in Q\mathbb{Q}? If it did, that least upper bound would have to be 2\sqrt{2} — but 2Q\sqrt{2} \notin \mathbb{Q}. The rationals have a “hole” right where the supremum should be.

This is not an isolated curiosity. The rationals are riddled with such holes — one for every irrational number. A number system with holes cannot support calculus: limits might not exist, monotone sequences might not converge, and continuous functions might not attain their extrema. The real numbers are constructed precisely to fill these holes.

📐 Definition 1 (Upper Bound and Supremum)

Let SRS \subseteq \mathbb{R} be nonempty. A number MRM \in \mathbb{R} is an upper bound of SS if sMs \leq M for all sSs \in S. The supremum (least upper bound) supS\sup S is an upper bound of SS with the property that if MM is any other upper bound of SS, then supSM\sup S \leq M.

Dually, mm is a lower bound of SS if msm \leq s for all sSs \in S, and the infimum infS\inf S is the greatest lower bound.

The supremum has a crucial characterization: supS\sup S is the unique number α\alpha such that (i) α\alpha is an upper bound of SS, and (ii) for every ε>0\varepsilon > 0, there exists sSs \in S with s>αεs > \alpha - \varepsilon. This “approximation property” — you can always find elements of SS arbitrarily close to supS\sup S — is what makes the supremum useful for proofs.

🔷 Theorem (Axiom — Completeness of ℝ (LUB Property))

Every nonempty subset of R\mathbb{R} that is bounded above has a supremum in R\mathbb{R}.

This is not a theorem we prove — it is the axiom that defines the real numbers. Every construction of R\mathbb{R} (Dedekind cuts, Cauchy sequence completion of Q\mathbb{Q}) is designed to make this axiom true. And from this single axiom, the entire structure of calculus follows.

The completeness of ℝ and equivalent formulations

🔷 Theorem 1 (Equivalence of Completeness Formulations)

The following are equivalent:

(a) The LUB Property: Every nonempty subset of R\mathbb{R} bounded above has a supremum in R\mathbb{R}.

(b) Monotone Convergence Theorem: Every bounded monotone sequence in R\mathbb{R} converges.

(c) Nested Interval Property: If [a1,b1][a2,b2][a_1, b_1] \supseteq [a_2, b_2] \supseteq \cdots is a nested sequence of closed bounded intervals with bnan0b_n - a_n \to 0, then n=1[an,bn]\bigcap_{n=1}^{\infty} [a_n, b_n] contains exactly one point.

(d) Cauchy Completeness: Every Cauchy sequence in R\mathbb{R} converges to a point in R\mathbb{R}.

Proof.

We sketch the cycle of implications.

(a) ⟹ (b): Let (an)(a_n) be bounded and increasing. The set {an}\{a_n\} is bounded above, so L=sup{an}L = \sup\{a_n\} exists by the LUB property. For every ε>0\varepsilon > 0, the approximation property gives aN>Lεa_N > L - \varepsilon for some NN. Since (an)(a_n) is increasing, anaN>Lεa_n \geq a_N > L - \varepsilon for all nNn \geq N. And anLa_n \leq L since LL is an upper bound. So anL<ε|a_n - L| < \varepsilon for all nNn \geq N. This was proved as Theorem 1 in Sequences, Limits & Convergence.

(b) ⟹ (c): The sequence (an)(a_n) is increasing and bounded above (by b1b_1), so anaa_n \to a by MCT. Similarly bnbb_n \to b is decreasing and bounded below. Since anbna_n \leq b_n for all nn, we have aba \leq b. If bnan0b_n - a_n \to 0, then b=ab = a, and {a}=[an,bn]\{a\} = \bigcap [a_n, b_n].

(c) ⟹ Bolzano-Weierstrass: Given a bounded sequence in [a,b][a, b], repeatedly bisect: at least one half contains infinitely many terms. The Nested Interval Property guarantees the bisected intervals converge to a point, which is the limit of the extracted subsequence. This was the strategy behind Theorem 4 in Sequences, Limits & Convergence.

Bolzano-Weierstrass ⟹ (d): Every Cauchy sequence is bounded (proved in Sequences, Limits & Convergence, Theorem 5). Bolzano-Weierstrass gives a convergent subsequence ankLa_{n_k} \to L. A Cauchy sequence with a convergent subsequence converges to the same limit: for ε>0\varepsilon > 0, choose NN so that aman<ε/2|a_m - a_n| < \varepsilon/2 for m,nNm, n \geq N and ankL<ε/2|a_{n_k} - L| < \varepsilon/2 for nkNn_k \geq N. Then anLanank+ankL<ε|a_n - L| \leq |a_n - a_{n_k}| + |a_{n_k} - L| < \varepsilon.

(d) ⟹ (a): Let SS be nonempty and bounded above. We construct a Cauchy sequence converging to supS\sup S. Start with any upper bound b0b_0 and any s0Ss_0 \in S. Bisect [s0,b0][s_0, b_0]: if the midpoint is an upper bound of SS, take the left half; otherwise, the right half. This produces intervals of width (b0s0)/2n0(b_0 - s_0)/2^n \to 0, and the endpoints form Cauchy sequences. By Cauchy completeness, they converge to some αR\alpha \in \mathbb{R}, which we verify is supS\sup S.

💡 Remark 1

In machine learning, completeness ensures that iterative algorithms have a limit. When gradient descent produces a Cauchy sequence of parameters θt\theta_t — meaning the updates θt+1θt\|\theta_{t+1} - \theta_t\| get arbitrarily small — completeness guarantees convergence to some θRd\theta^* \in \mathbb{R}^d. Without completeness, the “limit” might not exist, like 2\sqrt{2} in Q\mathbb{Q}.

The Nested Interval Property

The equivalence theorem tells us the Nested Interval Property follows from completeness. But the NIP deserves a standalone treatment because it is the key proof technique for many existence theorems — and it is the theoretical backbone of bisection algorithms.

🔷 Theorem 2 (Nested Interval Property)

If [a1,b1][a2,b2][a_1, b_1] \supseteq [a_2, b_2] \supseteq \cdots is a nested sequence of closed, bounded intervals, then n=1[an,bn]\bigcap_{n=1}^{\infty} [a_n, b_n] \neq \emptyset. If additionally bnan0b_n - a_n \to 0, then the intersection contains exactly one point.

Proof.

Let a=sup{an}a = \sup\{a_n\}, which exists by the LUB property since the ana_n are bounded above by b1b_1. Let b=inf{bn}b = \inf\{b_n\}, which exists since the bnb_n are bounded below by a1a_1.

We claim aba \leq b. For any n,mn, m, the nesting gives anbma_n \leq b_m (if nmn \leq m, then anambma_n \leq a_m \leq b_m; if mnm \leq n, then anbnbma_n \leq b_n \leq b_m). So every bmb_m is an upper bound for {an}\{a_n\}, hence a=sup{an}bma = \sup\{a_n\} \leq b_m for all mm, hence ainf{bm}=ba \leq \inf\{b_m\} = b.

Now a[an,bn]a \in [a_n, b_n] for all nn (because anaa_n \leq a and abbna \leq b \leq b_n), so a[an,bn]a \in \bigcap [a_n, b_n] \neq \emptyset.

If bnan0b_n - a_n \to 0, then 0babnan00 \leq b - a \leq b_n - a_n \to 0, so b=ab = a and the intersection is {a}\{a\}.

Step: 0Interval: [1.00000000, 2.00000000]Width: 1.00e+0

💡 Remark 2

The Nested Interval Property is the theoretical backbone of bisection-based algorithms. Every time you halve a search interval and keep the half containing your target, you are applying the NIP. Binary search, the bisection root-finding method from Epsilon-Delta & Continuity, and branch-and-bound optimization algorithms are all NIP in action. The visualization above lets you see this convergence directly — watch the interval width shrink exponentially while the endpoints close in on the target.

Open and Closed Sets

Before defining compactness, we need precise vocabulary for the topology of R\mathbb{R}. The distinction between open and closed sets is not just a formality — it determines whether limits of sequences “stay inside” a set, which is exactly the property that makes compactness work.

📐 Definition 2 (Open Set)

A set SRS \subseteq \mathbb{R} is open if for every xSx \in S, there exists r>0r > 0 such that the open ball (xr,x+r)S(x - r, x + r) \subseteq S. In other words, every point of SS has a “buffer zone” of neighboring points that also belong to SS.

📐 Definition 3 (Closed Set)

A set SRS \subseteq \mathbb{R} is closed if its complement RS\mathbb{R} \setminus S is open. Equivalently, SS is closed if and only if it contains all its limit points: whenever (xn)(x_n) is a sequence in SS with xnLx_n \to L, then LSL \in S.

The sequential characterization is often more useful: a closed set is one that “keeps its limits.”

Open and closed sets in ℝ

🔷 Proposition 1 (Properties of Open and Closed Sets)

(a) Arbitrary unions of open sets are open.

(b) Finite intersections of open sets are open.

(c) Arbitrary intersections of closed sets are closed.

(d) Finite unions of closed sets are closed.

Proof.

(a): If xαSαx \in \bigcup_\alpha S_\alpha, then xSβx \in S_\beta for some β\beta. Since SβS_\beta is open, there exists r>0r > 0 with (xr,x+r)SβαSα(x - r, x + r) \subseteq S_\beta \subseteq \bigcup_\alpha S_\alpha.

(b): If xi=1nSix \in \bigcap_{i=1}^n S_i, then for each ii, there exists ri>0r_i > 0 with (xri,x+ri)Si(x - r_i, x + r_i) \subseteq S_i. Take r=min(r1,,rn)>0r = \min(r_1, \ldots, r_n) > 0 (this is where finiteness is essential — an infinite minimum of positive numbers could be zero). Then (xr,x+r)i=1nSi(x - r, x + r) \subseteq \bigcap_{i=1}^n S_i.

(c)–(d): Follow from (a)–(b) by De Morgan’s laws: RαSα=α(RSα)\mathbb{R} \setminus \bigcap_\alpha S_\alpha = \bigcup_\alpha (\mathbb{R} \setminus S_\alpha) and Ri=1nSi=i=1n(RSi)\mathbb{R} \setminus \bigcup_{i=1}^n S_i = \bigcap_{i=1}^n (\mathbb{R} \setminus S_i).

📐 Definition 4 (Interior, Boundary, Closure)

For SRS \subseteq \mathbb{R}:

  • The interior int(S)\mathrm{int}(S) is the largest open set contained in SS: the set of all xSx \in S for which some (xr,x+r)S(x - r, x + r) \subseteq S.
  • The closure S\overline{S} is the smallest closed set containing SS: equivalently, SS together with all its limit points.
  • The boundary S=Sint(S)\partial S = \overline{S} \setminus \mathrm{int}(S): the set of points where every neighborhood meets both SS and RS\mathbb{R} \setminus S.

📝 Example 1

For S=[0,1)S = [0, 1):

  • int(S)=(0,1)\mathrm{int}(S) = (0, 1) — the point 00 has no neighborhood contained in SS (every (r,r)(- r, r) contains negative numbers not in SS).
  • S=[0,1]\overline{S} = [0, 1] — the point 11 is a limit point of SS (the sequence 11/n11 - 1/n \to 1) but 1S1 \notin S.
  • S={0,1}\partial S = \{0, 1\}.

The set S=[0,1)S = [0, 1) is neither open (0 is a boundary point in SS with no interior buffer) nor closed (1 is a limit point not in SS).

Sequential Compactness

The Bolzano-Weierstrass Theorem from Sequences, Limits & Convergence says every bounded sequence in R\mathbb{R} has a convergent subsequence. But where does that subsequence converge to? If the original sequence lives in [0,1][0, 1], the limit is guaranteed to also be in [0,1][0, 1] — because [0,1][0, 1] is closed. If the sequence lives in (0,1)(0, 1), the limit might escape to 0 or 1. This is the difference between compact and non-compact.

📐 Definition 5 (Sequential Compactness)

A set KRK \subseteq \mathbb{R} is sequentially compact if every sequence (xn)(x_n) in KK has a subsequence (xnk)(x_{n_k}) that converges to a point LKL \in K.

The key requirement is that the limit LL belongs to KK — not just that a convergent subsequence exists, but that the subsequence converges to a point inside the set.

🔷 Proposition 2 (Compact ⟺ Closed + Bounded in ℝ)

A subset KRK \subseteq \mathbb{R} is sequentially compact if and only if KK is closed and bounded.

Proof.

(⟸) Closed + bounded implies compact: If KK is bounded, then K[M,M]K \subseteq [-M, M] for some MM. By the Bolzano-Weierstrass Theorem, any sequence (xn)(x_n) in KK has a convergent subsequence xnkLx_{n_k} \to L. If KK is closed, then LKL \in K (closed sets contain their limit points). So KK is sequentially compact.

(⟹) Compact implies closed: Suppose KK is not closed. Then there exists a sequence (xn)(x_n) in KK with xnLx_n \to L and LKL \notin K. Every subsequence of (xn)(x_n) also converges to LL (a fact from Sequences, Limits & Convergence). Since LKL \notin K, no subsequence converges to a point in KK — contradicting sequential compactness.

(⟹) Compact implies bounded: Suppose KK is unbounded. Then for each nn, there exists xnKx_n \in K with xn>n|x_n| > n. The sequence (xn)(x_n) has xn|x_n| \to \infty, so no subsequence can converge (convergent sequences are bounded). This contradicts sequential compactness.

Sequential compactness: compact vs. non-compact sets

Compact: YesClosed: YesBounded: Yes

💡 Remark 3

The characterization “closed + bounded ⟺ compact” is specific to Rn\mathbb{R}^n — this is the Heine-Borel theorem we prove in the next section. In general metric spaces, compactness is a strictly stronger property than “closed + bounded.” For example, in the space of continuous functions C([0,1])C([0,1]) with the supremum norm, the closed unit ball {f:f1}\{f : \|f\|_\infty \leq 1\} is closed and bounded but not compact — you can find a sequence of functions in it with no convergent subsequence. This distinction matters when working with function spaces in ML, and we develop it in Metric Spaces & Topology, where we prove that the sequence fn(x)=sin(nπx)f_n(x) = \sin(n\pi x) has all pairwise sup-norm distances 1\geq 1, so no subsequence is Cauchy and no subsequence converges uniformly.

Compactness — The Open Cover Definition

Sequential compactness is intuitive: “every sequence has a convergent subsequence staying in the set.” There is an equivalent definition using open covers that is more powerful for proofs, especially when we generalize beyond Rn\mathbb{R}^n.

📐 Definition 6 (Open Cover)

An open cover of KRK \subseteq \mathbb{R} is a collection {Uα}αA\{U_\alpha\}_{\alpha \in A} of open sets such that KαAUαK \subseteq \bigcup_{\alpha \in A} U_\alpha. That is, every point of KK belongs to at least one of the UαU_\alpha.

A subcover is a subcollection that still covers KK. A finite subcover is a subcover with finitely many sets.

📐 Definition 7 (Compactness via Open Covers)

A set KK is compact if every open cover of KK has a finite subcover: for any collection of open sets {Uα}\{U_\alpha\} covering KK, there exist finitely many Uα1,,UαnU_{\alpha_1}, \ldots, U_{\alpha_n} such that KUα1UαnK \subseteq U_{\alpha_1} \cup \cdots \cup U_{\alpha_n}.

🔷 Theorem 3 (Heine-Borel Theorem)

A subset KRnK \subseteq \mathbb{R}^n is compact (every open cover has a finite subcover) if and only if KK is closed and bounded.

In R\mathbb{R}, this means the compact sets are precisely the closed, bounded intervals [a,b][a, b] and their closed subsets.

Proof.

We prove the theorem for R\mathbb{R}; the Rn\mathbb{R}^n case follows by the same argument applied coordinate-wise.

(⟸) Closed and bounded implies compact: Suppose K=[a,b]K = [a, b] and let {Uα}\{U_\alpha\} be any open cover. Define

S={x[a,b]:[a,x] can be covered by finitely many Uα}.S = \{x \in [a, b] : [a, x] \text{ can be covered by finitely many } U_\alpha\}.

SS is nonempty: aUβa \in U_\beta for some β\beta, and since UβU_\beta is open, some interval [a,a+r)Uβ[a, a + r) \subseteq U_\beta, so aSa \in S.

SS is bounded above by bb. Let c=supSc = \sup S, which exists by the completeness of R\mathbb{R}. We claim c=bc = b.

Since c[a,b]c \in [a, b] and {Uα}\{U_\alpha\} covers [a,b][a, b], there exists UγU_\gamma with cUγc \in U_\gamma. Since UγU_\gamma is open, there exists δ>0\delta > 0 with (cδ,c+δ)Uγ(c - \delta, c + \delta) \subseteq U_\gamma.

Since c=supSc = \sup S, there exists xSx \in S with x>cδx > c - \delta. Then [a,x][a, x] is covered by finitely many UαU_\alpha, and [x,min(c+δ,b)]Uγ[x, \min(c + \delta, b)] \subseteq U_\gamma. So [a,min(c+δ,b)][a, \min(c + \delta, b)] is covered by finitely many UαU_\alpha.

If c<bc < b, then min(c+δ,b)>c\min(c + \delta, b) > c, contradicting c=supSc = \sup S. Therefore c=bc = b, and [a,b][a, b] has a finite subcover.

(⟹) Compact implies bounded: Cover KK by {(n,n):nN}\{(-n, n) : n \in \mathbb{N}\}. A finite subcover gives K(N,N)K \subseteq (-N, N) for some NN, so KK is bounded.

(⟹) Compact implies closed: Suppose xKx \notin K. For each yKy \in K, let ry=xy/2>0r_y = |x - y|/2 > 0. The intervals (yry,y+ry)(y - r_y, y + r_y) cover KK. Extract a finite subcover: Ki=1n(yiryi,yi+ryi)K \subseteq \bigcup_{i=1}^n (y_i - r_{y_i}, y_i + r_{y_i}). Let r=min(ry1,,ryn)>0r = \min(r_{y_1}, \ldots, r_{y_n}) > 0. Then (xr,x+r)K=(x - r, x + r) \cap K = \emptyset, so xx is an interior point of RK\mathbb{R} \setminus K. Since every point of RK\mathbb{R} \setminus K is interior, RK\mathbb{R} \setminus K is open, hence KK is closed.

Heine-Borel: finite subcovers on compact sets

📝 Example 2

Consider the open cover {(1/n,1):n2}\{(1/n, 1) : n \geq 2\} of the open interval (0,1)(0, 1). Each point x(0,1)x \in (0, 1) lies in (1/n,1)(1/n, 1) for n>1/xn > 1/x, so this is indeed an open cover.

But no finite subcollection works. If we select (1/n1,1),,(1/nk,1)(1/n_1, 1), \ldots, (1/n_k, 1), the union is (1/N,1)(1/N, 1) where N=max(n1,,nk)N = \max(n_1, \ldots, n_k), which misses points in (0,1/N](0, 1/N]. This proves (0,1)(0, 1) is not compact.

Contrast with [0,1][0, 1]: any open cover must include some UαU_\alpha containing 00 and some UβU_\beta containing 11, and the Heine-Borel proof guarantees a finite subcover exists.

Click on individual cover intervals to toggle them on/off. Try to find a finite subcover that covers the set.

Continuous Functions on Compact Sets

This is where compactness meets continuity — producing the power theorems of analysis. In Epsilon-Delta & Continuity, we proved the Extreme Value Theorem and the Heine-Cantor Theorem using direct arguments. Here we reprove both as consequences of compactness, revealing the structural reason they hold: the essential ingredient is not the specific argument, but the compactness of the domain.

🔶 Lemma 1 (Continuous Image of a Compact Set)

If f:KRf: K \to \mathbb{R} is continuous and KK is compact, then f(K)={f(x):xK}f(K) = \{f(x) : x \in K\} is compact.

Proof.

Let (yn)(y_n) be a sequence in f(K)f(K). Then yn=f(xn)y_n = f(x_n) for some xnKx_n \in K. By compactness of KK, there exists a subsequence xnkxKx_{n_k} \to x^* \in K. By continuity of ff, f(xnk)f(x)f(K)f(x_{n_k}) \to f(x^*) \in f(K).

So every sequence in f(K)f(K) has a subsequence converging to a point in f(K)f(K). Therefore f(K)f(K) is compact.

🔷 Theorem 4 (Extreme Value Theorem (Compactness Proof))

If f:KRf: K \to \mathbb{R} is continuous and KK is compact, then ff attains its maximum and minimum on KK: there exist xmax,xminKx_{\max}, x_{\min} \in K with f(xmin)f(x)f(xmax)f(x_{\min}) \leq f(x) \leq f(x_{\max}) for all xKx \in K.

Proof.

By Lemma 1, f(K)f(K) is compact, hence closed and bounded by Heine-Borel. Since f(K)f(K) is bounded, M=supf(K)M = \sup f(K) exists. Since f(K)f(K) is closed and MM is a limit point of f(K)f(K) (by the approximation property of the supremum: there exist ynf(K)y_n \in f(K) with ynMy_n \to M), we have Mf(K)M \in f(K). So M=f(xmax)M = f(x_{\max}) for some xmaxKx_{\max} \in K.

The argument for the minimum is identical with inf\inf replacing sup\sup.

Compare this with the proof in Epsilon-Delta & Continuity (Theorem 4): the compactness proof is cleaner because it identifies the essential structural ingredient. The continuous image of a compact set is compact — that is the entire story.

🔷 Theorem 5 (Heine-Cantor Theorem (Compactness Proof))

If f:KRf: K \to \mathbb{R} is continuous and KK is compact, then ff is uniformly continuous on KK: for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that xy<δ    f(x)f(y)<ε|x - y| < \delta \implies |f(x) - f(y)| < \varepsilon for all x,yKx, y \in K.

Proof.

Suppose ff is not uniformly continuous. Then there exists ε>0\varepsilon > 0 such that for every δ=1/n\delta = 1/n, there exist xn,ynKx_n, y_n \in K with xnyn<1/n|x_n - y_n| < 1/n but f(xn)f(yn)ε|f(x_n) - f(y_n)| \geq \varepsilon.

By compactness of KK, the sequence (xn)(x_n) has a subsequence xnkxKx_{n_k} \to x^* \in K. Since xnkynk<1/nk0|x_{n_k} - y_{n_k}| < 1/n_k \to 0, we also have ynkxy_{n_k} \to x^*.

By continuity of ff at xx^*: f(xnk)f(x)andf(ynk)f(x),f(x_{n_k}) \to f(x^*) \quad \text{and} \quad f(y_{n_k}) \to f(x^*), so f(xnk)f(ynk)0|f(x_{n_k}) - f(y_{n_k})| \to 0. But f(xnk)f(ynk)ε>0|f(x_{n_k}) - f(y_{n_k})| \geq \varepsilon > 0 for all kk — contradiction.

Continuous functions on compact sets: EVT and Heine-Cantor

💡 Remark 4

The Heine-Cantor proof by contradiction is a paradigm for compactness arguments: assume the negation, extract a sequence, use compactness to get a convergent subsequence, and reach a contradiction via continuity. This pattern — sometimes called the “compactness argument” — appears throughout analysis and optimization theory. Whenever you see a proof that constructs a sequence and extracts a convergent subsequence, compactness is at work.

Coercivity — Compactness Without Boundedness

In practice, parameter spaces are Rd\mathbb{R}^d — not compact (unbounded). So the EVT does not apply directly: minθRdL(θ)\min_{\theta \in \mathbb{R}^d} \mathcal{L}(\theta) might not be attained. This is not an abstract concern — the function f(x)=exf(x) = e^{-x} on R\mathbb{R} has infimum 0 but never reaches it.

Coercivity provides an elegant substitute: if the function “blows up” at infinity, the optimization effectively takes place on a compact sublevel set.

📐 Definition 8 (Coercive Function)

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} is coercive if limxf(x)=+\lim_{\|x\| \to \infty} f(x) = +\infty. Equivalently, for every cRc \in \mathbb{R}, the sublevel set {xRd:f(x)c}\{x \in \mathbb{R}^d : f(x) \leq c\} is bounded.

🔷 Theorem 6 (Weierstrass Theorem for Coercive Functions)

If f:RdRf: \mathbb{R}^d \to \mathbb{R} is continuous and coercive, then ff attains its minimum: there exists xRdx^* \in \mathbb{R}^d with f(x)=infxRdf(x)f(x^*) = \inf_{x \in \mathbb{R}^d} f(x).

Proof.

Let m=infxRdf(x)m = \inf_{x \in \mathbb{R}^d} f(x). Choose a minimizing sequence (xn)(x_n) with f(xn)mf(x_n) \to m.

Since f(xn)mf(x_n) \to m, we have f(xn)m+1f(x_n) \leq m + 1 for all sufficiently large nn. The sublevel set S={x:f(x)m+1}S = \{x : f(x) \leq m + 1\} is:

  • Bounded, by coercivity of ff.
  • Closed, by continuity of ff (preimage of the closed set (,m+1](-\infty, m + 1]).

So SS is compact by the Heine-Borel theorem, and the tail of (xn)(x_n) lies in SS. By compactness, extract a subsequence xnkxSx_{n_k} \to x^* \in S. By continuity, f(x)=limf(xnk)=mf(x^*) = \lim f(x_{n_k}) = m.

Compactness in optimization: compact domain vs. coercive function

💡 Remark 5

Coercivity is why L2L_2 regularization guarantees a unique minimizer for convex losses. The regularized loss L(θ)+λθ22\mathcal{L}(\theta) + \lambda\|\theta\|_2^2 is coercive for any λ>0\lambda > 0: the quadratic penalty dominates as θ\|\theta\| \to \infty, regardless of what L\mathcal{L} does. The sublevel sets are compact (closed and bounded), so the Weierstrass theorem guarantees a minimizer exists. Without regularization, the loss might decrease asymptotically — approaching its infimum along a sequence θn\theta_n with θn\|\theta_n\| \to \infty — without ever being attained.

Connections to Machine Learning

Regularization as compactification

Regularization creates compact constraint sets

The unconstrained parameter space Rd\mathbb{R}^d is not compact. Regularization techniques create compact subsets, restoring the EVT guarantee that minimizers exist:

  • L2L_2 weight decay constrains θ2R\|\theta\|_2 \leq R, a closed ball in Rd\mathbb{R}^d — compact by Heine-Borel.
  • L1L_1 lasso constrains θ1R\|\theta\|_1 \leq R, a closed “diamond” (cross-polytope) — also compact.
  • Elastic net combines both: αθ1+(1α)θ2R\alpha\|\theta\|_1 + (1-\alpha)\|\theta\|_2 \leq R — the intersection of compact sets, still compact.
  • Weight clipping (θiC|\theta_i| \leq C for all ii) constrains parameters to a closed hypercube — compact.
  • Spectral normalization constrains the spectral norm (largest singular value) of weight matrices, creating a compact set of matrices.

Each technique converts an unbounded optimization problem into a compact one, where the existence of minimizers is guaranteed by the EVT. Regularization is not just for generalization — it is for existence. See Convex Analysis on formalML for the full treatment.

Existence of optimal parameters

Without compactness or coercivity, infθL(θ)\inf_\theta \mathcal{L}(\theta) may not be attained. Consider a simple example: L(θ)=eθ2\mathcal{L}(\theta) = e^{-\theta^2} on R\mathbb{R}. The infimum is 0, but L(θ)>0\mathcal{L}(\theta) > 0 for all θ\theta — the infimum is never achieved.

Regularization fixes this: L(θ)+λθ2=eθ2+λθ2\mathcal{L}(\theta) + \lambda\|\theta\|^2 = e^{-\theta^2} + \lambda\theta^2 is coercive for any λ>0\lambda > 0, so its minimum is attained. The regularization parameter λ\lambda controls the trade-off between fitting the data and keeping parameters bounded — but at a more fundamental level, λ>0\lambda > 0 ensures the optimization problem has a solution. See Gradient Descent on formalML.

Tightness and convergence of distributions

Compactness appears in probability theory through the concept of tightness. A family of probability distributions {μα}\{\mu_\alpha\} is tight if for every ε>0\varepsilon > 0, there exists a compact set KK such that μα(K)>1ε\mu_\alpha(K) > 1 - \varepsilon for all α\alpha. Prokhorov’s theorem says tightness is equivalent to sequential compactness of the family of measures (in the topology of weak convergence).

In ML, tightness guarantees that sequences of empirical distributions (from training data) have convergent subsequences — a key ingredient in proving that training procedures converge to meaningful distributions. See Measure-Theoretic Probability on formalML.

Early stopping and implicit compactness

Early stopping — halting training after TT iterations — implicitly constrains the effective parameter space. Starting from θ0\theta_0 and taking gradient steps of size η\eta, the parameter trajectory θ0,θ1,,θT\theta_0, \theta_1, \ldots, \theta_T is bounded: θtθ0tηG\|\theta_t - \theta_0\| \leq t \cdot \eta \cdot G where GG is the maximum gradient norm. So the effective parameter space is contained in a ball of radius TηGT \eta G centered at θ0\theta_0 — a compact set.

This observation connects early stopping to explicit regularization: both create compact constraint sets, and both guarantee that the optimization trajectory stays in a region where the EVT applies. The theoretical equivalence between early stopping and L2L_2 regularization (for linear models) is a manifestation of this shared compactness structure.

Computational Notes

Verifying the Nested Interval Property via bisection

import numpy as np

def nested_interval_bisection(a, b, target, max_steps=50):
    """Bisect [a, b] to locate target, returning interval history."""
    intervals = [(a, b)]
    for _ in range(max_steps):
        mid = (a + b) / 2
        if mid < target:
            a = mid
        else:
            b = mid
        intervals.append((a, b))
        if b - a < 1e-15:
            break
    return intervals

# Find √2 in [1, 2]
intervals = nested_interval_bisection(1, 2, np.sqrt(2), max_steps=50)
print(f"After {len(intervals)-1} steps: [{intervals[-1][0]:.15f}, {intervals[-1][1]:.15f}]")
print(f"Width: {intervals[-1][1] - intervals[-1][0]:.2e}")
print(f"√2 =   {np.sqrt(2):.15f}")

Demonstrating sequential compactness

def extract_convergent_subsequence(sequence, set_contains):
    """Extract a convergent subsequence and check if limit is in the set."""
    # Sort by value, take elements converging to the median
    sorted_vals = sorted(enumerate(sequence), key=lambda x: x[1])
    median_idx = len(sorted_vals) // 2
    target = sorted_vals[median_idx][1]

    # Greedily select elements approaching the target
    subseq_indices = [sorted_vals[median_idx][0]]
    for i in range(median_idx + 1, len(sorted_vals)):
        subseq_indices.append(sorted_vals[i][0])
    subseq_indices.sort()  # Restore original order

    subseq = [sequence[i] for i in subseq_indices[:20]]
    limit = np.mean(subseq[-5:])
    return subseq, limit, set_contains(limit)

# On [0, 1] (compact): limit stays in set
rng = np.random.default_rng(42)
seq = rng.uniform(0, 1, size=100)
subseq, limit, in_set = extract_convergent_subsequence(seq, lambda x: 0 <= x <= 1)
print(f"[0,1]: limit = {limit:.6f}, in set: {in_set}")

# On (0, 1) (not compact): limit might escape
seq = np.array([1/(n+1) for n in range(100)])  # converges to 0 ∉ (0,1)
subseq, limit, in_set = extract_convergent_subsequence(seq, lambda x: 0 < x < 1)
print(f"(0,1): limit = {limit:.6f}, in set: {in_set}")

Checking coercivity of regularized objectives

def check_coercivity(f, x_range, n_points=1000):
    """Check if f appears coercive by examining boundary behavior."""
    x = np.linspace(*x_range, n_points)
    y = np.array([f(xi) for xi in x])

    boundary = n_points // 10
    left_min = y[:boundary].min()
    right_min = y[-boundary:].min()
    interior_min = y[boundary:-boundary].min()

    is_coercive = left_min > interior_min and right_min > interior_min
    return is_coercive, y.min(), x[y.argmin()]

# Unregularized: L(θ) = e^(-θ²) — NOT coercive
coercive, fmin, xmin = check_coercivity(lambda x: np.exp(-x**2), (-10, 10))
print(f"exp(-x²): coercive = {coercive}, min ≈ {fmin:.6f} at x ≈ {xmin:.4f}")

# Regularized: L(θ) + λ‖θ‖² = e^(-θ²) + 0.1θ² — coercive
coercive, fmin, xmin = check_coercivity(lambda x: np.exp(-x**2) + 0.1*x**2, (-10, 10))
print(f"exp(-x²) + 0.1x²: coercive = {coercive}, min ≈ {fmin:.6f} at x ≈ {xmin:.4f}")

Connections & Further Reading

Where this leads within formalCalculus

  • Uniform Convergence — the Arzelà-Ascoli theorem extends Heine-Borel to function space: equicontinuity + pointwise boundedness characterizes compact subsets of C([a,b]).
  • The Riemann Integral & FTC — continuous functions on compact intervals are Riemann integrable. The proof uses compactness to control partition refinement.
  • The Derivative & Chain Rule — Rolle’s theorem and the Mean Value Theorem use compactness of [a,b][a, b] (via EVT) to guarantee the existence of critical points.
  • Metric Spaces & Topology — completeness and compactness generalize to arbitrary metric spaces. The characterization “closed + bounded ⟺ compact” fails in general — compactness becomes a much deeper property, as the sin(nπx)\sin(n\pi x) counterexample in C([0,1])C([0,1]) shows.

Where this leads on formalML

  • Convex Analysis — Weierstrass EVT for optimization on compact sets; compactness of sublevel sets for coercive convex functions; existence of minimizers.
  • Gradient Descent — Existence of minimizers for regularized objectives (L(θ)+λθ2\mathcal{L}(\theta) + \lambda\|\theta\|^2) via coercivity arguments.
  • Measure-Theoretic Probability — Prokhorov’s theorem connects tightness to sequential compactness of probability measures.

References

  1. book Abbott (2015). Understanding Analysis Chapters 2–3 on completeness and Chapter 3 on compact sets — the primary reference for the completeness-first approach to real analysis
  2. book Rudin (1976). Principles of Mathematical Analysis Chapter 2 on basic topology and Chapter 4 on compactness — the definitive treatment of Heine-Borel and its consequences
  3. book Tao (2016). Analysis I Chapter 12 on metric spaces and compactness — careful first-principles construction connecting completeness, compactness, and continuity
  4. book Boyd & Vandenberghe (2004). Convex Optimization Section 4.2 on existence of optimal solutions and Section 6.1 on regularization — direct applications of compactness to optimization
  5. book Folland (1999). Real Analysis Chapter 4 on topological spaces and Tychonoff's theorem — for readers wanting the general topological perspective
  6. paper Zou & Hastie (2005). “Regularization and Variable Selection via the Elastic Net” The elastic net combines L1 and L2 penalties — both create compact constraint sets, guaranteeing minimizer existence while promoting sparsity