Limits & Continuity · intermediate · 40 min read

Completeness & Compactness

The structural properties of ℝ that guarantee limits exist and optima are attained — why regularization works and when your optimization problem has a solution

Abstract. The completeness of ℝ — every nonempty set bounded above has a least upper bound — is the axiom that separates the real numbers from the rationals and makes calculus possible. This single property implies the Monotone Convergence Theorem, the Bolzano-Weierstrass Theorem, and the Cauchy completeness that guarantees iterative algorithms converge to something. Compactness, the other structural pillar, says that a set K ⊆ ℝ is compact (closed and bounded) if and only if every sequence in K has a subsequence converging to a point in K (Heine-Borel). Compactness is the topological property behind the Extreme Value Theorem: continuous functions on compact sets attain their maximum and minimum, which is exactly the existence guarantee for optimization problems like min_{θ ∈ Θ} L(θ) when Θ is compact. In machine learning, neural network parameter spaces are ℝ^d — decidedly not compact. Regularization techniques (L2 weight decay constraining ‖θ‖₂ ≤ R, L1 lasso constraining ‖θ‖₁ ≤ R, gradient clipping, spectral normalization) effectively restrict parameters to compact subsets, restoring the existence guarantees that compactness provides. Coercive loss functions (L(θ) → ∞ as ‖θ‖ → ∞) offer an alternative: their sublevel sets are compact, so minimizers exist even on unbounded domains.

Overview & Motivation

Suppose you’re training a neural network and the loss decreases at every step: $\mathcal{L}(\theta_1) > \mathcal{L}(\theta_2) > \mathcal{L}(\theta_3) > \cdots$ . This is a bounded, decreasing sequence of real numbers. Does it converge? And does there actually exist a parameter vector $\theta^*$ that achieves the infimum of the loss? The first question is about completeness; the second is about compactness. These are the two structural properties of $\mathbb{R}$ that make calculus — and optimization — work.

In Sequences, Limits & Convergence, we used completeness when we proved the Monotone Convergence Theorem and the Bolzano-Weierstrass Theorem. In Epsilon-Delta & Continuity, we used compactness (implicitly) when we proved the Extreme Value Theorem and the Heine-Cantor Theorem. Now we develop the theory explicitly. We will see that a single axiom — the least upper bound property — is the foundation for all of these results, and that compactness is the structural property that makes continuous optimization well-posed.

This is where the abstraction level increases. In the first two topics, we worked with individual sequences and functions. Here we reason about structural properties of the real line itself — asking not “does this particular sequence converge?” but “what property of $\mathbb{R}$ guarantees that every bounded monotone sequence converges?”

Completeness of ℝ — The Least Upper Bound Property

Why $\mathbb{Q}$ is not enough

Consider the set $S = \{q \in \mathbb{Q} : q^2 < 2\}$ . This set is nonempty ( $1 \in S$ ) and bounded above ( $2$ is an upper bound). Does $S$ have a least upper bound in $\mathbb{Q}$ ? If it did, that least upper bound would have to be $\sqrt{2}$ — but $\sqrt{2} \notin \mathbb{Q}$ . The rationals have a “hole” right where the supremum should be.

This is not an isolated curiosity. The rationals are riddled with such holes — one for every irrational number. A number system with holes cannot support calculus: limits might not exist, monotone sequences might not converge, and continuous functions might not attain their extrema. The real numbers are constructed precisely to fill these holes.

📐 Definition 1 (Upper Bound and Supremum)

Let $S \subseteq \mathbb{R}$ be nonempty. A number $M \in \mathbb{R}$ is an upper bound of $S$ if $s \leq M$ for all $s \in S$ . The supremum (least upper bound) $\sup S$ is an upper bound of $S$ with the property that if $M$ is any other upper bound of $S$ , then $\sup S \leq M$ .

Dually, $m$ is a lower bound of $S$ if $m \leq s$ for all $s \in S$ , and the infimum $\inf S$ is the greatest lower bound.

The supremum has a crucial characterization: $\sup S$ is the unique number $\alpha$ such that (i) $\alpha$ is an upper bound of $S$ , and (ii) for every $\varepsilon > 0$ , there exists $s \in S$ with $s > \alpha - \varepsilon$ . This “approximation property” — you can always find elements of $S$ arbitrarily close to $\sup S$ — is what makes the supremum useful for proofs.

🔷 Theorem (Axiom — Completeness of ℝ (LUB Property))

Every nonempty subset of $\mathbb{R}$ that is bounded above has a supremum in $\mathbb{R}$ .

This is not a theorem we prove — it is the axiom that defines the real numbers. Every construction of $\mathbb{R}$ (Dedekind cuts, Cauchy sequence completion of $\mathbb{Q}$ ) is designed to make this axiom true. And from this single axiom, the entire structure of calculus follows.

The completeness of ℝ and equivalent formulations

🔷 Theorem 1 (Equivalence of Completeness Formulations)

The following are equivalent:

(a) The LUB Property: Every nonempty subset of $\mathbb{R}$ bounded above has a supremum in $\mathbb{R}$ .

(b) Monotone Convergence Theorem: Every bounded monotone sequence in $\mathbb{R}$ converges.

(c) Nested Interval Property: If $[a_1, b_1] \supseteq [a_2, b_2] \supseteq \cdots$ is a nested sequence of closed bounded intervals with $b_n - a_n \to 0$ , then $\bigcap_{n=1}^{\infty} [a_n, b_n]$ contains exactly one point.

(d) Cauchy Completeness: Every Cauchy sequence in $\mathbb{R}$ converges to a point in $\mathbb{R}$ .

Proof.

We sketch the cycle of implications.

(a) ⟹ (b): Let $(a_n)$ be bounded and increasing. The set $\{a_n\}$ is bounded above, so $L = \sup\{a_n\}$ exists by the LUB property. For every $\varepsilon > 0$ , the approximation property gives $a_N > L - \varepsilon$ for some $N$ . Since $(a_n)$ is increasing, $a_n \geq a_N > L - \varepsilon$ for all $n \geq N$ . And $a_n \leq L$ since $L$ is an upper bound. So $|a_n - L| < \varepsilon$ for all $n \geq N$ . This was proved as Theorem 1 in Sequences, Limits & Convergence.

(b) ⟹ (c): The sequence $(a_n)$ is increasing and bounded above (by $b_1$ ), so $a_n \to a$ by MCT. Similarly $b_n \to b$ is decreasing and bounded below. Since $a_n \leq b_n$ for all $n$ , we have $a \leq b$ . If $b_n - a_n \to 0$ , then $b = a$ , and $\{a\} = \bigcap [a_n, b_n]$ .

(c) ⟹ Bolzano-Weierstrass: Given a bounded sequence in $[a, b]$ , repeatedly bisect: at least one half contains infinitely many terms. The Nested Interval Property guarantees the bisected intervals converge to a point, which is the limit of the extracted subsequence. This was the strategy behind Theorem 4 in Sequences, Limits & Convergence.

Bolzano-Weierstrass ⟹ (d): Every Cauchy sequence is bounded (proved in Sequences, Limits & Convergence, Theorem 5). Bolzano-Weierstrass gives a convergent subsequence $a_{n_k} \to L$ . A Cauchy sequence with a convergent subsequence converges to the same limit: for $\varepsilon > 0$ , choose $N$ so that $|a_m - a_n| < \varepsilon/2$ for $m, n \geq N$ and $|a_{n_k} - L| < \varepsilon/2$ for $n_k \geq N$ . Then $|a_n - L| \leq |a_n - a_{n_k}| + |a_{n_k} - L| < \varepsilon$ .

(d) ⟹ (a): Let $S$ be nonempty and bounded above. We construct a Cauchy sequence converging to $\sup S$ . Start with any upper bound $b_0$ and any $s_0 \in S$ . Bisect $[s_0, b_0]$ : if the midpoint is an upper bound of $S$ , take the left half; otherwise, the right half. This produces intervals of width $(b_0 - s_0)/2^n \to 0$ , and the endpoints form Cauchy sequences. By Cauchy completeness, they converge to some $\alpha \in \mathbb{R}$ , which we verify is $\sup S$ .

∎

💡 Remark 1

In machine learning, completeness ensures that iterative algorithms have a limit. When gradient descent produces a Cauchy sequence of parameters $\theta_t$ — meaning the updates $\|\theta_{t+1} - \theta_t\|$ get arbitrarily small — completeness guarantees convergence to some $\theta^* \in \mathbb{R}^d$ . Without completeness, the “limit” might not exist, like $\sqrt{2}$ in $\mathbb{Q}$ .

The Nested Interval Property

The equivalence theorem tells us the Nested Interval Property follows from completeness. But the NIP deserves a standalone treatment because it is the key proof technique for many existence theorems — and it is the theoretical backbone of bisection algorithms.

🔷 Theorem 2 (Nested Interval Property)

If $[a_1, b_1] \supseteq [a_2, b_2] \supseteq \cdots$ is a nested sequence of closed, bounded intervals, then $\bigcap_{n=1}^{\infty} [a_n, b_n] \neq \emptyset$ . If additionally $b_n - a_n \to 0$ , then the intersection contains exactly one point.

Proof.

Let $a = \sup\{a_n\}$ , which exists by the LUB property since the $a_n$ are bounded above by $b_1$ . Let $b = \inf\{b_n\}$ , which exists since the $b_n$ are bounded below by $a_1$ .

We claim $a \leq b$ . For any $n, m$ , the nesting gives $a_n \leq b_m$ (if $n \leq m$ , then $a_n \leq a_m \leq b_m$ ; if $m \leq n$ , then $a_n \leq b_n \leq b_m$ ). So every $b_m$ is an upper bound for $\{a_n\}$ , hence $a = \sup\{a_n\} \leq b_m$ for all $m$ , hence $a \leq \inf\{b_m\} = b$ .

Now $a \in [a_n, b_n]$ for all $n$ (because $a_n \leq a$ and $a \leq b \leq b_n$ ), so $a \in \bigcap [a_n, b_n] \neq \emptyset$ .

If $b_n - a_n \to 0$ , then $0 \leq b - a \leq b_n - a_n \to 0$ , so $b = a$ and the intersection is $\{a\}$ .

∎

Target:

Step: 0Interval: [1.00000000, 2.00000000]Width: 1.00e+0

💡 Remark 2

The Nested Interval Property is the theoretical backbone of bisection-based algorithms. Every time you halve a search interval and keep the half containing your target, you are applying the NIP. Binary search, the bisection root-finding method from Epsilon-Delta & Continuity, and branch-and-bound optimization algorithms are all NIP in action. The visualization above lets you see this convergence directly — watch the interval width shrink exponentially while the endpoints close in on the target.

Open and Closed Sets

Before defining compactness, we need precise vocabulary for the topology of $\mathbb{R}$ . The distinction between open and closed sets is not just a formality — it determines whether limits of sequences “stay inside” a set, which is exactly the property that makes compactness work.

📐 Definition 2 (Open Set)

A set $S \subseteq \mathbb{R}$ is open if for every $x \in S$ , there exists $r > 0$ such that the open ball $(x - r, x + r) \subseteq S$ . In other words, every point of $S$ has a “buffer zone” of neighboring points that also belong to $S$ .

📐 Definition 3 (Closed Set)

A set $S \subseteq \mathbb{R}$ is closed if its complement $\mathbb{R} \setminus S$ is open. Equivalently, $S$ is closed if and only if it contains all its limit points: whenever $(x_n)$ is a sequence in $S$ with $x_n \to L$ , then $L \in S$ .

The sequential characterization is often more useful: a closed set is one that “keeps its limits.”

Open and closed sets in ℝ

🔷 Proposition 1 (Properties of Open and Closed Sets)

(a) Arbitrary unions of open sets are open.

(b) Finite intersections of open sets are open.

(d) Finite unions of closed sets are closed.

Proof.

(a): If $x \in \bigcup_\alpha S_\alpha$ , then $x \in S_\beta$ for some $\beta$ . Since $S_\beta$ is open, there exists $r > 0$ with $(x - r, x + r) \subseteq S_\beta \subseteq \bigcup_\alpha S_\alpha$ .

(b): If $x \in \bigcap_{i=1}^n S_i$ , then for each $i$ , there exists $r_i > 0$ with $(x - r_i, x + r_i) \subseteq S_i$ . Take $r = \min(r_1, \ldots, r_n) > 0$ (this is where finiteness is essential — an infinite minimum of positive numbers could be zero). Then $(x - r, x + r) \subseteq \bigcap_{i=1}^n S_i$ .

(c)–(d): Follow from (a)–(b) by De Morgan’s laws: $\mathbb{R} \setminus \bigcap_\alpha S_\alpha = \bigcup_\alpha (\mathbb{R} \setminus S_\alpha)$ and $\mathbb{R} \setminus \bigcup_{i=1}^n S_i = \bigcap_{i=1}^n (\mathbb{R} \setminus S_i)$ .

∎

📐 Definition 4 (Interior, Boundary, Closure)

For $S \subseteq \mathbb{R}$ :

The interior $\mathrm{int}(S)$ is the largest open set contained in $S$ : the set of all $x \in S$ for which some $(x - r, x + r) \subseteq S$ .
The closure $\overline{S}$ is the smallest closed set containing $S$ : equivalently, $S$ together with all its limit points.
The boundary $\partial S = \overline{S} \setminus \mathrm{int}(S)$ : the set of points where every neighborhood meets both $S$ and $\mathbb{R} \setminus S$ .

📝 Example 1

For $S = [0, 1)$ :

$\mathrm{int}(S) = (0, 1)$ — the point $0$ has no neighborhood contained in $S$ (every $(- r, r)$ contains negative numbers not in $S$ ).
$\overline{S} = [0, 1]$ — the point $1$ is a limit point of $S$ (the sequence $1 - 1/n \to 1$ ) but $1 \notin S$ .
$\partial S = \{0, 1\}$ .

The set $S = [0, 1)$ is neither open (0 is a boundary point in $S$ with no interior buffer) nor closed (1 is a limit point not in $S$ ).

Sequential Compactness

The Bolzano-Weierstrass Theorem from Sequences, Limits & Convergence says every bounded sequence in $\mathbb{R}$ has a convergent subsequence. But where does that subsequence converge to? If the original sequence lives in $[0, 1]$ , the limit is guaranteed to also be in $[0, 1]$ — because $[0, 1]$ is closed. If the sequence lives in $(0, 1)$ , the limit might escape to 0 or 1. This is the difference between compact and non-compact.

📐 Definition 5 (Sequential Compactness)

A set $K \subseteq \mathbb{R}$ is sequentially compact if every sequence $(x_n)$ in $K$ has a subsequence $(x_{n_k})$ that converges to a point $L \in K$ .

The key requirement is that the limit $L$ belongs to $K$ — not just that a convergent subsequence exists, but that the subsequence converges to a point inside the set.

🔷 Proposition 2 (Compact ⟺ Closed + Bounded in ℝ)

A subset $K \subseteq \mathbb{R}$ is sequentially compact if and only if $K$ is closed and bounded.

Proof.

(⟸) Closed + bounded implies compact: If $K$ is bounded, then $K \subseteq [-M, M]$ for some $M$ . By the Bolzano-Weierstrass Theorem, any sequence $(x_n)$ in $K$ has a convergent subsequence $x_{n_k} \to L$ . If $K$ is closed, then $L \in K$ (closed sets contain their limit points). So $K$ is sequentially compact.

(⟹) Compact implies closed: Suppose $K$ is not closed. Then there exists a sequence $(x_n)$ in $K$ with $x_n \to L$ and $L \notin K$ . Every subsequence of $(x_n)$ also converges to $L$ (a fact from Sequences, Limits & Convergence). Since $L \notin K$ , no subsequence converges to a point in $K$ — contradicting sequential compactness.

(⟹) Compact implies bounded: Suppose $K$ is unbounded. Then for each $n$ , there exists $x_n \in K$ with $|x_n| > n$ . The sequence $(x_n)$ has $|x_n| \to \infty$ , so no subsequence can converge (convergent sequences are bounded). This contradicts sequential compactness.

∎

Sequential compactness: compact vs. non-compact sets

Set:

Compact: YesClosed: YesBounded: Yes

💡 Remark 3

The characterization “closed + bounded ⟺ compact” is specific to $\mathbb{R}^n$ — this is the Heine-Borel theorem we prove in the next section. In general metric spaces, compactness is a strictly stronger property than “closed + bounded.” For example, in the space of continuous functions $C([0,1])$ with the supremum norm, the closed unit ball $\{f : \|f\|_\infty \leq 1\}$ is closed and bounded but not compact — you can find a sequence of functions in it with no convergent subsequence. This distinction matters when working with function spaces in ML, and we develop it in Metric Spaces & Topology, where we prove that the sequence $f_n(x) = \sin(n\pi x)$ has all pairwise sup-norm distances $\geq 1$ , so no subsequence is Cauchy and no subsequence converges uniformly.

Compactness — The Open Cover Definition

Sequential compactness is intuitive: “every sequence has a convergent subsequence staying in the set.” There is an equivalent definition using open covers that is more powerful for proofs, especially when we generalize beyond $\mathbb{R}^n$ .

📐 Definition 6 (Open Cover)

An open cover of $K \subseteq \mathbb{R}$ is a collection $\{U_\alpha\}_{\alpha \in A}$ of open sets such that $K \subseteq \bigcup_{\alpha \in A} U_\alpha$ . That is, every point of $K$ belongs to at least one of the $U_\alpha$ .

A subcover is a subcollection that still covers $K$ . A finite subcover is a subcover with finitely many sets.

📐 Definition 7 (Compactness via Open Covers)

A set $K$ is compact if every open cover of $K$ has a finite subcover: for any collection of open sets $\{U_\alpha\}$ covering $K$ , there exist finitely many $U_{\alpha_1}, \ldots, U_{\alpha_n}$ such that $K \subseteq U_{\alpha_1} \cup \cdots \cup U_{\alpha_n}$ .

🔷 Theorem 3 (Heine-Borel Theorem)

A subset $K \subseteq \mathbb{R}^n$ is compact (every open cover has a finite subcover) if and only if $K$ is closed and bounded.

In $\mathbb{R}$ , this means the compact sets are precisely the closed, bounded intervals $[a, b]$ and their closed subsets.

Proof.

We prove the theorem for $\mathbb{R}$ ; the $\mathbb{R}^n$ case follows by the same argument applied coordinate-wise.

(⟸) Closed and bounded implies compact: Suppose $K = [a, b]$ and let $\{U_\alpha\}$ be any open cover. Define

$S = \{x \in [a, b] : [a, x] \text{ can be covered by finitely many } U_\alpha\}.$

$S$ is nonempty: $a \in U_\beta$ for some $\beta$ , and since $U_\beta$ is open, some interval $[a, a + r) \subseteq U_\beta$ , so $a \in S$ .

$S$ is bounded above by $b$ . Let $c = \sup S$ , which exists by the completeness of $\mathbb{R}$ . We claim $c = b$ .

Since $c \in [a, b]$ and $\{U_\alpha\}$ covers $[a, b]$ , there exists $U_\gamma$ with $c \in U_\gamma$ . Since $U_\gamma$ is open, there exists $\delta > 0$ with $(c - \delta, c + \delta) \subseteq U_\gamma$ .

Since $c = \sup S$ , there exists $x \in S$ with $x > c - \delta$ . Then $[a, x]$ is covered by finitely many $U_\alpha$ , and $[x, \min(c + \delta, b)] \subseteq U_\gamma$ . So $[a, \min(c + \delta, b)]$ is covered by finitely many $U_\alpha$ .

If $c < b$ , then $\min(c + \delta, b) > c$ , contradicting $c = \sup S$ . Therefore $c = b$ , and $[a, b]$ has a finite subcover.

(⟹) Compact implies bounded: Cover $K$ by $\{(-n, n) : n \in \mathbb{N}\}$ . A finite subcover gives $K \subseteq (-N, N)$ for some $N$ , so $K$ is bounded.

(⟹) Compact implies closed: Suppose $x \notin K$ . For each $y \in K$ , let $r_y = |x - y|/2 > 0$ . The intervals $(y - r_y, y + r_y)$ cover $K$ . Extract a finite subcover: $K \subseteq \bigcup_{i=1}^n (y_i - r_{y_i}, y_i + r_{y_i})$ . Let $r = \min(r_{y_1}, \ldots, r_{y_n}) > 0$ . Then $(x - r, x + r) \cap K = \emptyset$ , so $x$ is an interior point of $\mathbb{R} \setminus K$ . Since every point of $\mathbb{R} \setminus K$ is interior, $\mathbb{R} \setminus K$ is open, hence $K$ is closed.

∎

Heine-Borel: finite subcovers on compact sets

📝 Example 2

Consider the open cover $\{(1/n, 1) : n \geq 2\}$ of the open interval $(0, 1)$ . Each point $x \in (0, 1)$ lies in $(1/n, 1)$ for $n > 1/x$ , so this is indeed an open cover.

But no finite subcollection works. If we select $(1/n_1, 1), \ldots, (1/n_k, 1)$ , the union is $(1/N, 1)$ where $N = \max(n_1, \ldots, n_k)$ , which misses points in $(0, 1/N]$ . This proves $(0, 1)$ is not compact.

Contrast with $[0, 1]$ : any open cover must include some $U_\alpha$ containing $0$ and some $U_\beta$ containing $1$ , and the Heine-Borel proof guarantees a finite subcover exists.

Set:

Click on individual cover intervals to toggle them on/off. Try to find a finite subcover that covers the set.

Continuous Functions on Compact Sets

This is where compactness meets continuity — producing the power theorems of analysis. In Epsilon-Delta & Continuity, we proved the Extreme Value Theorem and the Heine-Cantor Theorem using direct arguments. Here we reprove both as consequences of compactness, revealing the structural reason they hold: the essential ingredient is not the specific argument, but the compactness of the domain.

🔶 Lemma 1 (Continuous Image of a Compact Set)

If $f: K \to \mathbb{R}$ is continuous and $K$ is compact, then $f(K) = \{f(x) : x \in K\}$ is compact.

Proof.

Let $(y_n)$ be a sequence in $f(K)$ . Then $y_n = f(x_n)$ for some $x_n \in K$ . By compactness of $K$ , there exists a subsequence $x_{n_k} \to x^* \in K$ . By continuity of $f$ , $f(x_{n_k}) \to f(x^*) \in f(K)$ .

So every sequence in $f(K)$ has a subsequence converging to a point in $f(K)$ . Therefore $f(K)$ is compact.

∎

🔷 Theorem 4 (Extreme Value Theorem (Compactness Proof))

If $f: K \to \mathbb{R}$ is continuous and $K$ is compact, then $f$ attains its maximum and minimum on $K$ : there exist $x_{\max}, x_{\min} \in K$ with $f(x_{\min}) \leq f(x) \leq f(x_{\max})$ for all $x \in K$ .

Proof.

By Lemma 1, $f(K)$ is compact, hence closed and bounded by Heine-Borel. Since $f(K)$ is bounded, $M = \sup f(K)$ exists. Since $f(K)$ is closed and $M$ is a limit point of $f(K)$ (by the approximation property of the supremum: there exist $y_n \in f(K)$ with $y_n \to M$ ), we have $M \in f(K)$ . So $M = f(x_{\max})$ for some $x_{\max} \in K$ .

The argument for the minimum is identical with $\inf$ replacing $\sup$ .

Compare this with the proof in Epsilon-Delta & Continuity (Theorem 4): the compactness proof is cleaner because it identifies the essential structural ingredient. The continuous image of a compact set is compact — that is the entire story.

∎

🔷 Theorem 5 (Heine-Cantor Theorem (Compactness Proof))

If $f: K \to \mathbb{R}$ is continuous and $K$ is compact, then $f$ is uniformly continuous on $K$ : for every $\varepsilon > 0$ , there exists $\delta > 0$ such that $|x - y| < \delta \implies |f(x) - f(y)| < \varepsilon$ for all $x, y \in K$ .

Proof.

Suppose $f$ is not uniformly continuous. Then there exists $\varepsilon > 0$ such that for every $\delta = 1/n$ , there exist $x_n, y_n \in K$ with $|x_n - y_n| < 1/n$ but $|f(x_n) - f(y_n)| \geq \varepsilon$ .

By compactness of $K$ , the sequence $(x_n)$ has a subsequence $x_{n_k} \to x^* \in K$ . Since $|x_{n_k} - y_{n_k}| < 1/n_k \to 0$ , we also have $y_{n_k} \to x^*$ .

By continuity of $f$ at $x^*$ : $f(x_{n_k}) \to f(x^*) \quad \text{and} \quad f(y_{n_k}) \to f(x^*),$ so $|f(x_{n_k}) - f(y_{n_k})| \to 0$ . But $|f(x_{n_k}) - f(y_{n_k})| \geq \varepsilon > 0$ for all $k$ — contradiction.

∎

Continuous functions on compact sets: EVT and Heine-Cantor

Function:Domain:

💡 Remark 4

The Heine-Cantor proof by contradiction is a paradigm for compactness arguments: assume the negation, extract a sequence, use compactness to get a convergent subsequence, and reach a contradiction via continuity. This pattern — sometimes called the “compactness argument” — appears throughout analysis and optimization theory. Whenever you see a proof that constructs a sequence and extracts a convergent subsequence, compactness is at work.

Coercivity — Compactness Without Boundedness

In practice, parameter spaces are $\mathbb{R}^d$ — not compact (unbounded). So the EVT does not apply directly: $\min_{\theta \in \mathbb{R}^d} \mathcal{L}(\theta)$ might not be attained. This is not an abstract concern — the function $f(x) = e^{-x}$ on $\mathbb{R}$ has infimum 0 but never reaches it.

Coercivity provides an elegant substitute: if the function “blows up” at infinity, the optimization effectively takes place on a compact sublevel set.

📐 Definition 8 (Coercive Function)

A function $f: \mathbb{R}^d \to \mathbb{R}$ is coercive if $\lim_{\|x\| \to \infty} f(x) = +\infty$ . Equivalently, for every $c \in \mathbb{R}$ , the sublevel set $\{x \in \mathbb{R}^d : f(x) \leq c\}$ is bounded.

🔷 Theorem 6 (Weierstrass Theorem for Coercive Functions)

If $f: \mathbb{R}^d \to \mathbb{R}$ is continuous and coercive, then $f$ attains its minimum: there exists $x^* \in \mathbb{R}^d$ with $f(x^*) = \inf_{x \in \mathbb{R}^d} f(x)$ .

Proof.

Let $m = \inf_{x \in \mathbb{R}^d} f(x)$ . Choose a minimizing sequence $(x_n)$ with $f(x_n) \to m$ .

Since $f(x_n) \to m$ , we have $f(x_n) \leq m + 1$ for all sufficiently large $n$ . The sublevel set $S = \{x : f(x) \leq m + 1\}$ is:

Bounded, by coercivity of $f$ .
Closed, by continuity of $f$ (preimage of the closed set $(-\infty, m + 1]$ ).

So $S$ is compact by the Heine-Borel theorem, and the tail of $(x_n)$ lies in $S$ . By compactness, extract a subsequence $x_{n_k} \to x^* \in S$ . By continuity, $f(x^*) = \lim f(x_{n_k}) = m$ .

∎

Compactness in optimization: compact domain vs. coercive function

💡 Remark 5

Coercivity is why $L_2$ regularization guarantees a unique minimizer for convex losses. The regularized loss $\mathcal{L}(\theta) + \lambda\|\theta\|_2^2$ is coercive for any $\lambda > 0$ : the quadratic penalty dominates as $\|\theta\| \to \infty$ , regardless of what $\mathcal{L}$ does. The sublevel sets are compact (closed and bounded), so the Weierstrass theorem guarantees a minimizer exists. Without regularization, the loss might decrease asymptotically — approaching its infimum along a sequence $\theta_n$ with $\|\theta_n\| \to \infty$ — without ever being attained.

Connections to Machine Learning

Regularization as compactification

Regularization creates compact constraint sets

The unconstrained parameter space $\mathbb{R}^d$ is not compact. Regularization techniques create compact subsets, restoring the EVT guarantee that minimizers exist:

$L_2$ weight decay constrains $\|\theta\|_2 \leq R$ , a closed ball in $\mathbb{R}^d$ — compact by Heine-Borel.
$L_1$ lasso constrains $\|\theta\|_1 \leq R$ , a closed “diamond” (cross-polytope) — also compact.
Elastic net combines both: $\alpha\|\theta\|_1 + (1-\alpha)\|\theta\|_2 \leq R$ — the intersection of compact sets, still compact.
Weight clipping ( $|\theta_i| \leq C$ for all $i$ ) constrains parameters to a closed hypercube — compact.
Spectral normalization constrains the spectral norm (largest singular value) of weight matrices, creating a compact set of matrices.

Each technique converts an unbounded optimization problem into a compact one, where the existence of minimizers is guaranteed by the EVT. Regularization is not just for generalization — it is for existence. See Convex Analysis on formalML for the full treatment.

Existence of optimal parameters

Without compactness or coercivity, $\inf_\theta \mathcal{L}(\theta)$ may not be attained. Consider a simple example: $\mathcal{L}(\theta) = e^{-\theta^2}$ on $\mathbb{R}$ . The infimum is 0, but $\mathcal{L}(\theta) > 0$ for all $\theta$ — the infimum is never achieved.

Regularization fixes this: $\mathcal{L}(\theta) + \lambda\|\theta\|^2 = e^{-\theta^2} + \lambda\theta^2$ is coercive for any $\lambda > 0$ , so its minimum is attained. The regularization parameter $\lambda$ controls the trade-off between fitting the data and keeping parameters bounded — but at a more fundamental level, $\lambda > 0$ ensures the optimization problem has a solution. See Gradient Descent on formalML.

Tightness and convergence of distributions

Compactness appears in probability theory through the concept of tightness. A family of probability distributions $\{\mu_\alpha\}$ is tight if for every $\varepsilon > 0$ , there exists a compact set $K$ such that $\mu_\alpha(K) > 1 - \varepsilon$ for all $\alpha$ . Prokhorov’s theorem says tightness is equivalent to sequential compactness of the family of measures (in the topology of weak convergence).

In ML, tightness guarantees that sequences of empirical distributions (from training data) have convergent subsequences — a key ingredient in proving that training procedures converge to meaningful distributions. See Measure-Theoretic Probability on formalML.

Early stopping and implicit compactness

Early stopping — halting training after $T$ iterations — implicitly constrains the effective parameter space. Starting from $\theta_0$ and taking gradient steps of size $\eta$ , the parameter trajectory $\theta_0, \theta_1, \ldots, \theta_T$ is bounded: $\|\theta_t - \theta_0\| \leq t \cdot \eta \cdot G$ where $G$ is the maximum gradient norm. So the effective parameter space is contained in a ball of radius $T \eta G$ centered at $\theta_0$ — a compact set.

This observation connects early stopping to explicit regularization: both create compact constraint sets, and both guarantee that the optimization trajectory stays in a region where the EVT applies. The theoretical equivalence between early stopping and $L_2$ regularization (for linear models) is a manifestation of this shared compactness structure.

Computational Notes

Verifying the Nested Interval Property via bisection

import numpy as np

def nested_interval_bisection(a, b, target, max_steps=50):
    """Bisect [a, b] to locate target, returning interval history."""
    intervals = [(a, b)]
    for _ in range(max_steps):
        mid = (a + b) / 2
        if mid < target:
            a = mid
        else:
            b = mid
        intervals.append((a, b))
        if b - a < 1e-15:
            break
    return intervals

# Find √2 in [1, 2]
intervals = nested_interval_bisection(1, 2, np.sqrt(2), max_steps=50)
print(f"After {len(intervals)-1} steps: [{intervals[-1][0]:.15f}, {intervals[-1][1]:.15f}]")
print(f"Width: {intervals[-1][1] - intervals[-1][0]:.2e}")
print(f"√2 =   {np.sqrt(2):.15f}")

Demonstrating sequential compactness

def extract_convergent_subsequence(sequence, set_contains):
    """Extract a convergent subsequence and check if limit is in the set."""
    # Sort by value, take elements converging to the median
    sorted_vals = sorted(enumerate(sequence), key=lambda x: x[1])
    median_idx = len(sorted_vals) // 2
    target = sorted_vals[median_idx][1]

    # Greedily select elements approaching the target
    subseq_indices = [sorted_vals[median_idx][0]]
    for i in range(median_idx + 1, len(sorted_vals)):
        subseq_indices.append(sorted_vals[i][0])
    subseq_indices.sort()  # Restore original order

    subseq = [sequence[i] for i in subseq_indices[:20]]
    limit = np.mean(subseq[-5:])
    return subseq, limit, set_contains(limit)

# On [0, 1] (compact): limit stays in set
rng = np.random.default_rng(42)
seq = rng.uniform(0, 1, size=100)
subseq, limit, in_set = extract_convergent_subsequence(seq, lambda x: 0 <= x <= 1)
print(f"[0,1]: limit = {limit:.6f}, in set: {in_set}")

# On (0, 1) (not compact): limit might escape
seq = np.array([1/(n+1) for n in range(100)])  # converges to 0 ∉ (0,1)
subseq, limit, in_set = extract_convergent_subsequence(seq, lambda x: 0 < x < 1)
print(f"(0,1): limit = {limit:.6f}, in set: {in_set}")

Checking coercivity of regularized objectives

def check_coercivity(f, x_range, n_points=1000):
    """Check if f appears coercive by examining boundary behavior."""
    x = np.linspace(*x_range, n_points)
    y = np.array([f(xi) for xi in x])

    boundary = n_points // 10
    left_min = y[:boundary].min()
    right_min = y[-boundary:].min()
    interior_min = y[boundary:-boundary].min()

    is_coercive = left_min > interior_min and right_min > interior_min
    return is_coercive, y.min(), x[y.argmin()]

# Unregularized: L(θ) = e^(-θ²) — NOT coercive
coercive, fmin, xmin = check_coercivity(lambda x: np.exp(-x**2), (-10, 10))
print(f"exp(-x²): coercive = {coercive}, min ≈ {fmin:.6f} at x ≈ {xmin:.4f}")

# Regularized: L(θ) + λ‖θ‖² = e^(-θ²) + 0.1θ² — coercive
coercive, fmin, xmin = check_coercivity(lambda x: np.exp(-x**2) + 0.1*x**2, (-10, 10))
print(f"exp(-x²) + 0.1x²: coercive = {coercive}, min ≈ {fmin:.6f} at x ≈ {xmin:.4f}")

Connections & Further Reading

Prerequisites — topics you need first

foundational Limits & Continuity 40 min

Sequences, Limits & Convergence

Completeness was introduced informally via Cauchy sequences and the Bolzano-Weierstrass Theorem. This topic elevates those results from 'useful theorems' to 'structural consequences of a single axiom' — the least upper bound property.

foundational Limits & Continuity 45 min

Epsilon-Delta & Continuity

The Extreme Value Theorem and Heine-Cantor Theorem were proved in Topic 2 using ad-hoc arguments. Here we reprove them as immediate consequences of compactness, revealing the deeper structural reason they hold.

Where this leads — next in formalCalculus

intermediate Limits & Continuity 45 min

Uniform Convergence

intermediate Single-Variable Calculus 55 min

Mean Value Theorem & Taylor Expansion

advanced Measure & Integration 45 min

Sigma-Algebras & Measures

intermediate Functional Analysis 45 min

Metric Spaces & Topology

On to formalStatistics — where this calculus powers inference

Empirical Processes

Prokhorov's theorem (tightness ⇔ precompactness in distribution) is the probabilistic lift of Heine–Borel / Bolzano–Weierstrass. Donsker's theorem is proven via finite-dimensional CLT plus tightness; tightness is the compactness half of the argument.

Kernel Density Estimation

Uniform consistency proofs for KDEs (sup_x |f̂_n(x) - f(x)| → 0) use Arzelà–Ascoli-style equicontinuity and compactness arguments on the class of kernel-convolved densities.

On to formalML — where this calculus powers ML

Convex Analysis

The Weierstrass extreme value theorem — continuous functions on compact sets attain their extrema — is the foundational existence result for convex optimization. Compactness of sublevel sets for coercive convex functions guarantees minimizers exist even on unbounded domains.

Measure Theoretic Probability

Tightness of probability measures (Prokhorov's theorem) is a compactness condition: a family of measures is tight if for every ε > 0, there exists a compact set K with μ(K) > 1 - ε for all μ in the family. This is used in convergence-in-distribution proofs.

Gradient Descent

The existence of minimizers for regularized objectives (L(θ) + λ‖θ‖²) follows from compactness arguments: the regularization term makes the effective objective coercive, so sublevel sets are compact, and the EVT guarantees a minimizer exists.

References

book Abbott (2015). Understanding Analysis Chapters 2–3 on completeness and Chapter 3 on compact sets — the primary reference for the completeness-first approach to real analysis
book Rudin (1976). Principles of Mathematical Analysis Chapter 2 on basic topology and Chapter 4 on compactness — the definitive treatment of Heine-Borel and its consequences
book Tao (2016). Analysis I Chapter 12 on metric spaces and compactness — careful first-principles construction connecting completeness, compactness, and continuity
book Boyd & Vandenberghe (2004). Convex Optimization Section 4.2 on existence of optimal solutions and Section 6.1 on regularization — direct applications of compactness to optimization
book Folland (1999). Real Analysis Chapter 4 on topological spaces and Tychonoff's theorem — for readers wanting the general topological perspective
paper Zou & Hastie (2005). “Regularization and Variable Selection via the Elastic Net” The elastic net combines L1 and L2 penalties — both create compact constraint sets, guaranteeing minimizer existence while promoting sparsity

Overview & Motivation

Completeness of ℝ — The Least Upper Bound Property

Why Q\mathbb{Q}Q is not enough

The Nested Interval Property

Open and Closed Sets

Sequential Compactness

Compactness — The Open Cover Definition

Continuous Functions on Compact Sets

Coercivity — Compactness Without Boundedness

Connections to Machine Learning

Regularization as compactification

Existence of optimal parameters

Tightness and convergence of distributions

Early stopping and implicit compactness

Computational Notes

Verifying the Nested Interval Property via bisection

Demonstrating sequential compactness

Checking coercivity of regularized objectives

Connections & Further Reading

Prerequisites — topics you need first

Where this leads — next in formalCalculus

On to formalStatistics — where this calculus powers inference

On to formalML — where this calculus powers ML

References

Why $\mathbb{Q}$ is not enough