Measure & Integration · advanced · 45 min read

Sigma-Algebras & Measures

Why not every set is measurable, and how the right framework for 'size' unlocks rigorous probability — from Borel sets to Lebesgue measure to the mathematical language that makes p(x) mean something precise.

Abstract. Measure theory is where analysis leaves the realm of pictures and becomes structural. The Riemann integral fails on important functions — the indicator of the rationals on [0,1] has no Riemann integral at all. To build a more powerful integral theory, we first need a more powerful notion of 'size.' A sigma-algebra tells us which sets we are allowed to measure; a measure assigns a non-negative number to each of those sets, respecting countable additivity. The Lebesgue measure on ℝ extends the intuitive notion of length to a far richer collection of sets than intervals — but the Vitali construction shows that not every subset of ℝ can be assigned a consistent measure. The measurable functions with respect to a sigma-algebra are precisely the functions for which integration will be defined. These concepts are the mathematical language of probability: a probability space is a measure space with total measure 1, and every density function, expectation, and convergence theorem in ML rests on this foundation.

Where this leads → formalML

  • formalML A probability space (Ω, F, P) is a measure space with P(Ω) = 1. Sigma-algebras formalize observable events; filtrations model information flow.
  • formalML The set of initial conditions that converge to saddle points has Lebesgue measure zero — 'gradient descent avoids saddle points with probability 1' is a measure-theoretic statement.
  • formalML Normalizing flows are pushforward measures: T*μ_X = μ_Y. The change-of-variables formula for densities is a Radon-Nikodym derivative of the pushforward.
  • formalML Markov, Chebyshev, Hoeffding — all are bounds on the measure of tail sets P(|X - μ| > t).

1. Why the Riemann Integral Isn’t Enough

If this topic feels more abstract than the previous ones, that’s the subject matter, not your mathematical maturity — measure theory is where analysis becomes structural. Every prior topic in this curriculum has been geometric: epsilon-delta limits live near a point, derivatives are slopes, integrals are areas, vector fields are arrows, phase portraits are trajectories. Measure theory has none of those pictures. A sigma-algebra is a family of sets, and a measure is a function on that family. There is no picture of “the Borel sigma-algebra on R\mathbb{R}” — it is an uncountable collection of subsets that we can describe but never see.

That’s the bad news. The good news is that this abstract framework is what unlocks rigorous probability, modern analysis, and most of the mathematical machinery that machine learning silently depends on. By the end of this topic you’ll understand exactly what people mean when they write p(x)dxp(x) \, dx for a probability density, why “with probability 1” is more than a figure of speech, and why the Riemann integral you learned in calculus is not the right tool for any serious work in probability or functional analysis.

We’ll motivate the framework with three puzzles — each one a question the Riemann integral cannot answer.

Puzzle 1: The Dirichlet function

Define 1Q:[0,1]{0,1}\mathbf{1}_\mathbb{Q}: [0, 1] \to \{0, 1\} by 1Q(x)={1if xQ,0otherwise.\mathbf{1}_\mathbb{Q}(x) = \begin{cases} 1 & \text{if } x \in \mathbb{Q}, \\ 0 & \text{otherwise.} \end{cases}

This is the indicator function of the rationals. Try to integrate it on [0,1][0, 1] via Riemann sums and you immediately hit a wall.

📝 Example 1 (The Dirichlet function has no Riemann integral)

Let P={0=x0<x1<<xn=1}P = \{0 = x_0 < x_1 < \cdots < x_n = 1\} be any partition of [0,1][0, 1]. On every subinterval [xi1,xi][x_{i-1}, x_i] — no matter how small — we can find both a rational point and an irrational point. So supx[xi1,xi]1Q(x)=1,infx[xi1,xi]1Q(x)=0.\sup_{x \in [x_{i-1}, x_i]} \mathbf{1}_\mathbb{Q}(x) = 1, \quad \inf_{x \in [x_{i-1}, x_i]} \mathbf{1}_\mathbb{Q}(x) = 0.

Therefore the upper Darboux sum is U(1Q,P)=i1(xixi1)=1U(\mathbf{1}_\mathbb{Q}, P) = \sum_i 1 \cdot (x_i - x_{i-1}) = 1 and the lower sum is L(1Q,P)=0L(\mathbf{1}_\mathbb{Q}, P) = 0, regardless of the partition. The Riemann integral exists only when supPL=infPU\sup_P L = \inf_P U, but here the sup is 00 and the inf is 11. The Riemann integral does not exist.

But our intuition is screaming that the integral should be 00. The rationals are countable; the irrationals are uncountable. Almost every point in [0,1][0, 1] is irrational, so 1Q\mathbf{1}_\mathbb{Q} is “almost everywhere zero.” A good theory of integration should recognize that. Measure theory will let us make “the rationals are negligible” a precise statement: Q[0,1]\mathbb{Q} \cap [0, 1] has Lebesgue measure zero, so 1Qdλ=0\int \mathbf{1}_\mathbb{Q} \, d\lambda = 0 in the Lebesgue sense.

Puzzle 2: The Vitali impossibility

Suppose we want to assign a “length” μ(A)\mu(A) to every subset A[0,1]A \subseteq [0, 1]. We’d want this length function to satisfy two reasonable properties:

  1. Translation-invariance: μ(A+t)=μ(A)\mu(A + t) = \mu(A) for any tt — sliding a set sideways doesn’t change its length.
  2. Countable additivity: μ(nAn)=nμ(An)\mu(\bigsqcup_n A_n) = \sum_n \mu(A_n) for any countable collection of disjoint sets.

These look harmless. The shocking fact, due to Vitali in 1905, is that no such function exists — at least not one defined on every subset. We will prove this in Section 8: using the axiom of choice, we can construct a subset V[0,1]V \subset [0, 1] to which no consistent length can be assigned. The resolution is sobering: we don’t measure all subsets. We measure only those in a designated collection — a sigma-algebra. Choosing the right sigma-algebra is the central design choice of measure theory.

Puzzle 3: The ML density puzzle

When you write p(x)=12πex2/2p(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} for the standard normal density, what does this mean operationally? The continuous random variable XX takes any specific value xx with probability zero — P(X=x)=0P(X = x) = 0 for every xRx \in \mathbb{R}. So p(x)p(x) is not a probability. What is it?

The answer is that p(x)p(x) is a Radon-Nikodym derivative: it is the rate at which probability mass accumulates per unit Lebesgue measure. The expression "p(x)dxp(x) \, dx" means dPdλdλ\frac{dP}{d\lambda} \cdot d\lambda, where PP is the probability measure and λ\lambda is Lebesgue measure. Without measure theory, "p(x)p(x)" is a notational convenience with no operational content; with it, the relationship between PP and λ\lambda becomes precise. This connection is the entry point to the formalml.com topic on Probability Spaces.

These three puzzles all point to the same need: a framework where “size” is defined on a controlled family of sets, where countable operations behave well, and where the limitations of the Riemann integral can be transcended. That framework is built in the next eleven sections.

Three integration scenarios: smooth function (Riemann and Lebesgue agree), oscillatory function (both work but Lebesgue is natural), Dirichlet-like function (Riemann fails)

Riemann sum (midpoint, n = 8): 0.6407 · err 4.11e-3Lebesgue sum (n = 8): 0.5642 · err 7.24e-2Exact integral: 0.6366

A smooth, well-behaved bump on [0, 1]. Both Riemann and Lebesgue integrals agree exactly with the closed form 2/π. Use this preset to see the two strategies converge in lockstep.

2. Sigma-Algebras

A sigma-algebra is a family of subsets of a fixed ambient set Ω\Omega that is closed under complement, countable union, and contains Ω\Omega itself. That’s it — three axioms. But these three axioms have surprisingly deep consequences, and choosing the right sigma-algebra is the central modeling decision of measure theory.

📐 Definition 1 (Sigma-algebra (σ-algebra))

Let Ω\Omega be a non-empty set. A sigma-algebra on Ω\Omega is a collection FP(Ω)\mathcal{F} \subseteq \mathcal{P}(\Omega) of subsets of Ω\Omega satisfying:

  1. ΩF\Omega \in \mathcal{F}.
  2. Closure under complement. If AFA \in \mathcal{F}, then Ac=ΩAFA^c = \Omega \setminus A \in \mathcal{F}.
  3. Closure under countable union. If A1,A2,A3,A_1, A_2, A_3, \ldots is a countable sequence in F\mathcal{F}, then n=1AnF\bigcup_{n=1}^\infty A_n \in \mathcal{F}.

The pair (Ω,F)(\Omega, \mathcal{F}) is called a measurable space. Members of F\mathcal{F} are called measurable sets.

A few easy consequences fall out. Since ΩF\Omega \in \mathcal{F} and F\mathcal{F} is closed under complement, =ΩcF\emptyset = \Omega^c \in \mathcal{F}. Closure under countable union plus complement gives closure under countable intersection (de Morgan). And since the union of a finite number of sets is a special case of a countable union (pad with \emptyset), F\mathcal{F} is also closed under finite unions and finite intersections.

The simplest examples are at the two extremes: the smallest possible sigma-algebra and the largest possible one.

📝 Example 2 (The trivial sigma-algebra)

For any set Ω\Omega, the collection {,Ω}\{\emptyset, \Omega\} is a sigma-algebra — the smallest one possible. Closure under complement is immediate (c=Ω\emptyset^c = \Omega and vice versa), and any countable union of these two sets is again one of these two sets. This sigma-algebra “knows” only whether a point is in Ω\Omega or not — it carries no further information.

📝 Example 3 (The power set)

The full power set P(Ω)\mathcal{P}(\Omega) is a sigma-algebra — the largest one possible. It contains every subset of Ω\Omega and is trivially closed under all set operations. When Ω\Omega is countable, P(Ω)\mathcal{P}(\Omega) is the only sigma-algebra we ever use. But when Ω=R\Omega = \mathbb{R}, we will see that P(R)\mathcal{P}(\mathbb{R}) is too big — it contains pathological sets that no consistent length function can measure.

📝 Example 4 (A finite sigma-algebra from a partition)

Take Ω={1,2,3,4}\Omega = \{1, 2, 3, 4\} and consider the partition {1,2}\{1, 2\} and {3,4}\{3, 4\}. The sigma-algebra generated by this partition is F={,{1,2},{3,4},{1,2,3,4}}.\mathcal{F} = \{\emptyset, \{1, 2\}, \{3, 4\}, \{1, 2, 3, 4\}\}.

This collection is closed under complement (the complement of {1,2}\{1, 2\} is {3,4}\{3, 4\}) and under union (any union of the four sets is again in the four sets). Note that {1}\{1\} is not in F\mathcal{F} — this sigma-algebra cannot distinguish 11 from 22. It encodes precisely the information “which side of the partition does this point belong to?” and nothing more.

💡 Remark 1 (Sigma-algebras as 'levels of information')

The previous example hints at an interpretation that becomes essential in probability theory: a sigma-algebra is an abstract description of what information is available. A coarser sigma-algebra (fewer sets) corresponds to less information; a finer sigma-algebra (more sets) corresponds to more information. In the example above, F\mathcal{F} knows only “is this point in {1,2}\{1, 2\} or in {3,4}\{3, 4\}?” but not “is this point exactly 11?” In probability, a filtration — a nested sequence of sigma-algebras F1F2\mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots — models information being revealed over time. This is the foundation of stochastic processes and the source of every “conditional expectation given Ft\mathcal{F}_t” in mathematical finance and reinforcement learning.

In practice, we rarely write down a sigma-algebra by listing its members. We specify a smaller collection of “interesting” sets and then close it up under the sigma-algebra axioms.

📐 Definition 2 (Generated sigma-algebra σ(C))

Let CP(Ω)\mathcal{C} \subseteq \mathcal{P}(\Omega) be any collection of subsets of Ω\Omega. The sigma-algebra generated by C\mathcal{C}, written σ(C)\sigma(\mathcal{C}), is the smallest sigma-algebra on Ω\Omega that contains C\mathcal{C}. Equivalently, it is the intersection of all sigma-algebras containing C\mathcal{C}: σ(C)={F:F is a sigma-algebra on Ω with CF}.\sigma(\mathcal{C}) = \bigcap \{\mathcal{F} : \mathcal{F} \text{ is a sigma-algebra on } \Omega \text{ with } \mathcal{C} \subseteq \mathcal{F}\}.

This intersection is non-empty (the power set P(Ω)\mathcal{P}(\Omega) always works) and is itself a sigma-algebra (the intersection of any family of sigma-algebras is a sigma-algebra — check the axioms). So σ(C)\sigma(\mathcal{C}) is well-defined.

The most important generated sigma-algebra in all of analysis is the Borel sigma-algebra on the real line.

📐 Definition 3 (The Borel sigma-algebra B(ℝ))

The Borel sigma-algebra on R\mathbb{R}, denoted B(R)\mathcal{B}(\mathbb{R}), is the sigma-algebra generated by the collection of all open intervals: B(R)=σ({(a,b):a<bR}).\mathcal{B}(\mathbb{R}) = \sigma\left(\{(a, b) : a < b \in \mathbb{R}\}\right).

Members of B(R)\mathcal{B}(\mathbb{R}) are called Borel sets. The collection B(R)\mathcal{B}(\mathbb{R}) contains every open set (each is a countable union of open intervals), every closed set (complements of open sets), every countable union of closed sets (FσF_\sigma sets), every countable intersection of open sets (GδG_\delta sets), and so on through a transfinite hierarchy. In short: every “reasonable” subset of R\mathbb{R} that you would write down without using the axiom of choice is Borel.

Equivalently, B(R)\mathcal{B}(\mathbb{R}) can be generated by any of the following: all open sets, all closed sets, all half-open intervals (a,b](a, b], all rays (,b](-\infty, b], all rationals as singletons. The choice of generator doesn’t matter — the resulting sigma-algebra is the same.

💡 Remark 2 (Why 'countable' and not 'finite' additivity)

The choice of countable (rather than finite) closure in the sigma-algebra axioms is the central technical decision of measure theory. With only finite closure, you get an algebra of sets — adequate for classical probability of dice and cards, but too weak for analysis. Countable closure is what makes limits behave correctly. For example, the set Q[0,1]\mathbb{Q} \cap [0, 1] is the countable union of singletons {q}\{q\} for qq rational; in the Borel sigma-algebra it is automatically measurable, and its measure is the countable sum qλ({q})=0\sum_q \lambda(\{q\}) = 0. With only finite additivity, this argument fails — the rationals would slip through the cracks. Countable additivity is exactly the right strength to make a.e. arguments, dominated convergence, and Fubini’s theorem all work.

3. Measures

A sigma-algebra tells us which sets we are allowed to measure. A measure tells us what number to assign to each one.

📐 Definition 4 (Measure on a measurable space)

Let (Ω,F)(\Omega, \mathcal{F}) be a measurable space. A measure on (Ω,F)(\Omega, \mathcal{F}) is a function μ:F[0,]\mu: \mathcal{F} \to [0, \infty] satisfying:

  1. Non-negativity. μ(A)0\mu(A) \geq 0 for all AFA \in \mathcal{F}.
  2. Empty set has measure zero. μ()=0\mu(\emptyset) = 0.
  3. Countable additivity. For any countable sequence of pairwise disjoint sets A1,A2,FA_1, A_2, \ldots \in \mathcal{F}, μ(n=1An)=n=1μ(An).\mu\left(\bigcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty \mu(A_n).

The triple (Ω,F,μ)(\Omega, \mathcal{F}, \mu) is called a measure space. When μ(Ω)=1\mu(\Omega) = 1, the triple is a probability space and μ\mu is a probability measure — usually denoted PP or P\mathbb{P}.

Note that measures take values in the extended non-negative reals [0,][0, \infty] — sets of infinite measure are allowed, and their arithmetic obeys the conventions a+=a + \infty = \infty for a0a \geq 0 and 0=00 \cdot \infty = 0 (the latter is a convention, not a theorem, but it makes integrals of functions on null sets behave correctly).

The three axioms force a number of properties to hold automatically — properties that mirror the geometric intuition of “area” or “length” in the real line. Together they form the toolkit you’ll use whenever you work with measures.

🔷 Theorem 1 (Basic properties of measures)

Let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a measure space. Then for all measurable sets A,B,AnFA, B, A_n \in \mathcal{F}:

  1. Monotonicity. If ABA \subseteq B then μ(A)μ(B)\mu(A) \leq \mu(B).
  2. Subadditivity. μ(n=1An)n=1μ(An)\mu\left(\bigcup_{n=1}^\infty A_n\right) \leq \sum_{n=1}^\infty \mu(A_n) (no disjointness required).
  3. Continuity from below. If A1A2A_1 \subseteq A_2 \subseteq \cdots is an increasing sequence with union A=nAnA = \bigcup_n A_n, then μ(An)μ(A)\mu(A_n) \to \mu(A) as nn \to \infty.
  4. Continuity from above. If A1A2A_1 \supseteq A_2 \supseteq \cdots is a decreasing sequence with intersection A=nAnA = \bigcap_n A_n, and μ(A1)<\mu(A_1) < \infty, then μ(An)μ(A)\mu(A_n) \to \mu(A) as nn \to \infty.

Proof.

Monotonicity (1). Write B=A(BA)B = A \sqcup (B \setminus A) as a disjoint union. Both AA and BA=BAcB \setminus A = B \cap A^c are measurable. By countable additivity (with all but two terms equal to \emptyset), μ(B)=μ(A)+μ(BA)μ(A),\mu(B) = \mu(A) + \mu(B \setminus A) \geq \mu(A), since μ(BA)0\mu(B \setminus A) \geq 0.

Continuity from below (3). Define a disjointified sequence: B1=A1,B2=A2A1,Bn=AnAn1 for n2.B_1 = A_1, \quad B_2 = A_2 \setminus A_1, \quad B_n = A_n \setminus A_{n-1} \text{ for } n \geq 2. Each BnB_n is measurable (difference of measurable sets), the BnB_n are pairwise disjoint, and k=1nBk=An,k=1Bk=k=1Ak=A.\bigcup_{k=1}^n B_k = A_n, \qquad \bigcup_{k=1}^\infty B_k = \bigcup_{k=1}^\infty A_k = A.

By countable additivity applied to the disjoint BkB_k, μ(A)=k=1μ(Bk)=limnk=1nμ(Bk)=limnμ(An),\mu(A) = \sum_{k=1}^\infty \mu(B_k) = \lim_{n \to \infty} \sum_{k=1}^n \mu(B_k) = \lim_{n \to \infty} \mu(A_n), where the second equality is the definition of an infinite series and the third equality is finite additivity (AnA_n is the disjoint union of B1,,BnB_1, \ldots, B_n).

Subadditivity (2). Disjointify as in (3): let B1=A1B_1 = A_1 and Bn=An(A1An1)B_n = A_n \setminus (A_1 \cup \cdots \cup A_{n-1}) for n2n \geq 2. Then BnAnB_n \subseteq A_n (so μ(Bn)μ(An)\mu(B_n) \leq \mu(A_n) by monotonicity), the BnB_n are pairwise disjoint, and Bn=An\bigcup B_n = \bigcup A_n. Countable additivity gives μ(nAn)=μ(nBn)=nμ(Bn)nμ(An).\mu\left(\bigcup_{n} A_n\right) = \mu\left(\bigcup_{n} B_n\right) = \sum_{n} \mu(B_n) \leq \sum_{n} \mu(A_n).

Continuity from above (4). This reduces to (3) by taking complements relative to A1A_1. The hypothesis μ(A1)<\mu(A_1) < \infty is essential — see Royden §17.4 for the cautionary counter-example without finiteness.

Now for the canonical examples. These three measures are the workhorses of probability and analysis.

📝 Example 5 (Counting measure on ℕ)

Let Ω=N\Omega = \mathbb{N} and F=P(N)\mathcal{F} = \mathcal{P}(\mathbb{N}) (every subset is measurable, since N\mathbb{N} is countable). Define μ(A)=A,the number of elements of A,\mu(A) = |A|, \quad \text{the number of elements of } A, where A=|A| = \infty if AA is infinite. This is the counting measure. It satisfies all three axioms: μ()=0\mu(\emptyset) = 0, μ(A)0\mu(A) \geq 0, and the cardinality of a countable disjoint union is the sum of cardinalities. The counting measure is the foundation of discrete probability — every "P(X=k)P(X = k)" is integration against the counting measure.

📝 Example 6 (Dirac measure δₐ)

Fix a point aΩa \in \Omega. The Dirac measure at aa is δa(A)={1if aA,0otherwise.\delta_a(A) = \begin{cases} 1 & \text{if } a \in A, \\ 0 & \text{otherwise.} \end{cases}

This is a probability measure (δa(Ω)=1\delta_a(\Omega) = 1, since aΩa \in \Omega). Countable additivity holds because aa belongs to at most one set in any disjoint collection. Dirac measures are the building blocks of point masses in probability and the simplest non-trivial example of a singular measure with respect to Lebesgue measure on R\mathbb{R}.

📝 Example 7 (Lebesgue measure on [0,1] (preview))

On the measurable space ([0,1],B([0,1]))([0, 1], \mathcal{B}([0, 1])), there exists a unique measure λ\lambda such that λ([a,b])=ba\lambda([a, b]) = b - a for every interval [a,b][0,1][a, b] \subseteq [0, 1]. This is the Lebesgue measure on the unit interval, and constructing it rigorously is the work of Section 4 (we will need Carathéodory’s extension theorem). For now, observe that λ([0,1])=1\lambda([0, 1]) = 1, so ([0,1],B([0,1]),λ)([0, 1], \mathcal{B}([0, 1]), \lambda) is a probability space — the uniform distribution on the unit interval.

💡 Remark 3 (Sigma-finite vs. finite measures)

A measure μ\mu on (Ω,F)(\Omega, \mathcal{F}) is finite if μ(Ω)<\mu(\Omega) < \infty and sigma-finite if Ω\Omega can be written as a countable union Ω=nΩn\Omega = \bigcup_n \Omega_n with μ(Ωn)<\mu(\Omega_n) < \infty for every nn. Probability measures are finite. Lebesgue measure on R\mathbb{R} is not finite (λ(R)=\lambda(\mathbb{R}) = \infty) but is sigma-finite — write R=n[n,n]\mathbb{R} = \bigcup_n [-n, n]. Most theorems we care about (Fubini, Radon-Nikodym) assume sigma-finiteness; the non-sigma-finite case is where measure theory gets pathological. Throughout this topic, every measure is sigma-finite unless we explicitly say otherwise.

4. Lebesgue Measure on ℝ

We now construct the most important measure on the real line: the one that extends our intuitive notion of “length.” The construction is due to Lebesgue (1902) and Carathéodory (1914), and it proceeds in three stages: first define an outer measure on every subset of R\mathbb{R}, then identify which subsets are measurable under that outer measure, and finally restrict to the measurable subsets to get a genuine countably additive measure.

📐 Definition 5 (Lebesgue outer measure λ*(A))

For any subset ARA \subseteq \mathbb{R}, the Lebesgue outer measure of AA is λ(A)  =  inf{k=1(bkak)  :  Ak=1(ak,bk)}.\lambda^*(A) \;=\; \inf\left\{\sum_{k=1}^\infty (b_k - a_k) \;:\; A \subseteq \bigcup_{k=1}^\infty (a_k, b_k)\right\}.

The infimum is taken over all countable covers of AA by open intervals, and the value is the total length of the cover. Outer measure is defined for every subset of R\mathbb{R} — but as we will see, λ\lambda^* alone is not countably additive on the full power set, so we need to restrict.

Some basic properties drop out immediately: λ()=0\lambda^*(\emptyset) = 0 (use the empty cover), λ\lambda^* is monotone (ABλ(A)λ(B)A \subseteq B \Rightarrow \lambda^*(A) \leq \lambda^*(B)), λ\lambda^* is translation-invariant (λ(A+t)=λ(A)\lambda^*(A + t) = \lambda^*(A) — translating a cover gives a cover of the translate with the same total length), and λ\lambda^* is countably subadditive (λ(nAn)nλ(An)\lambda^*(\bigcup_n A_n) \leq \sum_n \lambda^*(A_n)). What we don’t yet have is countable additivity on disjoint sets — and that requires choosing the right sigma-algebra to restrict to.

The crucial insight is Carathéodory’s: a set AA is “well-behaved” with respect to λ\lambda^* if it cleanly splits any other set EE into two pieces whose outer measures add up.

🔷 Theorem 2 (Carathéodory's extension theorem (statement))

A set ARA \subseteq \mathbb{R} is Carathéodory-measurable (with respect to λ\lambda^*) if for every test set ERE \subseteq \mathbb{R}, λ(E)  =  λ(EA)+λ(EAc).\lambda^*(E) \;=\; \lambda^*(E \cap A) + \lambda^*(E \cap A^c).

Let L(R)\mathcal{L}(\mathbb{R}) denote the collection of all such measurable sets. Then:

  1. L(R)\mathcal{L}(\mathbb{R}) is a sigma-algebra containing every Borel set: B(R)L(R)\mathcal{B}(\mathbb{R}) \subseteq \mathcal{L}(\mathbb{R}).
  2. The restriction λ=λL(R)\lambda = \lambda^*|_{\mathcal{L}(\mathbb{R})} is a countably additive measure on L(R)\mathcal{L}(\mathbb{R}).
  3. λ([a,b])=ba\lambda([a, b]) = b - a for every interval, recovering the geometric notion of length.

This restriction λ\lambda is the Lebesgue measure on R\mathbb{R}, and L(R)\mathcal{L}(\mathbb{R}) is the Lebesgue sigma-algebra.

Proof.

The full proof is long — about ten pages in Royden — and we will not reproduce it here. The structural ingredients are:

  • L(R)\mathcal{L}(\mathbb{R}) contains \emptyset (trivially) and is closed under complement (the Carathéodory criterion is symmetric in AA and AcA^c).
  • Closure under finite union follows from a careful inclusion-exclusion argument on the criterion.
  • Closure under countable union uses both subadditivity of λ\lambda^* and continuity arguments to upgrade finite unions to countable.
  • The fact that λ\lambda^* is countably additive on L(R)\mathcal{L}(\mathbb{R}) — the punchline — uses subadditivity in one direction and the Carathéodory criterion applied to disjoint sets in the other.
  • The inclusion B(R)L(R)\mathcal{B}(\mathbb{R}) \subseteq \mathcal{L}(\mathbb{R}) comes from showing every interval is Carathéodory-measurable, then using closure under countable operations to extend to all Borel sets.

For the complete argument, see Royden §2.4 or Folland §1.4. The takeaway for this topic is that L(R)\mathcal{L}(\mathbb{R}) is a strictly larger sigma-algebra than B(R)\mathcal{B}(\mathbb{R}) — every Borel set is Lebesgue-measurable, but there exist Lebesgue-measurable sets that are not Borel (their existence requires the axiom of choice). Both are smaller than P(R)\mathcal{P}(\mathbb{R}), which contains the non-measurable Vitali set we will construct in Section 8.

📐 Definition 6 (Lebesgue measurable set)

A set ARA \subseteq \mathbb{R} is Lebesgue-measurable if it satisfies the Carathéodory criterion: for every ERE \subseteq \mathbb{R}, λ(E)=λ(EA)+λ(EAc).\lambda^*(E) = \lambda^*(E \cap A) + \lambda^*(E \cap A^c). The collection of all such sets is the Lebesgue sigma-algebra L(R)\mathcal{L}(\mathbb{R}), and λ=λL(R)\lambda = \lambda^*|_{\mathcal{L}(\mathbb{R})} is the Lebesgue measure on R\mathbb{R}.

The next theorem captures the key invariance property — Lebesgue measure does not see translations. We give the full proof because it illustrates exactly how Carathéodory measurability propagates from λ\lambda^* to λ\lambda.

🔷 Theorem 3 (Translation-invariance of Lebesgue measure)

For every AL(R)A \in \mathcal{L}(\mathbb{R}) and every tRt \in \mathbb{R}, the translate A+t={a+t:aA}A + t = \{a + t : a \in A\} is also Lebesgue-measurable, and λ(A+t)=λ(A).\lambda(A + t) = \lambda(A).

Proof.

We split the proof into two parts: first that λ\lambda^* is translation-invariant on every subset of R\mathbb{R}, then that L(R)\mathcal{L}(\mathbb{R}) is closed under translation.

Step 1: λ(A+t)=λ(A)\lambda^*(A + t) = \lambda^*(A) for every ARA \subseteq \mathbb{R}.

Let {(ak,bk)}k=1\{(a_k, b_k)\}_{k=1}^\infty be any countable open cover of AA. Then {(ak+t,bk+t)}k=1\{(a_k + t, b_k + t)\}_{k=1}^\infty is a countable open cover of A+tA + t with the same total length: k=1((bk+t)(ak+t))=k=1(bkak).\sum_{k=1}^\infty ((b_k + t) - (a_k + t)) = \sum_{k=1}^\infty (b_k - a_k).

Taking the infimum over all covers of AA on the left and noting that every cover of AA produces a cover of A+tA + t of equal length (and vice versa, by translating by t-t), we get λ(A+t)=λ(A)\lambda^*(A + t) = \lambda^*(A). So λ\lambda^* is translation-invariant on the full power set.

Step 2: If AL(R)A \in \mathcal{L}(\mathbb{R}), then A+tL(R)A + t \in \mathcal{L}(\mathbb{R}).

We must show that A+tA + t satisfies the Carathéodory criterion: for every ERE \subseteq \mathbb{R}, λ(E)=λ(E(A+t))+λ(E(A+t)c).\lambda^*(E) = \lambda^*(E \cap (A + t)) + \lambda^*(E \cap (A + t)^c).

Apply Step 1 to translate EE by t-t: λ(E)=λ(Et).\lambda^*(E) = \lambda^*(E - t).

Now EtE - t can be split using the measurability of AA (which we are given): λ(Et)=λ((Et)A)+λ((Et)Ac).\lambda^*(E - t) = \lambda^*((E - t) \cap A) + \lambda^*((E - t) \cap A^c).

We translate each piece on the right side back by +t+t. Note that (Et)A(E - t) \cap A translated by +t+t gives E(A+t)E \cap (A + t), and (Et)Ac(E - t) \cap A^c translated by +t+t gives E(A+t)cE \cap (A + t)^c. By Step 1 again, translation does not change outer measure: λ((Et)A)=λ(E(A+t)),\lambda^*((E - t) \cap A) = \lambda^*(E \cap (A + t)), λ((Et)Ac)=λ(E(A+t)c).\lambda^*((E - t) \cap A^c) = \lambda^*(E \cap (A + t)^c).

Substituting back: λ(E)=λ(E(A+t))+λ(E(A+t)c).\lambda^*(E) = \lambda^*(E \cap (A + t)) + \lambda^*(E \cap (A + t)^c).

This is exactly the Carathéodory criterion for A+tA + t, so A+tL(R)A + t \in \mathcal{L}(\mathbb{R}). Step 1 then gives λ(A+t)=λ(A+t)=λ(A)=λ(A)\lambda(A + t) = \lambda^*(A + t) = \lambda^*(A) = \lambda(A).

📝 Example 8 (λ([a,b]) = b - a — recovering interval length)

Let [a,b]R[a, b] \subset \mathbb{R} with a<ba < b. The interval is open-cover-able by (aε,b+ε)(a - \varepsilon, b + \varepsilon) for any ε>0\varepsilon > 0, so λ([a,b])(ba)+2ε\lambda^*([a, b]) \leq (b - a) + 2\varepsilon. Taking ε0\varepsilon \to 0 gives λ([a,b])ba\lambda^*([a, b]) \leq b - a. The reverse inequality λ([a,b])ba\lambda^*([a, b]) \geq b - a is the non-trivial direction — it requires showing that no countable open cover of [a,b][a, b] can have total length less than bab - a. The argument uses the Heine-Borel theorem (compactness of [a,b][a, b], from Topic 3) to extract a finite sub-cover, then a clean overlapping-intervals argument to bound the total length below by bab - a. Combined with the fact that [a,b][a, b] is Borel and hence Lebesgue-measurable, we get λ([a,b])=ba\lambda([a, b]) = b - a.

📝 Example 9 (λ(ℚ ∩ [0,1]) = 0 — the rationals have measure zero)

The rationals in [0,1][0, 1] form a countable set: Q[0,1]={q1,q2,q3,}\mathbb{Q} \cap [0, 1] = \{q_1, q_2, q_3, \ldots\}. For any ε>0\varepsilon > 0, cover the nn-th rational qnq_n by the open interval (qnε/2n+1,qn+ε/2n+1)(q_n - \varepsilon/2^{n+1}, q_n + \varepsilon/2^{n+1}), which has length ε/2n\varepsilon/2^n. The total length of this cover is n=1ε2n=ε.\sum_{n=1}^\infty \frac{\varepsilon}{2^n} = \varepsilon.

So λ(Q[0,1])ε\lambda^*(\mathbb{Q} \cap [0, 1]) \leq \varepsilon for every ε>0\varepsilon > 0, hence λ(Q[0,1])=0\lambda^*(\mathbb{Q} \cap [0, 1]) = 0. Since Q[0,1]\mathbb{Q} \cap [0, 1] is Borel (a countable union of singletons), it is Lebesgue-measurable, so λ(Q[0,1])=0\lambda(\mathbb{Q} \cap [0, 1]) = 0.

This is the precise statement of “the rationals are negligible.” Combined with λ([0,1])=1\lambda([0, 1]) = 1, we get that the irrationals in [0,1][0, 1] have measure 11 — almost every point in the unit interval is irrational. The Dirichlet function 1Q\mathbf{1}_\mathbb{Q} is therefore zero almost everywhere with respect to Lebesgue measure, and its Lebesgue integral is 00 — exactly what our intuition wanted in Section 1.

💡 Remark 4 (Lebesgue measure extends the Riemann integral)

A bounded function f:[a,b]Rf: [a, b] \to \mathbb{R} is Riemann-integrable if and only if its set of discontinuities has Lebesgue measure zero (Lebesgue’s criterion for Riemann integrability). For every Riemann-integrable ff, the Lebesgue integral exists and equals the Riemann integral. So Lebesgue integration is a strict extension of Riemann integration — it agrees with the old theory wherever the old theory applies, and assigns sensible values in many cases where the old theory fails (like the Dirichlet function).

Outer measure covering: progressively refined interval covers of [0.3, 0.7]

Σ |cover intervals| = 0.5600Exact measure of A = [0.3, 0.7]: λ(A) = 0.4000Gap = 0.1600

Drag the endpoints (or the bar bodies) to reshape the intervals. The Lebesgue outer measure λ*(A) is the infimum of the total length over all countable open covers of A — so when the intervals cover A, their total length is at least λ(A) = 0.4. If you drag them so they no longer cover A, that lower bound no longer applies to the current arrangement. As you tighten a valid cover, the gap shrinks toward zero. Try "Optimize cover" to snap to a near-optimal arrangement.

5. The Cantor Set and Measure-Zero Surprises

The Cantor set is the single most important example in measure theory. It demonstrates that a set can be uncountable, closed, totally disconnected, and have Lebesgue measure zero — four properties that look incompatible until you see them coexist. The Cantor set is also the canonical example of a “fractal” — its construction predates Mandelbrot by seventy years.

📐 Definition 7 (The Cantor middle-thirds set C)

Define a sequence of sets recursively. Start with C0=[0,1]C_0 = [0, 1]. To form Cn+1C_{n+1}, remove the open middle third of every interval in CnC_n: C1=[0,1/3][2/3,1],C_1 = [0, 1/3] \cup [2/3, 1], C2=[0,1/9][2/9,3/9][6/9,7/9][8/9,1],C_2 = [0, 1/9] \cup [2/9, 3/9] \cup [6/9, 7/9] \cup [8/9, 1], and so on. The Cantor set is the intersection C=n=0Cn.C = \bigcap_{n=0}^\infty C_n.

After nn iterations, CnC_n is a disjoint union of 2n2^n intervals each of length (1/3)n(1/3)^n, so the total length of CnC_n is (2/3)n(2/3)^n. As nn \to \infty, this length tends to zero — but the intersection CC is nevertheless non-empty and, as we will prove, uncountable.

🔷 Theorem 4 (The Cantor set is uncountable and has Lebesgue measure zero)

The Cantor set CC is a closed subset of [0,1][0, 1] that is:

  1. Uncountable — there is a bijection C{0,1}NC \leftrightarrow \{0, 1\}^\mathbb{N}, so C=20=|C| = 2^{\aleph_0} = continuum.
  2. Of Lebesgue measure zeroλ(C)=0\lambda(C) = 0.

Proof.

Measure zero. Since each CnC_n has Lebesgue measure (2/3)n(2/3)^n and CCnC \subseteq C_n for every nn, monotonicity of λ\lambda gives 0λ(C)λ(Cn)=(2/3)n.0 \leq \lambda(C) \leq \lambda(C_n) = (2/3)^n.

Since (2/3)n0(2/3)^n \to 0 as nn \to \infty, the squeeze gives λ(C)=0\lambda(C) = 0.

Uncountability. We use the ternary expansion characterization of CC. Every x[0,1]x \in [0, 1] has a base-3 (ternary) expansion x=0.d1d2d3(3)=k=1dk3k,dk{0,1,2}.x = 0.d_1 d_2 d_3 \ldots_{(3)} = \sum_{k=1}^\infty \frac{d_k}{3^k}, \quad d_k \in \{0, 1, 2\}.

The ternary expansion is unique except at terminating expansions, where the same number has two representations (analogous to 0.999=10.999\ldots = 1 in base 10). For terminating expansions we adopt the convention that ends in all 22‘s when possible — this resolves the ambiguity.

The middle third (1/3,2/3)(1/3, 2/3) removed at step 11 is precisely the set of xx whose ternary expansion has d1=1d_1 = 1 (and d1=1d_1 = 1 cannot be replaced by ending in 020\overline{2} for any of these — the boundary points 1/31/3 and 2/32/3 themselves have alternative expansions with d1{0,2}d_1 \in \{0, 2\}). So C1C_1 consists of points with d1{0,2}d_1 \in \{0, 2\}. By induction, CnC_n consists of points whose first nn ternary digits are all in {0,2}\{0, 2\}. Therefore C={x[0,1]:x has a ternary expansion with all digits dk{0,2}}.C = \{x \in [0, 1] : x \text{ has a ternary expansion with all digits } d_k \in \{0, 2\}\}.

The map ϕ:C{0,1}N\phi: C \to \{0, 1\}^\mathbb{N} defined by sending 0.d1d2(3)0.d_1 d_2 \ldots_{(3)} to the binary sequence (d1/2,d2/2,d3/2,)(d_1/2, d_2/2, d_3/2, \ldots) is a bijection. (It is well-defined and injective by the uniqueness of the chosen ternary expansion; it is surjective because every binary sequence corresponds to a valid ternary string in {0,2}N\{0, 2\}^\mathbb{N}.)

By Cantor’s diagonal argument (Topic 3), {0,1}N\{0, 1\}^\mathbb{N} is uncountable, with cardinality 202^{\aleph_0} — the cardinality of the continuum. Therefore C=20|C| = 2^{\aleph_0}, and CC is uncountable.

📝 Example 10 (Fat Cantor sets with positive measure)

The standard middle-thirds Cantor set has measure zero because we remove a constant fraction 1/31/3 at every step. If instead we remove a shrinking fraction at each step, we get a Cantor-like set with positive measure. Specifically, at step k1k \geq 1, remove a centered interval of length 1/4k1/4^k from each of the 2k12^{k-1} remaining intervals. The total length removed is k=12k114k  =  k=112k+1  =  12.\sum_{k=1}^\infty 2^{k-1} \cdot \frac{1}{4^k} \;=\; \sum_{k=1}^\infty \frac{1}{2^{k+1}} \;=\; \frac{1}{2}.

So the limiting set, the Smith-Volterra-Cantor set (or “fat Cantor set”), has Lebesgue measure 11/2=1/21 - 1/2 = 1/2. It is closed, nowhere dense, and uncountable — exactly like the standard Cantor set — but it has positive measure. This shows that “Cantor-like” does not imply “measure zero”: uncountability and measure zero are independent properties, and the standard middle-thirds set is a particular case where they happen to coincide.

💡 Remark 5 (The Cantor set is compact and totally disconnected)

CC is closed (it is the intersection of closed sets CnC_n) and bounded (a subset of [0,1][0, 1]), so by Heine-Borel from Topic 3, CC is compact. It is also totally disconnected: between any two distinct points x,yCx, y \in C there is a deleted middle interval at some stage of the construction, so the connected components of CC are single points. A non-empty compact totally disconnected metric space without isolated points is called a Cantor space, and the Cantor set is the prototype. (Every two such spaces are homeomorphic — the Cantor set is, up to homeomorphism, the Cantor space.)

Cantor set construction: iterations 0, 3, 6, 10 of the middle-thirds set

Total length at n = 5: 0.131687Interval count: 32Closed form: (2/3)ⁿLimit as n → ∞: 0

Standard middle-thirds set: at every step, remove the open middle 1/3 of each remaining interval. After 5 iterations, the total length is (2/3)^5 = 0.131687. The limiting set is non-empty and uncountable — every ternary expansion in {0, 2}^ℕ gives a point in C — yet it has total length 0.

6. Measurable Functions

Measurable functions are the class of functions for which the Lebesgue integral will be defined. The definition is suspiciously simple: a function is measurable if the preimage of every Borel set is measurable.

📐 Definition 8 (Measurable function)

Let (Ω,F)(\Omega, \mathcal{F}) be a measurable space. A function f:ΩRf: \Omega \to \mathbb{R} is F\mathcal{F}-measurable (or just measurable when the sigma-algebra is clear from context) if for every Borel set BB(R)B \in \mathcal{B}(\mathbb{R}), f1(B)={ωΩ:f(ω)B}F.f^{-1}(B) = \{\omega \in \Omega : f(\omega) \in B\} \in \mathcal{F}.

Equivalently — and this is usually how you check it in practice — ff is measurable if and only if f1((,t])Ff^{-1}((-\infty, t]) \in \mathcal{F} for every tRt \in \mathbb{R}. The “if” direction follows from the fact that the half-rays {(,t]:tR}\{(-\infty, t] : t \in \mathbb{R}\} generate B(R)\mathcal{B}(\mathbb{R}), and preimages commute with countable set operations.

The two universal examples are the continuous functions and the indicator functions.

📝 Example 11 (Continuous functions are Borel-measurable)

If f:RRf: \mathbb{R} \to \mathbb{R} is continuous and URU \subseteq \mathbb{R} is open, then f1(U)f^{-1}(U) is open, hence Borel, hence Lebesgue-measurable. Since the open sets generate B(R)\mathcal{B}(\mathbb{R}) and preimages preserve countable set operations, f1(B)B(R)f^{-1}(B) \in \mathcal{B}(\mathbb{R}) for every Borel BB. So every continuous function is Borel-measurable. This is why the integral of a continuous function on a compact interval — the bread and butter of single-variable calculus — is just a special case of the Lebesgue integral.

📝 Example 12 (The Dirichlet function is Borel-measurable)

The indicator 1Q:[0,1]{0,1}\mathbf{1}_\mathbb{Q}: [0, 1] \to \{0, 1\} takes only two values, so its preimage of any Borel set BB is one of \emptyset, Q[0,1]\mathbb{Q} \cap [0, 1], [0,1]Q[0, 1] \setminus \mathbb{Q}, or [0,1][0, 1], depending on which of 00 and 11 lie in BB. All four sets are Borel (Q[0,1]\mathbb{Q} \cap [0, 1] is a countable union of singletons), so 1Q\mathbf{1}_\mathbb{Q} is Borel-measurable. Combined with our earlier observation that λ(Q[0,1])=0\lambda(\mathbb{Q} \cap [0, 1]) = 0, this is the foundation for showing 1Qdλ=0\int \mathbf{1}_\mathbb{Q} \, d\lambda = 0 — once we have built the integral in Topic 26.

The Lebesgue integral of a measurable function is built up in two steps. First we define the integral for “simple functions” — finite linear combinations of indicators — and then we extend to general non-negative measurable functions by approximating from below.

📐 Definition 9 (Simple function)

A measurable function s:ΩRs: \Omega \to \mathbb{R} is simple if it takes only finitely many values. Equivalently, ss has a representation s=k=1nck1Aks = \sum_{k=1}^n c_k \cdot \mathbf{1}_{A_k} where c1,,cnc_1, \ldots, c_n are distinct real numbers and A1,,AnA_1, \ldots, A_n is a finite measurable partition of Ω\Omega (each AkA_k is measurable, the AkA_k are pairwise disjoint, and Ω=kAk\Omega = \bigsqcup_k A_k). The integral of a non-negative simple function with respect to a measure μ\mu is defined to be sdμ=k=1nckμ(Ak).\int s \, d\mu = \sum_{k=1}^n c_k \cdot \mu(A_k).

(With the convention 0=00 \cdot \infty = 0, this is well-defined even when some ck=0c_k = 0 but μ(Ak)=\mu(A_k) = \infty.)

The next theorem is the engine of Lebesgue integration: every non-negative measurable function is the pointwise limit of an increasing sequence of simple functions. Once you have this, the integral of a general non-negative measurable function can be defined as the limit of the integrals of the approximating simple functions.

🔷 Theorem 5 (Simple function approximation (dyadic construction))

Let f:Ω[0,]f: \Omega \to [0, \infty] be a non-negative measurable function. Then there exists a sequence (sn)n=1(s_n)_{n=1}^\infty of non-negative simple measurable functions such that:

  1. s1s2s3s_1 \leq s_2 \leq s_3 \leq \cdots (the sequence is monotone increasing).
  2. sn(ω)f(ω)s_n(\omega) \to f(\omega) for every ωΩ\omega \in \Omega.
  3. The convergence is uniform on every set where ff is bounded.

Proof.

We construct sns_n explicitly via dyadic slicing of the range. For each n1n \geq 1, partition the interval [0,n)[0, n) into n2nn \cdot 2^n dyadic subintervals of length 2n2^{-n}: [0,2n),[2n,22n),,[(n2n1)2n,n2n2n)=[(n2n),n).[0, 2^{-n}), \quad [2^{-n}, 2 \cdot 2^{-n}), \quad \ldots, \quad [(n \cdot 2^n - 1) \cdot 2^{-n}, n \cdot 2^n \cdot 2^{-n}) = [(n - 2^{-n}), n).

Define sn(ω)={k2nif k2nf(ω)<(k+1)2n for some k=0,1,,n2n1,nif f(ω)n.s_n(\omega) = \begin{cases} k \cdot 2^{-n} & \text{if } k \cdot 2^{-n} \leq f(\omega) < (k + 1) \cdot 2^{-n} \text{ for some } k = 0, 1, \ldots, n \cdot 2^n - 1, \\ n & \text{if } f(\omega) \geq n. \end{cases}

Each sns_n is a finite sum of step values times indicators of preimage sets, so it is a simple function. Each preimage set {f[k2n,(k+1)2n)}\{f \in [k \cdot 2^{-n}, (k+1) \cdot 2^{-n})\} is measurable (preimage of a Borel set under a measurable function), so sns_n is measurable.

Monotonicity (snsn+1s_n \leq s_{n+1}). Suppose sn(ω)=k2ns_n(\omega) = k \cdot 2^{-n} for some k<n2nk < n \cdot 2^n. Then f(ω)[k2n,(k+1)2n)f(\omega) \in [k \cdot 2^{-n}, (k+1) \cdot 2^{-n}). The next-finer dyadic partition splits this interval in half: f(ω)f(\omega) lies in either [2k2(n+1),(2k+1)2(n+1))[2k \cdot 2^{-(n+1)}, (2k + 1) \cdot 2^{-(n+1)}) or [(2k+1)2(n+1),(2k+2)2(n+1))[(2k + 1) \cdot 2^{-(n+1)}, (2k + 2) \cdot 2^{-(n+1)}). In the first case sn+1(ω)=2k2(n+1)=k2n=sn(ω)s_{n+1}(\omega) = 2k \cdot 2^{-(n+1)} = k \cdot 2^{-n} = s_n(\omega). In the second case sn+1(ω)=(2k+1)2(n+1)>sn(ω)s_{n+1}(\omega) = (2k + 1) \cdot 2^{-(n+1)} > s_n(\omega). Either way sn(ω)sn+1(ω)s_n(\omega) \leq s_{n+1}(\omega). The case where sn(ω)=ns_n(\omega) = n (the “cap”) is handled similarly: increasing nn either keeps sns_n at the cap or moves it into a finer dyadic level above nn.

Pointwise convergence (sn(ω)f(ω)s_n(\omega) \to f(\omega)). Fix ω\omega. If f(ω)<f(\omega) < \infty, choose n0n_0 so large that n0>f(ω)n_0 > f(\omega). For all nn0n \geq n_0, f(ω)[kn2n,(kn+1)2n)f(\omega) \in [k_n \cdot 2^{-n}, (k_n + 1) \cdot 2^{-n}) for some knk_n in the dyadic partition, so sn(ω)=kn2n[f(ω)2n,f(ω)].s_n(\omega) = k_n \cdot 2^{-n} \in [f(\omega) - 2^{-n}, f(\omega)].

Hence f(ω)sn(ω)<2n0|f(\omega) - s_n(\omega)| < 2^{-n} \to 0. If f(ω)=f(\omega) = \infty, then f(ω)>nf(\omega) > n for every nn, so sn(ω)=n=f(ω)s_n(\omega) = n \to \infty = f(\omega).

Uniform convergence on bounded sets. If fM|f| \leq M on a set EΩE \subseteq \Omega, then for n>Mn > M we have f(ω)<nf(\omega) < n for all ωE\omega \in E, and the bound f(ω)sn(ω)<2n|f(\omega) - s_n(\omega)| < 2^{-n} holds uniformly in ωE\omega \in E. So snfs_n \to f uniformly on EE.

💡 Remark 6 (Simple function approximation partitions the range)

Theorem 5 is the constructive heart of Lebesgue integration. Notice what it does: it slices the range of ff into dyadic levels and uses the preimages of those levels as the building blocks for simple functions. This is the exact opposite of the Riemann strategy, which slices the domain into uniform subintervals. The Riemann approach fails on the Dirichlet function because no domain partition can separate rationals from irrationals; the Lebesgue approach succeeds because the range of 1Q\mathbf{1}_\mathbb{Q} has only two values and their preimages (Q[0,1]\mathbb{Q} \cap [0, 1] and its complement) are both Borel-measurable. The “partition the range” insight is the single most important conceptual move in measure theory, and it is what makes the Lebesgue integral strictly more powerful than the Riemann integral.

Simple function approximation: dyadic staircases at 4, 16, and 64 levels converging to the target function

Σ c_k · λ(A_k) = 0.59569Exact integral: 0.63662Error: 4.09e-2

Dyadic simple function: slice the y-axis into 16 levels, and on each level, color the preimage A_k = {x : f(x) ∈ [k/16, (k+1)/16)·max}. The simple function s(x) = Σ c_k · 1_{A_k} is the largest dyadic step function below f. As you increase the level count, s converges to f from below — this is the constructive heart of Theorem 5 (simple function approximation).

7. Null Sets and “Almost Everywhere”

Once we have measures, we can make precise the language of “ignoring negligible sets” — the language of statements that hold “almost everywhere” or “with probability 1.” This is the vocabulary every probability theorem uses.

📐 Definition 10 (Null set (measure-zero set))

Let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a measure space. A set NFN \in \mathcal{F} is a null set (or μ\mu-null set) if μ(N)=0\mu(N) = 0. Examples on (R,B(R),λ)(\mathbb{R}, \mathcal{B}(\mathbb{R}), \lambda): every singleton {x}\{x\}, every countable set (such as Q\mathbb{Q}), and the Cantor set CC.

📐 Definition 11 (Almost everywhere (a.e.))

A property P(ω)P(\omega) is said to hold almost everywhere with respect to μ\mu — written μ\mu-a.e. or simply a.e. when the measure is clear — if the set of ω\omega for which P(ω)P(\omega) fails is contained in a null set: μ({ωΩ:P(ω) does not hold})=0.\mu(\{\omega \in \Omega : P(\omega) \text{ does not hold}\}) = 0.

In probability, where μ=P\mu = P is a probability measure, “almost everywhere” is usually called almost surely, abbreviated a.s. or “with probability 11.” When you read “the gradient flow converges to a critical point with probability 1,” the “with probability 1” is the measure-theoretic statement that the bad set has Lebesgue measure zero.

📝 Example 13 (The Dirichlet function is zero almost everywhere)

On ([0,1],B([0,1]),λ)([0, 1], \mathcal{B}([0, 1]), \lambda) the rationals form a null set (λ(Q[0,1])=0\lambda(\mathbb{Q} \cap [0, 1]) = 0, by Example 9). So 1Q(x)=0 for almost every x[0,1],\mathbf{1}_\mathbb{Q}(x) = 0 \text{ for almost every } x \in [0, 1], and the almost-everywhere equivalence class of 1Q\mathbf{1}_\mathbb{Q} is the same as that of the constant zero function. Once we have the Lebesgue integral, this will give 011Qdλ=0\int_0^1 \mathbf{1}_\mathbb{Q} \, d\lambda = 0 — exactly the answer our intuition demanded back in Section 1.

📝 Example 14 (The Cantor function (devil's staircase))

There exists a function ϕ:[0,1][0,1]\phi: [0, 1] \to [0, 1], called the Cantor function or devil’s staircase, with the following remarkable properties:

  1. ϕ\phi is continuous.
  2. ϕ\phi is non-decreasing, with ϕ(0)=0\phi(0) = 0 and ϕ(1)=1\phi(1) = 1.
  3. ϕ\phi is differentiable almost everywhere with ϕ(x)=0\phi'(x) = 0 a.e.

The construction is iterative: ϕ\phi is constant on every “removed middle third” of the Cantor set construction (taking the value 1/21/2 on (1/3,2/3)(1/3, 2/3), the values 1/41/4 and 3/43/4 on the next two removed middles, and so on). On the Cantor set itself, ϕ\phi is defined by a base-33-to-base-22 digit substitution.

The “missing increase” paradox is striking: ϕ\phi goes from 00 to 11 continuously, yet its derivative is 00 almost everywhere. All of the “increase” happens on the Cantor set CC, which has measure zero. The fundamental theorem of calculus fails for ϕ\phi: ϕ(1)ϕ(0)=1    0  =  01ϕ(x)dx.\phi(1) - \phi(0) = 1 \;\neq\; 0 \;=\; \int_0^1 \phi'(x) \, dx.

The FTC requires absolute continuity, and the Cantor function is the canonical example of a continuous, non-decreasing function that is not absolutely continuous. This is exactly the kind of pathology that motivates measure theory — the Riemann picture cannot detect the difference, but the measure-theoretic picture can.

💡 Remark 7 ('Almost everywhere' is relative to a measure)

The property “μ\mu-almost everywhere” depends entirely on which measure μ\mu you have in mind. The set {1}R\{1\} \subset \mathbb{R} has Lebesgue measure zero (so 1{1}=0\mathbf{1}_{\{1\}} = 0 Lebesgue-a.e.), but it has counting measure 11 (so 1{1}0\mathbf{1}_{\{1\}} \neq 0 counting-a.e.). When you read or write “a.e.” in measure theory, always know which measure you mean — switching measures changes which sets are negligible.

📐 Definition 12 (Complete measure)

A measure μ\mu on (Ω,F)(\Omega, \mathcal{F}) is complete if every subset of every null set is itself measurable (and therefore also has measure zero). Equivalently, if NFN \in \mathcal{F} with μ(N)=0\mu(N) = 0 and ANA \subseteq N, then AFA \in \mathcal{F}. A measure is incomplete if there exists a null set NN with a non-measurable subset.

💡 Remark 8 (Borel vs. Lebesgue — completeness via completion)

The Borel sigma-algebra B(R)\mathcal{B}(\mathbb{R}) is not complete: there exist subsets of the Cantor set CC that are not Borel, for a cardinality reason. The Cantor set has cardinality C=20|C| = 2^{\aleph_0}, so its power set has cardinality P(C)=220|\mathcal{P}(C)| = 2^{2^{\aleph_0}} — strictly larger than B(R)=20|\mathcal{B}(\mathbb{R})| = 2^{\aleph_0}. There are more subsets of CC than there are Borel sets in total, so most subsets of CC are not Borel. The Lebesgue sigma-algebra L(R)\mathcal{L}(\mathbb{R}) is the completion of B(R)\mathcal{B}(\mathbb{R}) with respect to Lebesgue measure: it contains every Borel set plus every subset of every Lebesgue-null set. Lebesgue measure λ\lambda, restricted to L(R)\mathcal{L}(\mathbb{R}), is complete by construction. In practice, completeness rarely matters for theorems we care about (the dominated convergence theorem doesn’t require it), but it eliminates pathological “this set is null but its subsets aren’t measurable” annoyances.

The Cantor function (devil's staircase): continuous, non-decreasing, derivative zero almost everywhere

8. Non-Measurable Sets — The Vitali Construction

We have spent six sections building Lebesgue measure on a carefully chosen sigma-algebra. We can now answer the question that started this topic: why do we need to restrict to a sigma-algebra at all? Why not just measure every subset of R\mathbb{R}?

The answer is the Vitali construction. Using the axiom of choice, we can build a subset V[0,1]V \subset [0, 1] to which no consistent length can be assigned — assuming “consistent” means translation-invariant and countably additive. This is one of the most important (and most disquieting) results in classical analysis.

🔷 Theorem 6 (Existence of a non-Lebesgue-measurable set (Vitali, 1905))

There exists a subset V[0,1]V \subseteq [0, 1] such that VL(R)V \notin \mathcal{L}(\mathbb{R}). That is, VV is not Lebesgue-measurable.

Proof.

Define an equivalence relation \sim on [0,1][0, 1] by xy    xyQ.x \sim y \iff x - y \in \mathbb{Q}.

This is reflexive (xx=0Qx - x = 0 \in \mathbb{Q}), symmetric (yx=(xy)Qy - x = -(x - y) \in \mathbb{Q} iff xyQx - y \in \mathbb{Q}), and transitive (xy,yzQxz=(xy)+(yz)Qx - y, y - z \in \mathbb{Q} \Rightarrow x - z = (x - y) + (y - z) \in \mathbb{Q}). So \sim partitions [0,1][0, 1] into equivalence classes — each class is the intersection [0,1](Q+r)[0, 1] \cap (\mathbb{Q} + r) for some real number rr, which is countably infinite (uncountably many distinct classes).

By the axiom of choice, we can select exactly one representative from each equivalence class. Let V[0,1]V \subset [0, 1] be the resulting set of representatives.

We claim VV is not Lebesgue-measurable. Suppose for contradiction that it is, with measure λ(V)=m0\lambda(V) = m \geq 0.

Enumerate the rationals in [1,1][-1, 1] as Q[1,1]={q1,q2,q3,}\mathbb{Q} \cap [-1, 1] = \{q_1, q_2, q_3, \ldots\} (a countable set). Consider the translates V+qnV + q_n for n=1,2,3,n = 1, 2, 3, \ldots. We make two observations.

Disjointness. Suppose x(V+qm)(V+qn)x \in (V + q_m) \cap (V + q_n) for some mnm \neq n. Then x=v+qm=v+qnx = v + q_m = v' + q_n for some v,vVv, v' \in V, so vv=qnqmQv - v' = q_n - q_m \in \mathbb{Q}, meaning vvv \sim v'. But VV contains exactly one representative of each equivalence class, so v=vv = v', which gives qm=qnq_m = q_n, contradicting mnm \neq n. Hence the translates V+qnV + q_n are pairwise disjoint.

Coverage. For any x[0,1]x \in [0, 1], the equivalence class [x][x] has a representative vVv \in V. Then x=v+(xv)x = v + (x - v), where xv[1,1]Qx - v \in [-1, 1] \cap \mathbb{Q}. So xv=qnx - v = q_n for some nn, and xV+qnx \in V + q_n. Hence [0,1]n=1(V+qn)[1,2].[0, 1] \subseteq \bigcup_{n=1}^\infty (V + q_n) \subseteq [-1, 2].

By countable additivity (the translates are disjoint and measurable, since VV is assumed measurable and Lebesgue measure is translation-invariant by Theorem 3), λ(n(V+qn))=n=1λ(V+qn)=n=1m.\lambda\left(\bigcup_n (V + q_n)\right) = \sum_{n=1}^\infty \lambda(V + q_n) = \sum_{n=1}^\infty m.

By monotonicity, 1=λ([0,1])n=1mλ([1,2])=3.1 = \lambda([0, 1]) \leq \sum_{n=1}^\infty m \leq \lambda([-1, 2]) = 3.

Now examine the cases:

  • If m=0m = 0, the sum nm=0\sum_n m = 0, contradicting nm1\sum_n m \geq 1.
  • If m>0m > 0, the sum nm=\sum_n m = \infty, contradicting nm3\sum_n m \leq 3.

Either way we have a contradiction. So our assumption that VV is measurable was false: VL(R)V \notin \mathcal{L}(\mathbb{R}).

💡 Remark 9 (The axiom of choice is essential — Solovay (1970))

The Vitali construction critically depends on the axiom of choice — without it, we cannot “select one representative from each equivalence class.” This raises a natural question: is the existence of non-measurable sets a theorem of ZFC, or is it an artifact of the axiom of choice? The answer, due to Solovay (1970), is striking: there exists a model of Zermelo-Fraenkel set theory (without choice, but with a weaker axiom of “dependent choice”) in which every subset of R\mathbb{R} is Lebesgue-measurable. So non-measurable sets are not a feature of the real line itself — they are a consequence of the strong choice principle that mathematicians have collectively decided to adopt. In a universe without the full axiom of choice, the entire Vitali pathology vanishes, and Lebesgue measure extends to every subset.

The pragmatic upshot: in everything we do in measure theory and probability, we work with the axiom of choice, accept that non-measurable sets exist, and quietly restrict all our attention to the Borel or Lebesgue sigma-algebra. The non-measurable sets are out there, but the theorems we care about never need them.

Vitali construction: equivalence classes under rational translation, then a translation-invariance contradiction

9. Product Measures (Preview)

To integrate functions of two or more variables, we need to combine measures on different spaces into a product measure on the Cartesian product. The full theory — including Fubini’s theorem, which lets us compute double integrals as iterated integrals — belongs to The Lebesgue Integral. Here we lay the foundation.

📐 Definition 13 (Product sigma-algebra)

Let (Ω1,F1)(\Omega_1, \mathcal{F}_1) and (Ω2,F2)(\Omega_2, \mathcal{F}_2) be measurable spaces. The product sigma-algebra on Ω1×Ω2\Omega_1 \times \Omega_2, denoted F1F2\mathcal{F}_1 \otimes \mathcal{F}_2, is the sigma-algebra generated by all “rectangles” A1×A2A_1 \times A_2 with A1F1A_1 \in \mathcal{F}_1 and A2F2A_2 \in \mathcal{F}_2: F1F2=σ({A1×A2:A1F1,A2F2}).\mathcal{F}_1 \otimes \mathcal{F}_2 = \sigma\left(\{A_1 \times A_2 : A_1 \in \mathcal{F}_1, A_2 \in \mathcal{F}_2\}\right).

📝 Example 15 (The Borel sigma-algebra on ℝ² is the product of two copies of B(ℝ))

On the plane R2=R×R\mathbb{R}^2 = \mathbb{R} \times \mathbb{R}, the Borel sigma-algebra B(R2)\mathcal{B}(\mathbb{R}^2) generated by all open sets equals the product sigma-algebra B(R)B(R)\mathcal{B}(\mathbb{R}) \otimes \mathcal{B}(\mathbb{R}). This is because every open set in R2\mathbb{R}^2 can be written as a countable union of open rectangles (a1,b1)×(a2,b2)(a_1, b_1) \times (a_2, b_2), and conversely every open rectangle is open in R2\mathbb{R}^2. So the two ways of building “Borel sets in the plane” — directly from the topology, or as a product of one-dimensional Borel sigma-algebras — give the same answer. The same holds in arbitrary dimension: B(Rn)=B(R)n\mathcal{B}(\mathbb{R}^n) = \mathcal{B}(\mathbb{R})^{\otimes n}.

💡 Remark 10 (Joint distributions as product measures)

In probability theory, if XX and YY are independent random variables on a probability space (Ω,F,P)(\Omega, \mathcal{F}, P), then their joint distribution is the product measure PXPYP_X \otimes P_Y on (R2,B(R2))(\mathbb{R}^2, \mathcal{B}(\mathbb{R}^2)). The product structure is the formal definition of independence: for any Borel sets A,BRA, B \subseteq \mathbb{R}, P(XA,YB)=PX(A)PY(B).P(X \in A, Y \in B) = P_X(A) \cdot P_Y(B).

This is the measure-theoretic foundation of every “i.i.d.” assumption in machine learning. When you train a model on data points (xi,yi)(x_i, y_i) assumed i.i.d. from a distribution, you are implicitly working with a product measure over the data space.

10. Computational Notes

Measure theory is a foundational subject — most of what we have covered is non-computational. But the framework has direct algorithmic counterparts in scientific Python that practitioners use every day, often without realizing they are doing measure theory.

Monte Carlo estimation of Lebesgue measure. For a Borel set A[0,1]A \subseteq [0, 1], the Lebesgue measure λ(A)\lambda(A) can be estimated by sampling NN independent uniform points X1,,XNUniform[0,1]X_1, \ldots, X_N \sim \text{Uniform}[0, 1] and computing λ^N(A)=1Ni=1N1A(Xi).\hat{\lambda}_N(A) = \frac{1}{N} \sum_{i=1}^N \mathbf{1}_A(X_i).

By the strong law of large numbers, λ^N(A)λ(A)\hat{\lambda}_N(A) \to \lambda(A) almost surely as NN \to \infty. The central limit theorem gives the convergence rate: the standard error is λ(A)(1λ(A))/N=O(N1/2)\sqrt{\lambda(A)(1 - \lambda(A))/N} = O(N^{-1/2}). This is exactly the same rate as Monte Carlo integration of any bounded function — measure estimation is integration of an indicator function.

scipy.stats is measure theory in disguise. Every distribution object in scipy.stats is, mathematically, a probability measure on (R,B(R))(\mathbb{R}, \mathcal{B}(\mathbb{R})). The .pdf() method returns the Radon-Nikodym derivative of the distribution with respect to Lebesgue measure (when the distribution is absolutely continuous). The .cdf() method returns the measure of the half-line (,x](-\infty, x]:

from scipy.stats import norm
norm.cdf(1.96) - norm.cdf(-1.96)  # ≈ 0.95

This is the measure-theoretic statement P({X[1.96,1.96]})0.95P(\{X \in [-1.96, 1.96]\}) \approx 0.95 where PP is the standard normal measure.

Empirical measures in PyTorch. The empirical measure of a finite sample {x1,,xN}\{x_1, \ldots, x_N\} is μ^N=1Ni=1Nδxi,\hat{\mu}_N = \frac{1}{N} \sum_{i=1}^N \delta_{x_i}, where each δxi\delta_{x_i} is the Dirac measure at xix_i. This is a probability measure on Rd\mathbb{R}^d that puts mass 1/N1/N at each sample point and zero elsewhere. Empirical risk minimization — the workhorse of supervised learning — is precisely the minimization of L(f^,x)dμ^N(x)\int L(\hat{f}, x) \, d\hat{\mu}_N(x) over a hypothesis class f^\hat{f}. The fact that this approximates the true population risk L(f^,x)dμ(x)\int L(\hat{f}, x) \, d\mu(x) is the Glivenko-Cantelli theorem, a measure-theoretic statement we will see in Section 11.

Numerical pitfall: floating-point measure-zero detection. Measure-zero properties are asymptotic and inherently invisible to finite-precision arithmetic. A floating-point random sample from any continuous distribution will, with probability one, never land on any specific countable set — but a finite sample of size 10610^6 contains exactly 10610^6 points, all rational (every IEEE-754 double is rational). The “set of irrationals” is mathematically well-defined and has measure 11 on [0,1][0, 1], but no Python program can ever directly check whether a sampled xx is in it. Measure-zero subtleties live in the limits, not in the sampled data.

📝 Example 16 (Monte Carlo estimation of λ([0.3, 0.7]))

We want to estimate the measure of A=[0.3,0.7]A = [0.3, 0.7], which we know is λ(A)=0.4\lambda(A) = 0.4.

import numpy as np
N = 10**6
samples = np.random.uniform(0, 1, N)
estimate = np.mean((samples >= 0.3) & (samples <= 0.7))
print(estimate)  # ≈ 0.4000

Running this with N=106N = 10^6 gives an estimate within ±0.001\pm 0.001 of the true value, consistent with the predicted standard error 0.40.6/1060.0005\sqrt{0.4 \cdot 0.6 / 10^6} \approx 0.0005. As NN grows, the empirical proportion converges to the true Lebesgue measure — this is Glivenko-Cantelli in its most elementary form.

11. Connections to ML

Measure theory is the mathematical language of probability, and probability is the mathematical language of machine learning. Here are five connections, each substantial enough to be its own research thread.

Probability densities as Radon-Nikodym derivatives. When you write p(x)p(x) for the density of a continuous random variable, you are writing a Radon-Nikodym derivative: p(x)=dPdλ(x)p(x) = \frac{dP}{d\lambda}(x), where PP is the probability measure and λ\lambda is Lebesgue measure on R\mathbb{R}. This is what makes p(x)dxp(x) \, dx a meaningful expression — it represents the differential measure dP=p(x)dλdP = p(x) \, d\lambda. Discrete distributions don’t have densities with respect to Lebesgue measure (they live on a measure-zero set of points); they have densities with respect to the counting measure on the support. Any time you mix continuous and discrete components — a Bernoulli mixture model, a quantized neural network — you are working with two different reference measures simultaneously, and the Radon-Nikodym formalism keeps the bookkeeping straight.

The manifold hypothesis. Real-world data — natural images, text embeddings, audio spectrograms — does not fill its ambient space uniformly. It concentrates on or near a low-dimensional manifold MRd\mathcal{M} \subset \mathbb{R}^d, where dimMd\dim \mathcal{M} \ll d. From a measure-theoretic standpoint, this means the data distribution is singular with respect to the dd-dimensional Lebesgue measure λd\lambda^d: the support has measure zero in Rd\mathbb{R}^d. Generative models that try to fit a density p(x)p(x) via maximum likelihood will fail catastrophically on truly singular distributions — there is no density to fit. The remedies (denoising score matching, normalizing flows with explicit Jacobians, diffusion models that add noise to lift the data off the manifold) are all about constructing measures whose Radon-Nikodym derivative with respect to λd\lambda^d is well-defined. Forward link: Normalizing Flows.

Empirical risk minimization as measure convergence. Training a model on NN data points (x1,,xN)(x_1, \ldots, x_N) replaces the true population measure μ\mu with the empirical measure μ^N=1Niδxi\hat{\mu}_N = \frac{1}{N} \sum_i \delta_{x_i}. The Glivenko-Cantelli theorem says that for measures on R\mathbb{R}, μ^N\hat{\mu}_N converges to μ\mu uniformly over half-lines: suptF^N(t)F(t)0\sup_t |\hat{F}_N(t) - F(t)| \to 0 almost surely, where FF is the true CDF. More general versions (Vapnik-Chervonenkis, Rademacher complexity) extend this to arbitrary classes of measurable sets and functions, giving us the generalization bounds that explain why training on finite data can produce models that work on unseen data. All of these bounds are measure-theoretic in nature.

Importance sampling and change of measure. To estimate EP[f(X)]\mathbb{E}_P[f(X)] when sampling from PP is hard, we can sample from a different (easier) measure QQ and reweight: EP[f(X)]=EQ[f(X)dPdQ(X)].\mathbb{E}_P[f(X)] = \mathbb{E}_Q\left[f(X) \cdot \frac{dP}{dQ}(X)\right].

The ratio dP/dQdP/dQ is a Radon-Nikodym derivative, and it exists exactly when PP is absolutely continuous with respect to QQ — i.e., when Q(A)=0Q(A) = 0 implies P(A)=0P(A) = 0 for every measurable AA. Importance sampling underlies off-policy reinforcement learning, sequential Monte Carlo, and many variance-reduction tricks in deep learning. Forward link: Concentration Inequalities.

“With probability 1” statements and gradient descent. The result that “stochastic gradient descent avoids saddle points with probability 1” (Lee, Simchowitz, Jordan, Recht, 2016 and follow-ups) is a measure-theoretic statement about the set of bad initial conditions in Rd\mathbb{R}^d. The set of initialization vectors θ0\theta_0 from which gradient descent with random perturbation converges to a strict saddle point is a closed set with Lebesgue measure zero — a meager exceptional set in the parameter space. So almost every initialization leads to a local minimum, which is why SGD works on non-convex losses despite their abundance of saddle points. The argument uses center-stable manifold theory, but the statement is pure measure theory. Forward link: Gradient Descent.

The next theorem is the workhorse for “with probability 1” statements in convergence theory. It’s a series convergence test recast as a measure-theoretic statement about tail events — the bridge between Topic 18 (Series Convergence & Tests) and probability theory.

🔷 Theorem 7 (Borel-Cantelli lemma (first form))

Let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a measure space and let (En)n=1(E_n)_{n=1}^\infty be a sequence of measurable sets with n=1μ(En)<.\sum_{n=1}^\infty \mu(E_n) < \infty.

Then μ\mu-almost every ωΩ\omega \in \Omega belongs to only finitely many of the EnE_n. Formally, μ(lim supnEn)=μ(N=1n=NEn)=0.\mu\left(\limsup_{n \to \infty} E_n\right) = \mu\left(\bigcap_{N=1}^\infty \bigcup_{n=N}^\infty E_n\right) = 0.

Proof.

Let A=lim supnEn=N1nNEnA = \limsup_n E_n = \bigcap_{N \geq 1} \bigcup_{n \geq N} E_n. The set AA is the collection of ω\omega that belong to EnE_n for infinitely many indices nn — every ωA\omega \in A is in some EnE_n for arbitrarily large nn. We must show μ(A)=0\mu(A) = 0.

For every N1N \geq 1, the definition of AA as an intersection gives An=NEn.A \subseteq \bigcup_{n=N}^\infty E_n.

By monotonicity (Theorem 1.1) and countable subadditivity (Theorem 1.2), μ(A)μ(n=NEn)n=Nμ(En).\mu(A) \leq \mu\left(\bigcup_{n=N}^\infty E_n\right) \leq \sum_{n=N}^\infty \mu(E_n).

The right-hand side is the tail of the convergent series n=1μ(En)\sum_{n=1}^\infty \mu(E_n). Since the full series converges, its tails tend to zero: n=Nμ(En)0as N.\sum_{n=N}^\infty \mu(E_n) \to 0 \quad \text{as } N \to \infty.

So μ(A)0\mu(A) \leq 0. Combined with μ(A)0\mu(A) \geq 0, we get μ(A)=0\mu(A) = 0.

The second form (which we will use without proof) reverses the implication when independence is assumed: if the EnE_n are mutually independent and nP(En)=\sum_n P(E_n) = \infty, then P(lim supEn)=1P(\limsup E_n) = 1 — almost every ω\omega belongs to infinitely many of the EnE_n. Together, the two forms give a complete dichotomy for tail events: either P(En)<\sum P(E_n) < \infty and a.e. point is in finitely many sets, or (under independence) P(En)=\sum P(E_n) = \infty and a.e. point is in infinitely many sets. This is the foundation for nearly every “with probability 1” theorem in the theory of stochastic processes and ML convergence.

Borel-Cantelli: convergent vs divergent series of measures, and the resulting almost-everywhere behavior

📝 Example 17 (Empirical measure convergence — Glivenko-Cantelli)

Sample X1,X2,,XNX_1, X_2, \ldots, X_N i.i.d. from the standard normal distribution. The empirical CDF is F^N(t)=1Ni=1N1Xit.\hat{F}_N(t) = \frac{1}{N} \sum_{i=1}^N \mathbf{1}_{X_i \leq t}.

This is the cumulative distribution function of the empirical measure μ^N=1NiδXi\hat{\mu}_N = \frac{1}{N} \sum_i \delta_{X_i}. As NN grows, F^N\hat{F}_N converges to the true CDF F(t)=Φ(t)=t12πes2/2dsF(t) = \Phi(t) = \int_{-\infty}^t \frac{1}{\sqrt{2\pi}} e^{-s^2/2} ds pointwise (by the law of large numbers) and in fact uniformly in tt (by the Glivenko-Cantelli theorem): suptRF^N(t)F(t)0almost surely as N.\sup_{t \in \mathbb{R}} |\hat{F}_N(t) - F(t)| \to 0 \quad \text{almost surely as } N \to \infty.

The Kolmogorov-Smirnov distance DN=suptF^N(t)F(t)D_N = \sup_t |\hat{F}_N(t) - F(t)| shrinks like 1/N1/\sqrt{N}, with the precise distributional behavior given by the Kolmogorov distribution. This is the rigorous statement that “training data converges to the true distribution as the sample size grows” — every empirical-risk-minimization argument in supervised learning depends on it.

Four measures on ℝ: Dirac delta, Gaussian density, mixture, and an empirical measure on 15 sample points

Empirical CDF convergence at N = 10, 100, 1000, 10000 samples — Glivenko-Cantelli in action

12. Connections & Further Reading

This is the first topic in Track 7 — Measure & Integration — and the first advanced topic in formalCalculus. It builds the framework that the next three topics in the track will populate: the Lebesgue integral is built from simple functions and measures defined here; LpL^p spaces (Topic 27) are equivalence classes of measurable functions identified up to a.e.-equality; the Radon-Nikodym theorem (Topic 28) characterizes when one measure has a density with respect to another. Without the sigma-algebra and measure framework constructed in this topic, none of the next three topics has a foundation to stand on.

The topic is also a bridge from classical calculus to modern probability. The measure-theoretic vocabulary built here — sigma-algebras as families of events, measures as probability assignments, measurable functions as random variables, “almost everywhere” as “with probability 1” — is exactly the vocabulary of probability theory. Every theorem in Probability Spaces on formalml.com starts from a measure space.

Within formalCalculus:

  • Completeness & Compactness — Completeness of R\mathbb{R} powers the closure-under-countable-operations arguments behind the Borel sigma-algebra (Section 2). Compactness of [a,b][a, b] via Heine-Borel is the engine behind λ([a,b])=ba\lambda([a, b]) = b - a in Section 4. The uncountability of R\mathbb{R} from Topic 3, together with the axiom of choice, is what makes the Vitali construction work in Section 8.
  • The Riemann Integral & FTC — The Riemann integral’s failure on the Dirichlet function (Section 1) is the entire motivation for measure theory. Every Riemann-integrable function is Lebesgue-integrable, and the integrals agree (Remark 4) — so the new theory is a strict extension of the old.
  • Series Convergence & Tests — Countable additivity is defined via convergent series of measures (Definition 4), and the first Borel-Cantelli lemma (Theorem 7) is a series convergence test recast as a measure-theoretic statement about tail events.

Successor topics now published:

  • The Lebesgue Integral — builds the integral fdμ\int f \, d\mu for non-negative measurable ff as the supremum of integrals of approximating simple functions, then extends to general measurable functions via f=f+ff = f^+ - f^-. Proves the Monotone, Fatou, and Dominated Convergence theorems, plus Fubini-Tonelli for product measures.
  • LpL^p Spaces — Banach spaces of measurable functions where fp=(fpdμ)1/p<\|f\|_p = (\int |f|^p \, d\mu)^{1/p} < \infty. Equivalence classes under a.e.-equality (the “null set” concept from Section 7 is essential).

Successor topics within formalCalculus:

  • Radon-Nikodym & Probability Densities — when one measure ν\nu has a density f=dν/dμf = d\nu/d\mu with respect to another measure μ\mu. Characterizes absolute continuity. The bridge from measure theory to densities, conditional expectation, and Bayesian inference.

Forward to formalml.com:

  • Probability Spaces — A probability space (Ω,F,P)(\Omega, \mathcal{F}, P) is exactly a measure space with P(Ω)=1P(\Omega) = 1. Sigma-algebras formalize observable events; filtrations (Ft)(\mathcal{F}_t) model information flow in stochastic processes. Random variables are measurable functions ΩR\Omega \to \mathbb{R}. Independence is product measure structure.
  • Gradient Descent — “SGD avoids saddle points with probability 1” is a statement about Lebesgue measure on initialization space. The set of bad initial conditions has measure zero in Rd\mathbb{R}^d, so the algorithm succeeds for almost every starting point.
  • Normalizing Flows — Pushforward measures: a flow T:RdRdT: \mathbb{R}^d \to \mathbb{R}^d transforms a base measure μX\mu_X into μY=TμX\mu_Y = T_* \mu_X via change of variables. The change-of-variables formula for densities is a Radon-Nikodym computation on the pushforward.
  • Concentration Inequalities — Markov, Chebyshev, Hoeffding, Bernstein: all are upper bounds on the measure P({XE[X]>t})P(\{|X - \mathbb{E}[X]| > t\}) of a tail set. Every concentration result is a statement about how much measure can lie far from the mean.

References

  1. book Royden, H. L. & Fitzpatrick, P. M. (2010). Real Analysis Fourth edition. Chapters 2–3 cover sigma-algebras, Lebesgue measure, and Carathéodory's extension theorem with the same proof structure used here.
  2. book Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications Second edition. Chapters 1–2 are the standard graduate-level reference for measure-theoretic foundations. The Vitali construction proof in Section 8 follows Folland's presentation.
  3. book Rudin (1976). Principles of Mathematical Analysis Chapter 11 — Lebesgue theory overview. Connects the real analysis foundations from earlier tracks to the measure-theoretic viewpoint
  4. book Rudin (1987). Real and Complex Analysis Chapters 1–2 — abstract measure construction and integration. The condensed, elegant treatment
  5. book Tao, T. (2011). An Introduction to Measure Theory Free PDF, excellent for self-study. Chapters 1–2 cover Lebesgue measure constructively with extensive geometric intuition.
  6. book Halmos (1974). Measure Theory The classic — rings, sigma-rings, extension theorems. Historically important and still widely referenced
  7. book Billingsley, P. (1995). Probability and Measure Third edition. Chapter 1 connects measure theory directly to probability — the bridge to formalml.com that this topic builds toward.
  8. book Durrett (2019). Probability: Theory and Examples Chapter 1 — measure theory for probabilists. Free online. Directly connects sigma-algebras to ML probability foundations