Measure & Integration · advanced · 50 min read

The Lebesgue Integral

Building the integral that handles limits — from simple function construction through the convergence theorems that make measure theory the engine of modern probability and machine learning.

Abstract. The Lebesgue integral is the completion of a promise. Topic 25 built the framework — sigma-algebras, measures, measurable functions, simple functions — but stopped short of the payoff: an integral powerful enough to handle limits. The Riemann integral fails when you need to interchange a limit and an integral, which is exactly what probability theory and machine learning do constantly. The Lebesgue integral fixes this. We construct it in three stages: first, for simple functions (finite sums weighted by measures); second, for non-negative measurable functions (supremum over simple approximations from below); and third, for general measurable functions (splitting into positive and negative parts). The three convergence theorems — Monotone Convergence, Fatou's Lemma, and Dominated Convergence — are the reason this integral exists. Dominated Convergence, in particular, is the mathematical justification for every 'swap the gradient and expectation' move in stochastic optimization. Fubini's theorem extends the integral to product spaces, making iterated integration rigorous. Expected value, KL divergence, and the ELBO are all Lebesgue integrals — this topic makes that precise.

Where this leads → formalML

  • formalML Expected value E[X] = ∫ X dP is a Lebesgue integral against a probability measure. Every moment, MGF, and characteristic function computation is an application of this topic.
  • formalML The interchange ∇E[L(θ,X)] = E[∇L(θ,X)] in SGD convergence proofs is a Dominated Convergence Theorem application — the single most important ML use of this topic.
  • formalML KL divergence D_KL(P||Q) = ∫ log(dP/dQ) dP and cross-entropy are Lebesgue integrals against probability measures. Their properties (non-negativity, convexity) follow from integral inequalities.
  • formalML The ELBO is an integral identity derived from Jensen's inequality applied to a Lebesgue integral: log p(x) ≥ ∫ q(z) log[p(x,z)/q(z)] dz.

1. Three Puzzles the Lebesgue Integral Solves

Topic 25 built the measure-theoretic framework — sigma-algebras, measures, measurable functions, simple functions, null sets, “almost everywhere” — but stopped short of the actual payoff: an integral. We had the vocabulary in place to talk about integration in a measure-theoretic way, but no integral defined yet. This topic builds it.

The motivation is the same one that drove the original development of the theory in the early twentieth century. The Riemann integral, beautiful as it is, fails on three different kinds of question that we routinely need to answer in probability theory and machine learning. Each failure points at a property we need the new integral to have.

If the proofs in this topic feel more technically demanding than Topic 25, that’s accurate — we are now working with suprema of integrals and interchanges of limits, which is where measure theory earns its keep. The conceptual mode is the same (sets, measures, measurable functions), but the manipulations are harder. Expect to slow down at the convergence-theorem proofs.

Puzzle 1: The limit–integral interchange failure

Define a sequence of functions fn:[0,1]Rf_n: [0, 1] \to \mathbb{R} by fn(x)=n1[0,1/n](x).f_n(x) = n \cdot \mathbf{1}_{[0, 1/n]}(x).

Each fnf_n is a simple function: it equals nn on the interval [0,1/n][0, 1/n] and 00 elsewhere. The Riemann integral is well-defined for every fnf_n and equals 01fn(x)dx=n1n=1.\int_0^1 f_n(x) \, dx = n \cdot \frac{1}{n} = 1.

So the integrals are constant: 01fn=1\int_0^1 f_n = 1 for every nn.

Now look at the pointwise limit. For any fixed x(0,1]x \in (0, 1], eventually 1/n<x1/n < x — specifically, for n>1/xn > 1/x — and so fn(x)=0f_n(x) = 0 for all sufficiently large nn. At x=0x = 0, fn(0)=nf_n(0) = n for every nn, so fn(0)f_n(0) \to \infty — a single point we can ignore. Putting these together, limnfn(x)=0for every x(0,1].\lim_{n \to \infty} f_n(x) = 0 \quad \text{for every } x \in (0, 1].

So the pointwise limit ff equals 00 almost everywhere, and any reasonable theory of integration should give 01f=0\int_0^1 f = 0.

But limn01fn=10=01limnfn\lim_{n \to \infty} \int_0^1 f_n = 1 \neq 0 = \int_0^1 \lim_{n \to \infty} f_n.

The Riemann integral provides no theorem that tells us when we are allowed to swap a limit and an integral. We are simply forbidden from doing it without justification, and there is no general justification available.

📝 Example 1 (The limit–integral interchange failure)

The sequence fn(x)=n1[0,1/n](x)f_n(x) = n \cdot \mathbf{1}_{[0, 1/n]}(x) on [0,1][0, 1] has pointwise limit f(x)=0f(x) = 0 (for every x>0x > 0), but 01fn(x)dx=1for every n,01f(x)dx=0.\int_0^1 f_n(x) \, dx = 1 \quad \text{for every } n, \qquad \int_0^1 f(x) \, dx = 0.

This is the simplest possible witness that “limit of integrals” and “integral of limit” can disagree. The mass of fnf_n — concentrated on a thin spike near x=0x = 0 — does not vanish in the limit, even though the spike itself does.

The Lebesgue integral comes with a precise theorem — the Dominated Convergence Theorem — that tells us exactly when the swap is legal: if fng|f_n| \leq g for some integrable function gg, the swap is valid. The sequence above escapes every dominating function (any gg would have to satisfy gng \geq n on [0,1/n][0, 1/n] for every nn, forcing gg to be unbounded near 00 in a non-integrable way), so its failure is not a defect of the theory but a genuine warning. By the end of Section 7 we’ll see this tightness from both sides: scenarios where DCT applies and the swap is valid, and the spike sequence above as a textbook example of what happens when the domination hypothesis fails.

Puzzle 2: The mixture distribution

Suppose a random variable XX has the following law: with probability 1/21/2 it equals exactly 00 (a point mass), and with probability 1/21/2 it is drawn from the exponential density p(x)=exp(x) = e^{x} on (,0)(-\infty, 0). Computing the expected value E[X]E[X] requires integrating against a measure that is neither purely absolutely continuous nor purely discrete.

Try to write this expectation as a Riemann integral. The continuous part contributes 120xexdx,\frac{1}{2} \int_{-\infty}^0 x \cdot e^{x} \, dx, which is a perfectly fine improper Riemann integral and evaluates to 1/2-1/2. But the point mass at 00 contributes 012=00 \cdot \frac{1}{2} = 0 to the expectation, and there is no way to capture that ”00 with probability 1/21/2” inside a Riemann integral framework — the value of any Riemann integral over a single point is always zero, regardless of how much probability is concentrated there.

In the Lebesgue framework, the entire expectation is one integral against a single mixed measure μ=12λ(,0)ex+12δ0\mu = \frac{1}{2} \lambda|_{(-\infty, 0)} \cdot e^x + \frac{1}{2} \delta_0: E[X]=xdμ(x)=12+0=12.E[X] = \int x \, d\mu(x) = -\frac{1}{2} + 0 = -\frac{1}{2}.

The point is not the numerical answer (which is the same either way — we got lucky here because 00 contributes zero). The point is that the framework handles continuous and discrete components uniformly, with no special-case grafting. Mixtures of densities and atoms are everywhere in real-world ML — quantized neural networks, censored data, discrete-continuous outcome models — and the Lebesgue integral is what makes them legal mathematics rather than ad-hoc bookkeeping.

Puzzle 3: The gradient–expectation swap in SGD

When you train a model with stochastic gradient descent, the move at the heart of the algorithm is θE[L(θ,X)]  =  E[θL(θ,X)].\nabla_\theta \, E[L(\theta, X)] \;=\; E[\nabla_\theta L(\theta, X)].

The left side is “differentiate the population loss”; the right side is “average the per-sample gradients.” We use this identity every time we compute a minibatch gradient and call it an unbiased estimator of the population gradient.

Look closely at what is being claimed. The gradient θ\nabla_\theta is a limit (it’s the limit of difference quotients in θ\theta). The expectation E[]=dPE[\,\cdot\,] = \int \cdot \, dP is an integral. So the identity is limh0L(θ+h,X)L(θ,X)hdP(X)  =  limh0L(θ+h,X)L(θ,X)hdP(X).\lim_{h \to 0} \int \frac{L(\theta + h, X) - L(\theta, X)}{h} \, dP(X) \;=\; \int \lim_{h \to 0} \frac{L(\theta + h, X) - L(\theta, X)}{h} \, dP(X).

This is exactly a “limit of integrals equals integral of limit” claim. Without a convergence theorem, swapping the limit and the integral is just wishful thinking. The Dominated Convergence Theorem provides the conditions under which the swap is legal: if there exists an integrable g(X)g(X) such that θL(θ,X)g(X)|\nabla_\theta L(\theta, X)| \leq g(X) for all θ\theta in a neighborhood, the interchange is valid.

In practice, ML papers cite this without naming it. Bounded gradients (gradient clipping), Lipschitz losses, and bounded-support data distributions are all ways to manufacture an integrable dominator. Every single SGD convergence proof in the literature implicitly invokes DCT at this step. Forward link: Gradient Descent.

Three failures, three needs: a convergence theorem for limits and integrals, a unified treatment of mixed measures, and a justification for swapping derivatives with expectations. The Lebesgue integral delivers on all three, and the rest of this topic shows how.

2. The Integral for Simple Functions

We start at the foundation. Recall from Topic 25 that a simple function on a measure space (Ω,F,μ)(\Omega, \mathcal{F}, \mu) is a function of the form s(x)=k=1mck1Ak(x),s(x) = \sum_{k=1}^m c_k \cdot \mathbf{1}_{A_k}(x), where c1,,cmc_1, \ldots, c_m are real numbers, A1,,AmFA_1, \ldots, A_m \in \mathcal{F} are measurable sets, and the AkA_k partition (or at least cover) the relevant part of Ω\Omega. A simple function takes only finitely many distinct values, each on a measurable set.

We already know what simple functions are. Now we give them an integral. The definition is the only reasonable one: if ss takes value ckc_k on the measurable set AkA_k, then the integral of ss should be ckμ(Ak)\sum c_k \, \mu(A_k) — the “height times width” principle, with “width” measured by μ\mu instead of by interval length.

📐 Definition 1 (Lebesgue integral of a non-negative simple function)

Let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a measure space and let s=k=1mck1Aks = \sum_{k=1}^m c_k \cdot \mathbf{1}_{A_k} be a non-negative simple function (ck0c_k \geq 0, AkFA_k \in \mathcal{F}, the AkA_k disjoint). The Lebesgue integral of ss with respect to μ\mu is sdμ  =  k=1mckμ(Ak),\int s \, d\mu \;=\; \sum_{k=1}^m c_k \cdot \mu(A_k), with the convention 0=00 \cdot \infty = 0 (so a coefficient of zero on an infinite-measure set contributes zero, not “indeterminate”).

The convention 0=00 \cdot \infty = 0 is essential. Without it, we couldn’t integrate the zero function over an infinite-measure space. With it, 0dμ=0\int 0 \, d\mu = 0 for every measure μ\mu, which is exactly what we want.

📝 Example 2 (The Dirichlet function has Lebesgue integral 0)

Take μ=λ\mu = \lambda (Lebesgue measure on R\mathbb{R}) and s=1Q[0,1]s = \mathbf{1}_{\mathbb{Q} \cap [0, 1]}. This is a simple function: it takes value 11 on Q[0,1]\mathbb{Q} \cap [0, 1] and value 00 on the rest of [0,1][0, 1]. From Topic 25, every countable set has Lebesgue measure zero, so λ(Q[0,1])=0\lambda(\mathbb{Q} \cap [0, 1]) = 0. The integral is 1Qdλ  =  1λ(Q[0,1])+0λ([0,1]Q)  =  10+01  =  0.\int \mathbf{1}_{\mathbb{Q}} \, d\lambda \;=\; 1 \cdot \lambda(\mathbb{Q} \cap [0, 1]) + 0 \cdot \lambda([0, 1] \setminus \mathbb{Q}) \;=\; 1 \cdot 0 + 0 \cdot 1 \;=\; 0. This is the answer the Riemann integral could not give us. Topic 25’s opening puzzle is now closed: the Dirichlet function is Lebesgue-integrable, and its integral is exactly the value our intuition demanded.

📝 Example 3 (Integration against a Dirac measure evaluates at the point)

Let δa\delta_a be the Dirac measure at aRa \in \mathbb{R}: δa(A)=1\delta_a(A) = 1 if aAa \in A and 00 otherwise. For any simple function s=ck1Aks = \sum c_k \mathbf{1}_{A_k}, sdδa  =  k=1mckδa(Ak)  =  ck(a),\int s \, d\delta_a \;=\; \sum_{k=1}^m c_k \cdot \delta_a(A_k) \;=\; c_{k(a)}, where k(a)k(a) is the unique kk with aAka \in A_k. In other words, integrating against δa\delta_a just evaluates the function at aa: sdδa=s(a)\int s \, d\delta_a = s(a). This generalizes to all measurable functions in the next section, and it is the precise mathematical sense in which a “point mass” picks out a single value.

The integral of simple functions has exactly the algebraic structure we want — it is linear in the function and monotonic with respect to pointwise inequalities. The proof is a straightforward but careful unwinding of the definition.

🔷 Theorem 1 (Linearity and monotonicity for simple functions)

Let s,ts, t be non-negative simple functions on (Ω,F,μ)(\Omega, \mathcal{F}, \mu) and α,β0\alpha, \beta \geq 0. Then αs+βt\alpha s + \beta t is a non-negative simple function and (αs+βt)dμ  =  αsdμ+βtdμ.\int (\alpha s + \beta t) \, d\mu \;=\; \alpha \int s \, d\mu + \beta \int t \, d\mu. Moreover, if sts \leq t pointwise (or even just μ\mu-a.e.), then sdμ    tdμ.\int s \, d\mu \;\leq\; \int t \, d\mu.

Proof.

We prove linearity first; monotonicity follows from it.

Step 1: A common refinement. Suppose s=i=1mai1Ais = \sum_{i=1}^m a_i \mathbf{1}_{A_i} and t=j=1nbj1Bjt = \sum_{j=1}^n b_j \mathbf{1}_{B_j}, where the AiA_i partition the support of ss and the BjB_j partition the support of tt. Form the joint refinement consisting of the sets Cij=AiBjC_{ij} = A_i \cap B_j. These are measurable (intersections of measurable sets), pairwise disjoint, and they cover the union of supports. On each CijC_{ij}, ss takes the constant value aia_i and tt takes the constant value bjb_j, so αs+βt\alpha s + \beta t takes the constant value αai+βbj\alpha a_i + \beta b_j. We have rewritten everything on a single common partition.

Step 2: Compute on the refinement. With everything on the refinement, the integrals are sums: sdμ  =  i,jaiμ(Cij),\int s \, d\mu \;=\; \sum_{i, j} a_i \cdot \mu(C_{ij}), tdμ  =  i,jbjμ(Cij),\int t \, d\mu \;=\; \sum_{i, j} b_j \cdot \mu(C_{ij}), (αs+βt)dμ  =  i,j(αai+βbj)μ(Cij).\int (\alpha s + \beta t) \, d\mu \;=\; \sum_{i, j} (\alpha a_i + \beta b_j) \cdot \mu(C_{ij}).

The first and second sums recover the original definitions of sdμ\int s \, d\mu and tdμ\int t \, d\mu because jμ(Cij)=μ(Ai)\sum_j \mu(C_{ij}) = \mu(A_i) by additivity of μ\mu on the disjoint {Cij}j\{C_{ij}\}_j, and similarly for BjB_j. The third sum splits as i,j(αai+βbj)μ(Cij)  =  αi,jaiμ(Cij)+βi,jbjμ(Cij)  =  αsdμ+βtdμ.\sum_{i, j} (\alpha a_i + \beta b_j) \cdot \mu(C_{ij}) \;=\; \alpha \sum_{i, j} a_i \cdot \mu(C_{ij}) + \beta \sum_{i, j} b_j \cdot \mu(C_{ij}) \;=\; \alpha \int s \, d\mu + \beta \int t \, d\mu.

That is linearity.

Step 3: Monotonicity. Suppose sts \leq t pointwise. On the common refinement, this means aibja_i \leq b_j for every CijC_{ij} with μ(Cij)>0\mu(C_{ij}) > 0. (If μ(Cij)=0\mu(C_{ij}) = 0 the term contributes nothing.) Summing over i,ji, j: sdμ  =  i,jaiμ(Cij)    i,jbjμ(Cij)  =  tdμ.\int s \, d\mu \;=\; \sum_{i, j} a_i \cdot \mu(C_{ij}) \;\leq\; \sum_{i, j} b_j \cdot \mu(C_{ij}) \;=\; \int t \, d\mu.

If sts \leq t only μ\mu-almost everywhere — that is, the set {x:s(x)>t(x)}\{x : s(x) > t(x)\} has measure zero — then on the common refinement the offending CijC_{ij} values have μ(Cij)=0\mu(C_{ij}) = 0 and contribute zero to both sums. The same chain of inequalities goes through.

💡 Remark 1 (Well-definedness of the simple-function integral)

The definition sdμ=ckμ(Ak)\int s \, d\mu = \sum c_k \mu(A_k) depends on the representation s=ck1Aks = \sum c_k \mathbf{1}_{A_k} that we chose. A given simple function has many such representations — we could refine the AkA_k further, or we could merge sets that share the same coefficient. The proof above shows that the value sdμ\int s \, d\mu is independent of representation: any two representations have a common refinement, and the integral computed on the common refinement matches the integral computed on either original. So sdμ\int s \, d\mu is a well-defined number, not a function-of-presentation.

3. The Integral for Non-Negative Measurable Functions

Now we extend the integral from simple functions to arbitrary non-negative measurable functions. The strategy is one of the central constructions in real analysis: define the integral as the supremum of the integrals of all simple functions that lie underneath. This works because (a) every non-negative measurable function is the pointwise limit of an increasing sequence of simple functions (Topic 25, Theorem 5), and (b) the integral on simple functions is monotone, so taking the sup of integrals of approximations from below is a sensible analogue of the Riemann integral’s “lower sum.”

📐 Definition 2 (Lebesgue integral of a non-negative measurable function)

Let f:Ω[0,]f: \Omega \to [0, \infty] be a non-negative measurable function on (Ω,F,μ)(\Omega, \mathcal{F}, \mu). The Lebesgue integral of ff with respect to μ\mu is fdμ  =  sup{sdμ  :  s is a non-negative simple function with 0sf}.\int f \, d\mu \;=\; \sup\left\{ \int s \, d\mu \;:\; s \text{ is a non-negative simple function with } 0 \leq s \leq f \right\}. The supremum is taken over all non-negative simple functions ss that lie pointwise below ff.

A few features of this definition deserve immediate attention.

💡 Remark 2 (The integral can be infinite — and that is the point)

The supremum on the right side may be ++\infty. We allow this. A non-negative measurable function for which fdμ=+\int f \, d\mu = +\infty is integrable in the extended sense but not Lebesgue-integrable in the strict sense. Reserving ++\infty as a valid value of the integral lets us state the convergence theorems below without piling on extra hypotheses every time. A function ff with fdμ<\int f \, d\mu < \infty is called integrable (or, more carefully, summable), and we write fL1(μ)f \in L^1(\mu) once we have extended the definition to general — not just non-negative — measurable functions in Section 4.

📝 Example 4 (Recovering the improper integral $\int_0^\infty e^{-x} \, dx = 1$)

Let μ=λ\mu = \lambda on R\mathbb{R} and f(x)=ex1[0,)(x)f(x) = e^{-x} \cdot \mathbf{1}_{[0, \infty)}(x). We can build a simple-function approximation from below as follows. For each nNn \in \mathbb{N}, partition [0,n][0, n] into n2nn \cdot 2^n subintervals of equal length 1/2n1/2^n, set sns_n equal to the infimum of ff on each subinterval, and set sn=0s_n = 0 outside [0,n][0, n]. Each sns_n is a non-negative simple function with snfs_n \leq f, and a direct computation shows sndλ1\int s_n \, d\lambda \to 1 as nn \to \infty. By the definition, fdλ=supnsndλ1\int f \, d\lambda = \sup_n \int s_n \, d\lambda \geq 1. The reverse inequality follows from fexf \leq e^{-x} pointwise and a similar refinement argument (or — more cleanly — from the comparison-with-Riemann theorem in Section 8). The conclusion is 0exdλ=1\int_0^\infty e^{-x} \, d\lambda = 1, exactly the value the improper Riemann integral gives.

📝 Example 5 (The Cantor set has Lebesgue measure zero — and so does its indicator)

Let C[0,1]C \subset [0, 1] be the standard middle-thirds Cantor set from Topic 25, and consider f=1Cf = \mathbf{1}_C. From Topic 25, λ(C)=0\lambda(C) = 0. The function ff is itself a simple function (one set, coefficient 1), so the simple-function integral applies directly: 1Cdλ  =  1λ(C)  =  0.\int \mathbf{1}_C \, d\lambda \;=\; 1 \cdot \lambda(C) \;=\; 0. The Cantor set is uncountable — so 1C\mathbf{1}_C is non-zero at uncountably many points — but the Lebesgue integral correctly assigns it value zero. “Uncountable” and “has positive measure” are unrelated concepts; the Cantor set is the canonical example of a set that is simultaneously uncountable and Lebesgue-null.

The integral on non-negative measurable functions has one immediate application that is worth proving right now, because it is the source of every tail bound in probability and ML.

🔷 Proposition 1 (Markov's inequality)

Let f0f \geq 0 be measurable and let t>0t > 0. Then μ({x:f(x)t})    1tfdμ.\mu\bigl(\{x : f(x) \geq t\}\bigr) \;\leq\; \frac{1}{t} \int f \, d\mu.

Proof.

Let At={x:f(x)t}A_t = \{x : f(x) \geq t\}. The set AtA_t is measurable because ff is measurable. Consider the simple function s=t1Ats = t \cdot \mathbf{1}_{A_t}. By construction s(x)=tf(x)s(x) = t \leq f(x) on AtA_t, and s(x)=0f(x)s(x) = 0 \leq f(x) off AtA_t, so sfs \leq f pointwise. The integral of ss is sdμ  =  tμ(At).\int s \, d\mu \;=\; t \cdot \mu(A_t).

By the definition of fdμ\int f \, d\mu as a supremum over simple functions f\leq f, tμ(At)  =  sdμ    fdμ.t \cdot \mu(A_t) \;=\; \int s \, d\mu \;\leq\; \int f \, d\mu.

Dividing both sides by tt (which is positive): μ(At)    1tfdμ.\mu(A_t) \;\leq\; \frac{1}{t} \int f \, d\mu.

💡 Remark 3 (Markov's inequality is the foundation of every tail bound in probability)

Markov’s inequality looks innocent — three lines of proof, no clever tricks — but it is the seed crystal for the entire theory of concentration of measure. Specializing ff gives Chebyshev’s inequality (set f=(Xμ)2f = (X - \mu)^2), Hoeffding’s inequality (set f=etXf = e^{tX} for some tt and tune), and ultimately the Chernoff–Cramér bound that powers VC dimension and Rademacher complexity arguments. Every “with high probability” statement in modern statistical learning theory rests on a Markov-style tail bound applied to a cleverly chosen non-negative random variable. Forward link: Concentration Inequalities.

The interactive viz below makes the supremum construction concrete. As you increase the dyadic refinement level, the simple-function staircase climbs upward toward ff and the integral climbs upward toward fdμ\int f \, d\mu. The supremum is the limit of this climbing process — it is what the construction computes in the limit, not a single staircase you can draw on the page.

∫ s_n dλ = 0.56375Exact ∫ f dλ = 0.63662Error: 7.29e-2

The half-cycle of a sine wave on [0, 1]. Exact integral 2/π ≈ 0.6366. The simple function s_n is the largest dyadic staircase below f at this level — the supremum over all such staircases as n → ∞ is the Lebesgue integral ∫ f dλ.

4. The Integral for General Measurable Functions

So far we have only integrated non-negative functions. To integrate a general measurable function (positive or negative), we split it into its positive and negative parts and integrate each separately.

📐 Definition 3 (Positive and negative parts)

For any measurable function f:ΩRf: \Omega \to \mathbb{R}, define f+(x)  =  max(f(x),0),f(x)  =  max(f(x),0).f^+(x) \;=\; \max(f(x), 0), \qquad f^-(x) \;=\; \max(-f(x), 0). Both f+f^+ and ff^- are non-negative measurable functions. They satisfy f  =  f+f,f  =  f++f.f \;=\; f^+ - f^-, \qquad |f| \;=\; f^+ + f^-.

A small visual: f+f^+ keeps the positive parts of ff and zeros out the negative parts; ff^- does the opposite (and flips the sign so it is also non-negative). Their difference is ff, and their sum is f|f|. Both are non-negative, so the previous section applies to each.

📐 Definition 4 (Lebesgue integral of a general measurable function)

Let f:ΩRf: \Omega \to \mathbb{R} be a measurable function on (Ω,F,μ)(\Omega, \mathcal{F}, \mu). If at least one of f+dμ\int f^+ \, d\mu and fdμ\int f^- \, d\mu is finite, the Lebesgue integral of ff with respect to μ\mu is fdμ  =  f+dμ    fdμ.\int f \, d\mu \;=\; \int f^+ \, d\mu \;-\; \int f^- \, d\mu. If both are finite, ff is Lebesgue-integrable and we write fL1(μ)f \in L^1(\mu). The space L1(μ)L^1(\mu) consists of all measurable functions ff with fdμ<\int |f| \, d\mu < \infty.

The reason we require at least one to be finite is to avoid the indeterminate form \infty - \infty. If both f+\int f^+ and f\int f^- are infinite, the integral f\int f is left undefined. (In some treatments such functions are called “non-integrable” or “improperly integrable”; we will not need that distinction in this topic.)

📝 Example 6 ($\int_{-\pi}^{\pi} \sin(x) \, d\lambda = 0$ — positive and negative parts cancel)

The function sin(x)\sin(x) on [π,π][-\pi, \pi] is non-negative on [0,π][0, \pi] and non-positive on [π,0][-\pi, 0]. Its positive part is sin+(x)=sin(x)1[0,π](x)\sin^+(x) = \sin(x) \cdot \mathbf{1}_{[0, \pi]}(x) and its negative part is sin(x)=sin(x)1[π,0](x)=sin(x)1[π,0](x)\sin^-(x) = -\sin(x) \cdot \mathbf{1}_{[-\pi, 0]}(x) = \sin(-x) \cdot \mathbf{1}_{[-\pi, 0]}(x). By symmetry, sin+dλ  =  0πsin(x)dλ  =  2,\int \sin^+ \, d\lambda \;=\; \int_0^{\pi} \sin(x) \, d\lambda \;=\; 2, sindλ  =  π0sin(x)dλ  =  0πsin(x)dλ  =  2.\int \sin^- \, d\lambda \;=\; \int_{-\pi}^0 -\sin(x) \, d\lambda \;=\; \int_0^{\pi} \sin(x) \, d\lambda \;=\; 2. Both integrals are finite, so sinL1([π,π],λ)\sin \in L^1([-\pi, \pi], \lambda), and sindλ  =  22  =  0.\int \sin \, d\lambda \;=\; 2 - 2 \;=\; 0.

The definition extends linearity from simple functions to all of L1L^1, but the proof requires a small bit of bookkeeping.

🔷 Theorem 2 (Linearity of the Lebesgue integral on $L^1(\mu)$)

Let f,gL1(μ)f, g \in L^1(\mu) and α,βR\alpha, \beta \in \mathbb{R}. Then αf+βgL1(μ)\alpha f + \beta g \in L^1(\mu) and (αf+βg)dμ  =  αfdμ+βgdμ.\int (\alpha f + \beta g) \, d\mu \;=\; \alpha \int f \, d\mu + \beta \int g \, d\mu.

Proof.

The strategy is to reduce to the non-negative case, where linearity is already established (as a limit of the simple-function linearity from Theorem 1).

Step 1: Non-negative linearity. First we extend Theorem 1 from simple functions to general non-negative measurable functions. If f,g0f, g \geq 0 are measurable and α,β0\alpha, \beta \geq 0, choose increasing sequences of non-negative simple functions snfs_n \uparrow f and tngt_n \uparrow g (Topic 25, Theorem 5). Then αsn+βtn\alpha s_n + \beta t_n is a non-negative simple function with αsn+βtnαf+βg\alpha s_n + \beta t_n \uparrow \alpha f + \beta g. By the definition of the integral as a supremum and Theorem 1 applied at each step, (αf+βg)dμ  =  supn(αsn+βtn)dμ  =  supn(αsn+βtn).\int (\alpha f + \beta g) \, d\mu \;=\; \sup_n \int (\alpha s_n + \beta t_n) \, d\mu \;=\; \sup_n \left( \alpha \int s_n + \beta \int t_n \right).

The sup of a sum of non-negative terms equals the sum of sups (each term is non-decreasing in nn), so this becomes αsupnsndμ+βsupntndμ  =  αfdμ+βgdμ.\alpha \sup_n \int s_n \, d\mu + \beta \sup_n \int t_n \, d\mu \;=\; \alpha \int f \, d\mu + \beta \int g \, d\mu.

We will give a tighter version of this “sup of sum” step inside the proof of the Monotone Convergence Theorem in Section 5; for now we treat it as the natural extension of Theorem 1 to limits.

Step 2: Subtraction by splitting. For general fL1(μ)f \in L^1(\mu), write f=f+ff = f^+ - f^- with both f+,fL1(μ)f^+, f^- \in L^1(\mu) (their integrals are both finite, since f<\int |f| < \infty). Similarly g=g+gg = g^+ - g^-. The sum f+gf + g has positive and negative parts that satisfy (f+g)+(f+g)  =  f+g  =  (f++g+)(f+g).(f + g)^+ - (f + g)^- \;=\; f + g \;=\; (f^+ + g^+) - (f^- + g^-).

This identity is the key. The two decompositions of f+gf + g — its own canonical (f+g)+(f+g)(f+g)^+ - (f+g)^- split, and the sum-of-splits (f++g+)(f+g)(f^+ + g^+) - (f^- + g^-) — are not the same simple-function decomposition (the second one is “wasteful” in that it can have f+f^+ and gg^- both non-zero at the same point, which the canonical decomposition would simplify), but they sum to the same thing. Rearranging: (f+g)++f+g  =  (f+g)+f++g+.(f + g)^+ + f^- + g^- \;=\; (f + g)^- + f^+ + g^+.

Both sides are non-negative measurable functions, so we can apply Step 1 (non-negative linearity) to each. Integrating: (f+g)+dμ+fdμ+gdμ  =  (f+g)dμ+f+dμ+g+dμ.\int (f + g)^+ \, d\mu + \int f^- \, d\mu + \int g^- \, d\mu \;=\; \int (f + g)^- \, d\mu + \int f^+ \, d\mu + \int g^+ \, d\mu.

All six integrals are finite (everything is in L1L^1), so we can rearrange: (f+g)+dμ(f+g)dμ  =  [f+dμfdμ]+[g+dμgdμ].\int (f + g)^+ \, d\mu - \int (f + g)^- \, d\mu \;=\; \left[ \int f^+ \, d\mu - \int f^- \, d\mu \right] + \left[ \int g^+ \, d\mu - \int g^- \, d\mu \right].

The left side is (f+g)dμ\int (f + g) \, d\mu by Definition 4; the right side is fdμ+gdμ\int f \, d\mu + \int g \, d\mu. So (f+g)dμ=fdμ+gdμ\int (f + g) \, d\mu = \int f \, d\mu + \int g \, d\mu.

Step 3: Scalar multiplication. For α0\alpha \geq 0, (αf)+=αf+(\alpha f)^+ = \alpha f^+ and (αf)=αf(\alpha f)^- = \alpha f^-, so αfdμ=αf+dμαfdμ=αfdμ\int \alpha f \, d\mu = \alpha \int f^+ \, d\mu - \alpha \int f^- \, d\mu = \alpha \int f \, d\mu. For α<0\alpha < 0, (αf)+=(α)f=αf(\alpha f)^+ = (-\alpha) f^- = |\alpha| f^- and (αf)=αf+(\alpha f)^- = |\alpha| f^+, so αfdμ=αfdμαf+dμ=αfdμ=αfdμ\int \alpha f \, d\mu = |\alpha| \int f^- \, d\mu - |\alpha| \int f^+ \, d\mu = -|\alpha| \int f \, d\mu = \alpha \int f \, d\mu. Either way, scalar multiplication pulls through.

Combining additivity (Step 2) and scalar multiplication (Step 3), (αf+βg)dμ=αfdμ+βgdμ\int (\alpha f + \beta g) \, d\mu = \alpha \int f \, d\mu + \beta \int g \, d\mu.

💡 Remark 4 (The Lebesgue integral strictly extends the Riemann integral)

If f:[a,b]Rf: [a, b] \to \mathbb{R} is bounded and Riemann-integrable, then ff is Lebesgue-integrable on [a,b][a, b] (with respect to Lebesgue measure restricted to [a,b][a, b]) and the two integrals agree: abf(x)dx  =  [a,b]fdλ.\int_a^b f(x) \, dx \;=\; \int_{[a, b]} f \, d\lambda. We will prove this in Section 8 (Theorem 6). The converse is false: the Dirichlet function 1Q\mathbf{1}_{\mathbb{Q}} on [0,1][0, 1] is Lebesgue-integrable (Example 2) with integral 00, but it is not Riemann-integrable (Topic 25, Section 1). So the Lebesgue integral is a strict extension — the same functions as before plus genuinely more.

5. The Monotone Convergence Theorem

The first of the three convergence theorems. The Monotone Convergence Theorem — historically due to Beppo Levi (1906) — handles the easy case where (fn)(f_n) is an increasing sequence. Increasing means we never lose mass, so the limit-integral interchange goes through with no extra hypotheses.

🔷 Theorem 3 (Monotone Convergence Theorem (Beppo Levi))

Let (fn)n1(f_n)_{n \geq 1} be a sequence of non-negative measurable functions on (Ω,F,μ)(\Omega, \mathcal{F}, \mu) with 0    f1(x)    f2(x)    for μ-a.e. x,0 \;\leq\; f_1(x) \;\leq\; f_2(x) \;\leq\; \cdots \quad \text{for $\mu$-a.e. } x, and let f(x)=limnfn(x)f(x) = \lim_{n \to \infty} f_n(x) be the pointwise limit (which exists in [0,][0, \infty] because the sequence is non-decreasing). Then ff is measurable and fdμ  =  limnfndμ.\int f \, d\mu \;=\; \lim_{n \to \infty} \int f_n \, d\mu.

The interactive viz below shows three monotone-convergence sequence types — truncation fn=min(f,n)f_n = \min(f, n), expanding support fn=f1[n,n]f_n = f \cdot \mathbf{1}_{[-n, n]}, and the dyadic simple-function staircase — and the corresponding integral values climbing toward the limit. Try each and see how the bar chart of f1,f2,\int f_1, \int f_2, \ldots converges to f\int f from below.

∫ f_n dμ = 1.80000∫ f dμ = 2.00000Gap (∫ f − ∫ f_n) = 0.20000

Truncate f(x) = 1/√x at height n. As n → ∞ the cap rises and f_n ↑ f. The integrals climb from a small value toward 2 = ∫ f. MCT predicts: ∫ f_n ↑ ∫ f because f_n is non-decreasing and converges pointwise to f.

The proof is the most technically demanding piece of this topic and is worth working through carefully. It is also the prototype for every other “limit and integral commute” argument we’ll see.

Proof.

We must show fdμ=limnfndμ\int f \, d\mu = \lim_n \int f_n \, d\mu. The plan is to prove \leq and \geq separately. The \geq direction is trivial; the \leq direction is where the work happens.

Step 1: limnfndμfdμ\lim_n \int f_n \, d\mu \leq \int f \, d\mu.

Since fnff_n \leq f for every nn, monotonicity of the integral gives fndμfdμ\int f_n \, d\mu \leq \int f \, d\mu for every nn. Taking the supremum (which equals the limit because the sequence fn\int f_n is non-decreasing — a consequence of monotonicity applied to fnfn+1f_n \leq f_{n+1}), limnfndμ  =  supnfndμ    fdμ.\lim_{n \to \infty} \int f_n \, d\mu \;=\; \sup_n \int f_n \, d\mu \;\leq\; \int f \, d\mu.

Step 2: fdμlimnfndμ\int f \, d\mu \leq \lim_n \int f_n \, d\mu.

This is the harder direction. By Definition 2, fdμ\int f \, d\mu is a supremum over simple functions ss with 0sf0 \leq s \leq f. So it is enough to show that for every such ss, sdμ    limnfndμ.\int s \, d\mu \;\leq\; \lim_n \int f_n \, d\mu.

Fix one such simple function s=k=1mck1Aks = \sum_{k=1}^m c_k \mathbf{1}_{A_k} with 0sf0 \leq s \leq f.

Step 2a: A scaled cutoff trick. Pick any α(0,1)\alpha \in (0, 1). Define En  =  {x:fn(x)αs(x)}.E_n \;=\; \{x : f_n(x) \geq \alpha \cdot s(x)\}.

The set EnE_n is measurable (it is a level set of the difference fnαsf_n - \alpha s, which is measurable). Crucially, the sets EnE_n are increasing: fnfn+1f_n \leq f_{n+1} implies that if fnαsf_n \geq \alpha s then fn+1αsf_{n+1} \geq \alpha s, so EnEn+1E_n \subseteq E_{n+1}. And their union covers all of Ω\Omega (well, all of Ω\Omega except a null set): for any fixed xx, the sequence fn(x)f_n(x) converges to f(x)s(x)>αs(x)f(x) \geq s(x) > \alpha s(x) (the strict inequality holds wherever s(x)>0s(x) > 0 because α<1\alpha < 1; on {s=0}\{s = 0\} the inequality fn0=αsf_n \geq 0 = \alpha s holds trivially, so xEnx \in E_n for every nn). So nEn=Ω\bigcup_n E_n = \Omega up to a null set.

Step 2b: Bound fn\int f_n from below using EnE_n. Restrict the integral to EnE_n: fndμ    Enfndμ  =  fn1Endμ.\int f_n \, d\mu \;\geq\; \int_{E_n} f_n \, d\mu \;=\; \int f_n \cdot \mathbf{1}_{E_n} \, d\mu.

(The first inequality is monotonicity: fnfn1Enf_n \geq f_n \cdot \mathbf{1}_{E_n} pointwise, since the indicator is at most 11 and fn0f_n \geq 0.) On EnE_n we have fnαsf_n \geq \alpha s by definition, so fn1Endμ    αs1Endμ  =  αs1Endμ  =  αEnsdμ.\int f_n \cdot \mathbf{1}_{E_n} \, d\mu \;\geq\; \int \alpha s \cdot \mathbf{1}_{E_n} \, d\mu \;=\; \alpha \int s \cdot \mathbf{1}_{E_n} \, d\mu \;=\; \alpha \int_{E_n} s \, d\mu.

(The second equality uses linearity from Theorem 1.) Combining the two displayed inequalities: fndμ    αEnsdμ  =  αk=1mckμ(AkEn).\int f_n \, d\mu \;\geq\; \alpha \int_{E_n} s \, d\mu \;=\; \alpha \sum_{k=1}^m c_k \, \mu(A_k \cap E_n).

Step 2c: Send nn \to \infty. The sets AkEnA_k \cap E_n are increasing in nn for each kk, with union AknEn=AkA_k \cap \bigcup_n E_n = A_k (up to a null set). By continuity of measure from below (Topic 25, Theorem 1, second form), μ(AkEn)  n  μ(Ak).\mu(A_k \cap E_n) \;\xrightarrow[n \to \infty]{}\; \mu(A_k).

So k=1mckμ(AkEn)k=1mckμ(Ak)=sdμ\sum_{k=1}^m c_k \, \mu(A_k \cap E_n) \to \sum_{k=1}^m c_k \, \mu(A_k) = \int s \, d\mu as nn \to \infty. Plugging this into the bound from Step 2b: limnfndμ    αsdμ.\lim_{n \to \infty} \int f_n \, d\mu \;\geq\; \alpha \cdot \int s \, d\mu.

Step 2d: Send α1\alpha \to 1^-. The bound limnfndμαsdμ\lim_n \int f_n \, d\mu \geq \alpha \int s \, d\mu holds for every α(0,1)\alpha \in (0, 1). Letting α1\alpha \uparrow 1, limnfndμ    sdμ.\lim_{n \to \infty} \int f_n \, d\mu \;\geq\; \int s \, d\mu.

Step 2e: Take the supremum over ss. This bound holds for every simple function ss with 0sf0 \leq s \leq f. Taking the supremum over all such ss on the right and using Definition 2, limnfndμ    sup0sfsdμ  =  fdμ.\lim_{n \to \infty} \int f_n \, d\mu \;\geq\; \sup_{0 \leq s \leq f} \int s \, d\mu \;=\; \int f \, d\mu.

This is the inequality we needed. Combined with Step 1, we get equality: fdμ=limnfndμ\int f \, d\mu = \lim_n \int f_n \, d\mu.

The “scaled cutoff” trick — picking α<1\alpha < 1 to give ourselves room, then sending α1\alpha \to 1 at the end — is a recurring move in measure theory. It is the technical engine behind every “lower-semicontinuity” type argument.

📝 Example 7 (Series as integrals against counting measure)

Let μ\mu be the counting measure on N\mathbb{N}: μ(A)\mu(A) is the cardinality of AA for finite AA, and \infty otherwise. A function f:N[0,)f: \mathbb{N} \to [0, \infty) is just a sequence (ak)k1(a_k)_{k \geq 1} with ak=f(k)a_k = f(k), and the Lebesgue integral against μ\mu is the sum fdμ=k=1ak\int f \, d\mu = \sum_{k=1}^{\infty} a_k.

Apply MCT to the partial-sum sequence fn=k=1nak1{k}f_n = \sum_{k=1}^n a_k \mathbf{1}_{\{k\}}. Each fnf_n is a non-negative simple function with fndμ=k=1nak\int f_n \, d\mu = \sum_{k=1}^n a_k. The sequence is increasing pointwise (each fnf_n adds one more non-negative term). The limit is f=k=1ak1{k}f = \sum_{k=1}^{\infty} a_k \mathbf{1}_{\{k\}}, and fdμ=k=1ak\int f \, d\mu = \sum_{k=1}^{\infty} a_k. MCT says k=1ak  =  fdμ  =  limnfndμ  =  limnk=1nak,\sum_{k=1}^{\infty} a_k \;=\; \int f \, d\mu \;=\; \lim_{n \to \infty} \int f_n \, d\mu \;=\; \lim_{n \to \infty} \sum_{k=1}^n a_k, which is just the definition of an infinite series. MCT specializes to “the series of non-negative terms equals the limit of partial sums” — a fact we already knew, but now justified inside the unified Lebesgue framework.

A clean corollary of MCT is the swap of sum and integral for non-negative functions — a result we will use repeatedly.

🔷 Corollary 1 (Sum–integral swap for non-negative functions)

Let (fn)n1(f_n)_{n \geq 1} be a sequence of non-negative measurable functions on (Ω,F,μ)(\Omega, \mathcal{F}, \mu). Then n=1fndμ  =  n=1fndμ.\int \sum_{n=1}^{\infty} f_n \, d\mu \;=\; \sum_{n=1}^{\infty} \int f_n \, d\mu.

Proof: apply MCT to the partial sums SN=n=1NfnS_N = \sum_{n=1}^N f_n. The sequence (SN)(S_N) is non-negative and increasing, with pointwise limit n=1fn\sum_{n=1}^\infty f_n. By Theorem 1 (linearity for finite sums of simple functions, then promoted to non-negative measurable functions in Theorem 2 Step 1), SNdμ=n=1Nfndμ\int S_N \, d\mu = \sum_{n=1}^N \int f_n \, d\mu. MCT gives (fn)dμ=limNSNdμ=limNn=1Nfndμ=n=1fndμ\int (\sum f_n) \, d\mu = \lim_N \int S_N \, d\mu = \lim_N \sum_{n=1}^N \int f_n \, d\mu = \sum_{n=1}^\infty \int f_n \, d\mu.

Three-panel monotone convergence: truncation min(f, n) at n=1, 3, 10 with shaded integrals climbing toward the limit

6. Fatou’s Lemma

MCT requires the sequence to be increasing. What if it isn’t? Fatou’s Lemma gives the next-best thing: a one-sided inequality that holds for any sequence of non-negative measurable functions, with no monotonicity assumption.

🔷 Theorem 4 (Fatou's Lemma)

Let (fn)n1(f_n)_{n \geq 1} be a sequence of non-negative measurable functions on (Ω,F,μ)(\Omega, \mathcal{F}, \mu). Then lim infnfndμ    lim infnfndμ.\int \liminf_{n \to \infty} f_n \, d\mu \;\leq\; \liminf_{n \to \infty} \int f_n \, d\mu.

A reminder on the notation: for any sequence (an)(a_n) of extended reals, lim infnan=limninfknak\liminf_n a_n = \lim_{n \to \infty} \inf_{k \geq n} a_k — the “eventual” lower bound on the sequence. For functions, lim infnfn\liminf_n f_n is the pointwise lim inf\liminf at each point.

Proof.

Define a new sequence gn(x)  =  infknfk(x).g_n(x) \;=\; \inf_{k \geq n} f_k(x).

The function gng_n is the pointwise infimum of the tail {fn,fn+1,}\{f_n, f_{n+1}, \ldots\}. By the standard argument (intersection of measurable level sets), each gng_n is measurable and non-negative.

Key observation 1: (gn)(g_n) is increasing. As nn grows, the set {k:kn}\{k : k \geq n\} shrinks, and the infimum over a smaller set is at least as large as the infimum over a larger set: gn  =  infknfk    infkn+1fk  =  gn+1.g_n \;=\; \inf_{k \geq n} f_k \;\leq\; \inf_{k \geq n + 1} f_k \;=\; g_{n + 1}.

So 0g1g20 \leq g_1 \leq g_2 \leq \cdots.

Key observation 2: gnfng_n \leq f_n. Trivially: gng_n is the infimum of {fn,fn+1,}\{f_n, f_{n+1}, \ldots\}, which includes fnf_n itself, so gnfng_n \leq f_n pointwise. By monotonicity, gndμ    fndμ.\int g_n \, d\mu \;\leq\; \int f_n \, d\mu.

Key observation 3: gnlim infnfng_n \to \liminf_n f_n. By definition, lim infnfn(x)  =  limninfknfk(x)  =  limngn(x).\liminf_{n \to \infty} f_n(x) \;=\; \lim_{n \to \infty} \inf_{k \geq n} f_k(x) \;=\; \lim_{n \to \infty} g_n(x).

So gng_n converges pointwise to lim infnfn\liminf_n f_n, and the convergence is monotone-increasing (Observation 1).

Apply MCT to (gn)(g_n). All three hypotheses of MCT hold: (gn)(g_n) is non-negative, measurable, and monotone-increasing with pointwise limit lim infnfn\liminf_n f_n. So lim infnfndμ  =  limngndμ  =  limngndμ.\int \liminf_{n \to \infty} f_n \, d\mu \;=\; \int \lim_{n \to \infty} g_n \, d\mu \;=\; \lim_{n \to \infty} \int g_n \, d\mu.

Combine with Observation 2. From Observation 2, gndμfndμ\int g_n \, d\mu \leq \int f_n \, d\mu, so the limit on the left is bounded by the lim inf\liminf of the right: limngndμ    lim infnfndμ.\lim_{n \to \infty} \int g_n \, d\mu \;\leq\; \liminf_{n \to \infty} \int f_n \, d\mu.

(Note: we use lim inf\liminf on the right rather than lim\lim, because fndμ\int f_n \, d\mu does not necessarily converge — it could oscillate. Whatever it does, gnlim inffn\int g_n \to \int \liminf f_n from below, and stays below the lim inf\liminf of fn\int f_n.) Combining with the previous display: lim infnfndμ    lim infnfndμ.\int \liminf_{n \to \infty} f_n \, d\mu \;\leq\; \liminf_{n \to \infty} \int f_n \, d\mu.

Fatou’s Lemma is a one-sided inequality, and the example below shows that the inequality can be strict — equality is not guaranteed.

📝 Example 8 (Strict inequality in Fatou: the spike sequence)

Take fn(x)=n1[0,1/n](x)f_n(x) = n \cdot \mathbf{1}_{[0, 1/n]}(x) on [0,1][0, 1] — the same sequence from Example 1. Then fn(x)0f_n(x) \to 0 pointwise for every x>0x > 0, so lim infnfn=0\liminf_n f_n = 0 everywhere (well, almost everywhere — the value at x=0x = 0 tends to \infty, but a single point is null). Therefore lim infnfndλ  =  0dλ  =  0.\int \liminf_{n \to \infty} f_n \, d\lambda \;=\; \int 0 \, d\lambda \;=\; 0. But fndλ=1\int f_n \, d\lambda = 1 for every nn, so lim infnfndλ=1\liminf_n \int f_n \, d\lambda = 1.

Fatou’s Lemma gives 010 \leq 1 — correct, but strict. The mass of fnf_n does not vanish in the integral even though it vanishes pointwise. This is the same phenomenon we observed in Puzzle 1 from Section 1: pointwise convergence does not always imply convergence of integrals.

💡 Remark 5 (Fatou cannot be promoted to equality without extra hypotheses)

The strict inequality in Example 8 shows that Fatou’s Lemma cannot in general be improved to equality. Some additional structure on the sequence (fn)(f_n) is required to recover equality. Monotone Convergence gives one such structure (an increasing sequence), and Dominated Convergence in the next section gives another (existence of an integrable dominator). Without one of these extra assumptions, Fatou’s one-sided inequality is the best we can say.

Two-panel Fatou's lemma: strict inequality case with the spike sequence and the integrals staying at 1 while the pointwise limit is 0

7. The Dominated Convergence Theorem

The crown jewel. The Dominated Convergence Theorem is the result that justifies the limit-integral interchange in nearly every modern application — including the gradient–expectation swap that drives stochastic gradient descent.

🔷 Theorem 5 (Dominated Convergence Theorem (Lebesgue))

Let (fn)n1(f_n)_{n \geq 1} be a sequence of measurable functions on (Ω,F,μ)(\Omega, \mathcal{F}, \mu) with fn(x)f(x)f_n(x) \to f(x) pointwise μ\mu-a.e. Suppose there exists an integrable function gL1(μ)g \in L^1(\mu) (i.e., gdμ<\int g \, d\mu < \infty) such that fn(x)    g(x)for μ-a.e. x, for every n.|f_n(x)| \;\leq\; g(x) \quad \text{for $\mu$-a.e. } x, \text{ for every } n. Then fL1(μ)f \in L^1(\mu) and fdμ  =  limnfndμ.\int f \, d\mu \;=\; \lim_{n \to \infty} \int f_n \, d\mu. Equivalently, limnfndμ=limnfndμ\lim_n \int f_n \, d\mu = \int \lim_n f_n \, d\mu — limits and integrals commute.

The flagship visualization for this topic is below. It contrasts two scenarios where DCT applies (sequences that have an integrable dominator and whose integrals do converge to the integral of the limit) with one scenario where DCT does not apply (the spike sequence from Example 1, which has no integrable dominator and whose integrals do not converge). Toggle between scenarios and watch the integral history bar chart on the right; the contrast is the conceptual heart of DCT.

∫ f_n dμ = 0.41990∫ f dμ = 0.00000|∫ f_n − ∫ f| = 0.41990Dominator g is integrable

f_n(x) = sin(x/n) / (1 + x²). As n → ∞ the argument x/n shrinks toward 0, so sin(x/n) → sin(0) = 0 and f_n → 0 pointwise (not by oscillation cancellation, but because the argument itself collapses). Dominator g(x) = 1 / (1 + x²) is integrable on [0, ∞) with ∫ g = π/2, and |sin(x/n)| ≤ 1 gives |f_n| ≤ g. DCT applies, and ∫ f_n → 0 = ∫ f. DCT predicts: ∫ f_n → ∫ f because |f_n| ≤ g and ∫ g < ∞.

The proof uses a slick double-application of Fatou’s Lemma — once to a pair of non-negative sequences manufactured from fnf_n and the dominator gg.

Proof.

The strategy is to apply Fatou’s Lemma to two sequences: (g+fn)(g + f_n) and (gfn)(g - f_n). Both are non-negative because fng|f_n| \leq g implies gfng-g \leq f_n \leq g, so g+fn0g + f_n \geq 0 and gfn0g - f_n \geq 0.

Step 1: ff is integrable. Pointwise convergence fnff_n \to f and fng|f_n| \leq g together imply fg|f| \leq g. By monotonicity, fdμ    gdμ  <  ,\int |f| \, d\mu \;\leq\; \int g \, d\mu \;<\; \infty, so fL1(μ)f \in L^1(\mu). The integral fdμ\int f \, d\mu is therefore well-defined and finite.

Step 2: Apply Fatou to g+fng + f_n. The sequence (g+fn)(g + f_n) is non-negative and converges pointwise to g+fg + f. Fatou’s Lemma gives (g+f)dμ    lim infn(g+fn)dμ.\int (g + f) \, d\mu \;\leq\; \liminf_{n \to \infty} \int (g + f_n) \, d\mu.

By linearity (Theorem 2), gdμ+fdμ    lim infn[gdμ+fndμ]  =  gdμ+lim infnfndμ.\int g \, d\mu + \int f \, d\mu \;\leq\; \liminf_{n \to \infty} \left[ \int g \, d\mu + \int f_n \, d\mu \right] \;=\; \int g \, d\mu + \liminf_{n \to \infty} \int f_n \, d\mu.

(The constant g\int g pulls out of the lim inf\liminf.) Subtracting gdμ\int g \, d\mu from both sides — legal because gdμ\int g \, d\mu is finite — we get \int f \, d\mu \;\leq\; \liminf_{n \to \infty} \int f_n \, d\mu. \tag{$\star$}

Step 3: Apply Fatou to gfng - f_n. The sequence (gfn)(g - f_n) is also non-negative and converges pointwise to gfg - f. Fatou’s Lemma gives (gf)dμ    lim infn(gfn)dμ.\int (g - f) \, d\mu \;\leq\; \liminf_{n \to \infty} \int (g - f_n) \, d\mu.

Linearity again: gdμfdμ    gdμlim supnfndμ.\int g \, d\mu - \int f \, d\mu \;\leq\; \int g \, d\mu - \limsup_{n \to \infty} \int f_n \, d\mu.

The key move on the right: lim infn(an)=lim supnan\liminf_n (-a_n) = -\limsup_n a_n, so lim infn(gfn)=glim supnfn\liminf_n (\int g - \int f_n) = \int g - \limsup_n \int f_n. Subtract gdμ\int g \, d\mu and multiply both sides by 1-1 (which flips the inequality): \limsup_{n \to \infty} \int f_n \, d\mu \;\leq\; \int f \, d\mu. \tag{$\star\star$}

Step 4: Combine ()(\star) and ()(\star\star). From ()(\star) and ()(\star\star), fdμ    lim infnfndμ    lim supnfndμ    fdμ.\int f \, d\mu \;\leq\; \liminf_{n \to \infty} \int f_n \, d\mu \;\leq\; \limsup_{n \to \infty} \int f_n \, d\mu \;\leq\; \int f \, d\mu.

Equality must hold throughout. In particular, lim infnfn=lim supnfn\liminf_n \int f_n = \limsup_n \int f_n, which means limnfn\lim_n \int f_n exists, and its common value is fdμ\int f \, d\mu.

The proof is short once you see the trick: manufacture two non-negative sequences from fnf_n and the dominator, apply Fatou to each, and let the upper and lower bounds collapse on f\int f. The hard work was in MCT (which underlies Fatou); DCT is a clean consequence.

📝 Example 9 (DCT applied: $\int_0^\infty \sin(x/n) / (x^2 + 1) \, dx \to 0$)

For each n1n \geq 1, define fn(x)=sin(x/n)/(x2+1)f_n(x) = \sin(x/n) / (x^2 + 1) on [0,)[0, \infty). As nn \to \infty, sin(x/n)sin(0)=0\sin(x/n) \to \sin(0) = 0 for every xx, so fn0f_n \to 0 pointwise. (The convergence is even uniform on bounded intervals.)

We need a dominator. Note that sin(x/n)1|\sin(x/n)| \leq 1 for every xx and nn, so fn(x)  =  sin(x/n)x2+1    1x2+1.|f_n(x)| \;=\; \frac{|\sin(x/n)|}{x^2 + 1} \;\leq\; \frac{1}{x^2 + 1}. Set g(x)=1/(x2+1)g(x) = 1 / (x^2 + 1). Then 0g(x)dλ=[arctan(x)]0=π/2<\int_0^\infty g(x) \, d\lambda = [\arctan(x)]_0^\infty = \pi/2 < \infty, so gg is integrable. DCT applies, and limn0sin(x/n)x2+1dλ  =  00dλ  =  0.\lim_{n \to \infty} \int_0^\infty \frac{\sin(x/n)}{x^2 + 1} \, d\lambda \;=\; \int_0^\infty 0 \, d\lambda \;=\; 0. This is the same sequence the flagship viz in this section uses. (Note: the related sequence fn(x)=sin(nx)/(x2+1)f_n(x) = \sin(nx)/(x^2 + 1) with high-frequency oscillation also has fn0\int f_n \to 0, but for a different reason — the Riemann-Lebesgue lemma rather than DCT, since sin(nx)\sin(nx) does not converge pointwise. Both reach the same conclusion via different theorems.)

📝 Example 10 (DCT justifies differentiation under the integral sign)

Let f(x,θ)f(x, \theta) be a function with the following properties: for each fixed θ\theta, f(,θ)L1(μ)f(\cdot, \theta) \in L^1(\mu); for each fixed xx, f(x,)f(x, \cdot) is differentiable in θ\theta; and there is an integrable function g(x)g(x) with fθ(x,θ)    g(x)for μ-a.e. x, for every θI\left| \frac{\partial f}{\partial \theta}(x, \theta) \right| \;\leq\; g(x) \quad \text{for $\mu$-a.e. } x, \text{ for every } \theta \in I on some interval II containing the point of interest. Then ddθf(x,θ)dμ(x)  =  fθ(x,θ)dμ(x).\frac{d}{d\theta} \int f(x, \theta) \, d\mu(x) \;=\; \int \frac{\partial f}{\partial \theta}(x, \theta) \, d\mu(x).

The proof is exactly DCT applied to the difference quotient: as h0h \to 0, the function f(x,θ+h)f(x,θ)h\frac{f(x, \theta + h) - f(x, \theta)}{h} converges pointwise to f/θ\partial f / \partial \theta, and the mean value theorem bounds the difference quotient by supθIf/θ(x,θ)g(x)\sup_{\theta' \in I} |\partial f / \partial \theta(x, \theta')| \leq g(x), an integrable dominator. DCT gives the limit-integral swap.

In machine learning, f(x,θ)=L(θ,x)f(x, \theta) = L(\theta, x) is the per-sample loss as a function of parameters and data, and the integral L(θ,x)dP(x)=EP[L(θ,X)]\int L(\theta, x) \, dP(x) = E_P[L(\theta, X)] is the population risk. The displayed identity is exactly the gradient–expectation swap, the move at the heart of every SGD convergence proof: θE[L(θ,X)]=E[θL(θ,X)]\nabla_\theta E[L(\theta, X)] = E[\nabla_\theta L(\theta, X)].

💡 Remark 6 (The domination hypothesis is essential — do not skip it)

The spike sequence fn=n1[0,1/n]f_n = n \cdot \mathbf{1}_{[0, 1/n]} from Example 1 is the canonical witness that the domination hypothesis cannot be dropped. Any candidate dominator gg would have to satisfy gng \geq n on [0,1/n][0, 1/n] for every nn, so g(x)1/xg(x) \geq 1/x near x=0x = 0, which is not integrable on (0,1](0, 1]. There is no integrable dominator. DCT therefore does not apply, and indeed the conclusion fails: fn=1\int f_n = 1 for every nn, but limfn=0\int \lim f_n = 0. Whenever you write down an SGD-style identity that swaps a limit and an integral, the cost of skipping the domination check is exactly this kind of failure — silent and wrong, with no warning from the theorem.

Three-panel dominated convergence: f_n with dominator envelope, shaded areas, and integral convergence plot

Comparison of the three convergence theorems: hypotheses and conclusions side by side

8. Comparison with the Riemann Integral

Time to tie up Remark 4 from Section 4 with a real proof. Every Riemann-integrable function is Lebesgue-integrable, and the two integrals agree. The Lebesgue integral is therefore a strict extension of the Riemann integral — same answers on the old domain, plus new answers on a larger class of functions.

🔷 Theorem 6 (Riemann-integrable implies Lebesgue-integrable)

Let f:[a,b]Rf: [a, b] \to \mathbb{R} be a bounded function. If ff is Riemann-integrable on [a,b][a, b], then ff is Lebesgue-integrable on [a,b][a, b] (with respect to Lebesgue measure restricted to [a,b][a, b]), and abf(x)dx  =  [a,b]fdλ.\int_a^b f(x) \, dx \;=\; \int_{[a, b]} f \, d\lambda.

Proof.

We give the proof in sketch form; the details are spelled out in Royden §4.3.

The Riemann integral abf(x)dx\int_a^b f(x) \, dx exists and equals VV if and only if for every ε>0\varepsilon > 0 there is a partition PεP_\varepsilon of [a,b][a, b] such that U(f,Pε)L(f,Pε)<εU(f, P_\varepsilon) - L(f, P_\varepsilon) < \varepsilon, where UU and LL are the upper and lower Darboux sums.

The upper and lower Darboux sums are themselves integrals of step functions. Specifically, L(f,Pε)=sεdλL(f, P_\varepsilon) = \int s_\varepsilon \, d\lambda where sεs_\varepsilon is the step function equal to inf[xi1,xi]f\inf_{[x_{i-1}, x_i]} f on each subinterval, and U(f,Pε)=tεdλU(f, P_\varepsilon) = \int t_\varepsilon \, d\lambda where tεt_\varepsilon takes the supremum on each subinterval. Both sεs_\varepsilon and tεt_\varepsilon are simple functions on [a,b][a, b] — their integrals exist by Definition 1 and equal the Darboux sums by direct computation.

Refining the partition (taking common refinements as ε\varepsilon shrinks) produces an increasing sequence of sεs_\varepsilon‘s and a decreasing sequence of tεt_\varepsilon‘s, with sεftεs_\varepsilon \leq f \leq t_\varepsilon everywhere. By Riemann-integrability, tεsε0\int t_\varepsilon - \int s_\varepsilon \to 0. Taking the pointwise limits s=limsεs = \lim s_\varepsilon and t=limtεt = \lim t_\varepsilon (which exist almost everywhere by the increasing/decreasing structure), we get sfts \leq f \leq t with (ts)dλ=0\int (t - s) \, d\lambda = 0. This forces s=f=ts = f = t almost everywhere, so ff agrees with a measurable function (ss) almost everywhere, hence ff is itself measurable. Applying MCT to (sε)(s_\varepsilon): [a,b]fdλ  =  [a,b]sdλ  =  limε0sεdλ  =  limε0L(f,Pε)  =  abf(x)dx.\int_{[a, b]} f \, d\lambda \;=\; \int_{[a, b]} s \, d\lambda \;=\; \lim_{\varepsilon \to 0} \int s_\varepsilon \, d\lambda \;=\; \lim_{\varepsilon \to 0} L(f, P_\varepsilon) \;=\; \int_a^b f(x) \, dx.

The Riemann and Lebesgue values agree.

The converse is much sharper — it tells us exactly which bounded functions are Riemann-integrable. The criterion is purely measure-theoretic: discontinuities only matter on a set of positive measure.

🔷 Theorem 7 (Lebesgue's criterion for Riemann integrability)

Let f:[a,b]Rf: [a, b] \to \mathbb{R} be a bounded function. Then ff is Riemann-integrable on [a,b][a, b] if and only if the set of points at which ff is discontinuous has Lebesgue measure zero.

💡 Remark 7 (Lebesgue's criterion closes the Riemann question)

The “only if” direction is the powerful one: it says Riemann-integrability forces the discontinuity set to be small. Combined with Theorem 6, this gives the complete picture:

  • The Dirichlet function 1Q\mathbf{1}_{\mathbb{Q}} on [0,1][0, 1] is discontinuous everywhere (at every point, you can find both rationals and irrationals nearby). The discontinuity set is all of [0,1][0, 1], which has measure 11, not 00. Lebesgue’s criterion says: not Riemann-integrable. Topic 25’s opening puzzle is now closed at the most precise possible level.
  • 1Q\mathbf{1}_{\mathbb{Q}} is Lebesgue-integrable, with integral 00 (Example 2). The Lebesgue integral handles it gracefully because it doesn’t care about the discontinuity set as long as the function agrees almost everywhere with a “nice” function (the constant zero, in this case).
  • A function with finitely many jump discontinuities — a step function, say — has discontinuity set of measure zero, so it is Riemann-integrable. Both integrals agree.

Lebesgue’s criterion is the bridge between the analytic notion of “Riemann-integrable” and the measure-theoretic notion of “almost-everywhere continuous.” It is one of the cleanest results in real analysis: a deep characterization with a one-line statement.

Two-panel Lebesgue vs Riemann: staircase comparison and convergence rates

9. Fubini-Tonelli Theorem

The fourth and final big theorem. Up to this point we have integrated functions of one variable on a single measure space. To integrate functions of two (or more) variables, we need a way to combine measures on different spaces into a product measure on the Cartesian product, and a theorem that lets us compute the resulting integral as an iterated integral (one variable at a time).

This completes the preview from Topic 25, Section 9, where we defined the product sigma-algebra F1F2\mathcal{F}_1 \otimes \mathcal{F}_2 but stopped short of building the integral on it.

The construction of the product measure μ×ν\mu \times \nu goes as follows: on rectangles A×BA \times B with AF1A \in \mathcal{F}_1 and BF2B \in \mathcal{F}_2, set (μ×ν)(A×B)=μ(A)ν(B)(\mu \times \nu)(A \times B) = \mu(A) \cdot \nu(B). Carathéodory’s extension theorem (Topic 25, Theorem 2 in spirit) extends this set function to a σ\sigma-additive measure on the product sigma-algebra F1F2\mathcal{F}_1 \otimes \mathcal{F}_2, provided that both μ\mu and ν\nu are σ\sigma-finite — meaning each space can be written as a countable union of measurable sets of finite measure. (Lebesgue measure on R\mathbb{R} is σ\sigma-finite — write R=n[n,n]\mathbb{R} = \bigcup_n [-n, n].) The σ\sigma-finiteness hypothesis is essential; without it, the product measure construction can fail to be unique.

With μ×ν\mu \times \nu in hand, we have an integral X×Yfd(μ×ν)\int_{X \times Y} f \, d(\mu \times \nu) defined by the same machinery as before. The Fubini-Tonelli theorems tell us that this integral can be computed as an iterated integral.

🔷 Theorem 8 (Tonelli's Theorem)

Let (X,F,μ)(X, \mathcal{F}, \mu) and (Y,G,ν)(Y, \mathcal{G}, \nu) be σ\sigma-finite measure spaces, and let f:X×Y[0,]f: X \times Y \to [0, \infty] be measurable with respect to the product sigma-algebra FG\mathcal{F} \otimes \mathcal{G}. Then the maps x    Yf(x,y)dν(y),y    Xf(x,y)dμ(x)x \;\mapsto\; \int_Y f(x, y) \, d\nu(y), \qquad y \;\mapsto\; \int_X f(x, y) \, d\mu(x) are measurable on XX and YY respectively, and X×Yfd(μ×ν)  =  X(Yf(x,y)dν(y))dμ(x)  =  Y(Xf(x,y)dμ(x))dν(y).\int_{X \times Y} f \, d(\mu \times \nu) \;=\; \int_X \left( \int_Y f(x, y) \, d\nu(y) \right) d\mu(x) \;=\; \int_Y \left( \int_X f(x, y) \, d\mu(x) \right) d\nu(y).

Tonelli is the “non-negative” half of Fubini-Tonelli. Because f0f \geq 0, all three integrals are non-negative (possibly ++\infty), and the equalities hold without any extra integrability assumption — that is the whole point of Tonelli. For sign-changing functions, we need an additional integrability hypothesis to avoid \infty - \infty ambiguities.

🔷 Theorem 9 (Fubini's Theorem)

Let (X,F,μ)(X, \mathcal{F}, \mu) and (Y,G,ν)(Y, \mathcal{G}, \nu) be σ\sigma-finite measure spaces, and let f:X×YRf: X \times Y \to \mathbb{R} be measurable with respect to FG\mathcal{F} \otimes \mathcal{G}. If X×Yfd(μ×ν)  <  \int_{X \times Y} |f| \, d(\mu \times \nu) \;<\; \infty (ff is integrable with respect to the product measure), then fL1(X×Y,μ×ν)f \in L^1(X \times Y, \mu \times \nu), the iterated integrals exist for μ\mu-a.e. xx (and ν\nu-a.e. yy), and X×Yfd(μ×ν)  =  X(Yf(x,y)dν(y))dμ(x)  =  Y(Xf(x,y)dμ(x))dν(y).\int_{X \times Y} f \, d(\mu \times \nu) \;=\; \int_X \left( \int_Y f(x, y) \, d\nu(y) \right) d\mu(x) \;=\; \int_Y \left( \int_X f(x, y) \, d\mu(x) \right) d\nu(y).

Proof.

Both theorems are proved by the same four-step “extension by linearity and limits” pattern that runs through all of measure theory.

Step 1: Verify on indicator functions of measurable rectangles. For f=1A×Bf = \mathbf{1}_{A \times B} with AFA \in \mathcal{F}, BGB \in \mathcal{G}, the iterated integral computation reduces to X(Y1A×B(x,y)dν(y))dμ(x)  =  X1A(x)ν(B)dμ(x)  =  μ(A)ν(B).\int_X \left( \int_Y \mathbf{1}_{A \times B}(x, y) \, d\nu(y) \right) d\mu(x) \;=\; \int_X \mathbf{1}_A(x) \cdot \nu(B) \, d\mu(x) \;=\; \mu(A) \cdot \nu(B). The other iterated order gives the same answer by symmetry, and the product integral is (μ×ν)(A×B)=μ(A)ν(B)(\mu \times \nu)(A \times B) = \mu(A) \cdot \nu(B) by definition. All three agree.

Step 2: Extend to indicators of measurable sets. A general measurable set EFGE \in \mathcal{F} \otimes \mathcal{G} is built up from rectangles via countable unions, intersections, and complements. The “monotone class theorem” (or equivalently the π\pi-λ\lambda theorem) lets us promote the rectangle case to the general measurable set case. Briefly: the collection of EE for which the equality X×Y1Ed(μ×ν)=XY1Edνdμ\int_{X \times Y} \mathbf{1}_E \, d(\mu \times \nu) = \int_X \int_Y \mathbf{1}_E \, d\nu \, d\mu holds is closed under increasing unions and complements, and it contains the rectangles, so it contains the entire generated sigma-algebra.

Step 3: Extend to non-negative simple functions, then to non-negative measurable functions. Linearity of the integral promotes Step 2 from indicators to finite linear combinations of indicators (i.e., simple functions). MCT promotes from non-negative simple functions to non-negative measurable functions: choose an increasing sequence of simple functions snfs_n \uparrow f, apply Step 3 to each sns_n, and pass to the limit using MCT on both the inner integral and the outer integral. This gives Tonelli.

Step 4: Extend to general integrable functions. For fL1(X×Y,μ×ν)f \in L^1(X \times Y, \mu \times \nu), write f=f+ff = f^+ - f^-. Both f+,f0f^+, f^- \geq 0 and both have finite integrals (since f<\int |f| < \infty). Apply Tonelli to each separately and subtract. The integrability hypothesis is what guarantees that the subtraction is unambiguous (no \infty - \infty). This gives Fubini.

The full details for Steps 2 and 3 occupy several pages and use a careful application of the π\pi-λ\lambda theorem; see Folland §2.5 for the textbook treatment.

The integrability hypothesis in Fubini is essential. Without it, the iterated integrals can disagree even when both exist as improper limits.

📝 Example 11 (Iterated integrals can disagree without absolute integrability)

Let f(x,y)=(x2y2)/(x2+y2)2f(x, y) = (x^2 - y^2) / (x^2 + y^2)^2 on (0,1]×(0,1](0, 1] \times (0, 1] (extend by 00 at the origin). A direct computation (or a clever trick: f=/x[y/(x2+y2)]f = \partial / \partial x [-y / (x^2 + y^2)]) gives 0101x2y2(x2+y2)2dydx  =  01[yx2+y2]y=0y=1dx  =  0111+x2dx  =  π4,\int_0^1 \int_0^1 \frac{x^2 - y^2}{(x^2 + y^2)^2} \, dy \, dx \;=\; \int_0^1 \left[ \frac{y}{x^2 + y^2} \right]_{y=0}^{y=1} dx \;=\; \int_0^1 \frac{1}{1 + x^2} \, dx \;=\; \frac{\pi}{4}, 0101x2y2(x2+y2)2dxdy  =  π4(by the same computation with xy, picking up a sign).\int_0^1 \int_0^1 \frac{x^2 - y^2}{(x^2 + y^2)^2} \, dx \, dy \;=\; -\frac{\pi}{4} \quad \text{(by the same computation with $x \leftrightarrow y$, picking up a sign).}

The two iterated integrals exist (as improper Riemann integrals) and disagree. Fubini tells us that this can only happen when the integrability hypothesis fails, and indeed 0101f(x,y)dydx=\int_0^1 \int_0^1 |f(x, y)| \, dy \, dx = \infty — the singularity at the origin is bad enough that the function is not absolutely integrable on (0,1]2(0, 1]^2. The product integral fd(λ×λ)\int f \, d(\lambda \times \lambda) does not exist.

📝 Example 12 ($E[XY] = E[X] \cdot E[Y]$ for independent random variables via Fubini)

Let X,YX, Y be independent real-valued random variables on a probability space (Ω,F,P)(\Omega, \mathcal{F}, P). Independence means the joint distribution of (X,Y)(X, Y) on R2\mathbb{R}^2 is the product measure PXPYP_X \otimes P_Y (this is the formal definition of independence — see Topic 25, Remark 10). Suppose EX,EY<E|X|, E|Y| < \infty.

The expected value E[XY]=xyd(PXPY)E[XY] = \int xy \, d(P_X \otimes P_Y). By Fubini (the integrability hypothesis follows from xy=xy|xy| = |x| \cdot |y| and Tonelli applied to xy|x| \cdot |y|, which gives xyd(PXPY)=EXEY<\int |xy| \, d(P_X \otimes P_Y) = E|X| \cdot E|Y| < \infty), E[XY]  =  xyd(PXPY)  =  R(RxydPY(y))dPX(x)  =  RxE[Y]dPX(x)  =  E[X]E[Y].E[XY] \;=\; \int xy \, d(P_X \otimes P_Y) \;=\; \int_{\mathbb{R}} \left( \int_{\mathbb{R}} xy \, dP_Y(y) \right) dP_X(x) \;=\; \int_{\mathbb{R}} x \cdot E[Y] \, dP_X(x) \;=\; E[X] \cdot E[Y].

This is the textbook identity “expectation factors over independent variables.” The proof is one application of Fubini. The same argument extends to any finite collection of independent random variables and is the foundation of every “weak law of large numbers” computation.

💡 Remark 8 (Fubini connects back to Topic 13)

The measure-theoretic Fubini theorem (Theorem 9) is the generalization of the Riemann-integral Fubini theorem from Multiple Integrals & Fubini’s Theorem. Same statement — “swap the order of iterated integration” — but now valid on arbitrary σ\sigma-finite measure spaces, not just Rn\mathbb{R}^n with Lebesgue measure. In particular, Tonelli’s theorem applied to non-negative integrands on R2\mathbb{R}^2 recovers the Topic 13 statement that the order of integration is irrelevant for non-negative continuous functions on a rectangle, and Fubini extends the result to the much broader class of L1L^1 functions on arbitrary product measure spaces (e.g., probability times probability, counting times Lebesgue, σ\sigma-finite times σ\sigma-finite). The Topic 13 version was already enough for almost all calculus computations; the Topic 26 version is what probability theory and functional analysis need.

The interactive viz below lets you choose a function on [0,1]2[0, 1]^2 and see all three integrals — both iterated orderings and the product integral — computed simultaneously. For the smooth and separable cases, all three agree. For the pathological case from Example 11, the iterated integrals disagree visibly.

∫₀¹ ∫₀¹ f dy dx = 0.25000∫₀¹ ∫₀¹ f dx dy = 0.25000∫ f d(λ × λ) = 0.25000✓ all three agree

f(x, y) = xy on [0, 1]². Smooth and integrable. All three integrals agree at 1/4.

Three-panel Fubini iterated: heatmap of f(x,y), slice plot, and final integral values

Two-panel Fubini pathological: heatmap of (x²-y²)/(x²+y²)² and the disagreeing iterated integrals

10. Computational Notes

A few practical notes on how the Lebesgue integral surfaces in scientific Python.

Numerical integration as Lebesgue approximation. scipy.integrate.quad computes abf(x)dx\int_a^b f(x) \, dx using adaptive Gauss-Kronrod quadrature. Conceptually, this is a sophisticated version of the simple-function supremum from Definition 2: it samples ff at strategically chosen points, computes a weighted sum, and refines until the estimated error is below a tolerance. The key practical difference from the theoretical definition is that the computer uses finitely many function evaluations (typically 21–63 in a single call), while the Lebesgue supremum is over all simple functions f\leq f. For smooth integrands the convergence is fast enough that the difference is invisible; for integrands with singularities or oscillations, you can hit the tolerance limit and need adaptive subdivision (scipy.integrate.quad does this automatically) or specialized methods.

Monte Carlo integration. For high-dimensional integrals — say Rdf(x)dP(x)\int_{\mathbb{R}^d} f(x) \, dP(x) with d5d \geq 5 — quadrature methods scale exponentially in dd and become unusable. The alternative is Monte Carlo: sample X1,,XNX_1, \ldots, X_N independently from PP and estimate fdP    1Ni=1Nf(Xi).\int f \, dP \;\approx\; \frac{1}{N} \sum_{i=1}^N f(X_i). The strong law of large numbers (which is itself a corollary of MCT and the first Borel-Cantelli lemma from Topic 25) guarantees almost-sure convergence as NN \to \infty. The standard error scales as σf/N\sigma_f / \sqrt{N}, where σf2=VarP[f(X)]\sigma_f^2 = \mathrm{Var}_P[f(X)] — the dimension does not appear. This is the reason every modern probabilistic ML algorithm (variational inference, MCMC, importance sampling, score matching) is built on Monte Carlo rather than quadrature.

torch.distributions and expected values. Every call to dist.mean, dist.variance, or dist.entropy() is computing a Lebesgue integral against the distribution dist. The .log_prob(x) method returns logp(x)=log(dP/dλ)(x)\log p(x) = \log (dP / d\lambda)(x) — the log of the Radon-Nikodym derivative of the distribution with respect to Lebesgue measure (when one exists). For discrete distributions, the same method returns logp(x)=logP({x})\log p(x) = \log P(\{x\}) — the log probability of the singleton set. The same code path handles both cases because torch.distributions is built on top of a measure-theoretic abstraction in which the underlying reference measure (Lebesgue or counting) is implicit. Topic 28 (Radon-Nikodym) will make this connection precise.

The numerical pitfall: verifying the DCT hypothesis. In practice, finding an integrable dominator gg is the hardest part of applying DCT correctly. For a parametric family f(x,θ)f(x, \theta), a common strategy is to dominate by the envelope supθΘf(x,θ)\sup_{\theta \in \Theta} |f(x, \theta)| over a compact parameter set Θ\Theta, and verify integrability of the envelope numerically. When you can’t find a dominator, you usually have to fall back on a weaker convergence theorem (Vitali convergence, dominated convergence in measure) or do the computation a different way. If your training loop is silently producing biased gradient estimates, it is likely because you implicitly invoked DCT in a place where the dominator does not exist — exactly the failure mode of the spike sequence in Example 1.

📝 Example 13 (Monte Carlo estimation of $E[\sin(X)]$ for $X \sim \mathrm{Exp}(1)$)

Let XX be an exponential random variable with rate 11, so XX has density p(x)=exp(x) = e^{-x} on [0,)[0, \infty). We want E[sin(X)]=0sin(x)exdxE[\sin(X)] = \int_0^\infty \sin(x) e^{-x} \, dx.

A direct integration (integration by parts twice, or recognizing the Laplace transform of sin\sin) gives 0sin(x)exdx  =  11+12  =  12.\int_0^\infty \sin(x) e^{-x} \, dx \;=\; \frac{1}{1 + 1^2} \;=\; \frac{1}{2}.

Monte Carlo: draw X1,,XNX_1, \ldots, X_N i.i.d. from Exp(1)\mathrm{Exp}(1) and form m^N=(1/N)isin(Xi)\hat{m}_N = (1/N) \sum_i \sin(X_i). By the strong law of large numbers, m^N1/2\hat{m}_N \to 1/2 almost surely. The standard error is σ/N\sigma / \sqrt{N} where σ2=Var[sin(X)]=E[sin2(X)](1/2)2\sigma^2 = \mathrm{Var}[\sin(X)] = E[\sin^2(X)] - (1/2)^2. A short numerical experiment with N=106N = 10^6 typically gives m^N\hat{m}_N within ±0.001\pm 0.001 of 0.50.5.

import numpy as np
N = 10**6
X = np.random.exponential(1.0, size=N)
estimate = np.mean(np.sin(X))
print(estimate)  # ≈ 0.5

The convergence rate 1/N1/\sqrt{N} is independent of dimension, which is why the same script generalizes — without modification — to estimating E[g(X1,,Xd)]E[g(X_1, \ldots, X_d)] for d=100d = 100 or d=10,000d = 10{,}000.

11. Connections to ML

Four big connections, each substantial enough to be its own research thread.

Expected value as a Lebesgue integral. The expected value of a real random variable XX on a probability space (Ω,F,P)(\Omega, \mathcal{F}, P) is, by definition, the Lebesgue integral E[X]  =  ΩXdP  =  ΩX(ω)dP(ω).E[X] \;=\; \int_\Omega X \, dP \;=\; \int_\Omega X(\omega) \, dP(\omega).

This single definition unifies all the textbook formulas you have seen in elementary probability. For a discrete random variable taking values x1,x2,x_1, x_2, \ldots with probabilities p1,p2,p_1, p_2, \ldots, the integral against PP becomes the sum E[X]=kxkpkE[X] = \sum_k x_k p_k. For a continuous random variable with density p(x)p(x) with respect to Lebesgue measure, the integral becomes E[X]=Rxp(x)dxE[X] = \int_{\mathbb{R}} x \cdot p(x) \, dx. For the mixture cases that appeared in Puzzle 2 of Section 1, the integral handles the discrete and continuous components in a single uniform formula. Every higher moment (E[Xk]E[X^k]), every moment generating function (E[etX]E[e^{tX}]), and every characteristic function (E[eitX]E[e^{itX}]) is a Lebesgue integral against PP. The Lebesgue integral is what makes “expectation” a single mathematical operation rather than a family of formulas. Forward link: Probability Spaces.

KL divergence and cross-entropy as Lebesgue integrals. The Kullback-Leibler divergence between two probability measures P,QP, Q on a common measurable space, with PQP \ll Q (meaning PP is absolutely continuous with respect to QQ), is DKL(PQ)  =  logdPdQdP.D_{\mathrm{KL}}(P \,\|\, Q) \;=\; \int \log \frac{dP}{dQ} \, dP.

The integrand log(dP/dQ)\log (dP/dQ) is the log of a Radon-Nikodym derivative — a measurable function on the underlying space. The integral is taken with respect to PP. KL divergence is non-negative (a consequence of Jensen’s inequality applied to the convex function log-\log, which is itself a Lebesgue integral inequality), zero iff P=QP = Q a.e., and convex in both arguments. None of these properties is provable without the Lebesgue framework; they all use integral inequalities that require absolute integrability and the linearity from Theorem 2.

The cross-entropy H(P,Q)=log(dQ/dλ)dPH(P, Q) = -\int \log(dQ / d\lambda) \, dP — the loss function for classification when QQ is the model’s predictive distribution and PP is the empirical distribution of the training labels — is the same kind of object: a Lebesgue integral of a log-density against a probability measure. Forward link: Information Geometry.

SGD as Dominated Convergence. This is the connection from Puzzle 3 of Section 1 and Example 10 of Section 7. The identity θE[L(θ,X)]  =  E[θL(θ,X)]\nabla_\theta E[L(\theta, X)] \;=\; E[\nabla_\theta L(\theta, X)] is the mathematical move that makes minibatch gradient descent an unbiased estimator of full-batch gradient descent. The interchange is justified by the Dominated Convergence Theorem applied to the difference quotients in θ\theta, with the dominator coming from any uniform Lipschitz bound on θL\nabla_\theta L. In practice, gradient clipping and bounded-support data assumptions are the operational ways to ensure a dominator exists. Every SGD convergence proof in the optimization literature (Robbins-Monro, Nesterov, Adam, anything in the deep learning theory canon) implicitly invokes DCT at this step. Forward link: Gradient Descent.

The ELBO and Jensen’s inequality on integrals. In variational inference, we want to compute the marginal likelihood logp(x)=logp(x,z)dz\log p(x) = \log \int p(x, z) \, dz, which is intractable for most interesting models. The trick is to introduce a tractable variational distribution q(z)q(z) and use Jensen’s inequality: logp(x)  =  logp(x,z)dz  =  logp(x,z)q(z)q(z)dz  =  logEq[p(x,z)q(z)]    Eq[logp(x,z)q(z)].\log p(x) \;=\; \log \int p(x, z) \, dz \;=\; \log \int \frac{p(x, z)}{q(z)} \cdot q(z) \, dz \;=\; \log E_q\left[ \frac{p(x, z)}{q(z)} \right] \;\geq\; E_q\left[ \log \frac{p(x, z)}{q(z)} \right].

The last step is Jensen’s inequality on the concave function log\log, applied to the random variable p(x,Z)/q(Z)p(x, Z) / q(Z) under the law qq. The right-hand side is the evidence lower bound (ELBO): ELBO(q)  =  Eq[logp(x,Z)]Eq[logq(Z)].\mathrm{ELBO}(q) \;=\; E_q\left[ \log p(x, Z) \right] - E_q\left[ \log q(Z) \right].

Jensen’s inequality is a Lebesgue integral inequality — it requires log\log being concave plus integrability of both sides, both of which are statements about the integral defined in this topic. The tightness of the ELBO (the gap logp(x)ELBO(q)\log p(x) - \mathrm{ELBO}(q)) is exactly the KL divergence DKL(qp(x))D_{\mathrm{KL}}(q \,\|\, p(\cdot \mid x)), which is itself a Lebesgue integral. The whole machinery of variational inference is integral inequalities all the way down. Forward link: Variational Inference.

📝 Example 14 (KL divergence between two Gaussians as a closed-form Lebesgue integral)

Let P=N(μ1,σ12)P = \mathcal{N}(\mu_1, \sigma_1^2) and Q=N(μ2,σ22)Q = \mathcal{N}(\mu_2, \sigma_2^2) be two univariate Gaussians on R\mathbb{R}. Both are absolutely continuous with respect to Lebesgue measure, with densities p1(x),p2(x)p_1(x), p_2(x). The Radon-Nikodym derivative dP/dQdP/dQ at xx is p1(x)/p2(x)p_1(x) / p_2(x), so DKL(PQ)  =  logp1(x)p2(x)p1(x)dx  =  EXP[logp1(X)p2(X)].D_{\mathrm{KL}}(P \,\|\, Q) \;=\; \int \log \frac{p_1(x)}{p_2(x)} \cdot p_1(x) \, dx \;=\; E_{X \sim P}\left[ \log \frac{p_1(X)}{p_2(X)} \right].

Plugging in the Gaussian densities and evaluating the integral (which is a finite computation involving EP[X]E_P[X] and EP[X2]E_P[X^2]), the answer is DKL(PQ)  =  logσ2σ1+σ12+(μ1μ2)22σ2212.D_{\mathrm{KL}}(P \,\|\, Q) \;=\; \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} - \frac{1}{2}.

This closed form is used inside the reparametrization-trick KL term of the variational autoencoder loss — every VAE that assumes Gaussian posterior and Gaussian prior ships this expression in its loss function.

📝 Example 15 (Verifying the DCT hypothesis for $L^2$-regularized SGD)

Suppose the per-sample loss is L(θ,x)=(θ,x)+(λ/2)θ2L(\theta, x) = \ell(\theta, x) + (\lambda / 2) \|\theta\|^2 for some baseline loss \ell and regularization strength λ>0\lambda > 0. The gradient is θL(θ,x)=θ(θ,x)+λθ\nabla_\theta L(\theta, x) = \nabla_\theta \ell(\theta, x) + \lambda \theta.

Suppose θ\nabla_\theta \ell is uniformly bounded: θ(θ,x)G|\nabla_\theta \ell(\theta, x)| \leq G for all θ\theta in a compact set Θ\Theta and μ\mu-a.e. xx. (This is the standard “bounded gradient” assumption that holds for any logistic-regression-style model with bounded features.) Then for θΘ\theta \in \Theta, θL(θ,x)    G+λdiam(Θ)  :=  M.|\nabla_\theta L(\theta, x)| \;\leq\; G + \lambda \cdot \mathrm{diam}(\Theta) \;:=\; M.

The constant MM is finite. The constant function g(x)=Mg(x) = M is integrable with respect to any probability measure PP on Rd\mathbb{R}^d (its integral is just MM). So the DCT hypothesis is satisfied with dominator gMg \equiv M, and the interchange θE[L(θ,X)]=E[θL(θ,X)]\nabla_\theta E[L(\theta, X)] = E[\nabla_\theta L(\theta, X)] is justified for θΘ\theta \in \Theta. The minibatch gradient is therefore an unbiased estimator of the population gradient — this is the precise statement that makes SGD work, and the verification took three lines once we knew the dominator structure to look for.

Four-panel ML connections: expected value, KL divergence, gradient-expectation swap, and the ELBO

12. Connections & Further Reading

This is the second topic in Track 7 — Measure & Integration — and the second advanced topic in formalCalculus. Topic 25 built the framework; this topic built the integral. The combination is what makes measure theory a usable tool, not just a piece of mathematical infrastructure: every theorem in this topic is the rigorous version of an everyday calculation in probability theory or machine learning. The next two topics in the track will build the function spaces (LpL^p) and the change-of-measure machinery (Radon-Nikodym) that complete the foundation for modern probability.

If the convergence-theorem proofs felt harder than the proofs in Topic 25, that is accurate, and it is the natural difficulty step-up of moving from “set-theoretic reasoning about measures” to “limit-of-integrals manipulations.” The MCT proof technique (the scaled cutoff α<1\alpha < 1, the increasing exhaustion EnΩE_n \uparrow \Omega, the diagonal sup over simple functions) is the prototype for almost every proof in measure-theoretic real analysis from this point forward.

Within formalCalculus:

  • Sigma-Algebras & Measures — Every concept used here was built in Topic 25: sigma-algebras, measures, measurable functions, simple functions, null sets, “almost everywhere,” product sigma-algebras, and the simple-function approximation theorem. Topic 26 is a direct continuation of Topic 25; if any of those tools is unfamiliar, Topic 25 is the place to refresh.
  • The Riemann Integral & FTC — Section 8 proves that every Riemann-integrable function is Lebesgue-integrable with the same value. Lebesgue’s criterion (Theorem 7) is the cleanest possible characterization of Riemann-integrability, and it is a measure-theoretic statement: the discontinuity set must have measure zero.
  • Multiple Integrals & Fubini’s Theorem — Section 9’s measure-theoretic Fubini-Tonelli generalizes the Riemann-integral version from Topic 13. Same statement, broader domain.

Successor topic now published:

  • LpL^p Spaces — Banach spaces of measurable functions where fp=(fpdμ)1/p<\|f\|_p = (\int |f|^p \, d\mu)^{1/p} < \infty. Equivalence classes under a.e.-equality. The completeness theorem (Riesz-Fischer) is a direct application of the Dominated Convergence Theorem from Section 7 — the LpL^p norm is exactly the right tool to make the convergence-theorem machinery into a complete metric space structure on functions.

Successor topics within formalCalculus:

  • Radon-Nikodym & Probability Densities — Characterizes when one measure ν\nu has a density f=dν/dμf = d\nu/d\mu with respect to another measure μ\mu. The bridge from measure theory to probability densities, conditional expectation, and Bayesian inference. The KL-divergence and importance-sampling formulas from Section 11 of this topic are special cases that we formalize there.

Forward to formalml.com:

  • Probability Spaces — Expected value E[X]=XdPE[X] = \int X \, dP is the central operation in probability theory, and it is exactly the Lebesgue integral built in this topic. Every moment, MGF, characteristic function, and conditional expectation in probability theory is an instance of this integral.
  • Gradient Descent — The interchange θE[L(θ,X)]=E[θL(θ,X)]\nabla_\theta E[L(\theta, X)] = E[\nabla_\theta L(\theta, X)] is the single most-cited DCT application in machine learning. Every SGD convergence proof in the optimization literature implicitly invokes the Dominated Convergence Theorem from Section 7 of this topic.
  • Information Geometry — KL divergence DKL(PQ)=log(dP/dQ)dPD_{\mathrm{KL}}(P \,\|\, Q) = \int \log(dP/dQ) \, dP and cross-entropy are Lebesgue integrals against probability measures. Their analytic properties (non-negativity via Jensen, convexity, lower semicontinuity) are integral inequalities provable inside the framework of this topic.
  • Variational Inference — The evidence lower bound is Jensen’s inequality applied to a Lebesgue integral. The tightness gap is a KL divergence, which is itself a Lebesgue integral. The entire variational machinery is integral inequalities on Lebesgue integrals.

References:

  • Royden, H. L. & Fitzpatrick, P. M. Real Analysis (4th ed., 2010), Chapters 4 and 18. The closest match to our exposition order — Chapter 4 covers Lebesgue integration on R\mathbb{R} with the same supremum-of-simple-functions construction we used in Section 3, and Chapter 18 generalizes to abstract measure spaces.
  • Folland, G. B. Real Analysis: Modern Techniques and Their Applications (2nd ed., 1999), Chapters 2 and 3. The reference for the Tonelli/Fubini proof sketch in Section 9. Concise, rigorous, graduate-level.
  • Tao, T. An Introduction to Measure Theory, Chapters 1.4–1.6. Free PDF, written with the same geometric-first instinct that drives this curriculum.
  • Billingsley, P. Probability and Measure (3rd ed., 1995), Chapters 3–4. The probability-flavored treatment, closest to our ML-connection sections.

References

  1. book Royden, H. L. & Fitzpatrick, P. M. (2010). Real Analysis Fourth edition. Chapters 4 (Lebesgue integration on ℝ) and 18 (general measure spaces). Closest to our exposition order.
  2. book Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications Second edition. Chapter 2 (integration), Chapter 3 (signed measures). Concise and rigorous; the source for our Tonelli/Fubini proof sketch.
  3. book Tao, T. (2011). An Introduction to Measure Theory Free PDF. Chapters 1.4–1.6 (the integral on ℝ, then abstract). Pedagogically closest to our geometric-first approach.
  4. book Billingsley, P. (1995). Probability and Measure Third edition. Chapters 3–4 (integration, expected values). Probability-flavored treatment closest to our ML connections.