Measure & Integration · advanced · 55 min read

Radon-Nikodym & Probability Densities

Turning measures into functions — from the Radon-Nikodym theorem through conditional expectation and change of measure to the rigorous foundations of probability densities, importance sampling, and Bayesian inference.

Abstract. Every probability density function you have ever written down — every Gaussian bell curve, every softmax output, every generative model's learned distribution — is a Radon-Nikodym derivative. The Radon-Nikodym theorem is the single theorem that turns measure theory into probability theory: it says that when one measure is absolutely continuous with respect to another, the relationship can be captured by a single integrable function. We prove this using the L² projection machinery from Topic 27, reducing the existence of densities to a Hilbert-space projection argument. The theorem's reach is extraordinary: probability densities are Radon-Nikodym derivatives with respect to Lebesgue measure. Conditional expectation — the engine behind regression, Bayesian inference, and reinforcement learning — is a Radon-Nikodym derivative with respect to a sub-sigma-algebra. Importance sampling weights are Radon-Nikodym derivatives between proposal and target measures. KL divergence is the expected log-Radon-Nikodym derivative. This topic completes the Measure and Integration track by bridging the gap between abstract measure theory and the concrete probabilistic tools that machine learning practitioners use every day.

Where this leads → formalML

  • formalML Probability densities are Radon-Nikodym derivatives dP/dλ. The entire measure-theoretic probability framework rests on the R-N theorem, making densities rigorous.
  • formalML Conditional expectation E[Y|X] is the best L² predictor — the R-N derivative of a restricted measure. Least-squares regression computes it.
  • formalML Bayes' theorem is a change of measure: the posterior density is the prior times the likelihood ratio, which is a Radon-Nikodym derivative.
  • formalML The importance weight dP/dQ is literally a Radon-Nikodym derivative. The identity E_P[f] = E_Q[f · dP/dQ] is a change of measure.
  • formalML Normalizing flows use the change-of-variables formula p_Y(y) = p_X(T⁻¹(y)) · |det DT⁻¹|, which is a Radon-Nikodym derivative under pushforward.
  • formalML KL divergence D_KL(P || Q) = ∫ log(dP/dQ) dP is the expected log-Radon-Nikodym derivative. Fisher information is the variance of the score, which is ∇ log(dP/dλ).
  • formalML Score matching involves ∇ log p, the gradient of the log-Radon-Nikodym derivative. Diffusion model training minimizes an L²(p_t) distance involving scores.

1. Three Puzzles the Radon-Nikodym Theorem Solves

Topic 27 organized integrable functions into the LpL^p spaces — vector spaces with a norm, a notion of distance, and (for L2L^2) an inner product and an orthogonal-projection theorem. Along the way we used the Radon-Nikodym theorem as a black box twice: once in Section 9, when we previewed L2L^2 projection, and once in Section 10, when we proved (Lp)=Lq(L^p)^* = L^q duality and said “by the Radon-Nikodym theorem (which we will prove in Topic 28; for now we use it as a black box), absolute continuity implies that ν\nu has a density…”. This topic pays the debt. The von Neumann proof uses the L2L^2 projection theorem from Topic 27 §9 — the same tool we previewed for regression — to reduce the existence of densities to a single Hilbert-space projection argument.

The pivot in this topic is from spaces of functions to measures as functions. Topic 27 asked “how do we organize integrable functions?”; Topic 28 asks “when can one measure be represented as a function against another?” The answer makes measure theory into probability theory. Three concrete questions motivate the theorem.

1. What exactly is a “probability density”? Every ML practitioner writes p(x)=12πex2/2p(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} and calls it “the density of the standard normal.” But a probability measure PP assigns numbers to sets, not points — and when PP is continuous, P({x})=0P(\{x\}) = 0 for every individual xx. So what is p(x)p(x)? It is not P({x})P(\{x\}). The Radon-Nikodym theorem answers: p(x)p(x) is the derivative dPdλ\frac{dP}{d\lambda} of the probability measure with respect to Lebesgue measure λ\lambda — a function whose integral recovers PP. The “density” is a derivative in a precise measure-theoretic sense, not a metaphor.

2. What is E[YX]E[Y \mid X], really? Introductory courses define conditional expectation E[YX=x]E[Y \mid X = x] by “averaging YY over the slice where X=xX = x.” But when XX is continuous, the event {X=x}\{X = x\} has probability zero — you cannot condition on a zero-probability event without doing some work first. Measure-theoretic conditional expectation resolves this: E[YX]E[Y \mid X] is the Radon-Nikodym derivative of a certain restricted measure, and (when YL2Y \in L^2) it is the best L2L^2 predictor of YY given XX. Regression algorithms — linear regression, random forests, neural networks — are all approximating this single object.

3. How do importance-sampling weights work? To estimate EP[f]E_P[f] using samples from a different distribution QQ, we reweight: EP[f]=EQ[fw]E_P[f] = E_Q[f \cdot w] where w=dPdQw = \frac{dP}{dQ}. But what is dPdQ\frac{dP}{dQ}? It is a Radon-Nikodym derivative — the function that converts integration with respect to QQ into integration with respect to PP. The reweighting identity is a change-of-measure formula, and its validity requires PQP \ll Q (every QQ-null set is PP-null). When importance sampling fails — when the variance of the weights blows up — it is almost always because the absolute-continuity hypothesis is silently violated in some region of the sample space.

These three questions are the same question from three angles: when can a measure be written as a function against another measure, and what is that function? The answer is the Radon-Nikodym derivative, and the theorem that produces it is the bridge from measure theory to probability theory.

2. Absolute Continuity and Mutual Singularity

Before stating the theorem we need two definitions. Both describe relationships between measures, and the contrast between them is sharp: absolutely continuous measures share their null sets; mutually singular measures live on disjoint sets.

📐 Definition 1 (Absolute continuity of measures)

Let μ\mu and ν\nu be measures on a measurable space (Ω,F)(\Omega, \mathcal{F}). We say ν\nu is absolutely continuous with respect to μ\mu, written νμ\nu \ll \mu, if for every AFA \in \mathcal{F},

μ(A)=0        ν(A)=0.\mu(A) = 0 \;\;\Longrightarrow\;\; \nu(A) = 0.

In words: every μ\mu-null set is also a ν\nu-null set. The relationship is one-directional — νμ\nu \ll \mu does not imply μν\mu \ll \nu.

📐 Definition 2 (Mutual singularity)

Two measures μ\mu and ν\nu on (Ω,F)(\Omega, \mathcal{F}) are mutually singular, written μν\mu \perp \nu, if there exist disjoint sets A,BFA, B \in \mathcal{F} with AB=ΩA \cup B = \Omega such that

μ(A)=0andν(B)=0.\mu(A) = 0 \quad \text{and} \quad \nu(B) = 0.

In words: the two measures live on disjoint pieces of Ω\Omegaμ\mu assigns all its mass to BB, and ν\nu assigns all its mass to AA. They cannot see each other.

📝 Example 1 (Gaussian measure is absolutely continuous w.r.t. Lebesgue)

Let PP be the standard normal probability measure on (R,B)(\mathbb{R}, \mathcal{B}) and let λ\lambda denote Lebesgue measure. If λ(A)=0\lambda(A) = 0 — say, AA is a countable set or a Cantor-type set of Lebesgue measure zero — then

P(A)=A12πex2/2dx=0,P(A) = \int_A \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \, dx = 0,

because the Lebesgue integral over a null set is always zero. So PλP \ll \lambda. The density 12πex2/2\frac{1}{\sqrt{2\pi}} e^{-x^2/2} is — as we will see in Section 3 — the Radon-Nikodym derivative dPdλ\frac{dP}{d\lambda}. Every Gaussian distribution you have ever computed with is a worked example of absolute continuity.

📝 Example 2 (Dirac measure is singular w.r.t. Lebesgue)

Let δ0\delta_0 be the Dirac measure at 00: δ0(A)=1\delta_0(A) = 1 if 0A0 \in A, else 00. Then δ0λ\delta_0 \perp \lambda. To verify, take A={0}A = \{0\} and B=R{0}B = \mathbb{R} \setminus \{0\}. Then λ(A)=0\lambda(A) = 0 (a single point has Lebesgue measure zero) and δ0(B)=0\delta_0(B) = 0 (the Dirac mass is not at any point of BB). The two measures partition R\mathbb{R} into the single point where δ0\delta_0 lives and the rest of the line where λ\lambda lives. They cannot be combined into a density relationship — there is no function ff with δ0(A)=Afdλ\delta_0(A) = \int_A f \, d\lambda, because integrating any Lebesgue-integrable function over {0}\{0\} gives zero.

📝 Example 3 (Counting measure on $\mathbb{Z}$ is singular w.r.t. Lebesgue on $\mathbb{R}$)

Let #\# be counting measure on R\mathbb{R} that assigns measure 11 to each integer and 00 to non-integers; equivalently, #(A)=AZ\#(A) = |A \cap \mathbb{Z}|. Then #λ\# \perp \lambda. Take A=ZA = \mathbb{Z} and B=RZB = \mathbb{R} \setminus \mathbb{Z}. Then λ(A)=0\lambda(A) = 0 (a countable set has Lebesgue measure zero) and #(B)=0\#(B) = 0 (no integers in BB). Discrete and continuous measures live on disjoint supports. This is why you cannot write a Poisson distribution as dP/dλdP/d\lambda — the Poisson measure is supported on {0,1,2,}\{0, 1, 2, \ldots\}, which is a λ\lambda-null set.

💡 Remark 1 (Absolute continuity of measures vs. absolute continuity of functions)

The reader may know “absolutely continuous function” from calculus — a function F:[a,b]RF: [a, b] \to \mathbb{R} is absolutely continuous if for every ε>0\varepsilon > 0 there is a δ>0\delta > 0 such that any finite collection of disjoint subintervals with total length less than δ\delta produces total variation less than ε\varepsilon. These two concepts are related but distinct: a function FF is absolutely continuous on [a,b][a, b] if and only if there exists fL1([a,b])f \in L^1([a, b]) with F(x)=F(a)+axf(t)dtF(x) = F(a) + \int_a^x f(t) \, dt, and equivalently the signed measure νF(A)=AF(x)dx\nu_F(A) = \int_A F'(x) \, dx is absolutely continuous (in our measure-theoretic sense) with respect to Lebesgue measure. The function-theoretic notion is a special case of the measure-theoretic one.

💡 Remark 2 ($\sigma$-finiteness is essential)

The Radon-Nikodym theorem requires both μ\mu and ν\nu to be σ\sigma-finite: Ω\Omega is a countable union of measurable sets of finite measure under each. Without this, the theorem can fail. The standard counterexample is counting measure on R\mathbb{R} (which is not σ\sigma-finite — the only set of finite counting measure is a finite set, and R\mathbb{R} is not a countable union of finite sets) compared with Lebesgue measure: λ#\lambda \ll \# holds vacuously (the only #\#-null set is \emptyset), but no integrable density can exist because λ\lambda and #\# disagree on every uncountable measurable set. In probability applications, σ\sigma-finiteness is automatic: every probability measure is finite (hence σ\sigma-finite), so this hypothesis is rarely something the practitioner needs to check.

Three-panel illustration: a Gaussian density (absolutely continuous w.r.t. Lebesgue), a Dirac mass at zero (singular w.r.t. Lebesgue), and the corresponding null sets that distinguish the two regimes

3. The Radon-Nikodym Derivative

We can now define the central object of the topic. The definition is short — it just names a function that satisfies a particular integral identity — and the substance of the topic is showing that such a function actually exists whenever absolute continuity holds.

📐 Definition 3 (The Radon-Nikodym derivative)

Let μ\mu and ν\nu be σ\sigma-finite measures on (Ω,F)(\Omega, \mathcal{F}) with νμ\nu \ll \mu. If there exists a non-negative measurable function fL1(μ)f \in L^1(\mu) such that

ν(A)=Afdμfor every AF,\nu(A) = \int_A f \, d\mu \quad \text{for every } A \in \mathcal{F},

then ff is called the Radon-Nikodym derivative of ν\nu with respect to μ\mu, written

f=dνdμ.f = \frac{d\nu}{d\mu}.

The notation is deliberately Leibniz-style: ff behaves like a derivative in several senses (chain rule, inverse rule), as we will see in Section 5.

🔷 Proposition 1 (Uniqueness of the R-N derivative ($\mu$-a.e.))

If ff and gg are both Radon-Nikodym derivatives of ν\nu with respect to μ\mu, then f=gf = g μ\mu-almost everywhere.

Proof.

Suppose Afdμ=ν(A)=Agdμ\int_A f \, d\mu = \nu(A) = \int_A g \, d\mu for every measurable AA, so

A(fg)dμ=0for every AF.\int_A (f - g) \, d\mu = 0 \quad \text{for every } A \in \mathcal{F}.

Let A+={x:f(x)>g(x)}A_+ = \{x : f(x) > g(x)\}. Then fg>0f - g > 0 on A+A_+, and A+(fg)dμ=0\int_{A_+} (f - g) \, d\mu = 0 forces μ(A+)=0\mu(A_+) = 0 (an integral of a strictly positive function over a set is zero only when the set has measure zero). Similarly, with A={x:g(x)>f(x)}A_- = \{x : g(x) > f(x)\}, we get μ(A)=0\mu(A_-) = 0. So μ({fg})=μ(A+A)=0\mu(\{f \neq g\}) = \mu(A_+ \cup A_-) = 0, which is exactly the statement that f=gf = g μ\mu-a.e.

📝 Example 4 (The R-N derivative you already know)

Every probability density function the reader has ever computed with is a Radon-Nikodym derivative w.r.t. Lebesgue measure on R\mathbb{R}. For the standard normal,

dPnormaldλ(x)=12πex2/2.\frac{dP_{\text{normal}}}{d\lambda}(x) = \frac{1}{\sqrt{2\pi}} \, e^{-x^2/2}.

For the exponential distribution with rate θ>0\theta > 0,

dPexpdλ(x)=θeθx1x0.\frac{dP_{\text{exp}}}{d\lambda}(x) = \theta \, e^{-\theta x} \, \mathbf{1}_{x \geq 0}.

For the Beta(α,β)(\alpha, \beta) distribution on [0,1][0, 1],

dPBetadλ(x)=xα1(1x)β1B(α,β)1[0,1](x),\frac{dP_{\text{Beta}}}{d\lambda}(x) = \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)} \, \mathbf{1}_{[0, 1]}(x),

where B(α,β)B(\alpha, \beta) is the Beta function. In each case the function on the right is the R-N derivative of the probability measure on the left with respect to Lebesgue measure on R\mathbb{R}. The defining property ν(A)=Afdμ\nu(A) = \int_A f \, d\mu becomes the familiar identity P(A)=Ap(x)dxP(A) = \int_A p(x) \, dx.

📝 Example 5 (Discrete R-N derivative is the PMF)

Let μ\mu be counting measure on {1,2,3,}\{1, 2, 3, \ldots\} (assigning measure 11 to each integer) and let ν\nu be the geometric distribution with parameter 12\frac{1}{2}, so ν({k})=2k\nu(\{k\}) = 2^{-k}. Then νμ\nu \ll \mu (any μ\mu-null set is empty), and the R-N derivative is

dνdμ(k)=2k.\frac{d\nu}{d\mu}(k) = 2^{-k}.

The defining property reduces to ν(A)=kA2k\nu(A) = \sum_{k \in A} 2^{-k} for AZ1A \subseteq \mathbb{Z}_{\geq 1}. The R-N derivative of a discrete probability measure with respect to counting measure is exactly the probability mass function. The “PDF vs PMF” distinction every student learns is purely about which dominating measure you take the derivative against — Lebesgue for continuous distributions, counting for discrete ones. The Radon-Nikodym framework unifies them.

Distribution 1: Beta(α₁, β₁)
Distribution 2: Beta(α₂, β₂)
For A = [0.10, 0.40]: P₁(A) = ∫ₐᵇ (dP₁/dλ) dλ = 0.6525 vs F₁(b) − F₁(a) = 0.6524P₂(A) = 0.0409 vs F₂(b) − F₂(a) = 0.0409

Drag the endpoint handles on the density panel to change the interval A = [a, b]. The R-N derivative dP/dλ is the density curve on the left; integrating it over A recovers the probability P(A), which equals the difference of CDF values F(b) − F(a) on the right. The two computations agree to numerical precision — that's exactly what the Radon-Nikodym theorem guarantees.

Two-panel: CDFs of two Beta distributions on the left and their corresponding densities (R-N derivatives) on the right, with the integral over a shaded interval verified numerically

4. The Radon-Nikodym Theorem

We have defined the R-N derivative and shown it is unique when it exists. The substance of the theory is the existence statement: whenever νμ\nu \ll \mu and both measures are σ\sigma-finite, the derivative exists. The proof uses the L2L^2 projection theorem from Topic 27 §9 — the very tool that section advertised as a forward reference. This is the proof Topic 27 promised.

🔷 Theorem 1 (Radon-Nikodym theorem)

Let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a σ\sigma-finite measure space and let ν\nu be a σ\sigma-finite measure on (Ω,F)(\Omega, \mathcal{F}) with νμ\nu \ll \mu. Then there exists a non-negative measurable function fL1(μ)f \in L^1(\mu) such that

ν(A)=Afdμfor every AF.\nu(A) = \int_A f \, d\mu \quad \text{for every } A \in \mathcal{F}.

The function ff is unique μ\mu-a.e. (Proposition 1) and is called the Radon-Nikodym derivative dνdμ\frac{d\nu}{d\mu}.

Proof.

Following von Neumann, we reduce the existence of the density to a single application of L2L^2 orthogonal projection. The argument has four stages.

Stage 1 — Reduction to finite measures. By σ\sigma-finiteness, we can write Ω\Omega as a disjoint union Ω=nEn\Omega = \bigsqcup_n E_n with μ(En)<\mu(E_n) < \infty and ν(En)<\nu(E_n) < \infty for every nn. If we can produce a density fnf_n on each EnE_n — that is, a function with νEn(A)=AfndμEn\nu|_{E_n}(A) = \int_A f_n \, d\mu|_{E_n} for every AEnA \subseteq E_n — then the function f=nfn1Enf = \sum_n f_n \cdot \mathbf{1}_{E_n} is a density for ν\nu on all of Ω\Omega. So it suffices to prove the theorem under the additional assumption that both μ\mu and ν\nu are finite. Assume from here on that μ(Ω)<\mu(\Omega) < \infty and ν(Ω)<\nu(\Omega) < \infty.

Stage 2 — The auxiliary measure ρ=μ+ν\rho = \mu + \nu. Define ρ=μ+ν\rho = \mu + \nu, the sum measure. Both μ\mu and ν\nu are absolutely continuous with respect to ρ\rho (every ρ\rho-null set is in particular a μ\mu-null set and a ν\nu-null set). Now consider the linear functional ϕ:L2(ρ)R\phi: L^2(\rho) \to \mathbb{R} defined by

ϕ(g)=gdν.\phi(g) = \int g \, d\nu.

This is linear in gg. To check it is bounded, apply the Cauchy-Schwarz inequality in L2(ρ)L^2(\rho):

ϕ(g)=gdνgdνgdρ.|\phi(g)| = \left|\int g \, d\nu\right| \leq \int |g| \, d\nu \leq \int |g| \, d\rho.

The last step uses νρ\nu \leq \rho pointwise as set functions. Now apply Cauchy-Schwarz to bound the L1(ρ)L^1(\rho) integral by an L2(ρ)L^2(\rho) integral:

gdρ=g1dρgL2(ρ)1L2(ρ)=ρ(Ω)1/2gL2(ρ).\int |g| \, d\rho = \int |g| \cdot 1 \, d\rho \leq \|g\|_{L^2(\rho)} \cdot \|1\|_{L^2(\rho)} = \rho(\Omega)^{1/2} \|g\|_{L^2(\rho)}.

So ϕ(g)ρ(Ω)1/2gL2(ρ)|\phi(g)| \leq \rho(\Omega)^{1/2} \|g\|_{L^2(\rho)}, which is exactly the statement that ϕ\phi is a bounded linear functional on L2(ρ)L^2(\rho) with operator norm at most ρ(Ω)1/2\rho(\Omega)^{1/2}.

Stage 3 — Riesz representation via L2L^2 projection. By Topic 27 §9 Proposition 3 — or equivalently the Riesz representation theorem for Hilbert spaces, which we previewed in Section 9 of that topic — every bounded linear functional on a Hilbert space is represented by an inner product with a unique element of the space. Applied to our ϕ\phi on the Hilbert space L2(ρ)L^2(\rho), this gives a function hL2(ρ)h \in L^2(\rho) such that

ϕ(g)=g,hL2(ρ)=ghdρfor every gL2(ρ).\phi(g) = \langle g, h \rangle_{L^2(\rho)} = \int g h \, d\rho \quad \text{for every } g \in L^2(\rho).

Combining with the definition of ϕ\phi, we get the integral identity that drives the rest of the proof:

\int g \, d\nu = \int g h \, d\rho \quad \text{for every } g \in L^2(\rho). \tag{$\ast$}

This is the entire content of Topic 27’s projection theorem applied here. Everything in Stage 4 is bookkeeping that turns hh into the actual R-N derivative.

Stage 4 — Extract the derivative. Substitute g=1Ag = \mathbf{1}_A in ()(\ast) for an arbitrary measurable set AA (this is allowed because ρ\rho is finite, so 1AL2(ρ)\mathbf{1}_A \in L^2(\rho)). The left side becomes ν(A)\nu(A), and the right side becomes Ahdρ=Ahdμ+Ahdν\int_A h \, d\rho = \int_A h \, d\mu + \int_A h \, d\nu. So

ν(A)=Ahdμ+Ahdν.\nu(A) = \int_A h \, d\mu + \int_A h \, d\nu.

Rearranging,

\int_A (1 - h) \, d\nu = \int_A h \, d\mu \quad \text{for every } A \in \mathcal{F}. \tag{$\ast\ast$}

We claim 0h10 \leq h \leq 1 ρ\rho-a.e. To see h0h \geq 0, choose A={h<0}A = \{h < 0\} in ()(\ast\ast): the right side is Ahdμ0\int_A h \, d\mu \leq 0 (because h<0h < 0 on AA), and the left side is A(1h)dν0\int_A (1 - h) \, d\nu \geq 0 (because 1h>11 - h > 1 on AA and ν\nu is positive). Both being equal forces both to be zero, which forces μ(A)=0\mu(A) = 0 and ν(A)=0\nu(A) = 0, hence ρ(A)=0\rho(A) = 0. So h0h \geq 0 ρ\rho-a.e. The argument that h1h \leq 1 is symmetric: choose A={h>1}A = \{h > 1\}, observe that 1h<01 - h < 0 on AA while h>0h > 0, and conclude both sides are zero so ρ(A)=0\rho(A) = 0.

Now consider the set {h=1}\{h = 1\}. Plugging A={h=1}A = \{h = 1\} into ()(\ast\ast) gives {h=1}0dν={h=1}hdμ\int_{\{h=1\}} 0 \, d\nu = \int_{\{h=1\}} h \, d\mu, hence μ({h=1})=0\mu(\{h = 1\}) = 0 (since h=1>0h = 1 > 0 on the set). By absolute continuity νμ\nu \ll \mu, this forces ν({h=1})=0\nu(\{h = 1\}) = 0. So we may discard the set {h=1}\{h = 1\} — it has both μ\mu-measure zero and ν\nu-measure zero — and assume 0h<10 \leq h < 1 on Ω\Omega.

Define

f=h1hon {h<1}.f = \frac{h}{1 - h} \quad \text{on } \{h < 1\}.

Then f0f \geq 0 everywhere (since h[0,1)h \in [0, 1)). The remaining step is to upgrade the test-function identity ()(\ast\ast) from indicators to functions large enough to invert (1h)(1 - h) — this is the only place the algebra needs care.

The integral identity ()(\ast\ast) — namely A(1h)dν=Ahdμ\int_A (1 - h) \, d\nu = \int_A h \, d\mu for every measurable AA — extends from indicators to non-negative measurable test functions by the standard machine: linearity gives it for simple functions, and the Monotone Convergence Theorem (Topic 26 Theorem 1) extends it to non-negative measurables. Equivalently, the two measures (1h)dν(1 - h) \, d\nu and hdμh \, d\mu on {h<1}\{h < 1\} agree on every measurable set, hence are equal as measures.

Now apply this extended identity to the test function φ=1A/(1h)\varphi = \mathbf{1}_A / (1 - h), where A{h<11n}A \subseteq \{h < 1 - \tfrac{1}{n}\} for some n1n \geq 1 (so that φ\varphi is bounded by nn and hence integrable). On this restricted set the substitution gives

A1Adν=A1A1h(1h)dν=A1A1hhdμ=Ah1hdμ=Afdμ.\int_A \mathbf{1}_A \, d\nu = \int_A \frac{\mathbf{1}_A}{1 - h} \cdot (1 - h) \, d\nu = \int_A \frac{\mathbf{1}_A}{1 - h} \cdot h \, d\mu = \int_A \frac{h}{1 - h} \, d\mu = \int_A f \, d\mu.

So ν(A)=Afdμ\nu(A) = \int_A f \, d\mu for every A{h<11n}A \subseteq \{h < 1 - \tfrac{1}{n}\}. Taking nn \to \infty and applying MCT to the increasing sequence {h<11n}{h<1}\{h < 1 - \tfrac{1}{n}\} \uparrow \{h < 1\} extends the identity to all measurable A{h<1}A \subseteq \{h < 1\}. Combined with the earlier observation that ν({h=1})=0\nu(\{h = 1\}) = 0, we get ν(A)=Afdμ\nu(A) = \int_A f \, d\mu for every measurable AΩA \subseteq \Omega.

This is exactly the Radon-Nikodym identity: ff is the density we sought. Finiteness of ff in L1(μ)L^1(\mu) follows from ν(Ω)<\nu(\Omega) < \infty, since Ωfdμ=ν(Ω)<\int_\Omega f \, d\mu = \nu(\Omega) < \infty.

💡 Remark 3 (Why the von Neumann proof)

The classical proof of Radon-Nikodym (via the Hahn decomposition theorem) requires building two pages of signed-measure machinery from scratch — positive sets, negative sets, total variation, and the Hahn decomposition itself. Von Neumann’s proof reduces the entire theorem to a single application of L2L^2 projection, which we already proved in Topic 27. The argument is essentially: “a bounded linear functional on L2L^2 is representable as an inner product, and the R-N derivative falls out of the algebra.” This is one of the most elegant proof strategies in analysis. It also closes Topic 27’s promise: the L2L^2 projection theorem we previewed in §9 is the engine that drives this proof. The function-space machinery built one topic ago has paid off in full.

5. Properties of the Radon-Nikodym Derivative

The Leibniz notation dν/dμd\nu/d\mu is not a coincidence: R-N derivatives obey a chain rule and an inverse rule that mirror the behavior of classical derivatives. These properties make formal manipulations with R-N derivatives feel exactly like the change-of-variables computations from single-variable calculus — except that the “derivative” is now a function in L1L^1 instead of a number.

🔷 Theorem 2 (Chain rule for R-N derivatives)

Let μ\mu, ν\nu, and λ\lambda be σ\sigma-finite measures on (Ω,F)(\Omega, \mathcal{F}) with νμλ\nu \ll \mu \ll \lambda. Then νλ\nu \ll \lambda and

dνdλ=dνdμdμdλλ-a.e.\frac{d\nu}{d\lambda} = \frac{d\nu}{d\mu} \cdot \frac{d\mu}{d\lambda} \qquad \lambda\text{-a.e.}

Proof.

That νλ\nu \ll \lambda follows immediately from the chain of implications: if λ(A)=0\lambda(A) = 0 then μ(A)=0\mu(A) = 0 (by μλ\mu \ll \lambda), and then ν(A)=0\nu(A) = 0 (by νμ\nu \ll \mu). For the formula, let f=dν/dμf = d\nu/d\mu and g=dμ/dλg = d\mu/d\lambda, both of which exist by Theorem 1. We need to show ν(A)=Afgdλ\nu(A) = \int_A f g \, d\lambda for every AA, because then by uniqueness (Proposition 1), fgfg must equal dν/dλd\nu/d\lambda a.e.

Start with the definition of ff: ν(A)=Afdμ\nu(A) = \int_A f \, d\mu. We want to rewrite the right side as a λ\lambda-integral. The key fact is that for any non-negative measurable function ϕ\phi,

ϕdμ=ϕgdλ\int \phi \, d\mu = \int \phi \cdot g \, d\lambda

where g=dμ/dλg = d\mu/d\lambda. This identity follows by the standard machine: it holds for ϕ=1B\phi = \mathbf{1}_B (just the definition of gg as the density), extends to simple functions by linearity, extends to non-negative measurables by the Monotone Convergence Theorem (Topic 26 Theorem 1), and extends to L1(μ)L^1(\mu) functions by the f=f+ff = f^+ - f^- decomposition. Apply this with ϕ=f1A\phi = f \cdot \mathbf{1}_A:

ν(A)=Afdμ=(f1A)dμ=(f1A)gdλ=Afgdλ.\nu(A) = \int_A f \, d\mu = \int (f \cdot \mathbf{1}_A) \, d\mu = \int (f \cdot \mathbf{1}_A) \cdot g \, d\lambda = \int_A f g \, d\lambda.

This holds for every measurable AA, so by Proposition 1, fg=dν/dλfg = d\nu/d\lambda λ\lambda-a.e.

🔷 Proposition 2 (Inverse rule)

Let μ\mu and ν\nu be σ\sigma-finite measures on (Ω,F)(\Omega, \mathcal{F}) with νμ\nu \ll \mu and μν\mu \ll \nu (mutual absolute continuity). Then

dμdν=(dνdμ)1μ-a.e. and ν-a.e.\frac{d\mu}{d\nu} = \left(\frac{d\nu}{d\mu}\right)^{-1} \quad \mu\text{-a.e. and } \nu\text{-a.e.}

This follows from the chain rule with λ=ν\lambda = \nu: dνdν=1=dνdμdμdν\frac{d\nu}{d\nu} = 1 = \frac{d\nu}{d\mu} \cdot \frac{d\mu}{d\nu}, so the two factors are reciprocals where they are nonzero. Mutual absolute continuity guarantees that the set where one factor vanishes is a null set under both measures, so the reciprocal is well-defined a.e.

📝 Example 6 (Chain rule and the change-of-variables formula for densities)

Let XX be a real-valued random variable with density fXf_X with respect to Lebesgue measure λ\lambda. Let g:RRg: \mathbb{R} \to \mathbb{R} be a strictly monotone differentiable function with nowhere-zero derivative — that is, a C1C^1 diffeomorphism of R\mathbb{R} onto its image — and let Y=g(X)Y = g(X). The pushforward measure PY(A)=PX(g1(A))P_Y(A) = P_X(g^{-1}(A)) has a density with respect to Lebesgue measure on the image. By the chain rule for R-N derivatives applied to the pushforward,

dPYdλ(y)=fX(g1(y))1g(g1(y)).\frac{dP_Y}{d\lambda}(y) = f_X(g^{-1}(y)) \cdot \frac{1}{|g'(g^{-1}(y))|}.

This is the change-of-variables formula for probability densities that every introductory probability course states without proof. The factor 1/g1 / |g'| is the Jacobian of the inverse transformation, and the entire formula is just the chain rule for R-N derivatives applied in the special case of a bijection between two open subsets of R\mathbb{R}. The same formula in higher dimensions involves the Jacobian determinant of the inverse map — again as an application of the chain rule, now for measures on Rn\mathbb{R}^n.

💡 Remark 4 (The Leibniz notation is not a coincidence)

R-N derivatives satisfy a chain rule (Theorem 2), an inverse rule (Proposition 2), and — with the right definitions — a kind of product rule. This is exactly the algebra of classical derivatives, and the Leibniz-style notation dν/dμd\nu/d\mu was chosen to make the analogy explicit. But the analogy has a limit: the R-N derivative is an equivalence class of functions in L1(μ)L^1(\mu) (modulo μ\mu-null sets), not a number. Identities like the chain rule hold λ\lambda-almost everywhere, not pointwise. When using R-N notation, always remember the “a.e.” qualifier — it is the price of generality, and dropping it is the most common bookkeeping error in measure-theoretic probability.

6. Signed Measures and Jordan Decomposition

So far we have only discussed positive measures (functions μ:F[0,]\mu: \mathcal{F} \to [0, \infty]). The Radon-Nikodym theorem extends to signed measures, which take values in R\mathbb{R} (or [,][-\infty, \infty], with at most one infinity allowed). We give the bare minimum needed to state the extension; the full Hahn decomposition theory is treated in Folland §3.1 for the reader who wants more.

📐 Definition 4 (Signed measure)

A signed measure on (Ω,F)(\Omega, \mathcal{F}) is a function ν:F[,]\nu: \mathcal{F} \to [-\infty, \infty] that takes at most one of the values ±\pm \infty, satisfies ν()=0\nu(\emptyset) = 0, and is countably additive: for every disjoint sequence (An)F(A_n) \subseteq \mathcal{F},

ν(nAn)=nν(An).\nu\left(\bigsqcup_{n} A_n\right) = \sum_{n} \nu(A_n).

Positive measures are the special case of signed measures that happen to take only non-negative values.

📐 Definition 5 (Total variation)

The total variation of a signed measure ν\nu on (Ω,F)(\Omega, \mathcal{F}) is the positive measure ν|\nu| defined by

ν(A)=sup{i=1nν(Ai):A=i=1nAi,AiF}|\nu|(A) = \sup \left\{ \sum_{i=1}^{n} |\nu(A_i)| : A = \bigsqcup_{i=1}^{n} A_i, \, A_i \in \mathcal{F} \right\}

where the supremum runs over all finite measurable partitions of AA. The total variation ν|\nu| measures the total mass moved by ν\nu, treating positive and negative contributions equally.

🔷 Theorem 3 (Jordan decomposition)

Every signed measure ν\nu with ν(Ω)<|\nu|(\Omega) < \infty can be written uniquely as

ν=ν+ν\nu = \nu^+ - \nu^-

where ν+\nu^+ and ν\nu^- are positive (finite) measures with ν+ν\nu^+ \perp \nu^-. Moreover, ν=ν++ν|\nu| = \nu^+ + \nu^-.

Proof.

We sketch the proof; the full Hahn decomposition argument is in Folland §3.1. The Hahn decomposition theorem (which we state without proof here) gives disjoint sets P,NFP, N \in \mathcal{F} with Ω=PN\Omega = P \cup N such that ν(A)0\nu(A) \geq 0 for every measurable APA \subseteq P and ν(A)0\nu(A) \leq 0 for every measurable ANA \subseteq N. Define

ν+(A)=ν(AP),ν(A)=ν(AN).\nu^+(A) = \nu(A \cap P), \qquad \nu^-(A) = -\nu(A \cap N).

Both are positive measures by construction: ν+\nu^+ is non-negative because APPA \cap P \subseteq P, and ν\nu^- is non-negative because ν0\nu \leq 0 on subsets of NN (so ν0-\nu \geq 0). They are mutually singular because they live on disjoint sets (PP and NN). The sum recovers ν\nu because ν(A)=ν(AP)+ν(AN)=ν+(A)ν(A)\nu(A) = \nu(A \cap P) + \nu(A \cap N) = \nu^+(A) - \nu^-(A). Uniqueness of the decomposition follows from the mutual singularity of ν+\nu^+ and ν\nu^-: any other pair of positive mutually-singular measures with the same difference must coincide with these on the sets PP and NN.

💡 Remark 5 (Why the minimal scope)

We keep signed-measure theory deliberately minimal because the primary applications in this topic — probability densities, conditional expectation, importance sampling — are about positive measures. Jordan decomposition appears here for two reasons. First, it gives the cleanest extension of the R-N theorem to signed measures: if ν\nu is a signed measure with νμ|\nu| \ll \mu, then ν+μ\nu^+ \ll \mu and νμ\nu^- \ll \mu, so applying Theorem 1 to each piece gives dν+/dμd\nu^+/d\mu and dν/dμd\nu^-/d\mu, and we define dν/dμ=dν+/dμdν/dμd\nu/d\mu = d\nu^+/d\mu - d\nu^-/d\mu as a function in L1(μ)L^1(\mu). Second, the conditional-expectation construction in Section 9 applies to any L1L^1 random variable XX, not just non-negative ones — and that argument relies on the signed-measure version of R-N to handle X=X+XX = X^+ - X^- properly.

7. The Lebesgue Decomposition Theorem

The Radon-Nikodym theorem assumes νμ\nu \ll \mu. What happens when we drop that assumption? The Lebesgue decomposition theorem says: any σ\sigma-finite measure ν\nu splits canonically into an absolutely continuous part and a singular part. Together with R-N, this gives a complete picture: the absolutely continuous part has a density, and the singular part lives on a null set of μ\mu.

🔷 Theorem 4 (Lebesgue decomposition theorem)

Let μ\mu and ν\nu be σ\sigma-finite measures on (Ω,F)(\Omega, \mathcal{F}). Then ν\nu decomposes uniquely as

ν=νac+νsing\nu = \nu_{\mathrm{ac}} + \nu_{\mathrm{sing}}

where νacμ\nu_{\mathrm{ac}} \ll \mu and νsingμ\nu_{\mathrm{sing}} \perp \mu. In particular, νac\nu_{\mathrm{ac}} has a Radon-Nikodym derivative dνac/dμL1(μ)d\nu_{\mathrm{ac}}/d\mu \in L^1(\mu), and νsing\nu_{\mathrm{sing}} is concentrated on a μ\mu-null set.

📝 Example 7 (The Cantor three-part decomposition)

The cleanest worked example is a measure built from three pieces of mismatched type. Let

ν=13λ[0,1]  +  13δ0  +  13μC\nu = \tfrac{1}{3} \, \lambda|_{[0, 1]} \;+\; \tfrac{1}{3} \, \delta_0 \;+\; \tfrac{1}{3} \, \mu_C

on R\mathbb{R}, where λ[0,1]\lambda|_{[0,1]} is Lebesgue measure restricted to [0,1][0, 1], δ0\delta_0 is the Dirac mass at 00, and μC\mu_C is the Cantor measure (the probability measure whose CDF is the Devil’s staircase from Topic 25). Decompose ν\nu with respect to Lebesgue measure λ\lambda on R\mathbb{R}. The absolutely continuous part is

νac=13λ[0,1],with density dνacdλ=131[0,1].\nu_{\mathrm{ac}} = \tfrac{1}{3} \, \lambda|_{[0,1]}, \quad \text{with density } \frac{d\nu_{\mathrm{ac}}}{d\lambda} = \tfrac{1}{3} \, \mathbf{1}_{[0, 1]}.

The singular part is

νsing=13δ0+13μC.\nu_{\mathrm{sing}} = \tfrac{1}{3} \, \delta_0 + \tfrac{1}{3} \, \mu_C.

The Dirac component sits on the single point {0}\{0\} (Lebesgue measure zero), and the Cantor measure is supported on the Cantor set (also Lebesgue measure zero), so νsingλ\nu_{\mathrm{sing}} \perp \lambda via the partition A={0}CA = \{0\} \cup \mathcal{C} versus B=RAB = \mathbb{R} \setminus A. The singular part itself can be split further into a discrete component (δ0\delta_0, atoms) and a singular continuous component (μC\mu_C, no atoms but no density). The full Lebesgue decomposition of any σ\sigma-finite measure on R\mathbb{R} has at most these three pieces: absolutely continuous, discrete, and singular continuous.

💡 Remark 6 (Lebesgue decomposition from the von Neumann proof)

The Lebesgue decomposition theorem can be read off the same von Neumann argument that proved Radon-Nikodym. Recall the function hL2(ρ)h \in L^2(\rho) from Stage 3 of the proof, where ρ=μ+ν\rho = \mu + \nu. The set {h=1}\{h = 1\} is exactly where the absolute-continuity argument forced μ({h=1})=0\mu(\{h = 1\}) = 0; on this set, ν\nu lives on a μ\mu-null set, which is the singular part. On {h<1}\{h < 1\}, the algebra produced the density f=h/(1h)f = h/(1 - h), which is the absolutely continuous part. So defining νsing(A)=ν(A{h=1})\nu_{\mathrm{sing}}(A) = \nu(A \cap \{h = 1\}) and νac(A)=ν(A{h<1})=Afdμ\nu_{\mathrm{ac}}(A) = \nu(A \cap \{h < 1\}) = \int_A f \, d\mu gives the decomposition. The full derivation is in Stein and Shakarchi, Chapter 6.

Three-panel: the absolutely continuous component (uniform density on [0,1]), the discrete atom (Dirac mass at 0), and the singular continuous Cantor staircase, plus the combined CDF showing all three contributions

8. Probability Densities as Radon-Nikodym Derivatives

We promised in Section 1 to explain what a probability density “really is.” Here is the answer, made precise. The point of this section is not new mathematics — Definition 6 is just Definition 3 specialized to probability measures — but the change of vocabulary. Once you read “probability density function” as “Radon-Nikodym derivative with respect to Lebesgue measure,” everything in elementary probability snaps into rigorous focus.

📐 Definition 6 (Probability density function (rigorous))

Let PP be a probability measure on (Rn,B(Rn))(\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n)) with PλP \ll \lambda, where λ\lambda is Lebesgue measure on Rn\mathbb{R}^n. The probability density function of PP is

p=dPdλ,p = \frac{dP}{d\lambda},

the Radon-Nikodym derivative of PP with respect to Lebesgue measure. It satisfies (i) p0p \geq 0 λ\lambda-a.e., (ii) Rnpdλ=1\int_{\mathbb{R}^n} p \, d\lambda = 1, and (iii) P(A)=ApdλP(A) = \int_A p \, d\lambda for every Borel ARnA \subseteq \mathbb{R}^n.

📝 Example 8 (Beta density as an R-N derivative)

The Beta(α,β)(\alpha, \beta) probability measure PP on [0,1][0, 1] has

dPdλ(x)=xα1(1x)β1B(α,β)1[0,1](x),\frac{dP}{d\lambda}(x) = \frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{B(\alpha, \beta)} \, \mathbf{1}_{[0, 1]}(x),

where B(α,β)=01tα1(1t)β1dtB(\alpha, \beta) = \int_0^1 t^{\alpha - 1}(1 - t)^{\beta - 1} \, dt is the Beta function. The integral 01dPdλdλ=1\int_0^1 \frac{dP}{d\lambda} \, d\lambda = 1 is exactly the definition of B(α,β)B(\alpha, \beta). The shape of the density encodes how PP distributes mass relative to the uniform distribution: when α=β=1\alpha = \beta = 1, the density is constant and PP is uniform; when α>β\alpha > \beta the mass shifts right; when α<1\alpha < 1 or β<1\beta < 1 the density blows up at an endpoint. The viz in Section 3 lets the reader explore this dependence interactively.

📝 Example 9 (When no density exists)

The Cantor measure μC\mu_C on [0,1][0, 1] is not absolutely continuous with respect to Lebesgue measure: μC\mu_C is supported on the Cantor set C\mathcal{C}, which has Lebesgue measure zero. So μCλ\mu_C \perp \lambda, and dμC/dλd\mu_C/d\lambda does not exist. Not every probability measure has a density — only the absolutely continuous ones. The Cantor measure is the canonical example of a probability measure on [0,1][0, 1] with no density: it is continuous (no atoms — every singleton has measure zero) but not absolutely continuous (it lives on a Lebesgue-null set). It is the singular continuous part of the Lebesgue decomposition. Discrete distributions (like the Poisson or geometric) also have no density with respect to Lebesgue, but they do have densities with respect to counting measure — see Example 5.

💡 Remark 7 (Density is always with respect to a reference measure)

The phrase “the density of the Poisson distribution is eθθk/k!e^{-\theta} \theta^k / k!” is shorthand for dP/d#(k)=eθθk/k!dP/d\#(k) = e^{-\theta} \theta^k / k!, where #\# is counting measure on {0,1,2,}\{0, 1, 2, \ldots\}. The same Poisson measure has no density with respect to Lebesgue measure on R\mathbb{R} — it is supported on the integers, which form a Lebesgue-null set. Always ask: “density with respect to what?” The answer is almost always “Lebesgue measure” for continuous distributions and “counting measure” for discrete ones, but the framework allows densities with respect to any dominating measure, and exotic choices (like the Cantor measure on [0,1][0, 1]) are sometimes useful in advanced applications.

Three-panel: standard normal density on the real line, Exponential density on the non-negative reals, and Beta density on the unit interval, each with a shaded region whose area is the corresponding probability — the picture every "probability is the area under the curve" intuition is grounded in

9. Conditional Expectation via Radon-Nikodym

This is the flagship application. Conditional expectation E[YX]E[Y \mid X] is the single most important object in measure-theoretic probability — the workhorse of regression, Bayesian inference, reinforcement learning, and stochastic calculus — and the Radon-Nikodym theorem is what makes it well-defined when conditioning on a continuous variable. We state the definition, prove existence, prove the four basic properties, and then prove the L2L^2-best-predictor theorem that connects everything to regression.

📐 Definition 7 (Conditional expectation $E[X \mid \mathcal{G}]$)

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space, XL1(P)X \in L^1(P) an integrable random variable, and GF\mathcal{G} \subseteq \mathcal{F} a sub-σ\sigma-algebra. The conditional expectation of XX given G\mathcal{G} is the (a.e.-unique) G\mathcal{G}-measurable function E[XG]L1(P)E[X \mid \mathcal{G}] \in L^1(P) satisfying

AE[XG]dP=AXdPfor every AG.\int_A E[X \mid \mathcal{G}] \, dP = \int_A X \, dP \quad \text{for every } A \in \mathcal{G}.

In words: E[XG]E[X \mid \mathcal{G}] has the same integral as XX over every set you can describe using only the information in G\mathcal{G}.

🔷 Theorem 5 (Existence and uniqueness of conditional expectation)

Under the conditions of Definition 7, E[XG]E[X \mid \mathcal{G}] exists and is unique PP-almost everywhere.

Proof.

Define the signed measure ν\nu on the sub-σ\sigma-algebra (Ω,G)(\Omega, \mathcal{G}) by

ν(A)=AXdPfor AG.\nu(A) = \int_A X \, dP \quad \text{for } A \in \mathcal{G}.

Countable additivity of ν\nu follows from the Dominated Convergence Theorem (Topic 26 Theorem 3). Then νPG\nu \ll P|_\mathcal{G}: if AGA \in \mathcal{G} and P(A)=0P(A) = 0, then ν(A)=AXdP=0\nu(A) = \int_A X \, dP = 0 because the Lebesgue integral over a null set is zero. Both measures are σ\sigma-finite on G\mathcal{G} (since PP is finite). By the Radon-Nikodym theorem extended to signed measures via Jordan decomposition (Section 6), there exists a G\mathcal{G}-measurable function fL1(PG)f \in L^1(P|_\mathcal{G}) with

ν(A)=AfdPfor every AG.\nu(A) = \int_A f \, dP \quad \text{for every } A \in \mathcal{G}.

This ff is G\mathcal{G}-measurable by the way the R-N theorem produces densities (the derivative is constructed inside the function space tied to the σ\sigma-algebra), and it satisfies the defining property of conditional expectation. Set E[XG]=fE[X \mid \mathcal{G}] = f. Uniqueness PP-a.e. follows from Proposition 1 applied to PGP|_\mathcal{G}.

🔷 Proposition 3 (Properties of conditional expectation)

Let X,YL1(P)X, Y \in L^1(P) and let HGF\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F} be sub-σ\sigma-algebras of F\mathcal{F}.

(i) Linearity. E[aX+bYG]=aE[XG]+bE[YG]E[aX + bY \mid \mathcal{G}] = a E[X \mid \mathcal{G}] + b E[Y \mid \mathcal{G}] a.e., for any constants a,bRa, b \in \mathbb{R}.

(ii) Tower property. E[E[XG]H]=E[XH]E[E[X \mid \mathcal{G}] \mid \mathcal{H}] = E[X \mid \mathcal{H}] a.e. Conditioning twice on a coarser-and-finer pair collapses to conditioning once on the coarser one.

(iii) Conditional Jensen. If ϕ:RR\phi: \mathbb{R} \to \mathbb{R} is convex and ϕ(X)L1(P)\phi(X) \in L^1(P), then ϕ(E[XG])E[ϕ(X)G]\phi(E[X \mid \mathcal{G}]) \leq E[\phi(X) \mid \mathcal{G}] a.e.

(iv) Independence. If XX is independent of G\mathcal{G} — meaning the σ\sigma-algebra σ(X)\sigma(X) generated by XX is independent of G\mathcal{G} — then E[XG]=E[X]E[X \mid \mathcal{G}] = E[X] a.e. (the constant function equal to the unconditional mean).

All four properties follow directly from the defining integral identity by routine measure-theoretic arguments — see Billingsley §34.

🔷 Theorem 6 ($L^2$-best-predictor theorem)

Let XL2(Ω,F,P)X \in L^2(\Omega, \mathcal{F}, P) and let GF\mathcal{G} \subseteq \mathcal{F} be a sub-σ\sigma-algebra. Then E[XG]E[X \mid \mathcal{G}] is the unique element of L2(Ω,G,P)L^2(\Omega, \mathcal{G}, P) that minimizes

XYL2(P)2=(XY)2dP\|X - Y\|_{L^2(P)}^2 = \int (X - Y)^2 \, dP

over all G\mathcal{G}-measurable YL2(P)Y \in L^2(P). Equivalently, E[XG]E[X \mid \mathcal{G}] is the orthogonal projection of XX onto the closed subspace L2(Ω,G,P)L2(Ω,F,P)L^2(\Omega, \mathcal{G}, P) \subseteq L^2(\Omega, \mathcal{F}, P).

Proof.

Let V=L2(Ω,G,P)V = L^2(\Omega, \mathcal{G}, P) and H=L2(Ω,F,P)H = L^2(\Omega, \mathcal{F}, P). We claim VV is a closed subspace of HH. It is clearly a linear subspace. For closedness, suppose (Yn)V(Y_n) \subseteq V converges to YY in HH — that is, YnYL2(P)0\|Y_n - Y\|_{L^2(P)} \to 0. By the Riesz-Fischer theorem from Topic 27, L2L^2 convergence implies a subsequence converges PP-a.e. Pass to that subsequence. Since each YnY_n is G\mathcal{G}-measurable and the pointwise a.e. limit of G\mathcal{G}-measurable functions is G\mathcal{G}-measurable (after redefining on a null set), YY is G\mathcal{G}-measurable, hence YVY \in V.

Now apply the L2L^2 projection theorem from Topic 27 §9 Proposition 3: there exists a unique element X^V\hat X \in V minimizing XX^L2\|X - \hat X\|_{L^2}, characterized by the orthogonality condition

XX^,ZL2(P)=(XX^)ZdP=0for every ZV.\langle X - \hat X, Z \rangle_{L^2(P)} = \int (X - \hat X) Z \, dP = 0 \quad \text{for every } Z \in V.

Specializing Z=1AZ = \mathbf{1}_A for AGA \in \mathcal{G} (note 1AV\mathbf{1}_A \in V since AA is G\mathcal{G}-measurable), the orthogonality condition becomes

A(XX^)dP=0,i.e.,AX^dP=AXdPfor every AG.\int_A (X - \hat X) \, dP = 0, \quad \text{i.e.,} \quad \int_A \hat X \, dP = \int_A X \, dP \quad \text{for every } A \in \mathcal{G}.

This is exactly the defining property of conditional expectation. By the uniqueness in Definition 7, X^=E[XG]\hat X = E[X \mid \mathcal{G}] a.e. So E[XG]E[X \mid \mathcal{G}] is the L2L^2 projection of XX onto VV, which is the best predictor of XX given G\mathcal{G} in the mean-squared sense.

📝 Example 10 (Conditional expectation of a bivariate normal)

Let (X,Y)(X, Y) be jointly normal with μ=(μX,μY)\mu = (\mu_X, \mu_Y) and covariance matrix

Σ=(σX2ρσXσYρσXσYσY2).\Sigma = \begin{pmatrix} \sigma_X^2 & \rho \, \sigma_X \sigma_Y \\ \rho \, \sigma_X \sigma_Y & \sigma_Y^2 \end{pmatrix}.

A direct computation (complete the square in the joint density) gives

E[YX=x]=μY+ρσYσX(xμX).E[Y \mid X = x] = \mu_Y + \rho \, \frac{\sigma_Y}{\sigma_X} \, (x - \mu_X).

This is a linear function of xx — the conditional expectation is the linear regression of YY on XX. With the notebook’s preset values μX=μY=0.5\mu_X = \mu_Y = 0.5, σX=0.2\sigma_X = 0.2, σY=0.25\sigma_Y = 0.25, ρ=0.7\rho = 0.7, the slope is 0.70.25/0.2=0.8750.7 \cdot 0.25/0.2 = 0.875, so

E[YX=x]=0.5+0.875(x0.5).E[Y \mid X = x] = 0.5 + 0.875 \, (x - 0.5).

At x=0.65x = 0.65, this gives E[YX=0.65]=0.5+0.8750.15=0.63125E[Y \mid X = 0.65] = 0.5 + 0.875 \cdot 0.15 = 0.63125 (analytical), and the notebook’s slice-averaging numerical method recovers approximately 0.62400.6240 (modulo discretization error). The flagship visualization below lets you see the same curve from three angles — slice averaging, L2L^2 projection, and the explicit R-N derivative — and verify that they all produce the same function.

💡 Remark 8 (Regression is conditional expectation)

The L2L^2-best-predictor theorem says: among all functions of XX, the one that best predicts YY in mean-squared error is E[YX]E[Y \mid X]. Linear regression, polynomial regression, kernel regression, random forests, and neural networks are all attempting to approximate this single object. The R-N theorem is what guarantees that the object exists at all (the function might be exotic, but it is well-defined and integrable). The L2L^2 projection theorem from Topic 27 is what guarantees it is unique and optimal in mean-squared error. The training loss of a regression model is the squared distance from this projection, and successful training reduces the gap. Forward link: Regression.

The amber curve is E[Y | X = x]. In the slice view, dots mark the mean of each vertical slice — connecting them gives the curve. Toggle the views to see this same curve as an L² projection or as an R-N derivative.

Toggle the three views to see one curve from three angles. Slice averaging: cut the joint density into vertical slices, average Y in each slice, and connect the means. L² projection: this is the closest function of X (in mean-squared error) to Y. R-N derivative: E[Y | X = x] is the Radon-Nikodym derivative of ν(A) = ∫_A Y dP with respect to P, restricted to the σ-algebra generated by X. The amber curve is the same in every view — that's the whole point.

Three-panel: a joint density heatmap with the conditional expectation E[Y|X] curve overlaid, the slice-averaging interpretation showing means at representative x-values, and an L² error comparison showing that E[Y|X] beats the constant predictor E[Y]

10. Change of Measure and Importance Sampling

The change-of-measure identity is the operational use of R-N derivatives that ML practitioners encounter most often: it is the engine of importance sampling, REINFORCE policy gradients, and the variational lower bound. The identity is short, the proof is the standard “indicator → simple → non-negative measurable → general” machine, and the consequences are everywhere.

🔷 Theorem 7 (Change of measure)

Let P,QP, Q be σ\sigma-finite measures on (Ω,F)(\Omega, \mathcal{F}) with PQP \ll Q, and let ff be a PP-integrable function. Then

EP[f]=fdP=fdPdQdQ=EQ ⁣[fdPdQ].E_P[f] = \int f \, dP = \int f \cdot \frac{dP}{dQ} \, dQ = E_Q\!\left[f \cdot \frac{dP}{dQ}\right].

Integration with respect to PP is the same as integration with respect to QQ after multiplying the integrand by the R-N derivative.

Proof.

Start with f=1Af = \mathbf{1}_A for AFA \in \mathcal{F}. The left side is 1AdP=P(A)\int \mathbf{1}_A \, dP = P(A). The right side is

1AdPdQdQ=AdPdQdQ=P(A),\int \mathbf{1}_A \cdot \frac{dP}{dQ} \, dQ = \int_A \frac{dP}{dQ} \, dQ = P(A),

where the last step is the defining identity of the R-N derivative. So the identity holds for indicators. By linearity, it extends to simple functions f=ici1Aif = \sum_i c_i \mathbf{1}_{A_i}. By the Monotone Convergence Theorem (Topic 26), it extends to non-negative measurable functions: any such ff is the increasing limit of simple functions fnff_n \uparrow f, and both sides of the identity pass to the limit. Finally, for general PP-integrable ff, write f=f+ff = f^+ - f^- with f±0f^\pm \geq 0 and apply the identity to each piece separately. The change-of-measure identity is the indicator-simple-monotone-linear chain in its purest form.

📝 Example 11 (Importance sampling)

Goal: estimate EP[f]E_P[f] where PP is a target distribution that is hard or expensive to sample from, using samples from a proposal distribution QQ that is easier. The change-of-measure identity gives

EP[f]=EQ ⁣[fdPdQ]1ni=1nf(Xi)w(Xi),XiQ,E_P[f] = E_Q\!\left[f \cdot \frac{dP}{dQ}\right] \approx \frac{1}{n} \sum_{i=1}^{n} f(X_i) \, w(X_i), \quad X_i \sim Q,

where w(x)=dP/dQ(x)w(x) = dP/dQ(x) is the importance weight. The estimator on the right is unbiased for EP[f]E_P[f] as long as PQP \ll Q everywhere ff is integrable.

The notebook’s preset case takes P=N(0,1)P = N(0, 1), Q=N(2,1)Q = N(2, 1), and f(x)=x2f(x) = x^2 (so EP[f]=1E_P[f] = 1). The weight has the closed form

w(x)=p(x)q(x)=ex2/2e(x2)2/2=e2x+2.w(x) = \frac{p(x)}{q(x)} = \frac{e^{-x^2/2}}{e^{-(x - 2)^2/2}} = e^{-2x + 2}.

The notebook generates samples from QQ and computes the running estimate μ^n=(1/n)if(Xi)w(Xi)\hat\mu_n = (1/n) \sum_i f(X_i) w(X_i) as nn grows; the trace converges to 11 but the convergence is noisy because QQ shifts the proposal away from the bulk of PP, blowing up the variance of the weights for negative xx. The interactive visualization below lets you watch this convergence directly and compare different target/proposal pairs.

📝 Example 12 (KL divergence as expected log-R-N derivative)

The Kullback-Leibler divergence between two probability measures P,QP, Q with PQP \ll Q is

DKL(PQ)=log ⁣dPdQdP=EP ⁣[log ⁣dPdQ].D_{\mathrm{KL}}(P \,\|\, Q) = \int \log\!\frac{dP}{dQ} \, dP = E_P\!\left[\log\!\frac{dP}{dQ}\right].

It is the expected log-density-ratio under PP. Non-negativity follows from Jensen’s inequality (Topic 27 Theorem 1) applied with ϕ=log\phi = -\log (which is convex):

DKL(PQ)=EP ⁣[log ⁣dQdP]    logEP ⁣[dQdP]=log1=0.D_{\mathrm{KL}}(P \,\|\, Q) = -E_P\!\left[\log\!\frac{dQ}{dP}\right] \;\geq\; -\log E_P\!\left[\frac{dQ}{dP}\right] = -\log 1 = 0.

The middle inequality is Jensen, and the last step uses the change-of-measure identity:

EP ⁣[dQdP]=dQdPdP=dQ=1.E_P\!\left[\frac{dQ}{dP}\right] = \int \frac{dQ}{dP} \, dP = \int dQ = 1.

For two normal distributions P=N(0,1)P = N(0, 1) and Q=N(1,1.5)Q = N(1, 1.5), the closed form is

DKL(PQ)=log ⁣σQσP+σP2+(μPμQ)22σQ2120.3499.D_{\mathrm{KL}}(P \,\|\, Q) = \log\!\frac{\sigma_Q}{\sigma_P} + \frac{\sigma_P^2 + (\mu_P - \mu_Q)^2}{2 \sigma_Q^2} - \tfrac{1}{2} \approx 0.3499.

The notebook verifies this analytic value against direct numerical integration of the integral p(x)log(p(x)/q(x))dx\int p(x) \log(p(x)/q(x)) \, dx, getting a match to \sim$$10^{-6}.

💡 Remark 9 (Normalizing flows are R-N chain rules in disguise)

A normalizing flow defines a C1C^1 bijection T:RdRdT: \mathbb{R}^d \to \mathbb{R}^d and pushes a base distribution PZP_Z (typically standard normal) forward to a learned distribution PX=T#PZP_X = T_\# P_Z. The R-N derivative of PXP_X with respect to Lebesgue measure on Rd\mathbb{R}^d is computed via the chain rule for densities (Example 6 in higher dimensions):

pX(x)=pZ(T1(x))detDT1(x).p_X(x) = p_Z(T^{-1}(x)) \cdot \big|\det DT^{-1}(x)\big|.

This is the change-of-variables formula for densities, and it is just the chain rule for R-N derivatives applied to the pushforward measure. Normalizing flows are designed so that the determinant on the right is cheap to compute — coupling layers, autoregressive flows, and continuous flows are all architectural tricks for keeping detDT1|\det DT^{-1}| tractable while preserving expressivity. Forward link: Generative Modeling.

n = 500, Ê_P[] = 0.5317, true value = 1.0000, ESS = 47 / 500

Estimate E_P[f] using samples from the proposal Q. The importance weight w(x) = dP/dQ is a Radon-Nikodym derivative — the function that converts integration with respect to Q into integration with respect to P. The running estimate (slate) converges toward the true value (indigo dashed) as n grows. The effective sample size (ESS) measures how many of the n samples are doing meaningful work — an ESS far below n means the proposal is poorly matched to the target and most of the variance is concentrated in a few high-weight points.

Three-panel: target N(0,1) and proposal N(2,1) densities with the importance weight w(x) overlaid, a histogram of weighted samples, and the convergence of the Monte Carlo estimate to the true value as n grows

Two-panel: density ratio dP/dQ for P=N(0,1), Q=N(1,1.5) on the left, and log(dP/dQ) weighted by p(x) with the shaded KL integral on the right, illustrating that KL divergence is the expected log-R-N derivative

11. Computational Notes

A few practical observations about computing R-N derivatives in code, since ML practitioners encounter them constantly under different names.

  • Kernel density estimation as R-N approximation. Given samples x1,,xnx_1, \ldots, x_n from a probability measure PP with PλP \ll \lambda, the kernel density estimator p^h(x)=1nhi=1nK ⁣(xxih)\hat p_h(x) = \frac{1}{n h} \sum_{i=1}^{n} K\!\left( \frac{x - x_i}{h} \right) is a non-parametric estimate of dP/dλdP/d\lambda. Under regularity conditions on the kernel KK and the bandwidth hh, p^h\hat p_h converges to the true density in L1L^1 as nn \to \infty, h0h \to 0, and nhnh \to \infty. KDE is the simplest and most direct way to estimate an R-N derivative from samples.

  • scipy.stats and densities. Every continuous distribution in scipy.stats exposes a .pdf(x) method that returns dP/dλ(x)dP/d\lambda(x) and a .logpdf(x) method that returns log(dP/dλ(x))\log(dP/d\lambda(x)). The .logpdf form is preferred for numerical work because density values can be very small and underflow, but log densities are well-behaved across the entire support.

  • Conditional expectation in pandas. A df.groupby('X')['Y'].mean() call computes the empirical conditional expectation of YY given XX — that is, the slice average of YY over the groups defined by distinct XX values. This is the empirical version of the slice-averaging interpretation in Section 9, applied to a finite sample.

  • Importance weights in PyTorch. torch.distributions.Normal(0, 1).log_prob(x) returns logp(x)\log p(x) for the standard normal density. The log-importance weight for samples from QQ used to estimate expectations under PP is log_p(x) - log_q(x), which is log(dP/dQ)\log(dP/dQ). The actual weight is exp(log_p(x) - log_q(x)), but the log form is preferred until the very last step to avoid overflow.

📝 Example 13 (Numerical density-ratio verification)

For P=N(0,1)P = N(0, 1) and Q=N(1,1.5)Q = N(1, 1.5), the density ratio dP/dQdP/dQ has the closed form

dPdQ(x)=p(x)q(x)=1.5ex2/2e(x1)2/(21.52).\frac{dP}{dQ}(x) = \frac{p(x)}{q(x)} = \frac{1.5 \, e^{-x^2/2}}{e^{-(x-1)^2/(2 \cdot 1.5^2)}}.

Pick the integrand f(x)=x2f(x) = x^2 and verify the change-of-measure identity numerically by computing both sides of fpdλ=f(dP/dQ)qdλ\int f \, p \, d\lambda = \int f \cdot (dP/dQ) \cdot q \, d\lambda via quadrature on [5,5][-5, 5]. Both integrals should agree to \sim$$10^{-6} — the expected accuracy of scipy.integrate.quad for smooth integrands. If they disagree by more than discretization error, either the weight is wrong or the support of PP extends beyond the support of QQ (in which case P≪̸QP \not\ll Q and the identity does not hold). The notebook performs this verification step explicitly and confirms the agreement.

12. Connections to Machine Learning

This section is longer than the ML-connections sections in Topics 25–27, reflecting the extraordinary breadth of Radon-Nikodym applications in machine learning. Once you read “the density of PP” as “the R-N derivative dP/dλdP/d\lambda,” every probabilistic computation in ML traces back to a change of measure or a conditional expectation.

Probability densities and likelihoods. Every PDF the practitioner has ever written down — Gaussian, Beta, Dirichlet, mixture model, neural-network output — is dP/dλdP/d\lambda. Every likelihood L(θ)=ipθ(xi)L(\theta) = \prod_i p_\theta(x_i) is a product of R-N derivative values. Maximum likelihood estimation, Fisher information, and the Cramér-Rao bound all operate on R-N derivatives. The entire parametric statistics framework is built on this single object. Forward link: Probability Spaces.

Conditional expectation and regression. E[YX]E[Y \mid X] is the best L2L^2 predictor of YY given XX (Theorem 6), and it is an R-N derivative on a sub-σ\sigma-algebra (Theorem 5). Least-squares regression, random forests, gradient-boosted trees, and neural-network regression are all approximating this single object. The normal equations of OLS are exactly the orthogonality condition from the L2L^2 projection theorem applied to the linear subspace of affine functions of XX. Forward link: Regression.

Bayesian inference as change of measure. Bayes’ theorem says the posterior density is proportional to the prior times the likelihood: p(θx)p(xθ)p(θ)p(\theta \mid x) \propto p(x \mid \theta) \, p(\theta). In R-N terms, the posterior measure PθXP_{\theta \mid X} is absolutely continuous with respect to the prior PθP_\theta, and the R-N derivative is the (normalized) likelihood ratio. The posterior update is a change of measure from prior to posterior, and the normalization constant p(x)=p(xθ)p(θ)dθp(x) = \int p(x \mid \theta) p(\theta) \, d\theta is the marginal likelihood. Variational inference, MCMC, and Hamiltonian Monte Carlo are all algorithms for working with this change of measure when the posterior cannot be computed in closed form. Forward link: Bayesian Inference.

Importance sampling and REINFORCE. The identity EP[f]=EQ[fdP/dQ]E_P[f] = E_Q[f \cdot dP/dQ] from Section 10 is the engine of importance sampling. In reinforcement learning, the REINFORCE policy-gradient estimator uses the same identity in disguise: θEpθ[R]=Epθ[Rθlogpθ]\nabla_\theta E_{p_\theta}[R] = E_{p_\theta}[R \cdot \nabla_\theta \log p_\theta], where the gradient θlogpθ\nabla_\theta \log p_\theta is the score function — the gradient of the log R-N derivative. Every time a policy gradient is estimated by sampling from the current policy, an R-N derivative is being approximated. Forward link: Importance Sampling.

KL divergence and information geometry. DKL(PQ)=EP[log(dP/dQ)]D_{\mathrm{KL}}(P \,\|\, Q) = E_P[\log(dP/dQ)] is the expected log-R-N derivative. The Fisher information matrix I(θ)ij=EPθ[ilogpθjlogpθ]I(\theta)_{ij} = E_{P_\theta}[\partial_i \log p_\theta \cdot \partial_j \log p_\theta] is the covariance of the score vector — the gradient of the log R-N derivative — and it is the local curvature of the manifold of probability distributions in the KL geometry. Natural-gradient methods, mirror descent, and trust-region policy optimization all use this geometry. Forward link: Information Geometry.

Normalizing flows and diffusion models. Normalizing flows compute dPX/dλdP_X/d\lambda via the R-N chain rule applied to a learned bijection TT (Remark 9). Diffusion models learn the score logpt\nabla \log p_t at each noise level tt — the gradient of the log R-N derivative of the noisy data distribution — and use it to reverse the noising process. Both frameworks are built on R-N derivatives all the way down. Forward link: Diffusion Models.

ff-divergences and GANs. Every ff-divergence Df(PQ)=f(dP/dQ)dQD_f(P \,\|\, Q) = \int f(dP/dQ) \, dQ is a functional of the R-N derivative. KL is the special case f(t)=tlogtf(t) = t \log t. The total variation, Jensen-Shannon, χ2\chi^2, and Hellinger divergences are other choices of ff. Generative adversarial networks (GANs) train a discriminator that approximates dPdata/dPgendP_{\mathrm{data}}/dP_{\mathrm{gen}}, and the generator’s loss is some ff-divergence between the two distributions — the discriminator literally learns the R-N derivative. Forward link: Generative Modeling.

📝 Example 14 (Score function identity)

For a parametric family {Pθ}\{P_\theta\} with densities pθ=dPθ/dλp_\theta = dP_\theta/d\lambda, the score function is sθ(x)=θlogpθ(x)s_\theta(x) = \nabla_\theta \log p_\theta(x). A foundational identity says it has mean zero under PθP_\theta:

EPθ[sθ(X)]=0.E_{P_\theta}[s_\theta(X)] = 0.

Proof: differentiate pθdλ=1\int p_\theta \, d\lambda = 1 with respect to θ\theta. Under regularity conditions (those that justify swapping the derivative and the integral via the Dominated Convergence Theorem from Topic 26),

0=θ ⁣pθdλ=θpθdλ=θpθpθpθdλ=EPθ[θlogpθ]=EPθ[sθ].0 = \nabla_\theta \!\int p_\theta \, d\lambda = \int \nabla_\theta p_\theta \, d\lambda = \int \frac{\nabla_\theta p_\theta}{p_\theta} \, p_\theta \, d\lambda = E_{P_\theta}[\nabla_\theta \log p_\theta] = E_{P_\theta}[s_\theta].

The middle step uses the chain rule for derivatives and the fact that pθ>0p_\theta > 0 where the support lies. This identity is the basis of maximum-likelihood asymptotics — the score is a martingale at the true parameter, and the variance of the score is the Fisher information.

📝 Example 15 (Variational lower bound (ELBO))

In variational inference, the marginal log-likelihood logp(x)\log p(x) is intractable because it requires integrating over a latent variable zz: p(x)=p(x,z)dzp(x) = \int p(x, z) \, dz. The evidence lower bound (ELBO) sidesteps this by introducing an auxiliary distribution q(z)q(z) over the latents and applying Jensen’s inequality to the log:

logp(x)=log ⁣p(x,z)dz=logEq ⁣[p(x,z)q(z)]    Eq ⁣[log ⁣p(x,z)q(z)].\log p(x) = \log\!\int p(x, z) \, dz = \log \, E_q\!\left[\frac{p(x, z)}{q(z)}\right] \;\geq\; E_q\!\left[\log\!\frac{p(x, z)}{q(z)}\right].

The ratio p(x,z)/q(z)p(x, z)/q(z) inside the log is the R-N derivative of the joint PX,ZP_{X, Z} with respect to the product of the data marginal and the variational distribution qq — a change of measure between two probability spaces. The ELBO is itself an expectation of a log R-N derivative, and maximizing it with respect to both the model parameters and qq tightens the bound on logp(x)\log p(x) from below. Variational autoencoders, expectation maximization, and amortized inference all maximize ELBO objectives.

Four-panel ML connections to Radon-Nikodym: Bayesian update as change of measure, importance sampling weight visualization, normalizing flow density transformation via the chain rule, and the score function as the gradient of the log density

13. The Measure & Integration Arc — Connections & Further Reading

This is the fourth and final topic in the Measure & Integration track. The track tells a single story in four chapters. Topic 25 built the language: σ\sigma-algebras gave us a rigorous notion of “measurable set,” and measures assigned sizes to those sets in a way that is countably additive and well-behaved under countable operations. Topic 26 built the integral: the Lebesgue integral generalized Riemann’s by measuring the domain rather than the range, unlocking three convergence theorems (Monotone Convergence, Fatou, Dominated Convergence) that make limit/integral interchange rigorous. Topic 27 built function spaces: LpL^p organized integrable functions into normed vector spaces with geometric structure — distances, balls, completeness — and the L2L^2 inner product made “projection onto a closed subspace” meaningful. Topic 28 completed the arc by turning measures into functions: the Radon-Nikodym theorem says that when one measure is absolutely continuous with respect to another, the relationship is captured by a single integrable function — the density.

The four topics form a chain of increasingly powerful tools. σ\sigma-algebras are the foundation (you cannot measure without them). The Lebesgue integral is the engine (you cannot integrate without it). LpL^p spaces are the geometry (you cannot do analysis without norms and completeness). And the Radon-Nikodym theorem is the bridge to probability (you cannot do measure-theoretic probability without densities and conditional expectation). Each topic depends on the previous one, and the dependency is not formal — the proofs literally use the tools.

For the ML practitioner, this arc provides the rigorous underpinnings of the probabilistic tools used daily. The density p(x)p(x) is dP/dλdP/d\lambda — a Radon-Nikodym derivative. The expected value E[X]E[X] is XdP\int X \, dP — a Lebesgue integral. The best predictor E[YX]E[Y \mid X] is an L2L^2 projection of the random variable YY onto the sub-σ\sigma-algebra generated by XX. KL divergence is an integral of a log R-N derivative. Every probabilistic computation in modern machine learning traces back to this four-topic chain. The “Track 7” arc exists precisely so that a reader who finishes it can pick up any modern paper in measure-theoretic ML and recognize the pieces.

Within formalCalculus:

  • The Lebesgue Integral — The defining identity ν(A)=Afdμ\nu(A) = \int_A f \, d\mu is a Lebesgue integral, and the change-of-variables extension in Section 5 used the Monotone Convergence Theorem from Topic 26 as its closing tool.
  • LpL^p Spaces — The von Neumann proof of Radon-Nikodym in Section 4 is a single application of the L2L^2 projection theorem from Topic 27 §9. R-N derivatives live in L1(μ)L^1(\mu). Topic 27’s duality proof sketch ((Lp)=Lq(L^p)^* = L^q in §10) used R-N as a black box; this topic closes that loop.
  • Sigma-Algebras & Measures — Measurable spaces, null sets, and a.e.-equivalence are the framework throughout. The sub-σ\sigma-algebra construction in Section 9 reuses the σ\sigma-algebra machinery from Topic 25.
  • Completeness & Compactness — Completeness of L2L^2 (proved by Riesz-Fischer in Topic 27, using the Bolzano-Weierstrass-style subsequence extraction from Topic 3) is what makes the closed-subspace projection theorem work. Without completeness, the von Neumann proof would not close.

Successor topics within formalCalculus:

  • Normed & Banach SpacesLpL^p duality proved via Radon-Nikodym (now complete) is the canonical example of a dual-space characterization in functional analysis. The abstract dual-space theory, Hahn-Banach theorem, and the three pillars (UBP, Open Mapping, Closed Graph) are developed there.
  • Inner Product & Hilbert Spaces — Conditional expectation is the orthogonal projection of an L2L^2 random variable onto a sub-σ\sigma-algebra. The projection theorem provides existence and uniqueness.

Forward to formalml.com:

  • Probability Spaces — Every PDF is a Radon-Nikodym derivative; the entire measure-theoretic probability framework rests on this topic.
  • Regression — Conditional expectation is the best L2L^2 predictor; regression algorithms are how we approximate it.
  • Bayesian Inference — The posterior update is a change of measure from prior to posterior.
  • Importance Sampling — The importance weight is literally a Radon-Nikodym derivative.
  • Generative Modeling — Normalizing flows compute density via the R-N chain rule; GANs learn the R-N derivative as a discriminator.
  • Information Geometry — KL divergence and Fisher information are both functionals of R-N derivatives.
  • Diffusion Models — Score matching learns the gradient of the log R-N derivative at every noise level.

References:

  • Royden, H. L. & Fitzpatrick, P. M. Real Analysis (4th ed.), Chapter 18.
  • Folland, G. B. Real Analysis: Modern Techniques and Their Applications (2nd ed.), Chapter 3.
  • Billingsley, P. Probability and Measure (anniversary ed.), Chapters 31–33.
  • Stein, E. M. & Shakarchi, R. Real Analysis: Measure Theory, Integration, and Hilbert Spaces, Chapter 6.

Four-panel timeline showing the conceptual arc of the Measure & Integration track: σ-algebras (the language) → Lebesgue integral (the engine) → Lp spaces (the geometry) → Radon-Nikodym (the bridge to probability)

References

  1. book Royden, H. L. & Fitzpatrick, P. M. (2010). Real Analysis Fourth edition. Chapter 18 (Radon-Nikodym, Lebesgue decomposition). Closest to our von Neumann proof path.
  2. book Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications Second edition. Chapter 3 (signed measures, R-N theorem, Lebesgue decomposition). Concise classical treatment.
  3. book Billingsley, P. (2012). Probability and Measure Anniversary edition. Chapters 31–33. Excellent for the probability interpretation — densities, conditional expectation, regular conditional distributions.
  4. book Stein, E. M. & Shakarchi, R. (2005). Real Analysis: Measure Theory, Integration, and Hilbert Spaces Chapter 6. Clean presentation of the von Neumann proof using L² projection.