Measure & Integration · advanced · 50 min read

Lp Spaces

Function spaces with norms — from Hölder and Minkowski inequalities through Riesz-Fischer completeness to the geometric engine of regression, regularization, and modern function-space learning.

Abstract. Lp spaces are where functions become geometry. The Lebesgue integral from Topic 26 lets us integrate individual functions; Lp spaces organize those functions into vector spaces with norms, turning questions about function approximation into questions about distances and projections. We define the Lp norm, prove the three fundamental inequalities (Jensen, Hölder, Minkowski), and then prove the Riesz-Fischer theorem: Lp is complete, meaning every Cauchy sequence converges — using the Dominated Convergence Theorem from Topic 26 as the key tool. Completeness is what makes Lp spaces Banach spaces, and L2 a Hilbert space. The geometry of the Lp unit ball — diamond at p=1, circle at p=2, square at p=∞ — directly explains why L1 regularization produces sparsity and L2 regularization produces smoothness. Least-squares regression is an L2 projection. Fourier neural operators live in L2. Score matching minimizes an L2 distance between log-density gradients. This topic builds the function-space geometry that machine learning assumes you already have.

Where this leads → formalML

  • formalML Least-squares regression minimizes ||y - Xβ||₂², which is an L² norm. The existence and uniqueness of the minimizer follows from L² projection onto a closed subspace — the projection theorem previewed in Section 9.
  • formalML Fourier neural operators learn mappings between L²(ℝᵈ) spaces. The completeness (Riesz-Fischer) and inner product structure of L² are prerequisites for their well-definedness.
  • formalML Score matching minimizes ∫||∇ log p_θ - ∇ log p_data||² dx, an L² norm of gradient differences. Well-definedness requires ∇ log p ∈ L².
  • formalML The Kantorovich dual formulation of Wasserstein-p distances uses (Lp)* = Lq duality. The constraint set is described using Lp norms.
  • formalML Reproducing kernel Hilbert spaces (RKHS) are subspaces of L². Kernel evaluations are inner products in L², and the representer theorem uses L² completeness.

1. Three Puzzles LpL^p Spaces Solve

Topic 26 built the Lebesgue integral and proved the three convergence theorems — Monotone Convergence, Fatou, and Dominated Convergence. We can now integrate measurable functions, take limits of integrals, and swap limits and integrals under modest hypotheses. That is enough to evaluate integrals, but it is not yet enough to organize them. The functions we integrate live as scattered objects: each one has its own integral, but we have no way to talk about how close two functions are, no way to talk about a sequence of functions converging to another function in a useful sense, and no way to do the kind of geometric reasoning — distances, projections, perpendicularity — that we routinely do with vectors in Rn\mathbb{R}^n.

This topic fixes that. We organize measurable functions into vector spaces — the LpL^p spaces — equipped with norms that turn function-space questions into geometric ones. Three puzzles motivate the construction.

1. When does least-squares regression have a unique solution? Linear regression finds the function f^\hat f in some class F\mathcal{F} that minimizes (yf)2dμ\int (y - f)^2 \, d\mu. This is a “closest point” problem: we want the element of F\mathcal{F} that is nearest to the data yy. But “nearest” requires a distance on functions. And for the minimization to have a solution at all, we need the space to be complete — minimizing sequences should converge to something inside the space, not slip out the side. We will see in Sections 6 and 7 that L2L^2 provides exactly this: a distance (fg2\|f - g\|_2) and a completeness theorem (Riesz–Fischer) that guarantees minimizing sequences land somewhere legal.

2. Why do Fourier series converge “in energy” but not pointwise? Topic 22 showed that the partial Fourier sums of a discontinuous function — say, a square wave — overshoot at the discontinuity by about 9% no matter how many terms we add. This is the Gibbs phenomenon, and it tells us the partial sums do not converge pointwise to the target at the jump. Yet the integrated squared error fSnf2dx\int |f - S_n f|^2 \, dx goes to zero. So in some sense the partial sums do converge — just not pointwise. The right notion of convergence is convergence in the L2L^2 norm, which measures “how much energy is in the error” rather than “how big is the error at each point.” The two notions are different, and the difference is exactly the difference between asking “how does ff behave at xx?” and “how does ff behave on average?”

3. What does it mean for two probability densities to be “close”? Generative models in machine learning are trained by minimizing some kind of distance between the learned density pθp_\theta and the data density pdatap_{\text{data}}. But “distance between densities” is not a single thing — it depends on which LpL^p norm we use to measure it. The L1L^1 distance pθpdatadx\int |p_\theta - p_{\text{data}}| \, dx is the total variation: it cares about every region where the two densities disagree. The L2L^2 distance penalizes large pointwise errors quadratically, smoothing out the influence of small errors and amplifying the influence of large ones. The LL^\infty distance demands uniform agreement: even one point of large disagreement makes the distance large. The choice of pp encodes a modeling decision about which kinds of errors we are willing to tolerate.

These three puzzles all reduce to the same demand: we need a vector space of functions, equipped with a norm, in which we can do geometric reasoning. That is what LpL^p provides.

2. The LpL^p Norm and LpL^p Spaces

We start with the definition of the norm and then quotient by an equivalence relation to make it a true norm rather than a seminorm. The construction is one of the cleanest examples of “promote a seminorm to a norm by killing the kernel” in analysis.

📐 Definition 1 (The $L^p$ seminorm)

Let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a measure space, 1p<1 \leq p < \infty, and ff a measurable function ΩR\Omega \to \mathbb{R}. The LpL^p seminorm of ff is

fp  =  (fpdμ)1/p.\|f\|_p \;=\; \left( \int |f|^p \, d\mu \right)^{1/p}.

This is a seminorm, not a norm: fp0\|f\|_p \geq 0, scalar multiplication pulls through (αfp=αfp\|\alpha f\|_p = |\alpha| \, \|f\|_p), and the triangle inequality holds (Minkowski, Section 5). What it does not satisfy is the definiteness axiom: fp=0\|f\|_p = 0 does not imply f=0f = 0 — only f=0f = 0 almost everywhere.

The failure of definiteness is the whole reason we need the next step. If f=0f = 0 on a set of full measure but is non-zero on a measure-zero set, the integral fpdμ=0\int |f|^p \, d\mu = 0 and so fp=0\|f\|_p = 0 — yet ff is not the zero function. To get a true norm we have to identify functions that agree almost everywhere.

📐 Definition 2 (The space $L^p(\mu)$)

Define the equivalence relation fgf \sim g on measurable functions by

fg        f=g μ-almost everywhere.f \sim g \;\iff\; f = g \text{ $\mu$-almost everywhere}.

The space Lp(μ)L^p(\mu) is the set of equivalence classes [f][f] of measurable functions for which fp<\|f\|_p < \infty. On Lp(μ)L^p(\mu), the seminorm becomes a norm:

[f]p  =  0        [f]=[0],\|[f]\|_p \;=\; 0 \;\iff\; [f] = [0],

because fp=0\|f\|_p = 0 now means ”f=0f = 0 a.e.”, which is exactly the statement that [f][f] is the zero equivalence class.

The quotient is not just a technicality. Without it, we would have a seminormed space: a vector space where some non-zero elements have zero “length.” Quotient by the kernel (the functions of seminorm zero) and the seminorm becomes a norm. The norm is what gives us a metric (d(f,g)=fgpd(f, g) = \|f - g\|_p), and the metric is what gives us geometry — open balls, convergence, completeness, projections. From now on we will write fLpf \in L^p when we mean the equivalence class [f]Lp[f] \in L^p.

The case p=p = \infty needs its own definition because the formula (fp)1/p(\int |f|^p)^{1/p} has no obvious meaning when pp is infinite. The right notion is the essential supremum: the smallest constant that bounds f|f| except possibly on a null set.

📐 Definition 3 (The essential supremum and $L^\infty(\mu)$)

For a measurable function ff, the essential supremum is

ess supf  =  inf{M0  :  f(x)M for μ-a.e. x}.\operatorname{ess\,sup} |f| \;=\; \inf \bigl\{ M \geq 0 \;:\; |f(x)| \leq M \text{ for $\mu$-a.e. } x \bigr\}.

The space L(μ)L^\infty(\mu) consists of equivalence classes (under a.e.-equality) of measurable functions ff for which ess supf<\operatorname{ess\,sup} |f| < \infty. Its norm is

f  =  ess supf.\|f\|_\infty \;=\; \operatorname{ess\,sup} |f|.

The essential supremum is the natural measure-theoretic version of the supremum: it ignores what happens on null sets. A function that equals 11 everywhere except at a single point where it equals 1010010^{100} has essential supremum 11, even though its actual supremum is 1010010^{100} — because the offending point is a null set, and a.e.-equivalence renders it invisible.

📝 Example 1 ($x^{-1/3}$ is in $L^p$ for $p < 3$ but not $L^3$)

Consider f(x)=x1/3f(x) = x^{-1/3} on (0,1](0, 1] with Lebesgue measure. We compute

fpp  =  01xp/3dx  =  {11p/3if p<3,if p3.\|f\|_p^p \;=\; \int_0^1 x^{-p/3} \, dx \;=\; \begin{cases} \dfrac{1}{1 - p/3} & \text{if } p < 3, \\ \infty & \text{if } p \geq 3. \end{cases}

So fLp((0,1])f \in L^p((0, 1]) for every p<3p < 3, with fp\|f\|_p growing as pp approaches 33. At p=3p = 3 the integral 01x1dx\int_0^1 x^{-1} \, dx diverges logarithmically and the norm becomes infinite, so fL3f \notin L^3. The number 33 is called the critical exponent for this function: for any pp below it the function is integrable to the pp-th power, and for any pp at or above it, it is not. Different functions have different critical exponents, and that variability is what makes the family of LpL^p spaces interesting — the same function can be a member of one space and not another.

📝 Example 2 (A function in every finite $L^p$ but not in $L^\infty$)

Let f(x)=(log(1/x))1f(x) = (\log(1/x))^{-1} on (0,1/2)(0, 1/2) extended by zero outside. As x0+x \to 0^+, log(1/x)\log(1/x) \to \infty, so f(x)0f(x) \to 0 — the function is bounded near 00. But near x=1/2x = 1/2 from below, log(1/x)log2>0\log(1/x) \to \log 2 > 0, so ff approaches a finite positive value. The essential supremum is finite on most of the interval but blows up on the upper boundary in a different way:

supx(0,1/2)f(x)  =  limx1/21log(1/x)  =  1log2    1.44.\sup_{x \in (0, 1/2)} f(x) \;=\; \lim_{x \to 1/2^-} \frac{1}{\log(1/x)} \;=\; \frac{1}{\log 2} \;\approx\; 1.44.

So this particular ff is in fact in LL^\infty. To get a function in every finite LpL^p but not in LL^\infty, take instead f(x)=(log(1/x))1/2f(x) = (\log(1/x))^{1/2} on (0,1)(0, 1): the integrals 01(log(1/x))p/2dx\int_0^1 (\log(1/x))^{p/2} \, dx are all finite (the logarithm grows slower than any power, so integrating against it converges for every pp), but f(x)f(x) \to \infty as x0x \to 0, so ess supf=\operatorname{ess\,sup} |f| = \infty and fLf \notin L^\infty. The membership patterns of LpL^p as pp varies are surprisingly subtle, and they reflect different ways a function can fail to be “small.”

💡 Remark 1 (Notation: $f$ and $[f]$)

We will write fLpf \in L^p throughout this topic when we technically mean [f]Lp[f] \in L^p. The abuse is universal in analysis and rarely causes confusion: when a statement holds for ff, it holds for every representative of [f][f], since the statements are themselves only sensitive to a.e.-behavior. Two functions in the same equivalence class are interchangeable for every purpose we care about — integrals against them, LpL^p norms, convergence in LpL^p — so working with representatives instead of classes is harmless and notationally cleaner.

💡 Remark 2 (Why the quotient is essential)

The quotient by a.e.-equivalence is not optional. Without it, the seminorm p\|\cdot\|_p is not a norm, the functional ffpf \mapsto \|f\|_p does not separate points, and the metric d(f,g)=fgpd(f, g) = \|f - g\|_p is not a metric — two distinct functions can be at distance zero. None of the geometric constructions we want — open balls, convergence in norm, completeness, orthogonal projection — make sense in a seminormed space, because the topology is not Hausdorff. The quotient promotes the seminorm to a norm by collapsing every equivalence class to a single point, and that single move makes the entire theory of LpL^p spaces possible.

The interactive viz below makes the dependence of fp\|f\|_p on pp concrete: pick a function from the dropdown, slide pp, and watch how the integrand f(x)p|f(x)|^p redistributes mass as pp changes. For large pp the mass concentrates where f|f| is largest, and the norm approaches the essential supremum.

function: |f(x)| (blue)mass: |f(x)|^2.0 (red, shaded)curve: ||f||_p vs. p (cyan)

A smooth half-cycle of a sine wave on [0, 1]. As p grows, |f|^p concentrates where |f| is largest — for large p the integrand is dominated by the peak, and ||f||_p approaches the essential supremum (||f||_∞).

3. Jensen’s Inequality

Before we prove Hölder and Minkowski, we need one preparatory inequality that is interesting in its own right and that also previews the proof technique. Jensen’s inequality says that integrating a convex function against a probability measure gives at least the convex function applied to the integral. It is the single inequality that connects measure theory to information theory: every non-negativity result about KL divergence, mutual information, and entropy is a Jensen application.

🔷 Theorem 1 (Jensen's inequality)

Let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a probability space (μ(Ω)=1\mu(\Omega) = 1), let fL1(μ)f \in L^1(\mu), and let ϕ:RR\phi: \mathbb{R} \to \mathbb{R} be convex. Then

ϕ ⁣(fdμ)    ϕfdμ.\phi\!\left( \int f \, d\mu \right) \;\leq\; \int \phi \circ f \, d\mu.

Proof.

The proof uses one fact about convex functions: at every point, a convex function has a supporting line. That is, for every aRa \in \mathbb{R} there exists a slope λR\lambda \in \mathbb{R} (the right or left derivative of ϕ\phi at aa, or any value in between if ϕ\phi is not differentiable at aa) such that

ϕ(t)    ϕ(a)+λ(ta)for all tR.\phi(t) \;\geq\; \phi(a) + \lambda (t - a) \quad \text{for all } t \in \mathbb{R}.

This is the definition of ϕ\phi being convex, restated geometrically: the graph of ϕ\phi lies above every one of its tangent lines.

Step 1: Choose aa. Set a=fdμa = \int f \, d\mu — the integral of ff against μ\mu. Since μ\mu is a probability measure and fL1f \in L^1, this integral is a finite real number. Choose any supporting line of ϕ\phi at aa with slope λ\lambda.

Step 2: Apply the supporting-line inequality pointwise. For each xΩx \in \Omega we have f(x)Rf(x) \in \mathbb{R}, so the supporting-line inequality with t=f(x)t = f(x) gives

ϕ(f(x))    ϕ(a)+λ(f(x)a).\phi(f(x)) \;\geq\; \phi(a) + \lambda (f(x) - a).

This holds at every point xx, with the same constants ϕ(a)\phi(a) and λ\lambda.

Step 3: Integrate against μ\mu. Both sides of the pointwise inequality are measurable functions of xx, and the right-hand side is integrable because it is an affine function of ff and fL1f \in L^1. Integrating preserves the inequality:

ϕ(f(x))dμ(x)    [ϕ(a)+λ(f(x)a)]dμ(x).\int \phi(f(x)) \, d\mu(x) \;\geq\; \int \bigl[ \phi(a) + \lambda (f(x) - a) \bigr] \, d\mu(x).

The right side splits using linearity of the integral:

ϕ(f)dμ    ϕ(a)μ(Ω)+λ(fa)dμ  =  ϕ(a)+λ0  =  ϕ(a).\int \phi(f) \, d\mu \;\geq\; \phi(a) \cdot \mu(\Omega) + \lambda \int (f - a) \, d\mu \;=\; \phi(a) + \lambda \cdot 0 \;=\; \phi(a).

We used μ(Ω)=1\mu(\Omega) = 1 (probability measure) for the first term and (fa)dμ=fdμa=aa=0\int (f - a) \, d\mu = \int f \, d\mu - a = a - a = 0 for the second. Substituting a=fdμa = \int f \, d\mu on the right gives the conclusion.

The proof is short but geometrically sharp: convexity says the graph of ϕ\phi lies above its tangent line, and integration is linear, so anything you can say about a tangent line you can say about its integral. Jensen’s inequality is, in essence, the statement that integration commutes with affine transformations and convex functions only “lose” information in the right direction.

📝 Example 3 (KL divergence is non-negative — the first Jensen application)

Let PP and QQ be probability measures on Ω\Omega with PQP \ll Q (i.e., PP is absolutely continuous with respect to QQ — every QQ-null set is PP-null), and let g=dP/dQg = dP/dQ be the Radon–Nikodym density (we will formalize Radon–Nikodym in Topic 28; for now treat gg as the “ratio” of PP to QQ). The Kullback–Leibler divergence of PP from QQ is

DKL(PQ)  =  loggdP  =  log(1/g)dP.D_{\mathrm{KL}}(P \,\|\, Q) \;=\; \int \log g \, dP \;=\; -\int \log(1/g) \, dP.

The function ϕ(t)=logt\phi(t) = -\log t is convex on (0,)(0, \infty). Apply Jensen’s inequality to f=1/gf = 1/g against the probability measure PP:

log ⁣((1/g)dP)    log(1/g)dP  =  DKL(PQ).-\log\!\left( \int (1/g) \, dP \right) \;\leq\; \int -\log(1/g) \, dP \;=\; D_{\mathrm{KL}}(P \,\|\, Q).

Now compute the integral on the left. Since g=dP/dQg = dP/dQ, we have (1/g)dP=(1/g)gdQ=1dQ=1\int (1/g) \, dP = \int (1/g) \cdot g \, dQ = \int 1 \, dQ = 1. So log(1)=0-\log(1) = 0, and we get

0    DKL(PQ),0 \;\leq\; D_{\mathrm{KL}}(P \,\|\, Q),

with equality iff 1/g1/g is constant PP-a.e. (the equality case of Jensen for strictly convex ϕ\phi), which means gg is constant, which means g=1g = 1 a.e., which means P=QP = Q. This is Gibbs’ inequality, the foundational non-negativity result of information theory.

💡 Remark 3 (Jensen connects measure theory to information theory)

Jensen’s inequality is the bridge from measure theory to information theory, and it is the only bridge most people remember after they leave grad school. Every non-negativity statement in information theory — Gibbs’ inequality (KL 0\geq 0), the chain rule for KL divergence, the data processing inequality, the convexity of mutual information — is a Jensen application. Variational methods in machine learning rest on the same foundation: the ELBO (evidence lower bound) used in variational autoencoders and Bayesian neural networks is Jensen applied to logp(x)=logp(x,z)dz\log p(x) = \log \int p(x, z) \, dz, and the inequality slack is exactly the KL divergence between the variational posterior and the true posterior. Forward link: Variational Inference.

4. Hölder’s Inequality

Hölder’s inequality is the central duality inequality of LpL^p spaces. It says that the integral of a product of two functions is bounded by the product of their respective LpL^p norms — provided the exponents are conjugate. The notion of conjugate exponents is the engine of the entire LpL^p duality theory we will develop in Section 10.

We prove Hölder via Young’s inequality, which is itself a one-line consequence of the convexity of the exponential. Young’s inequality is the pointwise inequality that, when integrated, becomes Hölder.

🔷 Proposition 1 (Young's inequality)

For a,b0a, b \geq 0 and conjugate exponents p,q>1p, q > 1 satisfying 1/p+1/q=11/p + 1/q = 1:

ab    app+bqq.ab \;\leq\; \frac{a^p}{p} + \frac{b^q}{q}.

Equality holds iff ap=bqa^p = b^q.

Proof.

The cases a=0a = 0 or b=0b = 0 are trivial, so assume a,b>0a, b > 0.

Step 1: Rewrite the product as an exponential. Take logarithms inside an exponential:

ab  =  exp(loga+logb)  =  exp ⁣(1pploga+1qqlogb).ab \;=\; \exp(\log a + \log b) \;=\; \exp\!\left( \frac{1}{p} \cdot p \log a + \frac{1}{q} \cdot q \log b \right).

The exponent on the right is a convex combination of plogap \log a and qlogbq \log b, with weights 1/p1/p and 1/q1/q that sum to 11 by the conjugacy condition.

Step 2: Apply convexity of the exponential. The function ϕ(t)=et\phi(t) = e^t is convex. For any convex combination αu+βv\alpha u + \beta v with α,β0\alpha, \beta \geq 0 and α+β=1\alpha + \beta = 1, convexity gives eαu+βvαeu+βeve^{\alpha u + \beta v} \leq \alpha e^u + \beta e^v. With α=1/p\alpha = 1/p, β=1/q\beta = 1/q, u=plogau = p \log a, and v=qlogbv = q \log b:

exp ⁣(1pploga+1qqlogb)    1pexp(ploga)+1qexp(qlogb).\exp\!\left( \frac{1}{p} \cdot p \log a + \frac{1}{q} \cdot q \log b \right) \;\leq\; \frac{1}{p} \exp(p \log a) + \frac{1}{q} \exp(q \log b).

Step 3: Simplify. Note that exp(ploga)=ap\exp(p \log a) = a^p and exp(qlogb)=bq\exp(q \log b) = b^q. Combining the previous two displays:

ab    app+bqq.ab \;\leq\; \frac{a^p}{p} + \frac{b^q}{q}.

Equality in the convexity step requires ploga=qlogbp \log a = q \log b (the two points being averaged are the same), which after exponentiation is ap=bqa^p = b^q.

Young’s inequality is a pointwise statement: it bounds the product of two non-negative numbers by an explicit sum involving their powers. Hölder’s inequality is what you get by integrating Young pointwise after a clever normalization step — it converts a pointwise bound on numbers into an integrated bound on functions.

🔷 Theorem 2 (Hölder's inequality)

Let fLp(μ)f \in L^p(\mu) and gLq(μ)g \in L^q(\mu) with 1p1 \leq p \leq \infty and 1/p+1/q=11/p + 1/q = 1 (with the convention 1/=01/\infty = 0, so p=1p = 1 pairs with q=q = \infty and vice versa). Then fgL1(μ)fg \in L^1(\mu) and

fg1    fpgq.\|fg\|_1 \;\leq\; \|f\|_p \, \|g\|_q.

Proof.

We prove the case 1<p<1 < p < \infty (so 1<q<1 < q < \infty as well). The endpoint cases p=1,q=p = 1, q = \infty and p=,q=1p = \infty, q = 1 are immediate from the definition of g\|g\|_\infty as the essential supremum: fggf|fg| \leq \|g\|_\infty \cdot |f| a.e., so fggf=f1g\int |fg| \leq \|g\|_\infty \int |f| = \|f\|_1 \|g\|_\infty.

Step 1: Trivial cases. If fp=0\|f\|_p = 0, then f=0f = 0 a.e., so fg=0fg = 0 a.e., and both sides of the inequality are zero. Similarly for gq=0\|g\|_q = 0. So we may assume fp>0\|f\|_p > 0 and gq>0\|g\|_q > 0.

Step 2: Normalize. Define the rescaled functions

f~  =  ffp,g~  =  ggq.\tilde f \;=\; \frac{f}{\|f\|_p}, \qquad \tilde g \;=\; \frac{g}{\|g\|_q}.

By construction f~p=1\|\tilde f\|_p = 1 and g~q=1\|\tilde g\|_q = 1. Proving f~g~11\|\tilde f \tilde g\|_1 \leq 1 for the normalized pair will give fg1fpgq\|fg\|_1 \leq \|f\|_p \|g\|_q after multiplying through by fpgq\|f\|_p \|g\|_q.

Step 3: Apply Young’s inequality pointwise. For each xx, set a=f~(x)a = |\tilde f(x)| and b=g~(x)b = |\tilde g(x)| in Young’s inequality:

f~(x)g~(x)  =  f~(x)g~(x)    f~(x)pp+g~(x)qq.|\tilde f(x) \tilde g(x)| \;=\; |\tilde f(x)| \cdot |\tilde g(x)| \;\leq\; \frac{|\tilde f(x)|^p}{p} + \frac{|\tilde g(x)|^q}{q}.

This holds at every point xx, with the same pp and qq.

Step 4: Integrate against μ\mu. Both sides are measurable functions of xx, and the right-hand side is integrable since f~Lp\tilde f \in L^p and g~Lq\tilde g \in L^q. Integration preserves the inequality:

f~g~dμ    1pf~pdμ+1qg~qdμ.\int |\tilde f \tilde g| \, d\mu \;\leq\; \frac{1}{p} \int |\tilde f|^p \, d\mu + \frac{1}{q} \int |\tilde g|^q \, d\mu.

By construction, f~pdμ=f~pp=1p=1\int |\tilde f|^p \, d\mu = \|\tilde f\|_p^p = 1^p = 1, and likewise g~qdμ=1\int |\tilde g|^q \, d\mu = 1. The right side becomes 1/p+1/q=11/p + 1/q = 1, so

f~g~1  =  f~g~dμ    1.\|\tilde f \tilde g\|_1 \;=\; \int |\tilde f \tilde g| \, d\mu \;\leq\; 1.

Step 5: Un-normalize. Multiplying both sides by fpgq\|f\|_p \|g\|_q:

fg1  =  fpgqf~g~1    fpgq.\|fg\|_1 \;=\; \|f\|_p \|g\|_q \cdot \|\tilde f \tilde g\|_1 \;\leq\; \|f\|_p \|g\|_q.

The proof is a perfect example of the “integrate a pointwise inequality” technique. Young’s inequality is a pointwise bound on numbers; integrating it after a normalization step turns it into an integrated bound on functions. The conjugacy condition 1/p+1/q=11/p + 1/q = 1 is what makes the right-hand side simplify to 11 — without it, the proof would not close.

📝 Example 4 (Cauchy–Schwarz is Hölder at $p = q = 2$)

The most familiar special case of Hölder is the Cauchy–Schwarz inequality:

fgdμ    (f2dμ)1/2(g2dμ)1/2.\int |fg| \, d\mu \;\leq\; \left( \int |f|^2 \, d\mu \right)^{1/2} \left( \int |g|^2 \, d\mu \right)^{1/2}.

This is Hölder’s inequality with p=q=2p = q = 2 — and the conjugacy condition 1/2+1/2=11/2 + 1/2 = 1 checks out. Cauchy–Schwarz is the inequality that makes L2L^2 an inner product space: if we define f,g=fgdμ\langle f, g \rangle = \int fg \, d\mu, then Cauchy–Schwarz says f,gf2g2|\langle f, g \rangle| \leq \|f\|_2 \|g\|_2, which is the defining property of an inner product. Section 9 develops this perspective in detail.

📝 Example 5 (Hölder with explicit power functions)

Consider f(x)=x1/3f(x) = x^{1/3} and g(x)=x1/6g(x) = x^{1/6} on [0,1][0, 1] with Lebesgue measure, p=3p = 3 and q=3/2q = 3/2 (note 1/3+2/3=11/3 + 2/3 = 1). Compute each piece of Hölder:

fg1  =  01x1/3x1/6dx  =  01x1/2dx  =  23    0.6667.\|fg\|_1 \;=\; \int_0^1 x^{1/3} \cdot x^{1/6} \, dx \;=\; \int_0^1 x^{1/2} \, dx \;=\; \frac{2}{3} \;\approx\; 0.6667.

f3  =  (01(x1/3)3dx)1/3  =  (01xdx)1/3  =  (12)1/3    0.7937.\|f\|_3 \;=\; \left( \int_0^1 (x^{1/3})^3 \, dx \right)^{1/3} \;=\; \left( \int_0^1 x \, dx \right)^{1/3} \;=\; \left( \frac{1}{2} \right)^{1/3} \;\approx\; 0.7937.

g3/2  =  (01(x1/6)3/2dx)2/3  =  (01x1/4dx)2/3  =  (45)2/3    0.8618.\|g\|_{3/2} \;=\; \left( \int_0^1 (x^{1/6})^{3/2} \, dx \right)^{2/3} \;=\; \left( \int_0^1 x^{1/4} \, dx \right)^{2/3} \;=\; \left( \frac{4}{5} \right)^{2/3} \;\approx\; 0.8618.

The product f3g3/20.79370.86180.6840\|f\|_3 \cdot \|g\|_{3/2} \approx 0.7937 \cdot 0.8618 \approx 0.6840, so Hölder gives

0.6667    0.6840.0.6667 \;\leq\; 0.6840.

The inequality holds with a ratio of about 0.9740.974 — close to equality but not sharp. The interactive viz below lets you slide pp and watch this ratio evolve, including for choices that drive the ratio toward 11 (the Young’s-inequality saturation case).

||fg||_1 = 0.18930||f||_p · ||g||_q = 0.29823ratio = 0.6348

A sine half-cycle and a quadratic — Hölder is strict and the ratio sits well below 1. The bar chart on the right compares the two sides of the inequality. Slide p (or toggle Minkowski) to see how the ratio responds.

💡 Remark 4 (When does Hölder hold with equality?)

Tracing through the proof, equality in Hölder is equivalent to equality in the Young’s-inequality step at μ\mu-almost every point, which (Proposition 1) requires f~p=g~q|\tilde f|^p = |\tilde g|^q almost everywhere. Un-normalizing, this becomes: there exist constants α,β0\alpha, \beta \geq 0, not both zero, such that αfp=βgq\alpha |f|^p = \beta |g|^q a.e. Functions of this form — where fp|f|^p and gq|g|^q are proportional — are called conjugate pairs.

The sharpness condition is surprisingly fragile: it is not enough for ff and gg to “look similar.” Consider f(x)=x1/3f(x) = x^{1/3} and g(x)=x1/6g(x) = x^{1/6} on [0,1][0, 1] with p=3p = 3, q=3/2q = 3/2 (the “power-pair” preset in the explorer above). The powers look naturally matched, but the pointwise check fails: fp=x|f|^p = x while gq=(x1/6)3/2=x1/4|g|^q = (x^{1/6})^{3/2} = x^{1/4}, so fp|f|^p and gq|g|^q differ by a factor of x3/4x^{3/4} — they are not proportional, and the Hölder ratio sits at about 0.97470.9747 rather than 11.

To actually get equality at p=3p = 3, q=3/2q = 3/2, we need fpgq|f|^p \propto |g|^q. The simplest way: take f=xαf = x^{\alpha} and g=xβg = x^{\beta} with αp=βq\alpha p = \beta q, so 3α=(3/2)β3\alpha = (3/2)\beta, so β=2α\beta = 2\alpha. Choosing α=1/3\alpha = 1/3 gives g=x2/3g = x^{2/3}, and now fp=x=gq|f|^p = x = |g|^q exactly. This is the “power-pair-sharp” preset in the explorer — flip to it and the ratio snaps to 11 to within quadrature precision. Sharp Hölder pairs are the analogue of eigenvalue–eigenvector pairs in a finite-dimensional inner product space: they expose the “tightness directions” of the inequality.

5. Minkowski’s Inequality

Minkowski’s inequality is the triangle inequality for LpL^p. It is the inequality that promotes p\|\cdot\|_p from a “size functional” to a norm: the triangle inequality is exactly the missing axiom that the seminorm definition didn’t give us for free. We prove Minkowski directly from Hölder.

🔷 Theorem 3 (Minkowski's inequality)

Let f,gLp(μ)f, g \in L^p(\mu) with 1p1 \leq p \leq \infty. Then f+gLp(μ)f + g \in L^p(\mu) and

f+gp    fp+gp.\|f + g\|_p \;\leq\; \|f\|_p + \|g\|_p.

Proof.

The cases p=1p = 1 and p=p = \infty are immediate: f+gf+g|f + g| \leq |f| + |g| pointwise, and integrating (for p=1p = 1) or taking the essential supremum (for p=p = \infty) gives the result. We prove the case 1<p<1 < p < \infty via Hölder.

Step 1: Bound f+gp|f + g|^p in terms of f|f| and g|g|. The triangle inequality for absolute values gives f+gf+g|f + g| \leq |f| + |g| pointwise, so

f+gp  =  f+gf+gp1    (f+g)f+gp1.|f + g|^p \;=\; |f + g| \cdot |f + g|^{p-1} \;\leq\; \bigl( |f| + |g| \bigr) \cdot |f + g|^{p-1}.

Distributing:

f+gp    ff+gp1+gf+gp1.|f + g|^p \;\leq\; |f| \cdot |f + g|^{p-1} + |g| \cdot |f + g|^{p-1}.

Step 2: Integrate. Integrating both sides against μ\mu:

f+gpp  =  f+gpdμ    ff+gp1dμ+gf+gp1dμ.\|f + g\|_p^p \;=\; \int |f + g|^p \, d\mu \;\leq\; \int |f| \cdot |f + g|^{p-1} \, d\mu + \int |g| \cdot |f + g|^{p-1} \, d\mu.

Step 3: Apply Hölder to each term. Let q=p/(p1)q = p / (p - 1) be the conjugate exponent of pp, so 1/p+1/q=11/p + 1/q = 1. Apply Hölder’s inequality to the first integral with ff in the LpL^p slot and f+gp1|f + g|^{p-1} in the LqL^q slot:

ff+gp1dμ    fp(f+g(p1)qdμ)1/q.\int |f| \cdot |f + g|^{p-1} \, d\mu \;\leq\; \|f\|_p \cdot \left( \int |f + g|^{(p-1)q} \, d\mu \right)^{1/q}.

Compute (p1)q=(p1)p/(p1)=p(p - 1) q = (p - 1) \cdot p / (p - 1) = p. So the right side is fpf+gpp/q\|f\|_p \cdot \|f + g\|_p^{p/q}. Similarly for the second integral:

gf+gp1dμ    gpf+gpp/q.\int |g| \cdot |f + g|^{p-1} \, d\mu \;\leq\; \|g\|_p \cdot \|f + g\|_p^{p/q}.

Step 4: Combine and divide. Substituting back into Step 2:

f+gpp    (fp+gp)f+gpp/q.\|f + g\|_p^p \;\leq\; \bigl( \|f\|_p + \|g\|_p \bigr) \cdot \|f + g\|_p^{p/q}.

If f+gp=0\|f + g\|_p = 0 the inequality holds trivially; otherwise we may divide both sides by f+gpp/q\|f + g\|_p^{p/q}. Note that pp/q=p(11/q)=p(1/p)=1p - p/q = p(1 - 1/q) = p \cdot (1/p) = 1 by conjugacy, so f+gpp/f+gpp/q=f+gp1\|f + g\|_p^p / \|f + g\|_p^{p/q} = \|f + g\|_p^1. The result is

f+gp    fp+gp.\|f + g\|_p \;\leq\; \|f\|_p + \|g\|_p.

The trick in the proof is “split f+gp|f+g|^p as f+gf+gp1|f+g| \cdot |f+g|^{p-1} and apply Hölder,” where the conjugate exponent q=p/(p1)q = p/(p-1) is engineered exactly so that (p1)q=p(p-1)q = p — the inner integrals collapse back to f+gp\|f+g\|_p, which we can then divide out. This is a small algebraic miracle that only works because of the conjugacy condition.

📝 Example 6 (Minkowski with disjoint indicators)

Take f=1[0,1/2]f = \mathbf{1}_{[0, 1/2]} and g=1[1/2,1]g = \mathbf{1}_{[1/2, 1]} on [0,1][0, 1] with Lebesgue measure, and p=2p = 2. Compute the three quantities:

f2  =  (01/21dx)1/2  =  12,\|f\|_2 \;=\; \left( \int_0^{1/2} 1 \, dx \right)^{1/2} \;=\; \frac{1}{\sqrt{2}},

g2  =  (1/211dx)1/2  =  12,\|g\|_2 \;=\; \left( \int_{1/2}^1 1 \, dx \right)^{1/2} \;=\; \frac{1}{\sqrt{2}},

f+g2  =  (011dx)1/2  =  1.\|f + g\|_2 \;=\; \left( \int_0^1 1 \, dx \right)^{1/2} \;=\; 1.

(The supports are disjoint, so f+gf + g is the indicator of [0,1][0, 1] — except possibly at the single point x=1/2x = 1/2, which has measure zero.) Minkowski says f+g2f2+g2\|f + g\|_2 \leq \|f\|_2 + \|g\|_2, which here becomes

1    12+12  =  2    1.4142.1 \;\leq\; \frac{1}{\sqrt{2}} + \frac{1}{\sqrt{2}} \;=\; \sqrt{2} \;\approx\; 1.4142.

This is a strict inequality with a healthy gap — about 41%41\% slack. The reason: in the L2L^2 norm, “disjoint supports” is not the same as “orthogonal.” The two functions overlap at exactly one point (a measure-zero set), but the L2L^2 geometry treats them as if they were in different “directions” yet still adds them in a non-orthogonal way. The disjoint indicators are orthogonal in the inner-product sense (f,g=0\langle f, g \rangle = 0, by Section 9), and for orthogonal vectors the Pythagorean identity says f+g22=f22+g22\|f + g\|_2^2 = \|f\|_2^2 + \|g\|_2^2, which is 1=1/2+1/21 = 1/2 + 1/2. So the Minkowski inequality holds with strict inequality, but the Pythagorean identity holds with equality — these are two different statements about the same pair, and they are both true.

💡 Remark 5 ($p < 1$: the triangle inequality fails)

The Minkowski proof uses the fact that ttpt \mapsto t^p is convex on [0,)[0, \infty), which holds for p1p \geq 1. For 0<p<10 < p < 1 the function tpt^p is concave, and the triangle inequality reverses direction. Concretely, with the same disjoint indicators f=1[0,1/2]f = \mathbf{1}_{[0, 1/2]} and g=1[1/2,1]g = \mathbf{1}_{[1/2, 1]} but p=1/2p = 1/2:

f1/2  =  (01/211/2dx)2  =  (12)2  =  14,\|f\|_{1/2} \;=\; \left( \int_0^{1/2} 1^{1/2} \, dx \right)^{2} \;=\; \left( \frac{1}{2} \right)^2 \;=\; \frac{1}{4},

g1/2  =  14by symmetry,\|g\|_{1/2} \;=\; \frac{1}{4} \quad \text{by symmetry},

f+g1/2  =  (0111/2dx)2  =  12  =  1.\|f + g\|_{1/2} \;=\; \left( \int_0^1 1^{1/2} \, dx \right)^2 \;=\; 1^2 \;=\; 1.

Comparing: f+g1/2=1\|f + g\|_{1/2} = 1 but f1/2+g1/2=1/2\|f\|_{1/2} + \|g\|_{1/2} = 1/2. So f+g1/2>f1/2+g1/2\|f + g\|_{1/2} > \|f\|_{1/2} + \|g\|_{1/2} — the “norm” of the sum is larger than the sum of the “norms.” The triangle inequality fails, 1/2\|\cdot\|_{1/2} is not a norm, and L1/2L^{1/2} is not a normed space. (It is still a topological vector space; the formula d(f,g)=fg1/2dμd(f, g) = \int |f - g|^{1/2} \, d\mu defines a translation-invariant metric. But there is no norm.) For this reason the entire LpL^p theory in the rest of this topic restricts to p1p \geq 1.

6. LpL^p Is a Normed Vector Space

We have all the pieces. Sections 2 and 5 give us the seminorm, the quotient that promotes it to a norm, and the triangle inequality. Putting them together:

🔷 Theorem 4 ($L^p(\mu)$ is a normed vector space)

For every 1p1 \leq p \leq \infty, (Lp(μ),p)\bigl( L^p(\mu), \|\cdot\|_p \bigr) is a normed vector space. That is:

fp    0 for all fLp, with equality iff f=0 in Lp.\|f\|_p \;\geq\; 0 \text{ for all } f \in L^p, \text{ with equality iff } f = 0 \text{ in } L^p.

αfp  =  αfp for all αR and fLp.\|\alpha f\|_p \;=\; |\alpha| \, \|f\|_p \text{ for all } \alpha \in \mathbb{R} \text{ and } f \in L^p.

f+gp    fp+gp for all f,gLp.\|f + g\|_p \;\leq\; \|f\|_p + \|g\|_p \text{ for all } f, g \in L^p.

The first axiom (definiteness) is the part where the a.e.-quotient earns its keep: fp=0\|f\|_p = 0 implies fpdμ=0\int |f|^p \, d\mu = 0, which implies fp=0|f|^p = 0 a.e., which implies f=0f = 0 a.e., which is exactly the statement that f=0f = 0 in LpL^p (as an equivalence class). The second axiom (homogeneity) is immediate from pulling the constant out of the integral. The third axiom (triangle inequality) is Minkowski. So LpL^p is a normed vector space.

But “normed vector space” is not yet enough for the geometric reasoning we want. Real geometry — the kind where minimizing sequences converge, where projections exist, where Cauchy implies convergent — needs completeness. That is what Section 7 establishes: LpL^p is not just a normed vector space, it is a Banach space (a complete normed vector space). The Riesz–Fischer theorem is the cornerstone result of this topic, and its proof uses the Dominated Convergence Theorem from Topic 26 in an essential way.

Before we get there, let us look at what these norms actually mean geometrically. The unit ball of an LpL^p norm — the set of vectors with norm at most 11 — is a shape that depends in a striking way on pp. At p=1p = 1 the ball is a diamond (or, in higher dimensions, a cross-polytope); at p=2p = 2 it is a sphere; at p=p = \infty it is a cube. The shape morphs continuously between these extremes as pp varies, and the morphing is the visual core of the entire theory of LpL^p regularization in machine learning. The flagship explorer below lets you slide pp and watch the unit ball change shape — pay particular attention to the corners that appear at p=1p = 1 and disappear at p>1p > 1, because those corners are exactly the geometric reason that L1L^1 regularization (Lasso) produces sparse solutions and L2L^2 regularization (Ridge) does not.

circle — area = 3.1416dashed: unit circle (p = 2)

Slide p from 0.5 to 20 (or toggle p = ∞) to watch the unit ball morph from a non-convex star (p < 1) through a diamond (p = 1), to a circle (p = 2), to a square (p = ∞). The corners at p = 1 are exactly why L¹ regularization (Lasso) produces sparse solutions.

💡 Remark 6 (The quotient promotes a seminorm to a norm)

Looking back at the construction: the seminorm p\|\cdot\|_p on the space of measurable functions has a non-trivial kernel — the set of functions with fp=0\|f\|_p = 0 — and that kernel is precisely the set of functions that are zero almost everywhere. Quotienting by the kernel collapses the kernel to a single point (the zero element), and on the quotient the seminorm becomes a norm. This is a general construction: in any seminormed vector space, the kernel of the seminorm is a linear subspace, and the quotient by the kernel is automatically a normed space. LpL^p is one of the most famous applications of this general fact, and it is the reason measure-theoretic functional analysis can use the geometric language of normed vector spaces at all.

7. Completeness — The Riesz–Fischer Theorem

The single most important fact about LpL^p spaces — the result that makes essentially every downstream use of LpL^p work — is that LpL^p is complete. Every Cauchy sequence converges to an element of the space. This is not automatic for normed vector spaces in general (the space of polynomials with the L2L^2 norm, for example, is not complete: a Cauchy sequence of polynomials can converge to a non-polynomial), and it is not at all obvious from the definition. The Riesz–Fischer theorem is the proof.

📐 Definition 4 (Cauchy sequence in $L^p$)

A sequence (fn)(f_n) in Lp(μ)L^p(\mu) is Cauchy if for every ε>0\varepsilon > 0 there exists NN such that fnfmp<ε\|f_n - f_m\|_p < \varepsilon for all n,mNn, m \geq N.

Equivalently, fnfmp0\|f_n - f_m\|_p \to 0 as min(n,m)\min(n, m) \to \infty.

The Cauchy condition is purely internal to the sequence — it asks “are the terms getting closer to each other?” without ever mentioning a candidate limit. Completeness is the assertion that every Cauchy sequence in fact has a limit in the space. The Riesz–Fischer theorem says LpL^p has this property.

🔷 Theorem 5 (Riesz–Fischer theorem)

For every 1p1 \leq p \leq \infty and every measure space (Ω,F,μ)(\Omega, \mathcal{F}, \mu), the space Lp(μ)L^p(\mu) is complete: every Cauchy sequence in Lp(μ)L^p(\mu) converges to an element of Lp(μ)L^p(\mu).

In particular, Lp(μ)L^p(\mu) with the norm p\|\cdot\|_p is a Banach space.

Proof.

We prove the case 1p<1 \leq p < \infty. The case p=p = \infty is a separate (easier) argument that we sketch in Remark 7. Let (fn)(f_n) be a Cauchy sequence in LpL^p. We will construct a candidate limit, show that a subsequence converges to it, and then use the Cauchy property to upgrade subsequence convergence to full-sequence convergence.

Step 1: Extract a fast-converging subsequence. Since (fn)(f_n) is Cauchy, we can choose indices n1<n2<n3<n_1 < n_2 < n_3 < \cdots such that

fnk+1fnkp  <  2kfor every k1.\|f_{n_{k+1}} - f_{n_k}\|_p \;<\; 2^{-k} \quad \text{for every } k \geq 1.

This is possible because the Cauchy condition with εk=2k\varepsilon_k = 2^{-k} gives an index NkN_k such that fnfmp<2k\|f_n - f_m\|_p < 2^{-k} for n,mNkn, m \geq N_k; we just choose nk+1>max(nk,Nk)n_{k+1} > \max(n_k, N_k).

Step 2: Bound the partial-sum-of-differences. Define the non-negative function

gN(x)  =  k=1Nfnk+1(x)fnk(x).g_N(x) \;=\; \sum_{k=1}^N |f_{n_{k+1}}(x) - f_{n_k}(x)|.

Each gNg_N is measurable and non-negative. By Minkowski’s inequality applied NN times,

gNp    k=1Nfnk+1fnkp  <  k=1N2k  <  1.\|g_N\|_p \;\leq\; \sum_{k=1}^N \|f_{n_{k+1}} - f_{n_k}\|_p \;<\; \sum_{k=1}^N 2^{-k} \;<\; 1.

So the LpL^p norms of the partial sums gNg_N are uniformly bounded by 11.

Step 3: Pass to the monotone limit. The sequence (gN)(g_N) is monotone increasing — adding another non-negative term cannot decrease the sum. By the Monotone Convergence Theorem (Topic 26, Section 5) applied to (gNp)(g_N^p), the pointwise limit

g(x)  =  limNgN(x)  =  k=1fnk+1(x)fnk(x)g(x) \;=\; \lim_{N \to \infty} g_N(x) \;=\; \sum_{k=1}^\infty |f_{n_{k+1}}(x) - f_{n_k}(x)|

exists in [0,][0, \infty] at every xx and satisfies

gpdμ  =  limNgNpdμ  =  limNgNpp    1p  =  1.\int g^p \, d\mu \;=\; \lim_{N \to \infty} \int g_N^p \, d\mu \;=\; \lim_{N \to \infty} \|g_N\|_p^p \;\leq\; 1^p \;=\; 1.

So gLpg \in L^p, and in particular g(x)<g(x) < \infty for μ\mu-almost every xx. This is the key consequence of the bound: the infinite sum of absolute differences is finite a.e.

Step 4: Construct the candidate limit ff. Where g(x)<g(x) < \infty — i.e., on a set of full measure — the series

fn1(x)+k=1(fnk+1(x)fnk(x))f_{n_1}(x) + \sum_{k=1}^\infty \bigl( f_{n_{k+1}}(x) - f_{n_k}(x) \bigr)

is absolutely convergent. (This is the standard “absolute convergence implies convergence” fact for real-valued series, applied at each xx.) The partial sums of this series are exactly fnN+1(x)f_{n_{N+1}}(x) — the series telescopes — so we define

f(x)  =  limkfnk(x)for μ-a.e. x,f(x) \;=\; \lim_{k \to \infty} f_{n_k}(x) \quad \text{for $\mu$-a.e. } x,

and f(x)=0f(x) = 0 on the null set where the series fails to converge. The function ff is measurable (as the a.e.-limit of measurable functions, redefined on a null set).

Step 5: fLpf \in L^p. We need to check that the candidate limit actually lives in LpL^p. The triangle inequality gives, for every k1k \geq 1,

fnk(x)fn1(x)    j=1k1fnj+1(x)fnj(x)    g(x).|f_{n_k}(x) - f_{n_1}(x)| \;\leq\; \sum_{j=1}^{k-1} |f_{n_{j+1}}(x) - f_{n_j}(x)| \;\leq\; g(x).

Letting kk \to \infty and using the pointwise limit definition of ff:

f(x)fn1(x)    g(x)a.e.|f(x) - f_{n_1}(x)| \;\leq\; g(x) \quad \text{a.e.}

So ffn1+g|f| \leq |f_{n_1}| + g a.e., and both fn1Lpf_{n_1} \in L^p and gLpg \in L^p, so fLpf \in L^p by Minkowski.

Step 6: fnkff_{n_k} \to f in LpL^p. We use the Dominated Convergence Theorem from Topic 26. Define hk(x)=fnk(x)f(x)ph_k(x) = |f_{n_k}(x) - f(x)|^p. We need:

(a) hk(x)0h_k(x) \to 0 a.e. as kk \to \infty. This holds because fnk(x)f(x)f_{n_k}(x) \to f(x) a.e. by Step 4 and ttpt \mapsto |t|^p is continuous.

(b) A dominating function. By the same triangle-inequality argument as Step 5, fnk(x)f(x)2g(x)|f_{n_k}(x) - f(x)| \leq 2 g(x) for almost every xx (cap it through fn1f_{n_1} and gg). So

hk(x)  =  fnk(x)f(x)p    (2g(x))p  =  2pg(x)p.h_k(x) \;=\; |f_{n_k}(x) - f(x)|^p \;\leq\; (2 g(x))^p \;=\; 2^p \cdot g(x)^p.

The function 2pgp2^p \cdot g^p is in L1L^1 because gpdμ1\int g^p \, d\mu \leq 1 from Step 3. So hkh_k is dominated by an integrable function for every kk.

DCT now gives hkdμ0\int h_k \, d\mu \to 0, which is exactly fnkfpp0\|f_{n_k} - f\|_p^p \to 0. So fnkff_{n_k} \to f in LpL^p along the chosen subsequence.

Step 7: The full sequence converges. We have shown that the subsequence (fnk)(f_{n_k}) converges to ff in LpL^p. To upgrade this to convergence of the full sequence (fn)(f_n), fix ε>0\varepsilon > 0. Since (fn)(f_n) is Cauchy, choose NN such that fnfmp<ε/2\|f_n - f_m\|_p < \varepsilon / 2 whenever n,mNn, m \geq N. Since fnkff_{n_k} \to f in LpL^p, choose KK such that fnKfp<ε/2\|f_{n_K} - f\|_p < \varepsilon / 2 and nKNn_K \geq N. Then for any nnKn \geq n_K,

fnfp    fnfnKp+fnKfp  <  ε2+ε2  =  ε.\|f_n - f\|_p \;\leq\; \|f_n - f_{n_K}\|_p + \|f_{n_K} - f\|_p \;<\; \frac{\varepsilon}{2} + \frac{\varepsilon}{2} \;=\; \varepsilon.

So fnfp0\|f_n - f\|_p \to 0 and the full sequence converges to ff in LpL^p.

The proof has the unmistakable structure of a “real analysis lower-bound transfer” argument: extract a fast-decaying subsequence so that the differences are summable, sum the differences to get an a.e. limit, dominate to get LpL^p convergence, then use Cauchyness to extend from subsequence to full sequence. The DCT plays the role of the closer — once we have an a.e. limit and a uniform LpL^p dominator, DCT does the rest. Everything else in the proof is preparatory work to set up the DCT application.

📝 Example 7 (A Cauchy sequence converging in $L^2$)

The smooth-step functions fn:[1/2,3/2]Rf_n: [-1/2, 3/2] \to \mathbb{R} defined by

fn(t)  =  12(1+tanh(nt))12(1+tanh(n(1t)))f_n(t) \;=\; \frac{1}{2}\bigl(1 + \tanh(n t)\bigr) \cdot \frac{1}{2}\bigl(1 + \tanh(n (1 - t))\bigr)

are smooth approximations to the indicator 1[0,1]\mathbf{1}_{[0, 1]}. As nn grows the two tanh\tanh transitions sharpen, and fn(t)1[0,1](t)f_n(t) \to \mathbf{1}_{[0, 1]}(t) pointwise on the interior (1/2,0)(0,1)(1,3/2)(-1/2, 0) \cup (0, 1) \cup (1, 3/2) and at every interior point of [0,1][0, 1] — so the convergence is pointwise everywhere except at the two boundary points {0,1}\{0, 1\}, which are a null set.

Numerically, fn1[0,1]2\|f_n - \mathbf{1}_{[0, 1]}\|_2 decays rapidly: at n=10n = 10 it is around 0.180.18, at n=50n = 50 around 0.080.08, at n=200n = 200 around 0.040.04. The sequence (fn)(f_n) is Cauchy in L2L^2 — for any m,nm, n both large, the two functions both approximate the indicator and so are close to each other — and Riesz–Fischer guarantees a limit in L2L^2. That limit is the equivalence class [1[0,1]][\mathbf{1}_{[0, 1]}]. The interactive viz below lets you advance nn and watch both the function and the LpL^p-norm difference.

f_n (indigo)limit f (red dashed)|f_n − f| (amber shaded)||f_5 − f||_2.0 = 0.27807

Smooth tanh-mollified approximations to 1_[0, 1]. As n grows the transitions sharpen and f_n converges to the indicator in every L^p (1 ≤ p < ∞). The bar chart shows how the L^p norm difference decays as k grows — completeness of L^p (Riesz–Fischer) guarantees that this Cauchy sequence converges in the L^p norm to the limit shown.

💡 Remark 7 (Why completeness matters for ML — and the $p = \infty$ case)

Completeness is the abstract guarantee that minimizing sequences in optimization actually converge to something legitimate. In gradient descent, EM, variational inference, and any other iterative procedure that produces a sequence of approximations (fn)(f_n) to a target, completeness of the ambient space LpL^p is what guarantees that “Cauchy” implies “converges to a real LpL^p function.” Without completeness, a perfectly well-behaved minimizing sequence could “escape” the space and converge to something pathological — a delta function, a non-measurable beast, or nothing at all. Forward link: Gradient Descent.

The case p=p = \infty is technically separate but easier than the proof we just gave. If (fn)(f_n) is Cauchy in LL^\infty, then for every ε>0\varepsilon > 0 there is an NN such that fn(x)fm(x)<ε|f_n(x) - f_m(x)| < \varepsilon for μ\mu-almost every xx when n,mNn, m \geq N. Taking a countable union of null sets (one per pair (n,m)(n, m)) and removing it from Ω\Omega leaves a set on which (fn)(f_n) is uniformly Cauchy as a sequence of bounded functions, so it converges uniformly to a bounded function ff. Then fnf0\|f_n - f\|_\infty \to 0. So LL^\infty is also complete.

8. Density Results

The space LpL^p has many “nice” subspaces — simple functions, continuous compactly supported functions, polynomials on compact intervals, smooth functions — and an important question is whether these subspaces are dense. Density means: every LpL^p function can be approximated arbitrarily well in the LpL^p norm by an element of the subspace. Density results are the bread and butter of approximation theory: they say that questions about general LpL^p functions can often be reduced to questions about simpler, more tractable functions.

🔷 Theorem 6 (Density of simple functions in $L^p$)

For every 1p<1 \leq p < \infty, the simple functions in Lp(μ)L^p(\mu) are dense: for every fLpf \in L^p, there exists a sequence of simple functions (sn)(s_n) with fsnp0\|f - s_n\|_p \to 0.

Proof.

By Topic 25’s simple-function approximation theorem, there exists a sequence of simple functions (sn)(s_n) with sn(x)f(x)|s_n(x)| \leq |f(x)| pointwise and sn(x)f(x)s_n(x) \to f(x) pointwise (and even uniformly on sets where ff is bounded, but we only need pointwise here).

The sequence fsnp|f - s_n|^p converges pointwise to zero. To upgrade pointwise convergence to LpL^p convergence, we need a dominator. The triangle inequality gives

f(x)sn(x)p    (f(x)+sn(x))p    (2f(x))p  =  2pf(x)p.|f(x) - s_n(x)|^p \;\leq\; (|f(x)| + |s_n(x)|)^p \;\leq\; (2 |f(x)|)^p \;=\; 2^p |f(x)|^p.

The function 2pfp2^p |f|^p is in L1L^1 because fLpf \in L^p, so fpdμ<\int |f|^p \, d\mu < \infty. By the Dominated Convergence Theorem (Topic 26),

fsnpp  =  fsnpdμ    0dμ  =  0.\|f - s_n\|_p^p \;=\; \int |f - s_n|^p \, d\mu \;\longrightarrow\; \int 0 \, d\mu \;=\; 0.

Taking pp-th roots gives fsnp0\|f - s_n\|_p \to 0.

The proof is a beautiful one-line application of DCT: we already know the simple functions converge pointwise from Topic 25, and we already know fp|f|^p is integrable from fLpf \in L^p, so DCT gives the rest. This is the recurring pattern of this topic: every density and convergence result is “find a pointwise limit, find a dominator, apply DCT.”

For LpL^p on Rn\mathbb{R}^n with Lebesgue measure there is an even stronger density result: continuous, compactly supported functions are dense. This is what justifies using neural networks — universal approximators of continuous functions — to approximate LpL^p functions.

🔷 Theorem 7 (Density of $C_c(\mathbb{R}^n)$ in $L^p(\mathbb{R}^n)$)

For 1p<1 \leq p < \infty, the space Cc(Rn)C_c(\mathbb{R}^n) of continuous, compactly supported functions is dense in Lp(Rn,λ)L^p(\mathbb{R}^n, \lambda): for every fLp(Rn)f \in L^p(\mathbb{R}^n), there exists a sequence (φn)Cc(Rn)(\varphi_n) \subset C_c(\mathbb{R}^n) with fφnp0\|f - \varphi_n\|_p \to 0.

Proof.

The proof has two stages. First, we approximate ff by simple functions (Theorem 6). Second, we approximate each simple function by a continuous compactly supported function via mollification.

A simple function on Rn\mathbb{R}^n is a finite linear combination of indicator functions of measurable sets. By inner regularity of Lebesgue measure, every measurable set of finite measure can be approximated from inside by a closed set (and outside by an open set) up to arbitrarily small measure. Urysohn’s lemma then gives a continuous function that is 11 on the closed set, 00 outside the open set, and varies linearly in between. Taking the linear combination with the same coefficients as the simple function, we obtain a continuous compactly supported function that differs from the simple function in LpL^p norm by an arbitrarily small amount. Iterating both approximation steps with errors ε/2k+1\varepsilon/2^{k+1} at the kk-th step and applying the triangle inequality completes the proof.

A full version of this argument lives in Folland §6.1 (or Brezis §4.1 for the Rn\mathbb{R}^n case in detail). The key technical lemma is mollification: convolving an LpL^p function with a smooth, compactly supported “bump” function (a mollifier) produces a smooth approximation that converges in LpL^p as the mollifier scale shrinks.

📝 Example 8 (Density failure in $L^\infty$)

The density of CcC_c in LpL^p holds for 1p<1 \leq p < \infty, but fails at p=p = \infty. Consider f=1[0,1]f = \mathbf{1}_{[0, 1]} as an element of L(R)L^\infty(\mathbb{R}), and let gg be any continuous function. Near x=0x = 0, gg must transition continuously from g(0)g(0)g(0^-) \approx g(0) to g(0+)g(0)g(0^+) \approx g(0) — there is no “jump.” So gg cannot equal 00 for x<0x < 0 and 11 for x>0x > 0 simultaneously: at any δ>0\delta > 0, there will be points x(δ,0)x \in (-\delta, 0) where g(x)0g(x) \neq 0 or points x(0,δ)x \in (0, \delta) where g(x)1g(x) \neq 1. The same argument applies near x=1x = 1.

Quantitatively: for any continuous g:RRg: \mathbb{R} \to \mathbb{R}, the difference fgf - g has fg1/2\|f - g\|_\infty \geq 1/2. (In a neighborhood of either jump, ff and gg differ by at least 1/21/2 in essential supremum.) So CcC_c is not dense in LL^\infty. The closure of CcC_c in LL^\infty is the smaller space C0(R)C_0(\mathbb{R}) of continuous functions vanishing at infinity, which is a proper subspace. The density failure at p=p = \infty is one of several reasons LL^\infty is “harder” than the other LpL^p spaces and why most theorems in this topic explicitly restrict to 1p<1 \leq p < \infty.

💡 Remark 8 (Density for neural network approximation)

Density of CcC_c in LpL^p has a striking machine learning consequence. The universal approximation theorem (which we will state precisely in Approximation Theory — Topic 20) says that neural networks with one hidden layer can approximate any continuous function uniformly on compact sets. Combined with density of CcC_c in LpL^p, this gives:

Any LpL^p function can be approximated arbitrarily well in LpL^p norm by a neural network.

The logic is a two-step approximation: given fLpf \in L^p, first find a continuous compactly supported φ\varphi with fφp<ε/2\|f - \varphi\|_p < \varepsilon / 2 (Theorem 7), then find a neural network hh with φh<ε/(2vol(supp φ)1/p)\|\varphi - h\|_\infty < \varepsilon / (2 \cdot \text{vol}(\text{supp } \varphi)^{1/p}) (universal approximation), and conclude fhpfφp+φhp<ε\|f - h\|_p \leq \|f - \varphi\|_p + \|\varphi - h\|_p < \varepsilon. This is the existence proof behind every “neural networks can approximate any function” claim in machine learning textbooks. What it leaves open — and what most of modern statistical learning theory addresses — is the question of how big the network has to be to achieve a given approximation error.

9. L2L^2 as a Hilbert Space (Preview)

Among all the LpL^p spaces, p=2p = 2 is special. The L2L^2 norm comes from an inner product — a bilinear pairing f,g\langle f, g \rangle that satisfies f2=f,f\|f\|_2 = \sqrt{\langle f, f \rangle} — which gives L2L^2 a richer geometric structure than the other LpL^p spaces. In particular, L2L^2 has notions of orthogonality and projection that the other LpL^p spaces lack. This section previews the inner-product structure; the full Hilbert space theory will be developed in Topic 31.

📐 Definition 5 (The $L^2$ inner product)

For real-valued f,gL2(μ)f, g \in L^2(\mu), the L2L^2 inner product is

f,g  =  fgdμ.\langle f, g \rangle \;=\; \int f g \, d\mu.

For complex-valued functions, the inner product uses the conjugate of gg: f,g=fgˉdμ\langle f, g \rangle = \int f \bar g \, d\mu.

🔷 Proposition 2 (Inner product axioms on $L^2$)

The L2L^2 inner product is well-defined (the integral is finite by Cauchy–Schwarz; see Example 9), bilinear in the real case (sesquilinear in the complex case), symmetric (conjugate-symmetric in the complex case), and positive definite: f,f0\langle f, f \rangle \geq 0 with equality iff f=0f = 0 in L2L^2. Moreover,

f2  =  f,f  =  f2dμ.\|f\|_2 \;=\; \sqrt{\langle f, f \rangle} \;=\; \sqrt{\int |f|^2 \, d\mu}.

So the L2L^2 norm is the norm induced by the inner product, and L2L^2 is an inner product space. Combined with the Riesz–Fischer completeness from Section 7, this makes L2L^2 a Hilbert space: a complete inner product space.

📝 Example 9 (Cauchy–Schwarz as Hölder at $p = q = 2$)

For f,gL2f, g \in L^2, the Cauchy–Schwarz inequality

f,g  =  fgˉdμ    fgˉdμ  =  fgˉ1    f2g2|\langle f, g \rangle| \;=\; \left| \int f \bar g \, d\mu \right| \;\leq\; \int |f \bar g| \, d\mu \;=\; \|f \bar g\|_1 \;\leq\; \|f\|_2 \, \|g\|_2

is exactly Hölder’s inequality with p=q=2p = q = 2. The first inequality is hdμhdμ|\int h \, d\mu| \leq \int |h| \, d\mu (the integral version of x=x|x| = |x|), and the second is Hölder. So Cauchy–Schwarz is not an independent inequality — it is a special case of the more general Hölder result we proved in Section 4. This is the geometric reason that L2L^2 is the “best-behaved” LpL^p space: at p=2p = 2, the basic inequality of the inner-product structure coincides with the basic inequality of the integral structure.

📝 Example 10 (The Fourier basis is orthonormal in $L^2([0, 2\pi])$)

On the interval [0,2π][0, 2\pi] with normalized Lebesgue measure dμ=dx/(2π)d\mu = dx / (2\pi), the complex exponentials en(x)=einxe_n(x) = e^{i n x} for nZn \in \mathbb{Z} form an orthonormal system: en,em=δnm\langle e_n, e_m \rangle = \delta_{nm} (Kronecker delta). To see this, compute

en,em  =  12π02πeinxeimxdx  =  12π02πei(nm)xdx.\langle e_n, e_m \rangle \;=\; \frac{1}{2\pi} \int_0^{2\pi} e^{i n x} e^{-i m x} \, dx \;=\; \frac{1}{2\pi} \int_0^{2\pi} e^{i (n - m) x} \, dx.

If n=mn = m, the integrand is 11 and the integral is 2π2\pi, so en,en=1\langle e_n, e_n \rangle = 1. If nmn \neq m, the integral evaluates to [ei(nm)x/(i(nm))]02π=0[e^{i (n - m) x} / (i (n - m))]_0^{2\pi} = 0 (the function ei(nm)xe^{i (n - m) x} is 2π2\pi-periodic). So the Fourier basis is orthonormal, which means partial Fourier series can be computed by a finite-dimensional projection, the Fourier coefficients are inner products f^(n)=f,en\hat f(n) = \langle f, e_n \rangle, and Parseval’s identity f22=nf^(n)2\|f\|_2^2 = \sum_n |\hat f(n)|^2 is the Pythagorean theorem in the Hilbert space L2([0,2π])L^2([0, 2\pi]). Forward link: Fourier Series (Topic 22).

🔷 Proposition 3 (Orthogonal projection (statement only))

Let VL2(μ)V \subseteq L^2(\mu) be a closed linear subspace. For every fL2f \in L^2, there exists a unique element f^V\hat f \in V — the orthogonal projection of ff onto VV — that minimizes fv2\|f - v\|_2 over all vVv \in V. The error vector ff^f - \hat f is orthogonal to every element of VV:

ff^,v  =  0for all vV.\langle f - \hat f, v \rangle \;=\; 0 \quad \text{for all } v \in V.

A full proof requires the Hilbert space theory developed in Topic 31, but the statement is accessible now and is the central geometric fact about L2L^2.

💡 Remark 9 (Regression as $L^2$ projection)

Linear regression is exactly an orthogonal projection in L2L^2. Given data yRny \in \mathbb{R}^n and a design matrix XRn×dX \in \mathbb{R}^{n \times d}, the least-squares estimator is β^=argminβyXβ22\hat \beta = \arg\min_\beta \|y - X \beta\|_2^2. The vector Xβ^X \hat \beta is the orthogonal projection of yy onto the column space of XX (a closed subspace of Rn=L2\mathbb{R}^n = L^2 with counting measure), and the normal equations XTXβ^=XTyX^T X \hat \beta = X^T y are the orthogonality condition yXβ^,Xv=0\langle y - X \hat \beta, X v \rangle = 0 for every vv — i.e., the residual is orthogonal to every linear combination of the columns of XX. This identification is not a metaphor: it is literally the same thing. Generalized least squares uses a weighted inner product f,gW=fTWg\langle f, g \rangle_W = f^T W g for some positive-definite weight WW, and the geometry remains the same — projection onto the column space, orthogonality of the residual, existence and uniqueness via the projection theorem. Forward link: Regression.

10. Duality — (Lp)=Lq(L^p)^* = L^q

Every normed vector space XX has a dual space XX^*: the space of bounded linear functionals Φ:XR\Phi: X \to \mathbb{R}, with the operator norm Φ=supf1Φ(f)\|\Phi\| = \sup_{\|f\| \leq 1} |\Phi(f)|. For LpL^p spaces with conjugate exponents p,qp, q, the dual space has a particularly clean characterization — it is isometrically isomorphic to the other space in the conjugate pair. This is the Riesz representation theorem for LpL^p, and it is the foundation of LpL^p duality theory.

🔷 Theorem 8 (Riesz representation theorem for $L^p$)

Let 1<p<1 < p < \infty and let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a σ\sigma-finite measure space. Let qq be the conjugate exponent of pp, so 1/p+1/q=11/p + 1/q = 1. Then every bounded linear functional Φ:Lp(μ)R\Phi: L^p(\mu) \to \mathbb{R} has the form

Φ(f)  =  fgdμfor a unique gLq(μ),\Phi(f) \;=\; \int f g \, d\mu \quad \text{for a unique } g \in L^q(\mu),

and the operator norm of Φ\Phi equals the LqL^q norm of the representing function:

Φ(Lp)  =  gq.\|\Phi\|_{(L^p)^*} \;=\; \|g\|_q.

In particular, the map gΦgg \mapsto \Phi_g is an isometric isomorphism Lq(Lp)L^q \to (L^p)^*, often written as (Lp)Lq(L^p)^* \cong L^q.

Proof.

We sketch the proof; full details are in Folland §6.2 or Royden §19.5.

Embedding Lq(Lp)L^q \hookrightarrow (L^p)^*. Given gLqg \in L^q, define Φg(f)=fgdμ\Phi_g(f) = \int f g \, d\mu. By Hölder’s inequality, Φg(f)fpgq|\Phi_g(f)| \leq \|f\|_p \|g\|_q, so Φg\Phi_g is a bounded linear functional with Φg(Lp)gq\|\Phi_g\|_{(L^p)^*} \leq \|g\|_q. To show equality, take f=gq1sgn(g)/gqq1f = |g|^{q-1} \operatorname{sgn}(g) / \|g\|_q^{q-1}; a direct computation gives fp=1\|f\|_p = 1 and Φg(f)=gq\Phi_g(f) = \|g\|_q, so Φggq\|\Phi_g\| \geq \|g\|_q. Combining, Φg=gq\|\Phi_g\| = \|g\|_q.

Surjectivity. Given a bounded linear functional Φ(Lp)\Phi \in (L^p)^*, we want to find gLqg \in L^q with Φ=Φg\Phi = \Phi_g. For each measurable set AA of finite measure, define ν(A)=Φ(1A)\nu(A) = \Phi(\mathbf{1}_A). Linearity of Φ\Phi on indicator functions makes ν\nu a finitely additive set function; boundedness and a countable-additivity argument upgrade ν\nu to a σ\sigma-additive signed measure. Moreover, if μ(A)=0\mu(A) = 0, then 1A=0\mathbf{1}_A = 0 in LpL^p, so Φ(1A)=0\Phi(\mathbf{1}_A) = 0. So ν\nu is absolutely continuous with respect to μ\mu (every μ\mu-null set is ν\nu-null).

By the Radon–Nikodym theorem, absolute continuity implies that ν\nu has a density g=dν/dμg = d\nu / d\mu, so ν(A)=Agdμ\nu(A) = \int_A g \, d\mu. By construction, Φ(1A)=1Agdμ\Phi(\mathbf{1}_A) = \int \mathbf{1}_A g \, d\mu for every set AA of finite measure. Linearity then extends this from indicators to simple functions, and density of simple functions in LpL^p (Theorem 6) plus continuity of Φ\Phi extends it to all of LpL^p. So Φ(f)=fgdμ\Phi(f) = \int f g \, d\mu on all of LpL^p. Showing gLqg \in L^q uses the operator norm bound Φ(f)Φfp|\Phi(f)| \leq \|\Phi\| \|f\|_p and a clever choice of test function similar to the embedding step.

💡 Remark 10 (Endpoint cases — $L^1$, $L^\infty$, and the asymmetry)

The same theorem holds at p=1p = 1, q=q = \infty — namely (L1)L(L^1)^* \cong L^\infty, with the same proof structure but using σ\sigma-finiteness explicitly to handle the unboundedness of representing functions. The mirror statement (L)L1(L^\infty)^* \cong L^1 is false: the dual of LL^\infty is strictly larger than L1L^1, containing “singular” functionals such as evaluation at a point that cannot be represented as ffgdμf \mapsto \int f g \, d\mu for any gL1g \in L^1. This asymmetry — duality is symmetric for 1<p<1 < p < \infty but not at the endpoints — is one of the features that make LL^\infty the most idiosyncratic LpL^p space and the reason most theorems in functional analysis explicitly exclude p=p = \infty.

📝 Example 11 (Dual pairing in finite dimensions: $(\ell^p)^* = \ell^q$)

When Ω={1,2,,n}\Omega = \{1, 2, \ldots, n\} with counting measure, LpL^p becomes p\ell^p — the space of nn-tuples with the pp-norm xp=(ixip)1/p\|\mathbf{x}\|_p = (\sum_i |x_i|^p)^{1/p}. Hölder becomes the finite-dimensional inequality ixiyixpyq|\sum_i x_i y_i| \leq \|\mathbf{x}\|_p \|\mathbf{y}\|_q, and the Riesz representation theorem becomes the statement that every linear functional Φ:RnR\Phi: \mathbb{R}^n \to \mathbb{R} has the form Φ(x)=ixiyi=xy\Phi(\mathbf{x}) = \sum_i x_i y_i = \mathbf{x} \cdot \mathbf{y} for a unique yRn\mathbf{y} \in \mathbb{R}^n, with Φ(p)=yq\|\Phi\|_{(\ell^p)^*} = \|\mathbf{y}\|_q. So the dual space of Rn\mathbb{R}^n with the pp-norm is Rn\mathbb{R}^n with the qq-norm. In coordinates, every linear functional is a dot product — but the norm on the dual is the qq-norm rather than the pp-norm. This is the finite-dimensional fingerprint of the general LpL^p duality theorem, and it is the foundation of every constrained optimization argument that uses Lagrangian duality with pp-norm constraints.

11. Computational Notes

A few practical observations about computing LpL^p norms in code, since machine learning practitioners encounter discrete and continuous LpL^p norms constantly.

  • Discrete LpL^p norms. numpy.linalg.norm(x, ord=p) computes xp=(ixip)1/p\|\mathbf{x}\|_p = (\sum_i |x_i|^p)^{1/p} for a vector x\mathbf{x}. This is LpL^p against the counting measure on {1,2,,n}\{1, 2, \ldots, n\}. The ord=np.inf case gives maxixi\max_i |x_i| — the discrete essential supremum, which equals the actual supremum because every singleton has positive measure under counting measure.

  • Continuous LpL^p norms. For an analytic or sampled function, scipy.integrate.quad(lambda x: abs(f(x))**p, a, b)[0]**(1/p) computes fp\|f\|_p on [a,b][a, b] via numerical quadrature. This is the “integrate fp|f|^p, then take the pp-th root” recipe straight from Definition 1, and it works for any p[1,)p \in [1, \infty). For p=p = \infty, replace with a sup over a fine grid.

  • torch.nn.functional.mse_loss is a discrete L2L^2 computation. The mean squared error loss is MSE(y,y^)=1ni(yiy^i)2=1nyy^22\text{MSE}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{n} \sum_i (y_i - \hat y_i)^2 = \frac{1}{n} \|\mathbf{y} - \hat{\mathbf{y}}\|_2^2, which is 22\|\cdot\|_2^2 against the normalized counting measure on {1,,n}\{1, \ldots, n\} (assigning measure 1/n1/n to each point). The factor of 1/n1/n converts counting measure to the empirical measure, which is what makes MSE the natural loss for finite-sample expected value estimation.

  • Regularization geometry: the LpL^p ball is the constraint set. Ridge regression solves minβyXβ22+λβ22\min_\beta \|y - X \beta\|_2^2 + \lambda \|\beta\|_2^2 — the regularization term constrains β\beta to lie inside an L2L^2 ball, which is a sphere with no corners. The least-squares contour is generically tangent to the sphere at a point with all coordinates non-zero, so Ridge “shrinks” coefficients but never sets them exactly to zero. Lasso solves minβyXβ22+λβ1\min_\beta \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 — the regularization constrains β\beta to an L1L^1 ball, which is a diamond with corners on the coordinate axes. The least-squares contour generically intersects the diamond at a corner, where one or more coordinates are exactly zero, so Lasso produces sparse solutions. Elastic net uses both penalties: λ1β1+λ2β22\lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2, and its constraint set is a smooth blend of the two extremes. The flagship LpL^p ball viz in Section 6 makes this geometric explanation visual.

L1 vs L2 ball geometry with level curves of a least-squares loss — the L1 corner at the axis is where Lasso produces sparsity, the smooth L2 sphere is where Ridge shrinks but never zeros

📝 Example 12 (Numerical verification of Hölder)

For f(x)=sin(πx)f(x) = \sin(\pi x) and g(x)=x2g(x) = x^2 on [0,1][0, 1], p=3p = 3, q=3/2q = 3/2, computing each LpL^p norm via scipy.integrate.quad:

fg1  =  01sin(πx)x2dx    0.1894,\|fg\|_1 \;=\; \int_0^1 \sin(\pi x) \cdot x^2 \, dx \;\approx\; 0.1894,

f3  =  (01sin3(πx)dx)1/3    0.8268,\|f\|_3 \;=\; \left( \int_0^1 \sin^3(\pi x) \, dx \right)^{1/3} \;\approx\; 0.8268,

g3/2  =  (01x3dx)2/3  =  (0.25)2/3    0.3969.\|g\|_{3/2} \;=\; \left( \int_0^1 x^3 \, dx \right)^{2/3} \;=\; (0.25)^{2/3} \;\approx\; 0.3969.

Hölder bound: f3g3/20.3281\|f\|_3 \cdot \|g\|_{3/2} \approx 0.3281. The inequality 0.18940.32810.1894 \leq 0.3281 holds with substantial slack (ratio about 0.5770.577). The ratio is far from 11 because f3=sin3(πx)|f|^3 = \sin^3(\pi x) and g3/2=x3|g|^{3/2} = x^3 are not proportional — the conjugacy condition for sharp Hölder fails. Compare to the power-pair example in Section 4 where the ratio approached 11 because the functions were specifically designed to be a conjugate pair.

12. Connections to Machine Learning

LpL^p spaces are not an abstract curiosity — they are the function-space framework that every modern ML algorithm implicitly assumes. Five concrete connections illustrate the range.

Four-panel ML connections to Lp spaces: regression as L2 projection, ridge/lasso constraint sets, score matching in L2, Wasserstein distances

Regression as L2L^2 projection. Least-squares regression finds the closest element of a closed subspace of L2L^2, and the existence-and-uniqueness of the minimizer is exactly the orthogonal-projection theorem from Section 9. The normal equations are the orthogonality condition. Generalized least squares uses a weighted L2L^2 norm where the weights encode observation precisions, and weighted regression remains a projection — just in a re-weighted Hilbert space. Forward link: Regression.

Regularization geometry. The shapes of the L1L^1 and L2L^2 unit balls determine the qualitative behavior of regularized estimators. The L1L^1 ball has corners on the coordinate axes; the intersection of a level curve of the data-fit term with a corner produces sparsity (some coordinates exactly zero). The L2L^2 ball is a smooth sphere; intersections never sit on axes, so L2L^2 regularization shrinks coordinates but never zeros them out. This is why Lasso selects features and Ridge does not — and the explanation is purely geometric, not statistical.

Fourier neural operators. Fourier neural operators (FNO) learn mappings between function spaces, typically Gθ:L2(Rd)L2(Rd)\mathcal{G}_\theta: L^2(\mathbb{R}^d) \to L^2(\mathbb{R}^d). The training objective is an L2L^2 loss over function pairs (u,s)(u, s) from a PDE dataset: L(θ)=E[Gθ(u)sL22]\mathcal{L}(\theta) = \mathbb{E}\bigl[ \|\mathcal{G}_\theta(u) - s\|_{L^2}^2 \bigr]. Riesz–Fischer guarantees that the optimization happens in a complete function space — minimizing sequences converge inside L2L^2. Without completeness, the optimization could “escape” to non-functions, and the entire approach would be ill-defined. Forward link: Fourier Neural Operators.

Score matching. Score matching trains a generative model by minimizing the L2L^2 distance between the score functions (gradients of log-densities) of the model and data: LSM(θ)=logpθ(x)logpdata(x)22pdata(x)dx\mathcal{L}_{\text{SM}}(\theta) = \int \|\nabla \log p_\theta(x) - \nabla \log p_{\text{data}}(x)\|_2^2 \, p_{\text{data}}(x) \, dx. The objective is well-defined precisely when both score functions live in L2(pdata)L^2(p_{\text{data}}). Denoising score matching, sliced score matching, and diffusion-model training are all computational tricks for evaluating this L2L^2 functional without knowing logpdata\nabla \log p_{\text{data}} explicitly. Forward link: Score Matching.

Wasserstein distances and optimal transport. The Wasserstein-pp distance between two probability measures is

Wp(μ,ν)  =  (infγΓ(μ,ν)xypdγ(x,y))1/p,W_p(\mu, \nu) \;=\; \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int |x - y|^p \, d\gamma(x, y) \right)^{1/p},

which has the structure of an LpL^p norm in the space of couplings Γ(μ,ν)\Gamma(\mu, \nu). The Kantorovich dual formulation uses the (Lp)=Lq(L^p)^* = L^q duality from Section 10: W1(μ,ν)=supfLip1fd(μν)W_1(\mu, \nu) = \sup_{\|f\|_{\text{Lip}} \leq 1} \int f \, d(\mu - \nu), where the supremum is over 11-Lipschitz functions (an LL^\infty subspace). The dual side replaces the infimum over couplings with a supremum over functions, and the duality is exactly LpL^p/LqL^q duality applied to the optimal transport setting. Forward link: Wasserstein Distances.

📝 Example 13 ($L^1$ vs. $L^2$ regularization paths)

Consider a 2D regression with parameter vector β=(β1,β2)\beta = (\beta_1, \beta_2). As the regularization strength λ\lambda varies, the Ridge path β^(λ)=(XTX+λI)1XTy\hat\beta(\lambda) = (X^T X + \lambda I)^{-1} X^T y moves smoothly from the OLS solution (λ=0)(\lambda = 0) toward the origin (λ)(\lambda \to \infty), with both coordinates shrinking continuously and never reaching zero except in the limit. The Lasso path β^(λ)=argminβyXβ22+λβ1\hat\beta(\lambda) = \arg\min_\beta \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 is piecewise linear and kinks at values of λ\lambda where one of the coordinates first becomes exactly zero. After the kink, the Lasso solution stays on the axis — it has selected one feature and dropped the other. The qualitative difference between “smooth shrinkage” (Ridge) and “kink-then-zero” (Lasso) is entirely a consequence of the LpL^p ball geometry: the L2L^2 ball is differentiable, so the optimum moves smoothly; the L1L^1 ball has corners, so the optimum hits a corner at some λ\lambda and stays there.

📝 Example 14 ($L^2$ convergence of Fourier partial sums)

Take f=1[0,π]f = \mathbf{1}_{[0, \pi]} on [0,2π][0, 2\pi]. The partial Fourier sums SNf=nNf^(n)einxS_N f = \sum_{|n| \leq N} \hat f(n) e^{i n x} exhibit the Gibbs phenomenon: at the discontinuities x=0x = 0 and x=πx = \pi, SNfS_N f overshoots ff by about 9%9\% no matter how large NN is. So SNfS_N f does not converge pointwise to ff at the jumps. But the L2L^2 error decays:

fSNf22  =  n>Nf^(n)2  N  0,\|f - S_N f\|_2^2 \;=\; \sum_{|n| > N} |\hat f(n)|^2 \;\xrightarrow[N \to \infty]{} \;0,

since the Fourier coefficients of ff are square-summable (Parseval’s identity). The decay is moderate — about 1/N1/\sqrt{N} — but it goes to zero. So SNffS_N f \to f in L2L^2 but not pointwise. The Gibbs overshoot is an artifact of pointwise behavior at a measure-zero set, and the L2L^2 norm is insensitive to it because it integrates against a measure that assigns no mass to individual points. This is the cleanest example of “two notions of convergence give different answers, and the answer that survives in the long run is the one that respects the underlying measure.”

13. Connections & Further Reading

This is the third topic in Track 7 — Measure & Integration — and the third advanced topic in formalCalculus. Topics 25 and 26 built the framework and the integral; this topic built the function spaces. The combination is the foundation that the next several topics across measure theory, functional analysis, and probability rest on. The Riesz–Fischer completeness theorem proved here is what makes everything downstream possible: without completeness of LpL^p, there is no Banach space theory, no Hilbert space theory, no Radon–Nikodym theorem, no functional-analytic optimization. Topic 27 is the smallest topic that the rest of modern analysis depends on.

The central editorial pivot in this topic was from individual functions to spaces of functions. Topic 26 asked “does this function have a finite integral?” and Topic 27 asks “how far apart are these two functions?” Both questions are about measurable functions, but the second one requires organizing functions into a vector space with a norm — and that organization is exactly what makes function-space methods feel like linear algebra. Once LpL^p is in place, the analogies between Rn\mathbb{R}^n and function spaces (vectors / functions, dot products / integrals, projections / regression, completeness / convergence guarantees) become literal identities rather than loose metaphors.

Within formalCalculus:

  • The Lebesgue Integral — The LpL^p norm fp=(fpdμ)1/p\|f\|_p = (\int |f|^p \, d\mu)^{1/p} is a Lebesgue integral, and the Riesz–Fischer proof uses the Dominated Convergence Theorem from Topic 26 as its closing tool. Without the convergence theorems from Topic 26, the Riesz–Fischer proof would not close.
  • Sigma-Algebras & Measures — Measurable functions, the a.e.-equivalence relation, simple functions, and the simple-function approximation theorem are all from Topic 25. The density-of-simple-functions theorem in Section 8 is a direct application of the Topic 25 simple-function approximation result.
  • Completeness & Compactness — The Riesz–Fischer subsequence-extraction technique (extract a fast-decaying subsequence, sum the differences, recover an a.e. limit) is the same kind of move as the Bolzano–Weierstrass extraction from Topic 3, applied in function space rather than in R\mathbb{R}.
  • Fourier Series — The L2L^2 convergence of Fourier series (Example 14) is a special case of the general orthonormal-system theory that we previewed in Section 9. Topic 22’s geometric statements about Fourier series being “basis expansions” are made rigorous by the Hilbert space structure of L2L^2 developed here.

Successor topics within formalCalculus:

  • Radon-Nikodym & Probability Densities — Densities dν/dμd\nu / d\mu live in L1(μ)L^1(\mu). The Radon-Nikodym proof uses L2L^2 projection onto closed subspaces of measure differences — an LpL^p technique applied to a measure-theoretic problem.
  • Normed & Banach SpacesLpL^p is the canonical example of a Banach space (a complete normed vector space). Topic 30 develops the abstract theory — bounded operators, operator norms, the Baire Category Theorem and its three consequences, dual spaces — that this topic’s concrete LpL^p results exemplify.
  • Inner Product & Hilbert SpacesL2L^2 is the canonical Hilbert space. Topic 31 develops orthogonal projection, the Riesz representation theorem (the general version of Section 10’s special case), and the spectral theorem — all of which generalize properties we saw in L2L^2 here.

Forward to formalml.com:

  • Regression — Least-squares regression is exactly L2L^2 projection onto a closed subspace. The existence and uniqueness of the OLS estimator follows from the projection theorem previewed in Section 9. The normal equations are the orthogonality condition.
  • Fourier Neural Operators — FNOs learn mappings L2(Rd)L2(Rd)L^2(\mathbb{R}^d) \to L^2(\mathbb{R}^d) between function spaces. The completeness of L2L^2 from Riesz–Fischer is what makes the optimization well-posed.
  • Score Matching — Score matching minimizes the L2(pdata)L^2(p_{\text{data}}) distance between model and data score functions. The well-posedness of the objective requires both scores to live in L2L^2.
  • Wasserstein Distances — The Kantorovich dual formulation of Wasserstein distances uses the (Lp)=Lq(L^p)^* = L^q duality from Section 10. The Lipschitz constraint on the dual side is an LL^\infty subspace.
  • Kernel Methods — Reproducing kernel Hilbert spaces (RKHS) are subspaces of L2L^2. Kernel evaluations are L2L^2 inner products, and the representer theorem is a finite-dimensional projection within an infinite-dimensional Hilbert space — making essential use of the L2L^2 completeness from Riesz–Fischer.

References:

  • Royden, H. L. & Fitzpatrick, P. M. Real Analysis (4th ed., 2010), Chapter 7 (Lp spaces on R\mathbb{R}) and Chapter 19 (general measure spaces). Closest match to our exposition order, with the same seminorm-then-quotient construction in Section 2.
  • Folland, G. B. Real Analysis: Modern Techniques and Their Applications (2nd ed., 1999), Chapter 6. Concise and rigorous treatment of Hölder, Minkowski, completeness, density, and duality. Reference for the duality proof sketch in Section 10.
  • Brezis, H. Functional Analysis, Sobolev Spaces and Partial Differential Equations (2011), Chapter 4. Excellent for the density results and the σ\sigma-finiteness condition in the dual space characterization. Brezis is also where many readers first see the LpL^p/Sobolev space theory in the form ML practitioners actually use it.
  • Stein, E. M. & Shakarchi, R. Real Analysis: Measure Theory, Integration, and Hilbert Spaces (2005), Chapter 2. Elegant treatment that connects LpL^p theory directly to the Hilbert space theory of L2L^2. The geometric instinct in Stein & Shakarchi is closest to the philosophy of this curriculum.

References

  1. book Royden, H. L. & Fitzpatrick, P. M. (2010). Real Analysis Fourth edition. Chapter 7 (Lp spaces on ℝ) and Chapter 19 (general measure spaces). Closest to our exposition order.
  2. book Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications Second edition. Chapter 6 (Lp spaces). Concise treatment of Hölder, Minkowski, completeness, and duality.
  3. book Brezis, H. (2011). Functional Analysis, Sobolev Spaces and Partial Differential Equations Chapter 4 (Lp spaces). Excellent for the density results and dual space characterization.
  4. book Stein, E. M. & Shakarchi, R. (2005). Real Analysis: Measure Theory, Integration, and Hilbert Spaces Chapter 2. Elegant treatment connecting Lp to Hilbert space theory.