Functional Analysis · advanced · 55 min read

Inner Product & Hilbert Spaces

Adding angles and orthogonality to Banach spaces — from the projection theorem and Riesz representation through orthonormal bases and the spectral theorem to RKHS, kernel methods, and Gaussian processes.

Abstract. A normed space tells you how far apart points are, but not the angle between directions. An inner product adds angles, orthogonality, and projection — and these three geometric ideas unlock results of extraordinary power. The projection theorem guarantees that every point in a Hilbert space has a unique closest point in any closed convex subset. The Riesz representation theorem identifies a Hilbert space with its dual, collapsing the distinction between vectors and linear functionals. The spectral theorem decomposes compact self-adjoint operators into eigenspaces. And reproducing kernel Hilbert spaces provide the mathematical framework for every kernel method in machine learning — from SVMs through Gaussian processes to neural tangent kernels.

Where this leads → formalML

  • formalML RKHS is a Hilbert space; the representer theorem is a consequence of the Riesz representation theorem. Kernel evaluation is inner-product evaluation via the reproducing property.
  • formalML Proximal operators are defined via orthogonal projections in Hilbert spaces. The Riesz identification lets us interpret gradients as descent directions.
  • formalML Gaussian processes operate in RKHS. The posterior mean is a projection onto a finite-dimensional subspace spanned by the kernel evaluations at observed data points.
  • formalML The spectral theorem for compact self-adjoint operators is the foundation of PCA, kernel PCA, and spectral clustering.
  • formalML Hilbert-space gradient descent converges because the Riesz identification turns dual-space gradients into primal-space descent directions — no dual-space bookkeeping needed.

1. Overview and Motivation

In a reproducing kernel Hilbert space, evaluating a function ff at a point xx is an inner product: f(x)=f,Kxf(x) = \langle f, K_x \rangle. This single identity — the reproducing property — is the engine behind SVMs, Gaussian processes, and kernel PCA. It works because HH is a Hilbert space, not just a Banach space. And it works because the inner product gives us something a norm alone cannot: angles.

A norm measures length. An inner product measures both length and angle. Angle gives you orthogonality; orthogonality gives you projection; projection gives you everything.

The arc of Track 8 is a staircase of abstraction, where each step adds one axiom and gains strictly stronger conclusions:

  1. Metric spaces (Topic 29): distances → completeness → fixed-point theorems.
  2. Normed/Banach spaces (Topic 30): length → bounded operators → Baire, UBP, Open Mapping, Closed Graph.
  3. Inner product/Hilbert spaces (this topic): angles → orthogonality → projection theorem, Riesz representation, spectral decomposition, RKHS.

Topic 31 is where the reader adds one more axiom and gains an enormous amount of power. Where Topic 30 was algebraic machinery — Baire-powered theorems, careful epsilon-management — this topic is geometric machinery. The proofs are more visual. The projection theorem is a “closest point” argument. The Riesz representation theorem is a “find the unique vector representing a functional” argument. The spectral theorem is an eigenvalue-decomposition argument. At every turn, the geometry of orthogonality does the heavy lifting.


2. Inner Product Spaces

📐 Definition 1 (Inner Product)

An inner product on a real vector space HH is a function ,:H×HR\langle \cdot, \cdot \rangle : H \times H \to \mathbb{R} satisfying:

  1. Symmetry: x,y=y,x\langle x, y \rangle = \langle y, x \rangle for all x,yHx, y \in H.
  2. Linearity in the first argument: αx+βy,z=αx,z+βy,z\langle \alpha x + \beta y, z \rangle = \alpha \langle x, z \rangle + \beta \langle y, z \rangle for all x,y,zHx, y, z \in H and scalars α,β\alpha, \beta.
  3. Positive definiteness: x,x0\langle x, x \rangle \geq 0 for all xHx \in H, with equality if and only if x=0x = 0.

Every inner product induces a norm x=x,x\|x\| = \sqrt{\langle x, x \rangle} and hence a metric d(x,y)=xyd(x, y) = \|x - y\|. A vector space equipped with an inner product is called an inner product space (or pre-Hilbert space).

The inner product encodes geometry: the angle between nonzero vectors xx and yy is θ=arccos ⁣(x,yxy)\theta = \arccos\!\left(\frac{\langle x, y \rangle}{\|x\| \, \|y\|}\right), and two vectors are orthogonal when x,y=0\langle x, y \rangle = 0. This is the geometric structure that a norm lacks.

📝 Example 1 (Euclidean Inner Product on ℝⁿ)

The standard inner product on Rn\mathbb{R}^n is the dot product:

x,y=i=1nxiyi\langle x, y \rangle = \sum_{i=1}^n x_i y_i

This induces the Euclidean norm x2=xi2\|x\|_2 = \sqrt{\sum x_i^2}. The angle formula recovers the geometric angle between vectors in R2\mathbb{R}^2 and R3\mathbb{R}^3 that we know from linear algebra.

📝 Example 2 (The L² Inner Product)

On the space of square-integrable functions L2[a,b]L^2[a,b]:

f,g=abf(x)g(x)dx\langle f, g \rangle = \int_a^b f(x) \, g(x) \, dx

This is the inner product that makes Fourier analysis work (Topic 22). The induced norm is f2=f2\|f\|_2 = \sqrt{\int |f|^2}, which matches the L2L^2 norm from Topic 27. Functions are “orthogonal” when their pointwise product integrates to zero — sin(nx)\sin(nx) and cos(mx)\cos(mx) are orthogonal on [π,π][-\pi, \pi] precisely because ππsin(nx)cos(mx)dx=0\int_{-\pi}^\pi \sin(nx)\cos(mx)\,dx = 0.

📝 Example 3 (The ℓ² Inner Product)

On the sequence space 2={(xn):xn2<}\ell^2 = \{(x_n) : \sum |x_n|^2 < \infty\}:

x,y=n=1xnyn\langle x, y \rangle = \sum_{n=1}^\infty x_n y_n

The series converges absolutely by Cauchy-Schwarz (Theorem 1 below). This is the discrete analog of the L2L^2 inner product — it is L2(N,counting measure)L^2(\mathbb{N}, \text{counting measure}).

📝 Example 4 (Weighted Inner Products)

Given positive weights w1,,wn>0w_1, \ldots, w_n > 0, the weighted inner product on Rn\mathbb{R}^n is:

x,yw=i=1nwixiyi\langle x, y \rangle_w = \sum_{i=1}^n w_i x_i y_i

The induced norm xw=wixi2\|x\|_w = \sqrt{\sum w_i x_i^2} stretches the geometry along coordinate axes. In machine learning, the Mahalanobis distance uses a weighted inner product x,yΣ=xTΣ1y\langle x, y \rangle_\Sigma = x^T \Sigma^{-1} y where Σ\Sigma is a covariance matrix.

💡 Remark 1 (Banach Space Vocabulary Refresh)

If you completed Topic 30 some time ago, here is a brief refresher of the key definitions we will need. A normed space is a vector space with a norm satisfying positive definiteness, homogeneity, and the triangle inequality (Topic 30, Definition 1). A Banach space is a complete normed space (Topic 30, Definition 2). A bounded linear operator T:XYT: X \to Y satisfies TxYCxX\|Tx\|_Y \leq C\|x\|_X for some constant CC (Topic 30, Definition 3). The dual space XX^* is the space of all bounded linear functionals φ:XR\varphi: X \to \mathbb{R} (Topic 30, Section 10). Every inner product space is a normed space (via x=x,x\|x\| = \sqrt{\langle x, x \rangle}), and the Hilbert-space inner product will simplify the dual-space machinery of Topic 30 dramatically.

Inner product geometry: signed projection, Cauchy-Schwarz cone, angle between vectors
The inner product ⟨x, y⟩ encodes both length (‖x‖ = √⟨x, x⟩) and angle (cos θ = ⟨x, y⟩ / ‖x‖‖y‖). Left: inner product as signed projection. Center: the Cauchy-Schwarz inequality constrains ⟨x, y⟩ to lie between −‖x‖‖y‖ and ‖x‖‖y‖. Right: orthogonality when ⟨x, y⟩ = 0.

3. Cauchy-Schwarz and the Parallelogram Law

These are the two fundamental identities that govern inner product spaces. Every further result in this topic depends on one or both of them.

🔷 Theorem 1 (Cauchy-Schwarz Inequality)

For any vectors x,yx, y in an inner product space:

x,yxy|\langle x, y \rangle| \leq \|x\| \cdot \|y\|

Equality holds if and only if xx and yy are linearly dependent.

Proof.

If y=0y = 0, both sides are zero. Assume y0y \neq 0. For any tRt \in \mathbb{R}, the positive definiteness of the inner product gives:

0xty,xty=x22tx,y+t2y20 \leq \langle x - ty, x - ty \rangle = \|x\|^2 - 2t\langle x, y \rangle + t^2 \|y\|^2

This is a quadratic in tt that is non-negative for all tt, so its discriminant is non-positive:

4x,y24x2y204\langle x, y \rangle^2 - 4\|x\|^2 \|y\|^2 \leq 0

which gives x,yxy|\langle x, y \rangle| \leq \|x\| \cdot \|y\|.

For the equality case: if x=tyx = ty for some scalar tt, then x,y=ty2=xy|\langle x, y \rangle| = |t| \, \|y\|^2 = \|x\| \cdot \|y\|. Conversely, equality in the discriminant means the quadratic has a real root t0t_0, so xt0y2=0\|x - t_0 y\|^2 = 0, hence x=t0yx = t_0 y.

Cauchy-Schwarz is the engine that makes the angle formula well-defined: it guarantees x,yxy1\frac{|\langle x, y \rangle|}{\|x\| \|y\|} \leq 1, so the arccosine is always defined.

🔷 Theorem 2 (Parallelogram Law)

In any inner product space:

x+y2+xy2=2(x2+y2)\|x + y\|^2 + \|x - y\|^2 = 2(\|x\|^2 + \|y\|^2)

for all vectors x,yx, y.

Proof.

We expand both squared norms using the inner product:

x+y2=x+y,x+y=x2+2x,y+y2\|x + y\|^2 = \langle x + y, x + y \rangle = \|x\|^2 + 2\langle x, y \rangle + \|y\|^2

xy2=xy,xy=x22x,y+y2\|x - y\|^2 = \langle x - y, x - y \rangle = \|x\|^2 - 2\langle x, y \rangle + \|y\|^2

Adding these two equations, the cross terms ±2x,y\pm 2\langle x, y \rangle cancel:

x+y2+xy2=2x2+2y2\|x + y\|^2 + \|x - y\|^2 = 2\|x\|^2 + 2\|y\|^2

💡 Remark 2 (The Parallelogram Law Characterizes Inner Product Norms)

The parallelogram law is not just a consequence of inner products — it characterizes them. A normed space (X,)(X, \|\cdot\|) has a norm induced by an inner product if and only if the parallelogram law holds. When it does, the polarization identity recovers the inner product from the norm:

x,y=14(x+y2xy2)\langle x, y \rangle = \frac{1}{4}\left(\|x + y\|^2 - \|x - y\|^2\right)

This is why p\ell^p for p2p \neq 2 is a Banach space but not a Hilbert space: the 1\ell^1 and \ell^\infty norms violate the parallelogram law. The 2\ell^2 norm is the only p\ell^p norm that comes from an inner product.

Parallelogram law satisfied in ℓ² but violated in ℓ¹ and ℓ∞
The parallelogram law ‖x+y‖² + ‖x−y‖² = 2(‖x‖² + ‖y‖²) holds in ℓ² (left) but fails in ℓ¹ (center) and ℓ∞ (right). Only norms satisfying this law come from inner products.

4. Hilbert Spaces

📐 Definition 2 (Hilbert Space)

A Hilbert space is a complete inner product space — an inner product space HH in which every Cauchy sequence converges (in the norm induced by the inner product).

Equivalently: HH is a Banach space whose norm satisfies the parallelogram law.

📝 Example 5 (Catalog of Hilbert Spaces)

The following are Hilbert spaces (complete inner product spaces):

  • Rn\mathbb{R}^n with the standard inner product — finite-dimensional, hence automatically complete.
  • 2\ell^2 with x,y=xnyn\langle x, y \rangle = \sum x_n y_n — completeness is the Riesz-Fischer theorem (Topic 27, Theorem 5) for counting measure.
  • L2(μ)L^2(\mu) with f,g=fgdμ\langle f, g \rangle = \int fg \, d\mu — completeness is Riesz-Fischer for a general measure.
  • L2[π,π]L^2[-\pi, \pi] — the Hilbert space where Fourier analysis lives (Topic 22).

The following are inner product spaces that are not Hilbert spaces (not complete):

  • C[0,1]C[0,1] with f,g=01fgdx\langle f, g \rangle = \int_0^1 fg \, dx — not complete under this inner product because Cauchy sequences of continuous functions can converge to discontinuous L2L^2 functions.
  • The space of polynomials on [0,1][0,1] with the L2L^2 inner product — not complete because polynomials are dense in L2L^2 but do not exhaust it.

💡 Remark 3 (Why Hilbert Spaces Are Special)

Every Hilbert space is a Banach space. The converse is emphatically false: 1\ell^1, \ell^\infty, C[0,1]C[0,1] with the sup-norm, and LpL^p for p2p \neq 2 are all Banach but not Hilbert — their norms violate the parallelogram law. The distinction matters because the inner product gives us orthogonality, and orthogonality gives us the projection theorem (Theorem 4), which has no analog in general Banach spaces. The Baire-powered theorems from Topic 30 (Uniform Boundedness, Open Mapping, Closed Graph) still apply — a Hilbert space is a Banach space, so all Banach-space theorems carry over — but the inner product adds an entire geometric toolkit on top.


5. Orthogonality

📐 Definition 3 (Orthogonality and Orthogonal Complement)

Two vectors x,yHx, y \in H are orthogonal (written xyx \perp y) if x,y=0\langle x, y \rangle = 0.

For a subset MHM \subseteq H, the orthogonal complement is:

M={xH:x,m=0 for all mM}M^\perp = \{x \in H : \langle x, m \rangle = 0 \text{ for all } m \in M\}

MM^\perp is always a closed subspace of HH, regardless of whether MM itself is a subspace.

The Pythagorean theorem extends to inner product spaces: if xyx \perp y, then x+y2=x2+y2\|x + y\|^2 = \|x\|^2 + \|y\|^2. This is the geometric identity that the projection theorem exploits.

🔷 Theorem 3 (Orthogonal Decomposition)

Let MM be a closed subspace of a Hilbert space HH. Then every xHx \in H can be written uniquely as:

x=m+mx = m + m^\perp

where mMm \in M and mMm^\perp \in M^\perp. In other words, H=MMH = M \oplus M^\perp (orthogonal direct sum), and (M)=M(M^\perp)^\perp = M.

This is one of the most powerful structural results in functional analysis. It says that a closed subspace of a Hilbert space has a complement — and not just any complement, but an orthogonal one. In general Banach spaces, closed subspaces need not have complements at all.

📝 Example 6 (Orthogonal Complement in ℝ³)

In R3\mathbb{R}^3 with the standard inner product, let MM be the xyxy-plane: M={(x,y,0):x,yR}M = \{(x, y, 0) : x, y \in \mathbb{R}\}. Then M={(0,0,z):zR}M^\perp = \{(0, 0, z) : z \in \mathbb{R}\} — the zz-axis. Every vector (a,b,c)=(a,b,0)+(0,0,c)(a, b, c) = (a, b, 0) + (0, 0, c) decomposes uniquely into its MM and MM^\perp components.

📝 Example 7 (Orthogonal Complement in L²)

In L2[π,π]L^2[-\pi, \pi], let M={fL2:f is even}M = \{f \in L^2 : f \text{ is even}\} (even functions). Then M={fL2:f is odd}M^\perp = \{f \in L^2 : f \text{ is odd}\} (odd functions). Every L2L^2 function decomposes uniquely as f=feven+foddf = f_{\text{even}} + f_{\text{odd}} where feven(x)=f(x)+f(x)2f_{\text{even}}(x) = \frac{f(x) + f(-x)}{2} and fodd(x)=f(x)f(x)2f_{\text{odd}}(x) = \frac{f(x) - f(-x)}{2}.

Orthogonal projection in ℝ², ℝ³, and function space
Orthogonal projection: in ℝ² (left), ℝ³ (center), and a schematic for function spaces (right). The residual x − P_M(x) is always perpendicular to the subspace M.

6. The Projection Theorem

This is the central result of the topic. The projection theorem says: in a Hilbert space, every point has a unique closest point in any closed convex subset — and when the subset is a subspace, the closest point is the orthogonal projection.

🔷 Theorem 4 (Projection Theorem)

Let HH be a Hilbert space and CHC \subseteq H a nonempty, closed, convex subset. For every xHx \in H, there exists a unique pCp \in C such that:

xp=infcCxc=d(x,C)\|x - p\| = \inf_{c \in C} \|x - c\| = d(x, C)

When C=MC = M is a closed subspace, the minimizer pp is characterized by the orthogonality condition: xp,m=0\langle x - p, m \rangle = 0 for all mMm \in M.

Proof.

We prove existence, uniqueness, and the orthogonality characterization.

Step 1. Setup. Let d=infcCxcd = \inf_{c \in C} \|x - c\|. Choose a minimizing sequence (cn)C(c_n) \subset C with xcnd\|x - c_n\| \to d.

Step 2. The minimizing sequence is Cauchy. Apply the parallelogram law to u=xcnu = x - c_n and v=xcmv = x - c_m:

u+v2+uv2=2(u2+v2)\|u + v\|^2 + \|u - v\|^2 = 2(\|u\|^2 + \|v\|^2)

The left side involves u+v=2xcncmu + v = 2x - c_n - c_m and uv=cmcnu - v = c_m - c_n. Since CC is convex, cn+cm2C\frac{c_n + c_m}{2} \in C, so xcn+cm2d\|x - \frac{c_n + c_m}{2}\| \geq d, which gives u+v=2xcn+cm22d\|u + v\| = 2\|x - \frac{c_n + c_m}{2}\| \geq 2d. Thus:

cncm2=2xcn2+2xcm22xcncm2\|c_n - c_m\|^2 = 2\|x - c_n\|^2 + 2\|x - c_m\|^2 - \|2x - c_n - c_m\|^2

2xcn2+2xcm24d2\leq 2\|x - c_n\|^2 + 2\|x - c_m\|^2 - 4d^2

Since xcn2d2\|x - c_n\|^2 \to d^2 and xcm2d2\|x - c_m\|^2 \to d^2, the right side 2d2+2d24d2=0\to 2d^2 + 2d^2 - 4d^2 = 0. So (cn)(c_n) is Cauchy.

Step 3. Existence. Since HH is complete and CC is closed, the Cauchy sequence (cn)(c_n) converges to some pCp \in C. Continuity of the norm gives xp=limxcn=d\|x - p\| = \lim \|x - c_n\| = d.

Step 4. Uniqueness. If p,qCp, q \in C both achieve the minimum dd, apply the parallelogram law to u=xpu = x - p and v=xqv = x - q:

pq2=2xp2+2xq22xpq22d2+2d24d2=0\|p - q\|^2 = 2\|x - p\|^2 + 2\|x - q\|^2 - \|2x - p - q\|^2 \leq 2d^2 + 2d^2 - 4d^2 = 0

So p=qp = q.

Step 5. Orthogonality (when C=MC = M is a subspace). Suppose pMp \in M is the closest point to xx. For any mMm \in M and tRt \in \mathbb{R}, p+tmMp + tm \in M, so:

xp2xptm2=xp22txp,m+t2m2\|x - p\|^2 \leq \|x - p - tm\|^2 = \|x - p\|^2 - 2t\langle x - p, m \rangle + t^2 \|m\|^2

This gives 02txp,m+t2m20 \leq -2t\langle x - p, m \rangle + t^2\|m\|^2 for all tt. Taking t=xp,mm2t = \frac{\langle x - p, m \rangle}{\|m\|^2} (assuming m0m \neq 0) gives xp,m20\langle x - p, m \rangle^2 \leq 0, hence xp,m=0\langle x - p, m \rangle = 0.

The parallelogram law is essential in Step 2. In a Banach space without an inner product (where the parallelogram law fails), the closest point may not exist or may not be unique. This is the fundamental reason Hilbert spaces are geometrically better behaved than general Banach spaces.

📐 Definition 4 (Orthogonal Projection Operator)

Let MM be a closed subspace of a Hilbert space HH. The orthogonal projection PM:HMP_M : H \to M maps each xHx \in H to its unique closest point PM(x)MP_M(x) \in M. It satisfies:

  1. PMP_M is linear and bounded with PM=1\|P_M\| = 1 (unless M={0}M = \{0\}).
  2. PM2=PMP_M^2 = P_M (idempotent: projecting twice is the same as projecting once).
  3. PM=PMP_M^* = P_M (self-adjoint: PMx,y=x,PMy\langle P_M x, y \rangle = \langle x, P_M y \rangle).
  4. ker(PM)=M\ker(P_M) = M^\perp and range(PM)=M\text{range}(P_M) = M.
  5. IPM=PMI - P_M = P_{M^\perp}.

📝 Example 8 (Projection in ℝⁿ)

In Rn\mathbb{R}^n, if MM is the span of an orthonormal set {e1,,ek}\{e_1, \ldots, e_k\}, then:

PM(x)=j=1kx,ejejP_M(x) = \sum_{j=1}^k \langle x, e_j \rangle \, e_j

The residual is xPM(x)=j=k+1nx,ejejx - P_M(x) = \sum_{j=k+1}^n \langle x, e_j \rangle \, e_j (where {ek+1,,en}\{e_{k+1}, \ldots, e_n\} extends to an ONB of Rn\mathbb{R}^n). The Pythagorean theorem gives x2=PM(x)2+xPM(x)2\|x\|^2 = \|P_M(x)\|^2 + \|x - P_M(x)\|^2.

📝 Example 9 (Best L² Approximation as Projection)

Finding the best L2L^2 approximation to a function ff by polynomials of degree n\leq n is exactly orthogonal projection onto the (n+1)(n+1)-dimensional subspace ΠnL2\Pi_n \subset L^2. The error fPΠn(f)f - P_{\Pi_n}(f) is orthogonal to every polynomial of degree n\leq n. When we use the Legendre polynomials as an orthonormal basis for Πn\Pi_n on [1,1][-1,1], the projection formula becomes:

PΠn(f)=k=0nf,PkPkP_{\Pi_n}(f) = \sum_{k=0}^n \langle f, P_k \rangle \, P_k

This is best approximation in the L2L^2 sense, developed abstractly in Topic 20 and now revealed as a special case of the projection theorem.

Drag the blue point to explore
x = (2.00, 2.50)
PMx = (2.58, 1.49)
‖x − PMx‖ = 1.1651
∠(x, M) = 21.3°

The projection PMx is the closest point in M to x. The residual x − PMx is perpendicular to M — this is the geometric content of the projection theorem.

Projection theorem: closest point, variational characterization, projection operator
The projection theorem in three views: closest point in a convex set (left), the variational characterization via orthogonality (center), and the projection operator P_M as a linear map (right).

7. Orthonormal Bases

📐 Definition 5 (Orthonormal System and Orthonormal Basis)

An orthonormal system (ONS) in HH is a set {eα}αA\{e_\alpha\}_{\alpha \in A} such that eα,eβ=δαβ\langle e_\alpha, e_\beta \rangle = \delta_{\alpha\beta} (each vector has unit norm and distinct vectors are orthogonal).

An orthonormal basis (ONB) is a maximal orthonormal system — equivalently, an ONS {eα}\{e_\alpha\} such that span{eα}=H\overline{\text{span}}\{e_\alpha\} = H (the closed linear span is all of HH).

An ONB is not a Hamel basis (which requires every vector to be a finite linear combination). An ONB allows infinite linear combinations — series that converge in the norm.

🔷 Theorem 5 (Gram-Schmidt Orthonormalization)

Given a finite or countably infinite linearly independent set {v1,v2,}\{v_1, v_2, \ldots\} in an inner product space, the Gram-Schmidt process produces an orthonormal set {e1,e2,}\{e_1, e_2, \ldots\} with the property that span{e1,,ek}=span{v1,,vk}\text{span}\{e_1, \ldots, e_k\} = \text{span}\{v_1, \ldots, v_k\} for every kk.

The algorithm: set u1=v1u_1 = v_1, e1=u1/u1e_1 = u_1/\|u_1\|. For k2k \geq 2:

uk=vkj=1k1vk,ejej,ek=ukuku_k = v_k - \sum_{j=1}^{k-1} \langle v_k, e_j \rangle \, e_j, \quad e_k = \frac{u_k}{\|u_k\|}

Each uku_k is obtained by subtracting the projection of vkv_k onto the span of {e1,,ek1}\{e_1, \ldots, e_{k-1}\}. What remains is the component of vkv_k orthogonal to the previous basis vectors.

Input vectors shown. Press Next to begin orthonormalization.

Gram-Schmidt subtracts the projection of each new vector onto the already-orthonormalized set, then normalizes. Orange dashed lines show the projections being removed.

🔷 Theorem 6 (Bessel's Inequality)

Let {ek}k=1\{e_k\}_{k=1}^\infty be an orthonormal system in a Hilbert space HH. For every xHx \in H:

k=1x,ek2x2\sum_{k=1}^\infty |\langle x, e_k \rangle|^2 \leq \|x\|^2

The coefficients ck=x,ekc_k = \langle x, e_k \rangle are called the Fourier coefficients of xx with respect to the ONS.

Proof.

For any finite NN, let PNx=k=1Nx,ekekP_N x = \sum_{k=1}^N \langle x, e_k \rangle \, e_k be the projection of xx onto span{e1,,eN}\text{span}\{e_1, \ldots, e_N\}. By orthogonality of the residual:

x2=PNx2+xPNx2PNx2\|x\|^2 = \|P_N x\|^2 + \|x - P_N x\|^2 \geq \|P_N x\|^2

Now PNx2=k=1Nx,ek2\|P_N x\|^2 = \sum_{k=1}^N |\langle x, e_k \rangle|^2 (because the eke_k are orthonormal). So:

k=1Nx,ek2x2\sum_{k=1}^N |\langle x, e_k \rangle|^2 \leq \|x\|^2

This holds for every NN, so the series k=1x,ek2\sum_{k=1}^\infty |\langle x, e_k \rangle|^2 converges and is bounded by x2\|x\|^2.

🔷 Theorem 7 (Parseval's Identity)

Let {ek}k=1\{e_k\}_{k=1}^\infty be an orthonormal basis for a Hilbert space HH. Then for every xHx \in H:

k=1x,ek2=x2\sum_{k=1}^\infty |\langle x, e_k \rangle|^2 = \|x\|^2

Equivalently, x=k=1x,ekekx = \sum_{k=1}^\infty \langle x, e_k \rangle \, e_k (convergence in norm).

Proof.

Bessel’s inequality gives \leq. For the reverse, since {ek}\{e_k\} is an ONB, the closed span of {ek}\{e_k\} is all of HH. So for every ϵ>0\epsilon > 0, there exists NN and scalars a1,,aNa_1, \ldots, a_N with xk=1Nakek<ϵ\|x - \sum_{k=1}^N a_k e_k\| < \epsilon.

The projection PNx=k=1Nx,ekekP_N x = \sum_{k=1}^N \langle x, e_k \rangle \, e_k minimizes the distance from xx to span{e1,,eN}\text{span}\{e_1, \ldots, e_N\} (by the projection theorem), so xPNxxakek<ϵ\|x - P_N x\| \leq \|x - \sum a_k e_k\| < \epsilon.

Therefore xPNx2=x2k=1Nx,ek2<ϵ2\|x - P_N x\|^2 = \|x\|^2 - \sum_{k=1}^N |\langle x, e_k \rangle|^2 < \epsilon^2, and since ϵ\epsilon is arbitrary, k=1x,ek2=x2\sum_{k=1}^\infty |\langle x, e_k \rangle|^2 = \|x\|^2.

Parseval’s identity is a conservation law: it says the “energy” of xx is fully captured by its Fourier coefficients. For L2[π,π]L^2[-\pi, \pi] with the trigonometric ONB, Parseval becomes 1πππf2=a022+n=1(an2+bn2)\frac{1}{\pi}\int_{-\pi}^\pi |f|^2 = \frac{a_0^2}{2} + \sum_{n=1}^\infty (a_n^2 + b_n^2) — the concrete version from Topic 22.

🔷 Proposition 1 (Separable Hilbert Spaces Have Countable ONBs)

A Hilbert space HH is separable (has a countable dense subset) if and only if it admits a countable orthonormal basis. Every separable infinite-dimensional Hilbert space is isometrically isomorphic to 2\ell^2.

📝 Example 10 (Fourier Basis as an Orthonormal Basis)

The trigonometric system {1/2π,cos(nx)/π,sin(nx)/π}n1\{1/\sqrt{2\pi}, \cos(nx)/\sqrt{\pi}, \sin(nx)/\sqrt{\pi}\}_{n \geq 1} is an orthonormal basis for L2[π,π]L^2[-\pi, \pi]. Parseval’s identity in this basis is the classical result that the Fourier coefficients capture all the L2L^2 energy of the function. This concrete Fourier theory from Topic 22 is a special case of the abstract Hilbert space theory developed here.

💡 Remark 4 (All Separable Hilbert Spaces Are the Same)

Proposition 1 says something remarkable: up to isometric isomorphism, there is only one separable infinite-dimensional Hilbert space — 2\ell^2. Whether you work with L2[0,1]L^2[0,1], L2(R)L^2(\mathbb{R}) (with respect to a finite measure), or any other separable Hilbert space, the abstract structure is the same. The isomorphism maps any ONB to the standard basis of 2\ell^2. This universality is unique to Hilbert spaces — for Banach spaces, LpL^p and p\ell^p are genuinely different spaces when p2p \neq 2.

Gram-Schmidt orthonormalization step by step in ℝ³
Gram-Schmidt orthonormalization in ℝ³: at each step, subtract the projection onto the existing orthonormal vectors (orange dashed), then normalize the residual.
Parseval's identity and Bessel's inequality
Parseval’s identity: the sum of squared Fourier coefficients equals ‖f‖². Bessel’s inequality with partial sums — the sum converges monotonically to ‖f‖².

8. The Riesz Representation Theorem

The Riesz representation theorem is the most important structural result about Hilbert spaces. It says that the dual space HH^* (all bounded linear functionals on HH) is isometrically isomorphic to HH itself — every bounded linear functional is “represented” by a unique vector via the inner product.

🔷 Theorem 8 (Riesz Representation Theorem)

Let HH be a Hilbert space and φ:HR\varphi : H \to \mathbb{R} a bounded linear functional. Then there exists a unique yHy \in H such that:

φ(x)=x,yfor all xH\varphi(x) = \langle x, y \rangle \quad \text{for all } x \in H

Moreover, φH=yH\|\varphi\|_{H^*} = \|y\|_H. The map φy\varphi \mapsto y is a linear isometric isomorphism HHH^* \to H (in the real case; conjugate-linear in the complex case).

Proof.

Step 1. The kernel of φ\varphi. If φ=0\varphi = 0, take y=0y = 0. Assume φ0\varphi \neq 0. The kernel M=ker(φ)={x:φ(x)=0}M = \ker(\varphi) = \{x : \varphi(x) = 0\} is a closed subspace of HH (closed because φ\varphi is continuous).

Step 2. The codimension-1 decomposition. Since φ0\varphi \neq 0, MHM \neq H. By the orthogonal decomposition theorem (Theorem 3), H=MMH = M \oplus M^\perp. Since φ\varphi is a nonzero functional, dim(M)=1\dim(M^\perp) = 1 — pick any zMz \in M^\perp with z0z \neq 0.

Step 3. Construct the representing vector. Set y=φ(z)z2zy = \frac{\overline{\varphi(z)}}{\|z\|^2} z (in the real case, φ(z)=φ(z)\overline{\varphi(z)} = \varphi(z)). For any xHx \in H, write x=m+αzx = m + \alpha z where mMm \in M and α=φ(x)φ(z)\alpha = \frac{\varphi(x)}{\varphi(z)}.

Then:

x,y=m+αz,φ(z)z2z=αφ(z)z2z2=αφ(z)=φ(x)\langle x, y \rangle = \langle m + \alpha z, \frac{\varphi(z)}{\|z\|^2} z \rangle = \alpha \frac{\varphi(z)}{\|z\|^2} \|z\|^2 = \alpha \varphi(z) = \varphi(x)

Step 4. Uniqueness. If x,y1=x,y2\langle x, y_1 \rangle = \langle x, y_2 \rangle for all xx, then x,y1y2=0\langle x, y_1 - y_2 \rangle = 0 for all xx. Taking x=y1y2x = y_1 - y_2 gives y1y22=0\|y_1 - y_2\|^2 = 0, so y1=y2y_1 = y_2.

Step 5. Isometry. φ=supx1x,y=y\|\varphi\| = \sup_{\|x\| \leq 1} |\langle x, y \rangle| = \|y\| by Cauchy-Schwarz, with equality achieved at x=y/yx = y/\|y\|.

💡 Remark 5 (Contrast with Banach-Space Duality)

In Topic 30, computing the dual space XX^* required the full Hahn-Banach machinery, and the dual of LpL^p required the Radon-Nikodym theorem (Topic 28). For Hilbert spaces, the Riesz representation theorem replaces all of this with a single, geometric argument based on the projection theorem. The dual of HH is HH itself — there is no separate “dual space” to worry about. This is why optimization in Hilbert spaces (gradient descent, proximal methods) is fundamentally simpler than in general Banach spaces (mirror descent, Fenchel conjugates) — the gradient is a vector in the same space, not an element of a dual space that needs to be mapped back.

🔷 Corollary 1 (Hilbert Spaces Are Reflexive)

Every Hilbert space is reflexive: the canonical embedding J:HHJ: H \to H^{**} is surjective. This follows immediately from the Riesz representation theorem applied twice: HHH \cong H^* and HHH^* \cong H^{**}.

Reflexivity was an open question for general Banach spaces in Topic 30 (Remark 7). For Hilbert spaces, it is an immediate corollary.

Riesz representation: functional as hyperplane, representing vector, H* ≅ H
The Riesz representation theorem: a bounded linear functional φ defines a hyperplane ker(φ) (left), the representing vector y is the normal to this hyperplane (center), and the identification H* ≅ H collapses the distinction between vectors and functionals (right).

9. Compact Operators and the Spectral Theorem

📐 Definition 6 (Compact Operator)

A bounded linear operator T:HHT: H \to H is compact if it maps every bounded set to a relatively compact set (a set whose closure is compact). Equivalently, TT maps bounded sequences to sequences with convergent subsequences.

Compact operators are the “almost finite-dimensional” operators. They can be approximated by finite-rank operators. In finite dimensions, every linear operator is compact.

📐 Definition 7 (Self-Adjoint Operator)

A bounded linear operator T:HHT: H \to H is self-adjoint (or symmetric) if:

Tx,y=x,Tyfor all x,yH\langle Tx, y \rangle = \langle x, Ty \rangle \quad \text{for all } x, y \in H

Self-adjoint operators are the infinite-dimensional generalization of symmetric matrices. Their eigenvalues are real, and eigenvectors for distinct eigenvalues are orthogonal.

🔷 Theorem 9 (Spectral Theorem for Compact Self-Adjoint Operators)

Let T:HHT: H \to H be a compact self-adjoint operator on a Hilbert space HH. Then:

  1. The eigenvalues of TT are real, and form a sequence (λn)(\lambda_n) converging to 00 (or are finitely many).
  2. Eigenvectors corresponding to distinct eigenvalues are orthogonal.
  3. H=ker(T)nEλnH = \ker(T) \oplus \overline{\bigoplus_n E_{\lambda_n}}, where Eλn=ker(TλnI)E_{\lambda_n} = \ker(T - \lambda_n I) are the eigenspaces.
  4. For every xHx \in H:

Tx=nλnx,enenTx = \sum_n \lambda_n \langle x, e_n \rangle \, e_n

where {en}\{e_n\} is an orthonormal basis of eigenvectors for (kerT)(\ker T)^\perp, with Ten=λnenTe_n = \lambda_n e_n.

Proof.

Step 1. Eigenvalues are real. If Tx=λxTx = \lambda x with x0x \neq 0, then λx2=Tx,x=x,Tx=λx2\lambda \|x\|^2 = \langle Tx, x \rangle = \langle x, Tx \rangle = \overline{\lambda} \|x\|^2, so λ=λR\lambda = \overline{\lambda} \in \mathbb{R}.

Step 2. Eigenvectors for distinct eigenvalues are orthogonal. If Tx=λxTx = \lambda x and Ty=μyTy = \mu y with λμ\lambda \neq \mu, then λx,y=Tx,y=x,Ty=μx,y\lambda \langle x, y \rangle = \langle Tx, y \rangle = \langle x, Ty \rangle = \mu \langle x, y \rangle, so (λμ)x,y=0(\lambda - \mu)\langle x, y \rangle = 0, giving x,y=0\langle x, y \rangle = 0.

Step 3. The first eigenvalue exists via the Rayleigh quotient. Define T=supx=1Tx,x\|T\| = \sup_{\|x\|=1} |\langle Tx, x \rangle| (which equals the operator norm for self-adjoint TT). There exists a sequence (xn)(x_n) on the unit sphere with Txn,xnT|\langle Tx_n, x_n \rangle| \to \|T\|. By compactness, (Txn)(Tx_n) has a convergent subsequence. One shows the subsequence of (xn)(x_n) converges to an eigenvector e1e_1 with eigenvalue λ1=±T\lambda_1 = \pm\|T\|.

Step 4. Induction on orthogonal complements. The subspace {e1}\{e_1\}^\perp is invariant under TT (because TT is self-adjoint). Restrict TT to {e1}\{e_1\}^\perp — the restriction is still compact and self-adjoint — and apply Step 3 to find λ2,e2\lambda_2, e_2. Continue inductively.

Step 5. Eigenvalues converge to 0. If the eigenvalues (λn)(\lambda_n) do not converge to 0, there exists ϵ>0\epsilon > 0 and a subsequence with λnkϵ|\lambda_{n_k}| \geq \epsilon. The eigenvectors enke_{n_k} are orthonormal, so TenkTenj2=λnk2+λnj22ϵ2\|Te_{n_k} - Te_{n_j}\|^2 = \lambda_{n_k}^2 + \lambda_{n_j}^2 \geq 2\epsilon^2 — the sequence (Tenk)(Te_{n_k}) has no convergent subsequence, contradicting compactness of TT.

📝 Example 11 (Spectral Decomposition of a Symmetric Matrix)

A 2×22 \times 2 symmetric matrix A=(abbc)A = \begin{pmatrix} a & b \\ b & c \end{pmatrix} has eigenvalues λ1,2=(a+c)±(ac)2+4b22\lambda_{1,2} = \frac{(a+c) \pm \sqrt{(a-c)^2 + 4b^2}}{2} with orthogonal eigenvectors. The spectral decomposition is A=λ1v1v1T+λ2v2v2TA = \lambda_1 v_1 v_1^T + \lambda_2 v_2 v_2^T. The unit circle maps to an ellipse with semi-axes λ1|\lambda_1| and λ2|\lambda_2| along the eigenvector directions. The rank-1 approximation λ1v1v1T\lambda_1 v_1 v_1^T is the best rank-1 approximation in the operator norm (Eckart-Young theorem) — this is PCA in finite dimensions.

λ₁ = 3.0000
λ₂ = 0.5000
v₁ = (0.894, 0.447)
Rank-1 error = 0.5000

A symmetric matrix maps the unit circle to an ellipse whose axes are the eigenvectors, scaled by eigenvalues. The rank-1 approximation λ₁v₁v₁ᵀ keeps only the dominant eigendirection.

💡 Remark 6 (Beyond Compact Self-Adjoint)

The spectral theorem for compact self-adjoint operators is a stepping stone to the general spectral theorem for bounded (or unbounded) self-adjoint operators, which replaces the discrete eigenvalue sum with a spectral integral T=λdE(λ)T = \int \lambda \, dE(\lambda) over a projection-valued measure. This general version — due to von Neumann — is essential for quantum mechanics (where observables are self-adjoint operators) but is beyond the scope of this topic. We note only that our compact case captures the most common finite-dimensional intuition: diagonalization with orthogonal eigenvectors.

Spectral decomposition: unit circle image, eigenvalue bar chart, rank-k approximation
Spectral decomposition of a symmetric matrix: the unit circle maps to an ellipse whose axes are the eigenvectors (left), eigenvalues as a bar chart (center), and rank-k approximation convergence (right).

10. Reproducing Kernel Hilbert Spaces

This section is the strongest bridge between Hilbert space theory and machine learning. A reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which “evaluate the function at a point” is a continuous operation — and by the Riesz representation theorem, evaluation at each point is an inner product with a special element.

📐 Definition 8 (Reproducing Kernel Hilbert Space)

A reproducing kernel Hilbert space (RKHS) is a Hilbert space HH of functions f:XRf: \mathcal{X} \to \mathbb{R} such that for every xXx \in \mathcal{X}, the evaluation functional δx:ff(x)\delta_x : f \mapsto f(x) is a bounded linear functional on HH.

By the Riesz representation theorem (Theorem 8), there exists a unique KxHK_x \in H such that f(x)=f,Kxf(x) = \langle f, K_x \rangle for all fHf \in H. The function K:X×XRK : \mathcal{X} \times \mathcal{X} \to \mathbb{R} defined by K(x,y)=Kx,Ky=Ky(x)K(x, y) = \langle K_x, K_y \rangle = K_y(x) is called the reproducing kernel of HH.

🔷 Theorem 10 (The Reproducing Property)

Let HH be an RKHS with kernel KK. Then for every fHf \in H and every xXx \in \mathcal{X}:

f(x)=f,K(,x)Hf(x) = \langle f, K(\cdot, x) \rangle_H

This is the reproducing property: evaluating ff at xx is the same as taking the inner product of ff with the kernel function K(,x)K(\cdot, x).

🔷 Theorem 11 (Moore-Aronszajn Theorem)

For every positive-definite kernel K:X×XRK : \mathcal{X} \times \mathcal{X} \to \mathbb{R} (meaning i,jcicjK(xi,xj)0\sum_{i,j} c_i c_j K(x_i, x_j) \geq 0 for all finite sets of points and coefficients), there exists a unique RKHS HKH_K with KK as its reproducing kernel.

Proof sketch. Start with the pre-Hilbert space of finite linear combinations iciK(,xi)\sum_i c_i K(\cdot, x_i) with inner product ciK(,xi),djK(,yj)=i,jcidjK(xi,yj)\langle \sum c_i K(\cdot, x_i), \sum d_j K(\cdot, y_j) \rangle = \sum_{i,j} c_i d_j K(x_i, y_j). Positive-definiteness of KK ensures this inner product is well-defined and positive. Complete the space to get HKH_K.

📝 Example 12 (Common Kernels and Their RKHS)

  • Linear kernel: K(x,y)=xyK(x, y) = x \cdot y. The RKHS is the space of linear functions.
  • Polynomial kernel (degree dd): K(x,y)=(1+xy)dK(x, y) = (1 + x \cdot y)^d. The RKHS consists of polynomials of degree d\leq d.
  • Gaussian (RBF) kernel: K(x,y)=exp(xy2/2σ2)K(x, y) = \exp(-\|x - y\|^2 / 2\sigma^2). The RKHS is an infinite-dimensional space of smooth functions that is dense in L2L^2 — this kernel is universal.

🔷 Theorem 12 (Representer Theorem)

Let HKH_K be an RKHS with kernel KK, and let {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n be training data. The minimizer of the regularized empirical risk:

f=argminfHKi=1n(f(xi),yi)+λfHK2f^* = \arg\min_{f \in H_K} \sum_{i=1}^n \ell(f(x_i), y_i) + \lambda \|f\|_{H_K}^2

(where \ell is any loss function and λ>0\lambda > 0) lies in the finite-dimensional subspace spanned by the kernel evaluations at training points:

f=i=1nαiK(,xi)f^* = \sum_{i=1}^n \alpha_i K(\cdot, x_i)

for some coefficients α1,,αnR\alpha_1, \ldots, \alpha_n \in \mathbb{R}.

Proof.

Decompose f=f+ff = f_\parallel + f_\perp where fspan{K(,x1),,K(,xn)}f_\parallel \in \text{span}\{K(\cdot, x_1), \ldots, K(\cdot, x_n)\} and fK(,xi)f_\perp \perp K(\cdot, x_i) for all ii (using the projection theorem).

By the reproducing property, f(xi)=f,K(,xi)=f,K(,xi)+f,K(,xi)=f(xi)+0=f(xi)f(x_i) = \langle f, K(\cdot, x_i) \rangle = \langle f_\parallel, K(\cdot, x_i) \rangle + \langle f_\perp, K(\cdot, x_i) \rangle = f_\parallel(x_i) + 0 = f_\parallel(x_i).

So the loss (f(xi),yi)\sum \ell(f(x_i), y_i) depends only on ff_\parallel. Meanwhile, f2=f2+f2f2\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2. Therefore setting f=0f_\perp = 0 does not increase the loss and strictly decreases the regularization unless f=0f_\perp = 0 already. The minimizer has f=0f_\perp = 0, hence fspan{K(,xi)}f^* \in \text{span}\{K(\cdot, x_i)\}.

The representer theorem is the mathematical foundation of kernel methods in machine learning. It says that even though HKH_K may be infinite-dimensional (as for the Gaussian kernel), the optimal function lives in a finite-dimensional subspace whose dimension is the number of training points — not the dimension of HKH_K.

💡 Remark 7 (Why the Representer Theorem Matters for ML)

The representer theorem turns an infinite-dimensional optimization problem into a finite-dimensional one. Instead of searching over all functions in HKH_K, we search over nn coefficients α1,,αn\alpha_1, \ldots, \alpha_n. The kernel trick then allows us to compute everything — predictions, regularization, gradients — using only kernel evaluations K(xi,xj)K(x_i, x_j), without ever constructing the (possibly infinite-dimensional) feature map explicitly. This is why SVMs, Gaussian processes, and kernel PCA scale with the number of training points rather than the dimension of the feature space. See Kernel Methods on formalML for the full development.

Kernel: Gaussian (RBF)
Training: 3 points
‖f*‖²H = NaN

Click in the right panel to add training points (up to 8). Click near an existing point to remove it. The representer theorem guarantees f* lives in the span of kernel evaluations at training points.


11. Connections to ML

📝 Example 13 (Kernel Methods and SVMs)

Support vector machines solve a classification problem by finding a maximum-margin hyperplane in an RKHS. The SVM dual problem involves only kernel evaluations K(xi,xj)K(x_i, x_j) — the representer theorem guarantees the optimal classifier is f(x)=iαiK(xi,x)f^*(x) = \sum_i \alpha_i K(x_i, x), where the αi\alpha_i are the Lagrange multipliers (support vector coefficients). See Kernel Methods on formalML.

📝 Example 14 (Gaussian Processes as Projections in RKHS)

A Gaussian process (GP) prior on functions induces an RKHS. The GP posterior mean at a test point xx^*, given training data, is the orthogonal projection of K(,x)K(\cdot, x^*) onto the subspace span{K(,x1),,K(,xn)}\text{span}\{K(\cdot, x_1), \ldots, K(\cdot, x_n)\}. The posterior variance measures the squared distance from K(,x)K(\cdot, x^*) to this subspace — it is K(x,x)kT(K+σ2I)1kK(x^*, x^*) - k^T (K + \sigma^2 I)^{-1} k where ki=K(x,xi)k_i = K(x^*, x_i). See Generative Modeling on formalML.

📝 Example 15 (Proximal Operators as Projections)

The proximal operator proxf(x)=argminy{f(y)+12yx2}\text{prox}_f(x) = \arg\min_y \{f(y) + \frac{1}{2}\|y - x\|^2\} is a generalization of orthogonal projection. When f=ιCf = \iota_C is the indicator function of a closed convex set CC, proxf(x)=PC(x)\text{prox}_f(x) = P_C(x) — exactly the projection from Theorem 4. Proximal gradient descent, ISTA/FISTA, and ADMM all use projections in Hilbert spaces as their core operations. See Optimization Theory on formalML.

📝 Example 16 (Attention as Projection)

In the transformer attention mechanism, the output for a query qq is Attention(q)=isoftmax(q,ki)vi\text{Attention}(q) = \sum_i \text{softmax}(\langle q, k_i \rangle) v_i — a weighted combination of values viv_i, where the weights are inner-product similarities between the query and keys. This is a “soft projection” of qq onto the subspace spanned by keys, with the inner product determining the projection weights. Standard orthogonal projection is the “hard” limit as the softmax temperature goes to zero.

ML connections: SVM, GP, proximal operators, attention
Hilbert spaces in ML: (a) SVM kernel trick — classification via a hyperplane in RKHS, (b) GP posterior as projection in RKHS, (c) proximal operator as generalized projection, (d) attention as soft projection via inner products.

12. Computational Notes

All of the algorithms in this topic have efficient numerical implementations.

Inner products and norms. In NumPy: np.dot(x, y) or x @ y for the Euclidean inner product, np.linalg.norm(x) for the norm. For weighted inner products: np.dot(w * x, y).

Projection. Onto a subspace with orthonormal basis EE (columns are ONB vectors): P = E @ E.T, then proj = P @ x. For a general basis, Gram-Schmidt first or use the normal equations: proj = E @ np.linalg.solve(E.T @ E, E.T @ x).

Gram-Schmidt. NumPy: np.linalg.qr(V) computes the QR decomposition, where QQ contains the orthonormalized vectors. For step-by-step visualization, implement the loop explicitly as in the notebook.

Kernel methods. Build the Gram matrix Kij=K(xi,xj)K_{ij} = K(x_i, x_j), solve alpha = np.linalg.solve(K + lambda * I, y), predict via f_star = K_test @ alpha.

Eigendecomposition. For a symmetric matrix: eigenvalues, eigenvectors = np.linalg.eigh(A). Use eigh (not eig) for symmetric matrices — it guarantees real eigenvalues and orthogonal eigenvectors.


13. Connections and Further Reading

TopicConnection
Metric SpacesCompleteness and compactness foundations. Inner product spaces add geometry to the metric layer.
Banach SpacesEvery Hilbert space is Banach. The inner product replaces Baire-powered arguments with projection-based ones.
LpL^p SpacesL2L^2 is the canonical Hilbert space. Duality (Lp)Lq(L^p)^* \cong L^q becomes the simpler HHH^* \cong H via Riesz.
Fourier SeriesTrigonometric functions as an ONB for L2[π,π]L^2[-\pi, \pi]. Parseval’s identity as energy conservation.
Approximation TheoryBest L2L^2 approximation = orthogonal projection. Legendre polynomials as an ONB.
Radon-NikodymConditional expectation as L2L^2 projection onto a sub-σ\sigma-algebra.
Kernel Methods (formalML)RKHS is a Hilbert space. Representer theorem via Riesz.
Gradient Descent (formalML)Riesz identification: gradients as descent directions in the primal space.
Spectral Methods (formalML)PCA and spectral clustering via the spectral theorem.

14. Summary

ConceptKey ResultWhy It Matters
Inner productAdds angles and orthogonality to normed spacesEnables the geometric toolkit that norms lack
Cauchy-Schwarzx,yxy\|\langle x, y \rangle\| \leq \|x\| \cdot \|y\|Makes angles well-defined; controls all inner-product estimates
Parallelogram lawx+y2+xy2=2(x2+y2)\|x+y\|^2 + \|x-y\|^2 = 2(\|x\|^2 + \|y\|^2)Characterizes inner product norms; essential in projection proof
Hilbert spaceComplete inner product spaceThe “good” geometric spaces where all theorems work
Projection theoremUnique closest point in closed convex setsThe foundation for everything below
Orthonormal basesParseval’s identity: x,ek2=x2\sum \|\langle x, e_k \rangle\|^2 = \|x\|^2Generalized Fourier analysis; every separable Hilbert space ≅ ℓ²
Riesz representationHHH^* \cong H via φ(x)=x,y\varphi(x) = \langle x, y \rangleCollapses dual-space complexity; Hilbert spaces are self-dual
Spectral theoremCompact self-adjoint T=λnenenT = \sum \lambda_n e_n \otimes e_nEigendecomposition with orthogonal eigenvectors; foundation of PCA
RKHSf(x)=f,KxHf(x) = \langle f, K_x \rangle_HEvaluation via inner product; representer theorem for kernel methods

15. Looking Ahead

The next and final topic in Track 8 is Calculus of Variations — the study of functionals defined on function spaces, and the search for functions that minimize (or maximize) them. The Euler-Lagrange equation, the direct method for existence of minimizers, and weak solutions to PDEs all rely on the Hilbert space infrastructure we have built here. The projection theorem provides existence of minimizers in reflexive spaces; the Riesz representation theorem turns the weak formulation of PDEs into a Hilbert-space problem; and Sobolev spaces — the natural domains for variational problems — are Hilbert spaces when p=2p = 2.

Where this topic added one axiom (the inner product) to Banach spaces and gained geometric power, Topic 32 adds one more idea — variation — to function spaces and gains the ability to find optimal functions. The staircase of abstraction reaches its summit.

References

  1. book Kreyszig (1978). Introductory Functional Analysis with Applications Chapters 3–4 (inner product spaces, Hilbert spaces). Comprehensive and accessible.
  2. book Conway (1990). Functional Analysis Chapters 1–2 (Hilbert spaces, projection, Riesz). Clean modern treatment.
  3. book Brezis (2011). Functional Analysis, Sobolev Spaces and Partial Differential Equations Chapter 5 (Hilbert spaces). Concise proofs with applications.
  4. book Folland (1999). Real Analysis: Modern Techniques and Their Applications Chapter 5 (elements of functional analysis). Rigorous with measure-theoretic context.
  5. book Schölkopf & Smola (2002). Learning with Kernels RKHS theory for machine learning. The representer theorem and kernel trick.