Series & Approximation · intermediate · 50 min read

Approximation Theory

Every continuous function on a closed interval can be uniformly approximated by polynomials — the Weierstrass theorem. Bernstein polynomials provide the constructive proof, Chebyshev polynomials provide the optimal nodes, and Jackson's theorem quantifies how smoothness controls the rate. This is the existence theorem of approximation: the guarantee that good approximants exist, the machinery to build them, and the connection to the universal approximation theorem for neural networks.

Abstract. The Weierstrass approximation theorem guarantees that every continuous function f on a closed interval [a,b] can be uniformly approximated by polynomials: for any ε > 0, there exists a polynomial p with ||f − p||∞ < ε. The Bernstein polynomial construction Bₙ(f; x) = Σ f(k/n)C(n,k)xᵏ(1−x)ⁿ⁻ᵏ provides a concrete, constructive proof — each Bₙ is the expected value of f evaluated at a binomial random variable, and convergence follows from the law of large numbers. The convergence rate O(1/√n) is slow, but the theorem's power lies in its generality: no differentiability, no analyticity, just continuity. The Stone-Weierstrass theorem generalizes to compact spaces: a closed subalgebra of C(K) that separates points and contains the constants is dense. For optimal polynomial approximation, Chebyshev polynomials Tₙ(x) = cos(n arccos x) provide near-minimax approximation and the ideal interpolation nodes, avoiding the Runge phenomenon that plagues equispaced interpolation. Jackson's theorem quantifies the rate of best approximation: Eₙ(f) ≤ C·ω(f; 1/n), where ω is the modulus of continuity — smoother functions admit faster polynomial approximation. Bernstein's inverse theorem provides the converse: fast approximation implies smoothness. In machine learning, the universal approximation theorem for neural networks (Cybenko, Hornik) is the direct analog of Weierstrass: single-hidden-layer networks can uniformly approximate any continuous function on a compact set. Barron's theorem gives approximation rates for networks, and polynomial approximation underlies homomorphic encryption for private ML inference.

Where this leads → formalML

  • formalML The universal approximation theorem for neural networks (Cybenko 1989, Hornik 1991) is the neural-network analog of the Weierstrass theorem. The proof structure mirrors Stone-Weierstrass: networks form an algebra that separates points. Barron's theorem gives O(1/√n) approximation rates for networks with bounded first moments — the same rate as Bernstein polynomials.
  • formalML Polynomial and Chebyshev approximation of activation functions enables private ML inference via homomorphic encryption. Bernstein polynomials preserve the bounded range [0,1], making them suitable for approximating sigmoid and ReLU in encrypted computation.
  • formalML The Stone-Weierstrass theorem for C*-algebras gives the continuous functional calculus: for self-adjoint operator A, f(A) is defined by approximating f with polynomials in A. This is the infinite-dimensional generalization of the polynomial approximation theory developed here.
  • formalML RKHS approximation theory provides error bounds for kernel regression and Gaussian processes. The representer theorem is an approximation-theoretic result about best approximation in RKHS — the kernel analog of best polynomial approximation in C[a,b].

1. Overview & Motivation — Why Approximation Theory?

Track 5 has now developed three approximation paradigms. Power Series & Taylor Series showed how to approximate a function locally by matching derivatives at a center cc — but Taylor series require analyticity and are limited by the radius of convergence. Fourier Series & Orthogonal Expansions showed how to approximate a function globally by decomposing it into trigonometric harmonics — but Fourier series work best for periodic functions and converge in the L2L^2 norm, which allows large pointwise errors on small sets.

The capstone question of this track is: can we approximate any continuous function on [a,b][a,b] by polynomials, uniformly (with a guaranteed pointwise bound everywhere)? The answer is yes — the Weierstrass approximation theorem — and understanding why, how, and how fast is the subject of approximation theory.

Why this matters in ML. The universal approximation theorem for neural networks (Cybenko 1989) is the neural-network analog of Weierstrass. Just as Weierstrass says polynomials are dense in C[a,b]C[a,b], Cybenko says single-hidden-layer networks are dense in C(K)C(K) for compact KK. Understanding the classical theory illuminates the modern one — including its limitations. Density says approximants exist but says nothing about how to find them or how many parameters are needed. The gap between existence and computation is the gap between Weierstrass and training a neural network.

💡 Remark 1 (The approximation triptych)

Track 5 has developed three approximation paradigms, each with its own strengths and limitations:

Taylor (Topic 18)Fourier (Topic 19)Weierstrass/Bernstein (Topic 20)
BasisMonomials {1,x,x2,}\{1, x, x^2, \ldots\}Trig functions {1,cosx,sinx,}\{1, \cos x, \sin x, \ldots\}Bernstein basis {bk,n}\{b_{k,n}\}
ScopeLocal (radius RR around center cc)Global on the period [π,π][-\pi, \pi]Global on [a,b][a, b]
RequiresAnalyticityIntegrabilityContinuity only
ConvergenceInside radius RRL2L^2 norm; pointwise under conditionsUniform on [a,b][a, b]
CoefficientsDerivatives: f(n)(c)/n!f^{(n)}(c)/n!Integrals: f^(n)=12πfeinxdx\hat{f}(n) = \frac{1}{2\pi}\int f e^{-inx}\,dxSamples: f(k/n)f(k/n)

Each successive paradigm trades specificity for generality. The thread connecting them all is the question: how well can we approximate functions, and what determines the answer?

The approximation triptych — Taylor (local polynomial at a center), Fourier (global trigonometric on a period), Bernstein (global polynomial on an interval)

2. The Weierstrass Approximation Theorem

The Weierstrass approximation theorem (1885) is the existence theorem of approximation theory. It tells us that polynomial approximation is always possible for continuous functions — the polynomials are dense in C[a,b]C[a,b].

📐 Definition 1 (Best Polynomial Approximation)

For fC[a,b]f \in C[a,b] and n0n \geq 0, the best polynomial approximation error (or minimax error) of degree nn is

En(f)=infpPnfp=infpPnmaxx[a,b]f(x)p(x)E_n(f) = \inf_{p \in \mathcal{P}_n} \|f - p\|_\infty = \inf_{p \in \mathcal{P}_n} \max_{x \in [a,b]} |f(x) - p(x)|

where Pn\mathcal{P}_n is the space of polynomials of degree n\leq n. The infimum is attained — compactness of [a,b][a,b], continuity of ff, and finite dimension of Pn\mathcal{P}_n guarantee the existence of a best approximant pnp_n^* with fpn=En(f)\|f - p_n^*\|_\infty = E_n(f).

🔷 Theorem 1 (The Weierstrass Approximation Theorem)

If f:[a,b]Rf: [a,b] \to \mathbb{R} is continuous, then for every ε>0\varepsilon > 0 there exists a polynomial pp such that

fp=maxx[a,b]f(x)p(x)<ε.\|f - p\|_\infty = \max_{x \in [a,b]} |f(x) - p(x)| < \varepsilon.

Equivalently, En(f)0E_n(f) \to 0 as nn \to \infty.

In the language of metric spaces: the polynomials P=n=0Pn\mathcal{P} = \bigcup_{n=0}^{\infty} \mathcal{P}_n are dense in (C[a,b],)(C[a,b], \|\cdot\|_\infty).

💡 Remark 2 (What Weierstrass does and does not say)

The theorem is an existence result. It guarantees that good polynomial approximants exist but does not:

  • Construct the best polynomial for a given ε\varepsilon
  • Specify what degree nn is needed
  • Identify the optimal polynomial pnp_n^*

The Bernstein construction (Section 3) provides an explicit sequence of polynomial approximants. The Chebyshev theory (Section 6) characterizes the best approximant. Jackson’s theorem (Section 7) quantifies the rate.

3. Bernstein Polynomials — The Constructive Proof

The Bernstein polynomial is the simplest constructive proof of Weierstrass. It builds polynomial approximants by averaging ff over a binomial distribution — a probabilistic construction that converts the law of large numbers into an approximation theorem.

📐 Definition 2 (Bernstein Polynomial)

For f:[0,1]Rf: [0,1] \to \mathbb{R} and n1n \geq 1, the nnth Bernstein polynomial of ff is

Bn(f;x)=k=0nf ⁣(kn)(nk)xk(1x)nk.B_n(f; x) = \sum_{k=0}^{n} f\!\left(\frac{k}{n}\right) \binom{n}{k} x^k (1 - x)^{n-k}.

The Bernstein basis polynomials are

bk,n(x)=(nk)xk(1x)nk,k=0,1,,n.b_{k,n}(x) = \binom{n}{k} x^k (1 - x)^{n-k}, \quad k = 0, 1, \ldots, n.

These form a partition of unity: k=0nbk,n(x)=1\sum_{k=0}^n b_{k,n}(x) = 1 for all x[0,1]x \in [0,1].

💡 Remark 3 (The probabilistic interpretation)

Let SnBinomial(n,x)S_n \sim \text{Binomial}(n, x). Then

Bn(f;x)=E ⁣[f ⁣(Snn)].B_n(f; x) = \mathbb{E}\!\left[f\!\left(\frac{S_n}{n}\right)\right].

Each Bernstein polynomial evaluates ff at the random points k/nk/n weighted by the binomial probabilities. As nn \to \infty, the law of large numbers gives Sn/nxS_n/n \to x in probability, so f(Sn/n)f(x)f(S_n/n) \to f(x) by continuity, giving Bn(f;x)f(x)B_n(f; x) \to f(x). This is the heuristic — the rigorous proof makes it precise.

🔷 Theorem 2 (Uniform Convergence of Bernstein Polynomials)

If f:[0,1]Rf: [0,1] \to \mathbb{R} is continuous, then Bn(f;x)f(x)B_n(f; x) \to f(x) uniformly on [0,1][0,1]:

Bn(f)f0as n.\|B_n(f) - f\|_\infty \to 0 \quad \text{as } n \to \infty.

Proof.

The proof uses three identities for the Bernstein basis:

Identity 1 (partition of unity): k=0nbk,n(x)=1\displaystyle\sum_{k=0}^n b_{k,n}(x) = 1.

Identity 2 (first moment): k=0nknbk,n(x)=x\displaystyle\sum_{k=0}^n \frac{k}{n}\, b_{k,n}(x) = x.

Identity 3 (variance): k=0n(knx)2bk,n(x)=x(1x)n14n\displaystyle\sum_{k=0}^n \left(\frac{k}{n} - x\right)^2 b_{k,n}(x) = \frac{x(1-x)}{n} \leq \frac{1}{4n}.

Now fix ε>0\varepsilon > 0. Since ff is continuous on the compact set [0,1][0,1], it is uniformly continuous: choose δ>0\delta > 0 such that f(s)f(t)<ε|f(s) - f(t)| < \varepsilon whenever st<δ|s - t| < \delta.

For any x[0,1]x \in [0,1], using Identity 1:

Bn(f;x)f(x)=k=0n[f(k/n)f(x)]bk,n(x)k=0nf(k/n)f(x)bk,n(x).|B_n(f;x) - f(x)| = \left|\sum_{k=0}^n \bigl[f(k/n) - f(x)\bigr]\, b_{k,n}(x)\right| \leq \sum_{k=0}^n |f(k/n) - f(x)|\, b_{k,n}(x).

Split the sum into two parts. Let M=fM = \|f\|_\infty.

Near terms (k/nx<δ|k/n - x| < \delta): Here f(k/n)f(x)<ε|f(k/n) - f(x)| < \varepsilon, so these terms contribute at most εbk,n(x)ε\varepsilon \cdot \sum b_{k,n}(x) \leq \varepsilon.

Far terms (k/nxδ|k/n - x| \geq \delta): Here f(k/n)f(x)2M|f(k/n) - f(x)| \leq 2M, and by Identity 3:

k/nxδbk,n(x)1δ2k=0n(knx)2bk,n(x)14nδ2.\sum_{|k/n - x| \geq \delta} b_{k,n}(x) \leq \frac{1}{\delta^2}\sum_{k=0}^n \left(\frac{k}{n} - x\right)^2 b_{k,n}(x) \leq \frac{1}{4n\delta^2}.

These terms contribute at most 2M14nδ2=M2nδ22M \cdot \frac{1}{4n\delta^2} = \frac{M}{2n\delta^2}.

Combining: Bn(f;x)f(x)ε+M2nδ2|B_n(f;x) - f(x)| \leq \varepsilon + \frac{M}{2n\delta^2}.

Choose n>M2εδ2n > \frac{M}{2\varepsilon\delta^2}. Then Bn(f;x)f(x)<2ε|B_n(f;x) - f(x)| < 2\varepsilon for all x[0,1]x \in [0,1], so Bn(f)f<2ε\|B_n(f) - f\|_\infty < 2\varepsilon.

Since ε>0\varepsilon > 0 was arbitrary, Bn(f)fB_n(f) \to f uniformly. \square

📝 Example 1 (Bernstein polynomials for f(x) = |2x − 1|)

The function f(x)=2x1f(x) = |2x - 1| is continuous but not differentiable at x=1/2x = 1/2. Its Bernstein polynomials Bn(f;x)=k2k/n1bk,n(x)B_n(f; x) = \sum_k |2k/n - 1|\, b_{k,n}(x) are smooth polynomials that converge uniformly to ff, rounding the corner at x=1/2x = 1/2. At n=5n = 5, the approximation is visibly smooth; at n=50n = 50, it is nearly indistinguishable from ff except in a narrow region around x=1/2x = 1/2.

📝 Example 2 (Convergence rate comparison)

For the smooth function f(x)=x(1x)f(x) = x(1-x), the Bernstein polynomial satisfies Bn(f;x)=x(1x)(n1)/nB_n(f; x) = x(1-x)(n-1)/n, giving Bn(f)f=1/(4n)=O(1/n)\|B_n(f) - f\|_\infty = 1/(4n) = O(1/n) — faster than the general O(1/n)O(1/\sqrt{n}) because ff is smooth.

For f(x)=2x1f(x) = |2x - 1| (Lipschitz but not C1C^1), Bn(f)f=O(1/n)\|B_n(f) - f\|_\infty = O(1/\sqrt{n}) — the worst-case rate for Lipschitz functions.

This illustrates that Bernstein convergence is slow — adequate for proving the Weierstrass theorem, but not optimal for computation.

f B5(f)Max error: 0.375000Rate: O(1/√n)

Bernstein polynomials B_n(f; x) for f(x) = |2x − 1| at n = 5, 20, 100

4. Stone-Weierstrass — The Grand Generalization

Stone’s generalization (1937, 1948) replaces the specific polynomial algebra on [a,b][a,b] with an abstract algebra on a compact space, identifying exactly what properties an algebra needs to be dense in C(K)C(K).

📐 Definition 3 (Separating Algebra)

A set AC(K)\mathcal{A} \subset C(K) is a subalgebra of C(K)C(K) if it is closed under addition, scalar multiplication, and pointwise multiplication. A\mathcal{A} separates points if for any xyx \neq y in KK, there exists gAg \in \mathcal{A} with g(x)g(y)g(x) \neq g(y). A\mathcal{A} vanishes nowhere if for every xKx \in K, there exists gAg \in \mathcal{A} with g(x)0g(x) \neq 0. If A\mathcal{A} contains the constant functions, it automatically vanishes nowhere.

🔷 Theorem 3 (The Stone-Weierstrass Theorem (Real Version))

Let KK be a compact metric space and let AC(K,R)\mathcal{A} \subset C(K, \mathbb{R}) be a subalgebra that separates points and contains the constant functions. Then A\mathcal{A} is dense in C(K,R)C(K, \mathbb{R}) under the sup-norm:

A=C(K,R).\overline{\mathcal{A}} = C(K, \mathbb{R}).

📝 Example 3 (Recovering the Weierstrass theorem)

Take K=[a,b]K = [a,b] and A=P\mathcal{A} = \mathcal{P}, the polynomials. Polynomials form a subalgebra (closed under ++, \cdot, scalar multiplication). They separate points: p(x)=xp(x) = x distinguishes any xyx \neq y. They contain the constants. Therefore P=C[a,b]\overline{\mathcal{P}} = C[a,b] by Stone-Weierstrass. This is precisely Theorem 1.

📝 Example 4 (Trigonometric polynomials on the circle)

Take K=T=R/(2πZ)K = \mathbb{T} = \mathbb{R}/(2\pi\mathbb{Z}) (the circle) and A\mathcal{A} the trigonometric polynomials. They form a subalgebra (products of trig functions are trig functions), separate points (if eixeiye^{ix} \neq e^{iy} then cosxcosy\cos x \neq \cos y or sinxsiny\sin x \neq \sin y), and contain the constants. Stone-Weierstrass gives density, consistent with the Fourier series theory from Topic 19, which proved this concretely via Parseval’s identity.

📝 Example 5 (Polynomials in several variables)

Take KRdK \subset \mathbb{R}^d compact and A\mathcal{A} the polynomials in dd variables. They separate points: the coordinate functions xix_i distinguish points that differ in any coordinate. Stone-Weierstrass gives the density of multivariate polynomials in C(K)C(K) — a result that would be laborious to prove directly but follows immediately from the abstract theorem.

💡 Remark 4 (What Stone-Weierstrass does not cover)

The theorem applies to C(K,R)C(K, \mathbb{R}) with the sup-norm. It does not directly apply to LpL^p spaces (which require measure theory) or to non-compact domains (which require additional decay conditions). The complex version requires A\mathcal{A} to be self-conjugate: gAgˉAg \in \mathcal{A} \Rightarrow \bar{g} \in \mathcal{A}. This condition is automatic for real-valued algebras but must be checked for complex ones.

Three instances of Stone-Weierstrass — polynomials on [0,1], trigonometric polynomials on the circle, multivariate polynomials on a compact subset of R²

5. Best Approximation — L2L^2 vs. Uniform

The choice of norm changes the best approximant, the characterization of optimality, and the connection to other mathematical structures. Two norms, two optimization problems, two different answers.

📐 Definition 4 (Best Approximation in L² and C)

The best L2L^2 approximation of fL2[a,b]f \in L^2[a,b] from Pn\mathcal{P}_n is the polynomial pnL2p_n^{L^2} minimizing

fp2=abf(x)p(x)2dx.\|f - p\|_2 = \sqrt{\int_a^b |f(x) - p(x)|^2\,dx}.

The best uniform (Chebyshev) approximation of fC[a,b]f \in C[a,b] from Pn\mathcal{P}_n is the polynomial pnp_n^* minimizing

fp=maxx[a,b]f(x)p(x).\|f - p\|_\infty = \max_{x \in [a,b]} |f(x) - p(x)|.

🔷 Theorem 4 (Best L² Approximation as Orthogonal Projection)

The best L2L^2 polynomial approximation pnL2p_n^{L^2} is the orthogonal projection of ff onto Pn\mathcal{P}_n in the inner product f,g=abf(x)g(x)dx\langle f, g \rangle = \int_a^b f(x)g(x)\,dx. If {ϕ0,ϕ1,,ϕn}\{\phi_0, \phi_1, \ldots, \phi_n\} is an orthonormal basis for Pn\mathcal{P}_n (e.g., the Legendre polynomials), then

pnL2(x)=k=0nf,ϕkϕk(x).p_n^{L^2}(x) = \sum_{k=0}^{n} \langle f, \phi_k \rangle\, \phi_k(x).

This is exactly the Fourier coefficient formula from Topic 19, with orthogonal polynomials replacing trigonometric functions.

💡 Remark 5 (L² vs. uniform — why both matter)

L2L^2 best approximation is easy to compute (orthogonal projection — just compute inner products) but allows large pointwise errors on small sets. Uniform best approximation is hard to compute (a nonlinear optimization problem) but gives a guaranteed pointwise bound everywhere.

In ML: L2L^2 approximation corresponds to minimizing mean squared error (training loss), while uniform approximation corresponds to worst-case guarantees (robustness). A model that minimizes MSE can still have catastrophic errors on rare inputs — the gap between L2L^2 and uniform approximation is the gap between average-case and worst-case performance.

📝 Example 6 (Comparing L² and uniform best approximation of |x|)

For f(x)=xf(x) = |x| on [1,1][-1,1] with n=2n = 2 (quadratic approximation): the best L2L^2 approximant is the orthogonal projection p2L2(x)=12+158(x213)p_2^{L^2}(x) = \frac{1}{2} + \frac{15}{8}\left(x^2 - \frac{1}{3}\right) (computed from Legendre coefficients), while the best uniform approximant p2(x)p_2^*(x) has an equioscillation property — the error alternates between +E2(f)+E_2(f) and E2(f)-E_2(f) at three points. The two approximants are different polynomials — the norm determines the answer.

f L² best Uniform best Equioscillation pts‖f − p‖₂ = 0.100629‖f − p‖_∞ = 0.216506

L² vs. uniform best quadratic approximation of |x| on [-1,1], showing equioscillation for the uniform approximant

6. Chebyshev Polynomials & Optimal Nodes

Chebyshev polynomials are the algebraic analog of the trigonometric functions — they arise from cos(narccosx)\cos(n \arccos x), connect polynomial approximation to Fourier analysis, and provide the optimal interpolation nodes.

📐 Definition 5 (Chebyshev Polynomials)

The Chebyshev polynomial of the first kind is

Tn(x)=cos(narccosx),x[1,1],n0.T_n(x) = \cos(n \arccos x), \quad x \in [-1, 1], \quad n \geq 0.

Equivalently, TnT_n is the unique polynomial of degree nn with leading coefficient 2n12^{n-1} (for n1n \geq 1) satisfying Tn(x)1|T_n(x)| \leq 1 for x[1,1]x \in [-1, 1].

The zeros are the Chebyshev nodes:

xk=cos ⁣(2k12nπ),k=1,,n.x_k = \cos\!\left(\frac{2k - 1}{2n}\pi\right), \quad k = 1, \ldots, n.

🔷 Theorem 5 (Chebyshev Minimax Property)

Among all monic polynomials of degree nn (leading coefficient =1= 1), the one with smallest sup-norm on [1,1][-1,1] is T~n(x)=Tn(x)/2n1\tilde{T}_n(x) = T_n(x)/2^{n-1}, and

minmonic p,degp=np=12n1,\min_{\text{monic } p,\, \deg p = n} \|p\|_\infty = \frac{1}{2^{n-1}},

achieved uniquely by the normalized Chebyshev polynomial.

🔷 Proposition 1 (The Equioscillation Theorem (Chebyshev's Theorem))

For fC[a,b]f \in C[a,b], a polynomial pPnp^* \in \mathcal{P}_n is the best uniform approximation if and only if the error fpf - p^* equioscillates at least n+2n + 2 times: there exist ax0<x1<<xn+1ba \leq x_0 < x_1 < \cdots < x_{n+1} \leq b such that

f(xj)p(xj)=(1)jfpf(x_j) - p^*(x_j) = (-1)^j \|f - p^*\|_\infty

(alternating sign, full magnitude). This characterizes the best approximant uniquely.

📝 Example 7 (The Runge phenomenon)

Consider interpolating f(x)=1/(1+25x2)f(x) = 1/(1 + 25x^2) on [1,1][-1, 1] using polynomial interpolation. With n+1n + 1 equispaced nodes, the interpolant diverges near the endpoints as nn \to \infty — the maximum interpolation error grows without bound. With Chebyshev nodes xk=cos((2k1)π/(2n+2))x_k = \cos((2k-1)\pi/(2n+2)), the interpolant converges uniformly.

The Runge phenomenon is the polynomial interpolation analog of the Gibbs phenomenon from Topic 19: equispaced sampling is suboptimal, and the right node distribution matters enormously.

💡 Remark 6 (Chebyshev polynomials as 'cosines in disguise')

The identity Tn(cosθ)=cos(nθ)T_n(\cos\theta) = \cos(n\theta) means that Chebyshev approximation on [1,1][-1,1] is equivalent to Fourier cosine approximation on [0,π][0, \pi] under the substitution x=cosθx = \cos\theta. This connects Topic 20 back to Topic 19: the Chebyshev coefficients of ff are the Fourier cosine coefficients of f(cosθ)f(\cos\theta), and the coefficient decay rates from Topic 19 translate directly into approximation rates for Chebyshev expansions.

f(x)Equispaced: max |err| = 1.9156Chebyshev: max |err| = 0.109147

The Runge phenomenon — equispaced interpolation diverges while Chebyshev interpolation converges for the Runge function

7. Jackson’s Theorem & Bernstein’s Inverse

How fast can we approximate? Jackson’s theorem says the rate of best polynomial approximation is controlled by the smoothness of ff — smoother functions admit faster approximation. Bernstein’s inverse theorem gives the converse: fast approximation implies smoothness.

📐 Definition 6 (Modulus of Continuity)

The modulus of continuity of fC[a,b]f \in C[a,b] is

ω(f;δ)=supxyδf(x)f(y),δ>0.\omega(f; \delta) = \sup_{|x - y| \leq \delta} |f(x) - f(y)|, \quad \delta > 0.

ff is Lipschitz (with constant LL) if ω(f;δ)Lδ\omega(f; \delta) \leq L\delta. ff is Hölder continuous with exponent α(0,1]\alpha \in (0, 1] if ω(f;δ)Cδα\omega(f; \delta) \leq C\delta^\alpha. The modulus quantifies “how continuous” ff is — it is the most refined scalar measure of regularity for continuous functions.

🔷 Theorem 6 (Jackson's Theorem (Direct Theorem))

If fC[1,1]f \in C[-1, 1], then for every n1n \geq 1,

En(f)Cω ⁣(f;1n),E_n(f) \leq C \cdot \omega\!\left(f; \frac{1}{n}\right),

where CC is an absolute constant. More generally, if fCr[1,1]f \in C^r[-1,1] (i.e., ff is rr times continuously differentiable), then

En(f)Crnrω ⁣(f(r);1n).E_n(f) \leq \frac{C_r}{n^r}\, \omega\!\left(f^{(r)}; \frac{1}{n}\right).

The smoother ff is, the faster En(f)0E_n(f) \to 0.

🔷 Theorem 7 (Bernstein's Inverse Theorem)

If En(f)=O(nrα)E_n(f) = O(n^{-r-\alpha}) for some integer r0r \geq 0 and 0<α<10 < \alpha < 1, then fCr[1,1]f \in C^r[-1,1] and f(r)f^{(r)} is Hölder continuous with exponent α\alpha:

ω(f(r);δ)=O(δα).\omega(f^{(r)}; \delta) = O(\delta^\alpha).

The converse of Jackson’s theorem: fast approximation rates imply smoothness.

📝 Example 8 (Jackson's theorem for functions of varying smoothness)

(a) f(x)=xf(x) = |x| is Lipschitz (ω(f;δ)=δ\omega(f; \delta) = \delta), so Jackson gives En(f)=O(1/n)E_n(f) = O(1/n).

(b) f(x)=xxf(x) = x|x| is C1C^1 with Lipschitz derivative, so En(f)=O(1/n2)E_n(f) = O(1/n^2).

(c) fCf \in C^\infty (analytic) has En(f)=O(ecn)E_n(f) = O(e^{-cn}) — superalgebraic convergence.

This mirrors the Fourier coefficient decay hierarchy from Topic 19: the same smoothness conditions control both Fourier coefficient decay and polynomial approximation rates.

💡 Remark 7 (The Jackson-Bernstein equivalence)

Together, Jackson’s and Bernstein’s theorems establish that

En(f)nrαf(r) exists and is Ho¨lder-α.E_n(f) \asymp n^{-r-\alpha} \quad \Longleftrightarrow \quad f^{(r)} \text{ exists and is Hölder-}\alpha.

Approximation rate and smoothness are two sides of the same coin. This tight equivalence has no analog in Fourier theory (where coefficient decay characterizes smoothness of periodic functions).

Jackson-predicted approximation rates for functions of varying smoothness — |x| at O(1/n), x|x| at O(1/n²), analytic functions at O(e^-cn)

8. ML Connections — Universal Approximation, Barron’s Theorem, Private Inference

8.1 The universal approximation theorem

Cybenko (1989) and Hornik, Stinchcombe & White (1990) proved that single-hidden-layer networks with sigmoidal activation functions can uniformly approximate any continuous function on a compact set: the set

{xj=1Nαjσ(wjTx+bj):NN,αj,bjR,wjRd}\left\{x \mapsto \sum_{j=1}^{N} \alpha_j \sigma(w_j^T x + b_j): N \in \mathbb{N},\, \alpha_j, b_j \in \mathbb{R},\, w_j \in \mathbb{R}^d\right\}

is dense in C(K)C(K) for compact KRdK \subset \mathbb{R}^d. The proof structure mirrors Stone-Weierstrass: neural networks with a sigmoidal activation function form a set that separates points (by varying wjw_j), discriminates constants (by varying bjb_j), and is closed under the relevant operations.

The theorem says neural networks can approximate anything — but says nothing about how many neurons are needed. This is the same gap as in Weierstrass: existence without constructive bounds.

8.2 Barron’s theorem — approximation rates for networks

Barron (1993) proved a Jackson-type theorem for neural networks: if ff has bounded first-moment Fourier transform Cf=ωf^(ω)dω<C_f = \int |\omega|\,|\hat{f}(\omega)|\,d\omega < \infty, then the best approximation by a network with NN hidden units satisfies

inffNffN22CCf2N.\inf_{f_N} \|f - f_N\|_2^2 \leq C \cdot \frac{C_f^2}{N}.

The rate O(1/N)O(1/N) is independent of the input dimension dd — this avoids the curse of dimensionality that plagues polynomial approximation in high dimensions (where Jackson’s theorem gives rates that degrade exponentially in dd). This is why neural networks outperform polynomials for high-dimensional function approximation.

8.3 Polynomial approximation in homomorphic encryption

In privacy-preserving ML inference, computations on encrypted data must be polynomial — addition and multiplication only, no comparisons, no divisions. Activation functions like ReLU and sigmoid are approximated by low-degree polynomials (typically Chebyshev or Bernstein) so that inference can be performed on encrypted inputs.

Bernstein polynomials are particularly useful because they preserve the range [0,1][0,1]: if 0f(x)10 \leq f(x) \leq 1, then 0Bn(f;x)10 \leq B_n(f; x) \leq 1 — a property that sigmoid approximations need.

8.4 Kernel methods and RKHS approximation

In kernel-based ML (Gaussian processes, support vector machines), the function space is a reproducing kernel Hilbert space (RKHS). The representer theorem states that the best approximation in the RKHS from nn data points is a finite linear combination of kernel evaluations — an approximation-theoretic result. The RKHS approximation error depends on the RKHS norm of the target function, analogous to how En(f)E_n(f) depends on the smoothness of ff in Jackson’s theorem.

ML connections — universal approximation, Barron's theorem, Chebyshev activation approximation for homomorphic encryption, RKHS approximation bounds

9. Computational Notes

  • Bernstein polynomial evaluation: Direct evaluation of Bn(f;x)=kf(k/n)bk,n(x)B_n(f; x) = \sum_k f(k/n)\, b_{k,n}(x) is O(n)O(n) per point. The de Casteljau algorithm evaluates Bézier curves (which use the Bernstein basis) numerically stably via recursive linear interpolation.

  • Chebyshev expansion: For a function ff on [1,1][-1, 1], the Chebyshev coefficients ck=2π11f(x)Tk(x)/1x2dxc_k = \frac{2}{\pi}\int_{-1}^{1} f(x)\, T_k(x) / \sqrt{1 - x^2}\,dx can be computed via the FFT on the Chebyshev nodes (the discrete cosine transform). This gives all nn coefficients in O(nlogn)O(n \log n) operations — the same complexity as the FFT for Fourier coefficients.

  • Clenshaw’s algorithm evaluates a Chebyshev expansion ckTk(x)\sum c_k T_k(x) in O(n)O(n) operations using the three-term recurrence Tn+1(x)=2xTn(x)Tn1(x)T_{n+1}(x) = 2x\, T_n(x) - T_{n-1}(x), analogous to Horner’s method for monomial-form polynomials.

  • Numerical verification: The interactive visualization above lets you compare Bernstein polynomial approximation errors for f(x)=2x1f(x) = |2x - 1| (convergence rate O(1/n)O(1/\sqrt{n})) and f(x)=x(1x)f(x) = x(1-x) (convergence rate O(1/n)O(1/n)). Verify Chebyshev vs. equispaced interpolation for Runge’s function f(x)=1/(1+25x2)f(x) = 1/(1 + 25x^2).

Computational verification — Bernstein vs. Chebyshev approximation error comparison and Chebyshev coefficient decay

10. Connections & Further Reading

Track 5 completion

This topic completes the Sequences, Series & Approximation track. The four topics form a natural progression:

  1. Series Convergence & Tests — When do infinite sums converge? Absolute vs. conditional convergence, comparison and ratio tests.
  2. Power Series & Taylor Series — Local polynomial approximation from derivatives, radius of convergence, Taylor remainder.
  3. Fourier Series & Orthogonal Expansions — Global trigonometric approximation from integrals, orthogonality, coefficient decay, Gibbs phenomenon.
  4. Approximation Theory — The theoretical limits of polynomial approximation: existence (Weierstrass), construction (Bernstein), optimality (Chebyshev), rates (Jackson-Bernstein).

The thread connecting them all is the question: how well can we approximate functions, and what determines the answer?

Comparing three approximation methods

f(x) — Taylor — Fourier — Bernstein | Right panel: max |error| vs degree (log scale)

Prerequisites used in this topic

  • Fourier Series & Orthogonal Expansions — Best L2L^2 approximation in the trigonometric basis, coefficient decay \leftrightarrow approximation rate connection, Gibbs phenomenon as the Fourier analog of the Runge phenomenon.
  • Uniform Convergence — Uniform convergence of Bernstein polynomials, the distinction between pointwise and uniform convergence, interchange theorems used in Stone-Weierstrass.
  • Power Series & Taylor Series — Local vs. global approximation contrast, Taylor remainder vs. Jackson’s theorem, the approximation triptych.
  • The Riemann Integral & FTCL2L^2 norm computation, modulus of continuity, the Bernstein expectation interpretation as integration against the binomial distribution.

Forward references (planned topics)

  • Metric Spaces & Topology — density of polynomials in C[a,b]C[a,b] under the sup-norm is a statement about metric-space topology; the contraction mapping theorem sheds light on Bernstein operator iteration.
  • Normed & Banach SpacesC[a,b]C[a,b] and L2[a,b]L^2[a,b] as Banach spaces; best approximation in normed spaces; the equioscillation theorem as a characterization in the dual space.
  • Inner Product & Hilbert Spaces — Best L2L^2 approximation as orthogonal projection; Legendre polynomials as an orthonormal basis; RKHS approximation theory.
  • Calculus of Variations — Best approximation as optimization over function spaces. The direct method provides the general existence principle.
  • PAC Learning → formalML — The universal approximation theorem and Barron’s theorem on neural network approximation rates.
  • Gradient Descent → formalML — Polynomial approximation of activation functions for homomorphic encryption.
  • Spectral Theorem → formalML — Stone-Weierstrass gives the continuous functional calculus for self-adjoint operators.
  • Riemannian Geometry → formalML — RKHS approximation theory for kernel regression and Gaussian processes.

References

  1. book Rudin (1976). Principles of Mathematical Analysis Chapter 7 — the Stone-Weierstrass theorem and uniform approximation
  2. book Trefethen (2019). Approximation Theory and Approximation Practice The definitive modern treatment of Chebyshev approximation, interpolation, and practical polynomial approximation — the computational perspective
  3. book DeVore & Lorentz (1993). Constructive Approximation Chapters 1–4 — Bernstein polynomials, Jackson theorems, moduli of smoothness, inverse theorems
  4. book Abbott (2015). Understanding Analysis Chapter 6 — the Weierstrass approximation theorem via Bernstein polynomials, the cleanest undergraduate proof
  5. paper Cybenko (1989). “Approximation by Superpositions of a Sigmoidal Function” The original universal approximation theorem for neural networks — the neural-network Weierstrass theorem
  6. paper Hornik, Stinchcombe & White (1990). “Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks” Extension of Cybenko's result to networks with arbitrary activation functions and derivative approximation