Series & Approximation · intermediate · 50 min read

Approximation Theory

Every continuous function on a closed interval can be uniformly approximated by polynomials — the Weierstrass theorem. Bernstein polynomials provide the constructive proof, Chebyshev polynomials provide the optimal nodes, and Jackson's theorem quantifies how smoothness controls the rate. This is the existence theorem of approximation: the guarantee that good approximants exist, the machinery to build them, and the connection to the universal approximation theorem for neural networks.

Abstract. The Weierstrass approximation theorem guarantees that every continuous function f on a closed interval [a,b] can be uniformly approximated by polynomials: for any ε > 0, there exists a polynomial p with ||f − p||∞ < ε. The Bernstein polynomial construction Bₙ(f; x) = Σ f(k/n)C(n,k)xᵏ(1−x)ⁿ⁻ᵏ provides a concrete, constructive proof — each Bₙ is the expected value of f evaluated at a binomial random variable, and convergence follows from the law of large numbers. The convergence rate O(1/√n) is slow, but the theorem's power lies in its generality: no differentiability, no analyticity, just continuity. The Stone-Weierstrass theorem generalizes to compact spaces: a closed subalgebra of C(K) that separates points and contains the constants is dense. For optimal polynomial approximation, Chebyshev polynomials Tₙ(x) = cos(n arccos x) provide near-minimax approximation and the ideal interpolation nodes, avoiding the Runge phenomenon that plagues equispaced interpolation. Jackson's theorem quantifies the rate of best approximation: Eₙ(f) ≤ C·ω(f; 1/n), where ω is the modulus of continuity — smoother functions admit faster polynomial approximation. Bernstein's inverse theorem provides the converse: fast approximation implies smoothness. In machine learning, the universal approximation theorem for neural networks (Cybenko, Hornik) is the direct analog of Weierstrass: single-hidden-layer networks can uniformly approximate any continuous function on a compact set. Barron's theorem gives approximation rates for networks, and polynomial approximation underlies homomorphic encryption for private ML inference.

1. Overview & Motivation — Why Approximation Theory?

Track 5 has now developed three approximation paradigms. Power Series & Taylor Series showed how to approximate a function locally by matching derivatives at a center $c$ — but Taylor series require analyticity and are limited by the radius of convergence. Fourier Series & Orthogonal Expansions showed how to approximate a function globally by decomposing it into trigonometric harmonics — but Fourier series work best for periodic functions and converge in the $L^2$ norm, which allows large pointwise errors on small sets.

The capstone question of this track is: can we approximate any continuous function on $[a,b]$ by polynomials, uniformly (with a guaranteed pointwise bound everywhere)? The answer is yes — the Weierstrass approximation theorem — and understanding why, how, and how fast is the subject of approximation theory.

Why this matters in ML. The universal approximation theorem for neural networks (Cybenko 1989) is the neural-network analog of Weierstrass. Just as Weierstrass says polynomials are dense in $C[a,b]$ , Cybenko says single-hidden-layer networks are dense in $C(K)$ for compact $K$ . Understanding the classical theory illuminates the modern one — including its limitations. Density says approximants exist but says nothing about how to find them or how many parameters are needed. The gap between existence and computation is the gap between Weierstrass and training a neural network.

💡 Remark 1 (The approximation triptych)

Track 5 has developed three approximation paradigms, each with its own strengths and limitations:

	Taylor (Topic 18)	Fourier (Topic 19)	Weierstrass/Bernstein (Topic 20)
Basis	Monomials $\{1, x, x^2, \ldots\}$	Trig functions $\{1, \cos x, \sin x, \ldots\}$	Bernstein basis $\{b_{k,n}\}$
Scope	Local (radius $R$ around center $c$ )	Global on the period $[-\pi, \pi]$	Global on $[a, b]$
Requires	Analyticity	Integrability	Continuity only
Convergence	Inside radius $R$	$L^2$ norm; pointwise under conditions	Uniform on $[a, b]$
Coefficients	Derivatives: $f^{(n)}(c)/n!$	Integrals: $\hat{f}(n) = \frac{1}{2\pi}\int f e^{-inx}\,dx$	Samples: $f(k/n)$

Each successive paradigm trades specificity for generality. The thread connecting them all is the question: how well can we approximate functions, and what determines the answer?

The approximation triptych — Taylor (local polynomial at a center), Fourier (global trigonometric on a period), Bernstein (global polynomial on an interval)

2. The Weierstrass Approximation Theorem

The Weierstrass approximation theorem (1885) is the existence theorem of approximation theory. It tells us that polynomial approximation is always possible for continuous functions — the polynomials are dense in $C[a,b]$ .

📐 Definition 1 (Best Polynomial Approximation)

For $f \in C[a,b]$ and $n \geq 0$ , the best polynomial approximation error (or minimax error) of degree $n$ is

$E_n(f) = \inf_{p \in \mathcal{P}_n} \|f - p\|_\infty = \inf_{p \in \mathcal{P}_n} \max_{x \in [a,b]} |f(x) - p(x)|$

where $\mathcal{P}_n$ is the space of polynomials of degree $\leq n$ . The infimum is attained — compactness of $[a,b]$ , continuity of $f$ , and finite dimension of $\mathcal{P}_n$ guarantee the existence of a best approximant $p_n^*$ with $\|f - p_n^*\|_\infty = E_n(f)$ .

🔷 Theorem 1 (The Weierstrass Approximation Theorem)

If $f: [a,b] \to \mathbb{R}$ is continuous, then for every $\varepsilon > 0$ there exists a polynomial $p$ such that

$\|f - p\|_\infty = \max_{x \in [a,b]} |f(x) - p(x)| < \varepsilon.$

Equivalently, $E_n(f) \to 0$ as $n \to \infty$ .

In the language of metric spaces: the polynomials $\mathcal{P} = \bigcup_{n=0}^{\infty} \mathcal{P}_n$ are dense in $(C[a,b], \|\cdot\|_\infty)$ .

💡 Remark 2 (What Weierstrass does and does not say)

The theorem is an existence result. It guarantees that good polynomial approximants exist but does not:

Construct the best polynomial for a given $\varepsilon$
Specify what degree $n$ is needed
Identify the optimal polynomial $p_n^*$

The Bernstein construction (Section 3) provides an explicit sequence of polynomial approximants. The Chebyshev theory (Section 6) characterizes the best approximant. Jackson’s theorem (Section 7) quantifies the rate.

3. Bernstein Polynomials — The Constructive Proof

The Bernstein polynomial is the simplest constructive proof of Weierstrass. It builds polynomial approximants by averaging $f$ over a binomial distribution — a probabilistic construction that converts the law of large numbers into an approximation theorem.

📐 Definition 2 (Bernstein Polynomial)

For $f: [0,1] \to \mathbb{R}$ and $n \geq 1$ , the $n$ th Bernstein polynomial of $f$ is

$B_n(f; x) = \sum_{k=0}^{n} f\!\left(\frac{k}{n}\right) \binom{n}{k} x^k (1 - x)^{n-k}.$

The Bernstein basis polynomials are

$b_{k,n}(x) = \binom{n}{k} x^k (1 - x)^{n-k}, \quad k = 0, 1, \ldots, n.$

These form a partition of unity: $\sum_{k=0}^n b_{k,n}(x) = 1$ for all $x \in [0,1]$ .

💡 Remark 3 (The probabilistic interpretation)

Let $S_n \sim \text{Binomial}(n, x)$ . Then

$B_n(f; x) = \mathbb{E}\!\left[f\!\left(\frac{S_n}{n}\right)\right].$

Each Bernstein polynomial evaluates $f$ at the random points $k/n$ weighted by the binomial probabilities. As $n \to \infty$ , the law of large numbers gives $S_n/n \to x$ in probability, so $f(S_n/n) \to f(x)$ by continuity, giving $B_n(f; x) \to f(x)$ . This is the heuristic — the rigorous proof makes it precise.

🔷 Theorem 2 (Uniform Convergence of Bernstein Polynomials)

If $f: [0,1] \to \mathbb{R}$ is continuous, then $B_n(f; x) \to f(x)$ uniformly on $[0,1]$ :

$\|B_n(f) - f\|_\infty \to 0 \quad \text{as } n \to \infty.$

Proof.

The proof uses three identities for the Bernstein basis:

Identity 1 (partition of unity): $\displaystyle\sum_{k=0}^n b_{k,n}(x) = 1$ .

Identity 2 (first moment): $\displaystyle\sum_{k=0}^n \frac{k}{n}\, b_{k,n}(x) = x$ .

Identity 3 (variance): $\displaystyle\sum_{k=0}^n \left(\frac{k}{n} - x\right)^2 b_{k,n}(x) = \frac{x(1-x)}{n} \leq \frac{1}{4n}$ .

Now fix $\varepsilon > 0$ . Since $f$ is continuous on the compact set $[0,1]$ , it is uniformly continuous: choose $\delta > 0$ such that $|f(s) - f(t)| < \varepsilon$ whenever $|s - t| < \delta$ .

For any $x \in [0,1]$ , using Identity 1:

$|B_n(f;x) - f(x)| = \left|\sum_{k=0}^n \bigl[f(k/n) - f(x)\bigr]\, b_{k,n}(x)\right| \leq \sum_{k=0}^n |f(k/n) - f(x)|\, b_{k,n}(x).$

Split the sum into two parts. Let $M = \|f\|_\infty$ .

Near terms ( $|k/n - x| < \delta$ ): Here $|f(k/n) - f(x)| < \varepsilon$ , so these terms contribute at most $\varepsilon \cdot \sum b_{k,n}(x) \leq \varepsilon$ .

Far terms ( $|k/n - x| \geq \delta$ ): Here $|f(k/n) - f(x)| \leq 2M$ , and by Identity 3:

$\sum_{|k/n - x| \geq \delta} b_{k,n}(x) \leq \frac{1}{\delta^2}\sum_{k=0}^n \left(\frac{k}{n} - x\right)^2 b_{k,n}(x) \leq \frac{1}{4n\delta^2}.$

These terms contribute at most $2M \cdot \frac{1}{4n\delta^2} = \frac{M}{2n\delta^2}$ .

Combining: $|B_n(f;x) - f(x)| \leq \varepsilon + \frac{M}{2n\delta^2}$ .

Choose $n > \frac{M}{2\varepsilon\delta^2}$ . Then $|B_n(f;x) - f(x)| < 2\varepsilon$ for all $x \in [0,1]$ , so $\|B_n(f) - f\|_\infty < 2\varepsilon$ .

Since $\varepsilon > 0$ was arbitrary, $B_n(f) \to f$ uniformly. $\square$

∎

📝 Example 1 (Bernstein polynomials for f(x) = |2x − 1|)

The function $f(x) = |2x - 1|$ is continuous but not differentiable at $x = 1/2$ . Its Bernstein polynomials $B_n(f; x) = \sum_k |2k/n - 1|\, b_{k,n}(x)$ are smooth polynomials that converge uniformly to $f$ , rounding the corner at $x = 1/2$ . At $n = 5$ , the approximation is visibly smooth; at $n = 50$ , it is nearly indistinguishable from $f$ except in a narrow region around $x = 1/2$ .

📝 Example 2 (Convergence rate comparison)

For the smooth function $f(x) = x(1-x)$ , the Bernstein polynomial satisfies $B_n(f; x) = x(1-x)(n-1)/n$ , giving $\|B_n(f) - f\|_\infty = 1/(4n) = O(1/n)$ — faster than the general $O(1/\sqrt{n})$ because $f$ is smooth.

For $f(x) = |2x - 1|$ (Lipschitz but not $C^1$ ), $\|B_n(f) - f\|_\infty = O(1/\sqrt{n})$ — the worst-case rate for Lipschitz functions.

This illustrates that Bernstein convergence is slow — adequate for proving the Weierstrass theorem, but not optimal for computation.

Function: n = 5 Show basis

f B₅(f)Max error: 0.375000Rate: O(1/√n)

Bernstein polynomials B_n(f; x) for f(x) = |2x − 1| at n = 5, 20, 100

4. Stone-Weierstrass — The Grand Generalization

Stone’s generalization (1937, 1948) replaces the specific polynomial algebra on $[a,b]$ with an abstract algebra on a compact space, identifying exactly what properties an algebra needs to be dense in $C(K)$ .

📐 Definition 3 (Separating Algebra)

A set $\mathcal{A} \subset C(K)$ is a subalgebra of $C(K)$ if it is closed under addition, scalar multiplication, and pointwise multiplication. $\mathcal{A}$ separates points if for any $x \neq y$ in $K$ , there exists $g \in \mathcal{A}$ with $g(x) \neq g(y)$ . $\mathcal{A}$ vanishes nowhere if for every $x \in K$ , there exists $g \in \mathcal{A}$ with $g(x) \neq 0$ . If $\mathcal{A}$ contains the constant functions, it automatically vanishes nowhere.

🔷 Theorem 3 (The Stone-Weierstrass Theorem (Real Version))

Let $K$ be a compact metric space and let $\mathcal{A} \subset C(K, \mathbb{R})$ be a subalgebra that separates points and contains the constant functions. Then $\mathcal{A}$ is dense in $C(K, \mathbb{R})$ under the sup-norm:

$\overline{\mathcal{A}} = C(K, \mathbb{R}).$

📝 Example 3 (Recovering the Weierstrass theorem)

Take $K = [a,b]$ and $\mathcal{A} = \mathcal{P}$ , the polynomials. Polynomials form a subalgebra (closed under $+$ , $\cdot$ , scalar multiplication). They separate points: $p(x) = x$ distinguishes any $x \neq y$ . They contain the constants. Therefore $\overline{\mathcal{P}} = C[a,b]$ by Stone-Weierstrass. This is precisely Theorem 1.

📝 Example 4 (Trigonometric polynomials on the circle)

Take $K = \mathbb{T} = \mathbb{R}/(2\pi\mathbb{Z})$ (the circle) and $\mathcal{A}$ the trigonometric polynomials. They form a subalgebra (products of trig functions are trig functions), separate points (if $e^{ix} \neq e^{iy}$ then $\cos x \neq \cos y$ or $\sin x \neq \sin y$ ), and contain the constants. Stone-Weierstrass gives density, consistent with the Fourier series theory from Topic 19, which proved this concretely via Parseval’s identity.

📝 Example 5 (Polynomials in several variables)

Take $K \subset \mathbb{R}^d$ compact and $\mathcal{A}$ the polynomials in $d$ variables. They separate points: the coordinate functions $x_i$ distinguish points that differ in any coordinate. Stone-Weierstrass gives the density of multivariate polynomials in $C(K)$ — a result that would be laborious to prove directly but follows immediately from the abstract theorem.

💡 Remark 4 (What Stone-Weierstrass does not cover)

The theorem applies to $C(K, \mathbb{R})$ with the sup-norm. It does not directly apply to $L^p$ spaces (which require measure theory) or to non-compact domains (which require additional decay conditions). The complex version requires $\mathcal{A}$ to be self-conjugate: $g \in \mathcal{A} \Rightarrow \bar{g} \in \mathcal{A}$ . This condition is automatic for real-valued algebras but must be checked for complex ones.

Three instances of Stone-Weierstrass — polynomials on [0,1], trigonometric polynomials on the circle, multivariate polynomials on a compact subset of R²

5. Best Approximation — $L^2$ vs. Uniform

The choice of norm changes the best approximant, the characterization of optimality, and the connection to other mathematical structures. Two norms, two optimization problems, two different answers.

📐 Definition 4 (Best Approximation in L² and C)

The best $L^2$ approximation of $f \in L^2[a,b]$ from $\mathcal{P}_n$ is the polynomial $p_n^{L^2}$ minimizing

$\|f - p\|_2 = \sqrt{\int_a^b |f(x) - p(x)|^2\,dx}.$

The best uniform (Chebyshev) approximation of $f \in C[a,b]$ from $\mathcal{P}_n$ is the polynomial $p_n^*$ minimizing

$\|f - p\|_\infty = \max_{x \in [a,b]} |f(x) - p(x)|.$

🔷 Theorem 4 (Best L² Approximation as Orthogonal Projection)

The best $L^2$ polynomial approximation $p_n^{L^2}$ is the orthogonal projection of $f$ onto $\mathcal{P}_n$ in the inner product $\langle f, g \rangle = \int_a^b f(x)g(x)\,dx$ . If $\{\phi_0, \phi_1, \ldots, \phi_n\}$ is an orthonormal basis for $\mathcal{P}_n$ (e.g., the Legendre polynomials), then

$p_n^{L^2}(x) = \sum_{k=0}^{n} \langle f, \phi_k \rangle\, \phi_k(x).$

This is exactly the Fourier coefficient formula from Topic 19, with orthogonal polynomials replacing trigonometric functions.

💡 Remark 5 (L² vs. uniform — why both matter)

$L^2$ best approximation is easy to compute (orthogonal projection — just compute inner products) but allows large pointwise errors on small sets. Uniform best approximation is hard to compute (a nonlinear optimization problem) but gives a guaranteed pointwise bound everywhere.

In ML: $L^2$ approximation corresponds to minimizing mean squared error (training loss), while uniform approximation corresponds to worst-case guarantees (robustness). A model that minimizes MSE can still have catastrophic errors on rare inputs — the gap between $L^2$ and uniform approximation is the gap between average-case and worst-case performance.

📝 Example 6 (Comparing L² and uniform best approximation of |x|)

For $f(x) = |x|$ on $[-1,1]$ with $n = 2$ (quadratic approximation): the best $L^2$ approximant is the orthogonal projection $p_2^{L^2}(x) = \frac{1}{2} + \frac{15}{8}\left(x^2 - \frac{1}{3}\right)$ (computed from Legendre coefficients), while the best uniform approximant $p_2^*(x)$ has an equioscillation property — the error alternates between $+E_2(f)$ and $-E_2(f)$ at three points. The two approximants are different polynomials — the norm determines the answer.

Function: degree n = 2 L² best Uniform best

f L² best Uniform best Equioscillation pts‖f − p‖₂ = 0.100629‖f − p‖_∞ = 0.216506

L² vs. uniform best quadratic approximation of |x| on [-1,1], showing equioscillation for the uniform approximant

6. Chebyshev Polynomials & Optimal Nodes

Chebyshev polynomials are the algebraic analog of the trigonometric functions — they arise from $\cos(n \arccos x)$ , connect polynomial approximation to Fourier analysis, and provide the optimal interpolation nodes.

📐 Definition 5 (Chebyshev Polynomials)

The Chebyshev polynomial of the first kind is

$T_n(x) = \cos(n \arccos x), \quad x \in [-1, 1], \quad n \geq 0.$

Equivalently, $T_n$ is the unique polynomial of degree $n$ with leading coefficient $2^{n-1}$ (for $n \geq 1$ ) satisfying $|T_n(x)| \leq 1$ for $x \in [-1, 1]$ .

The zeros are the Chebyshev nodes:

$x_k = \cos\!\left(\frac{2k - 1}{2n}\pi\right), \quad k = 1, \ldots, n.$

🔷 Theorem 5 (Chebyshev Minimax Property)

Among all monic polynomials of degree $n$ (leading coefficient $= 1$ ), the one with smallest sup-norm on $[-1,1]$ is $\tilde{T}_n(x) = T_n(x)/2^{n-1}$ , and

$\min_{\text{monic } p,\, \deg p = n} \|p\|_\infty = \frac{1}{2^{n-1}},$

achieved uniquely by the normalized Chebyshev polynomial.

🔷 Proposition 1 (The Equioscillation Theorem (Chebyshev's Theorem))

For $f \in C[a,b]$ , a polynomial $p^* \in \mathcal{P}_n$ is the best uniform approximation if and only if the error $f - p^*$ equioscillates at least $n + 2$ times: there exist $a \leq x_0 < x_1 < \cdots < x_{n+1} \leq b$ such that

$f(x_j) - p^*(x_j) = (-1)^j \|f - p^*\|_\infty$

(alternating sign, full magnitude). This characterizes the best approximant uniquely.

📝 Example 7 (The Runge phenomenon)

Consider interpolating $f(x) = 1/(1 + 25x^2)$ on $[-1, 1]$ using polynomial interpolation. With $n + 1$ equispaced nodes, the interpolant diverges near the endpoints as $n \to \infty$ — the maximum interpolation error grows without bound. With Chebyshev nodes $x_k = \cos((2k-1)\pi/(2n+2))$ , the interpolant converges uniformly.

The Runge phenomenon is the polynomial interpolation analog of the Gibbs phenomenon from Topic 19: equispaced sampling is suboptimal, and the right node distribution matters enormously.

💡 Remark 6 (Chebyshev polynomials as 'cosines in disguise')

The identity $T_n(\cos\theta) = \cos(n\theta)$ means that Chebyshev approximation on $[-1,1]$ is equivalent to Fourier cosine approximation on $[0, \pi]$ under the substitution $x = \cos\theta$ . This connects Topic 20 back to Topic 19: the Chebyshev coefficients of $f$ are the Fourier cosine coefficients of $f(\cos\theta)$ , and the coefficient decay rates from Topic 19 translate directly into approximation rates for Chebyshev expansions.

Function: n = 11 nodes Nodes:

f(x)Equispaced: max |err| = 1.9156Chebyshev: max |err| = 0.109147

The Runge phenomenon — equispaced interpolation diverges while Chebyshev interpolation converges for the Runge function

7. Jackson’s Theorem & Bernstein’s Inverse

How fast can we approximate? Jackson’s theorem says the rate of best polynomial approximation is controlled by the smoothness of $f$ — smoother functions admit faster approximation. Bernstein’s inverse theorem gives the converse: fast approximation implies smoothness.

📐 Definition 6 (Modulus of Continuity)

The modulus of continuity of $f \in C[a,b]$ is

$\omega(f; \delta) = \sup_{|x - y| \leq \delta} |f(x) - f(y)|, \quad \delta > 0.$

$f$ is Lipschitz (with constant $L$ ) if $\omega(f; \delta) \leq L\delta$ . $f$ is Hölder continuous with exponent $\alpha \in (0, 1]$ if $\omega(f; \delta) \leq C\delta^\alpha$ . The modulus quantifies “how continuous” $f$ is — it is the most refined scalar measure of regularity for continuous functions.

🔷 Theorem 6 (Jackson's Theorem (Direct Theorem))

If $f \in C[-1, 1]$ , then for every $n \geq 1$ ,

$E_n(f) \leq C \cdot \omega\!\left(f; \frac{1}{n}\right),$

where $C$ is an absolute constant. More generally, if $f \in C^r[-1,1]$ (i.e., $f$ is $r$ times continuously differentiable), then

$E_n(f) \leq \frac{C_r}{n^r}\, \omega\!\left(f^{(r)}; \frac{1}{n}\right).$

The smoother $f$ is, the faster $E_n(f) \to 0$ .

🔷 Theorem 7 (Bernstein's Inverse Theorem)

If $E_n(f) = O(n^{-r-\alpha})$ for some integer $r \geq 0$ and $0 < \alpha < 1$ , then $f \in C^r[-1,1]$ and $f^{(r)}$ is Hölder continuous with exponent $\alpha$ :

$\omega(f^{(r)}; \delta) = O(\delta^\alpha).$

The converse of Jackson’s theorem: fast approximation rates imply smoothness.

📝 Example 8 (Jackson's theorem for functions of varying smoothness)

(a) $f(x) = |x|$ is Lipschitz ( $\omega(f; \delta) = \delta$ ), so Jackson gives $E_n(f) = O(1/n)$ .

(b) $f(x) = x|x|$ is $C^1$ with Lipschitz derivative, so $E_n(f) = O(1/n^2)$ .

(c) $f \in C^\infty$ (analytic) has $E_n(f) = O(e^{-cn})$ — superalgebraic convergence.

This mirrors the Fourier coefficient decay hierarchy from Topic 19: the same smoothness conditions control both Fourier coefficient decay and polynomial approximation rates.

💡 Remark 7 (The Jackson-Bernstein equivalence)

Together, Jackson’s and Bernstein’s theorems establish that

$E_n(f) \asymp n^{-r-\alpha} \quad \Longleftrightarrow \quad f^{(r)} \text{ exists and is Hölder-}\alpha.$

Approximation rate and smoothness are two sides of the same coin. This tight equivalence has no analog in Fourier theory (where coefficient decay characterizes smoothness of periodic functions).

Jackson-predicted approximation rates for functions of varying smoothness — |x| at O(1/n), x|x| at O(1/n²), analytic functions at O(e^-cn)

8. ML Connections — Universal Approximation, Barron’s Theorem, Private Inference

8.1 The universal approximation theorem

Cybenko (1989) and Hornik, Stinchcombe & White (1990) proved that single-hidden-layer networks with sigmoidal activation functions can uniformly approximate any continuous function on a compact set: the set

$\left\{x \mapsto \sum_{j=1}^{N} \alpha_j \sigma(w_j^T x + b_j): N \in \mathbb{N},\, \alpha_j, b_j \in \mathbb{R},\, w_j \in \mathbb{R}^d\right\}$

is dense in $C(K)$ for compact $K \subset \mathbb{R}^d$ . The proof structure mirrors Stone-Weierstrass: neural networks with a sigmoidal activation function form a set that separates points (by varying $w_j$ ), discriminates constants (by varying $b_j$ ), and is closed under the relevant operations.

The theorem says neural networks can approximate anything — but says nothing about how many neurons are needed. This is the same gap as in Weierstrass: existence without constructive bounds.

8.2 Barron’s theorem — approximation rates for networks

Barron (1993) proved a Jackson-type theorem for neural networks: if $f$ has bounded first-moment Fourier transform $C_f = \int |\omega|\,|\hat{f}(\omega)|\,d\omega < \infty$ , then the best approximation by a network with $N$ hidden units satisfies

$\inf_{f_N} \|f - f_N\|_2^2 \leq C \cdot \frac{C_f^2}{N}.$

The rate $O(1/N)$ is independent of the input dimension $d$ — this avoids the curse of dimensionality that plagues polynomial approximation in high dimensions (where Jackson’s theorem gives rates that degrade exponentially in $d$ ). This is why neural networks outperform polynomials for high-dimensional function approximation.

8.3 Polynomial approximation in homomorphic encryption

In privacy-preserving ML inference, computations on encrypted data must be polynomial — addition and multiplication only, no comparisons, no divisions. Activation functions like ReLU and sigmoid are approximated by low-degree polynomials (typically Chebyshev or Bernstein) so that inference can be performed on encrypted inputs.

Bernstein polynomials are particularly useful because they preserve the range $[0,1]$ : if $0 \leq f(x) \leq 1$ , then $0 \leq B_n(f; x) \leq 1$ — a property that sigmoid approximations need.

8.4 Kernel methods and RKHS approximation

In kernel-based ML (Gaussian processes, support vector machines), the function space is a reproducing kernel Hilbert space (RKHS). The representer theorem states that the best approximation in the RKHS from $n$ data points is a finite linear combination of kernel evaluations — an approximation-theoretic result. The RKHS approximation error depends on the RKHS norm of the target function, analogous to how $E_n(f)$ depends on the smoothness of $f$ in Jackson’s theorem.

ML connections — universal approximation, Barron's theorem, Chebyshev activation approximation for homomorphic encryption, RKHS approximation bounds

9. Computational Notes

Bernstein polynomial evaluation: Direct evaluation of $B_n(f; x) = \sum_k f(k/n)\, b_{k,n}(x)$ is $O(n)$ per point. The de Casteljau algorithm evaluates Bézier curves (which use the Bernstein basis) numerically stably via recursive linear interpolation.
Chebyshev expansion: For a function $f$ on $[-1, 1]$ , the Chebyshev coefficients $c_k = \frac{2}{\pi}\int_{-1}^{1} f(x)\, T_k(x) / \sqrt{1 - x^2}\,dx$ can be computed via the FFT on the Chebyshev nodes (the discrete cosine transform). This gives all $n$ coefficients in $O(n \log n)$ operations — the same complexity as the FFT for Fourier coefficients.
Clenshaw’s algorithm evaluates a Chebyshev expansion $\sum c_k T_k(x)$ in $O(n)$ operations using the three-term recurrence $T_{n+1}(x) = 2x\, T_n(x) - T_{n-1}(x)$ , analogous to Horner’s method for monomial-form polynomials.
Numerical verification: The interactive visualization above lets you compare Bernstein polynomial approximation errors for $f(x) = |2x - 1|$ (convergence rate $O(1/\sqrt{n})$ ) and $f(x) = x(1-x)$ (convergence rate $O(1/n)$ ). Verify Chebyshev vs. equispaced interpolation for Runge’s function $f(x) = 1/(1 + 25x^2)$ .

Computational verification — Bernstein vs. Chebyshev approximation error comparison and Chebyshev coefficient decay

10. Track 5 Completion

This topic completes the Sequences, Series & Approximation track. The four topics form a natural progression:

Series Convergence & Tests — When do infinite sums converge? Absolute vs. conditional convergence, comparison and ratio tests.
Power Series & Taylor Series — Local polynomial approximation from derivatives, radius of convergence, Taylor remainder.
Fourier Series & Orthogonal Expansions — Global trigonometric approximation from integrals, orthogonality, coefficient decay, Gibbs phenomenon.
Approximation Theory — The theoretical limits of polynomial approximation: existence (Weierstrass), construction (Bernstein), optimality (Chebyshev), rates (Jackson-Bernstein).

The thread connecting them all is the question: how well can we approximate functions, and what determines the answer?

Comparing three approximation methods

Function: n = 5 Taylor Fourier Bernstein

f(x) — Taylor — Fourier — Bernstein | Right panel: max |error| vs degree (log scale)

Connections & Further Reading

Prerequisites — topics you need first

intermediate Series & Approximation 55 min

Fourier Series & Orthogonal Expansions

Fourier partial sums are the best L² approximation among trigonometric polynomials (projection onto the trig subspace). Topic 20 asks: Can we do the same with algebraic polynomials? The answer is yes — the Weierstrass theorem — and the two approximation paradigms (L² vs. uniform, trig vs. polynomial) are unified by Stone-Weierstrass.

intermediate Limits & Continuity 45 min

Uniform Convergence

The Weierstrass theorem is a statement about uniform convergence: Bₙ(f) → f uniformly on [0,1]. The proof uses the uniform continuity of f on compact sets (Topic 4, §3) and the uniform version of the law of large numbers. Stone-Weierstrass uses the lattice version of uniform closure.

intermediate Series & Approximation 50 min

Power Series & Taylor Series

Taylor polynomials give a local approximation at a center c, requiring analyticity for convergence. Bernstein polynomials provide a global approximation on [0,1] that requires only continuity. This is the central contrast: local + smooth vs. global + continuous. Topic 20 completes the approximation triptych: Taylor (local polynomial) → Fourier (global trigonometric) → Weierstrass (global polynomial).

foundational Single-Variable Calculus 50 min

The Riemann Integral & FTC

The L² best approximation norm ||f − p||₂ = √(∫(f−p)²dx) requires Riemann integration. Jackson's theorem uses the modulus of continuity ω(f; δ), which connects to uniform continuity on compact intervals. The Bernstein polynomial Bₙ(f; x) = E[f(Sₙ/n)] can be interpreted as an integration (expectation) against the binomial distribution.

foundational Series & Approximation 45 min

Series Convergence & Tests

The convergence rate of Bernstein polynomials — ||Bₙ(f) − f||∞ = O(ω(f; 1/√n)) — involves series-like decay estimates. Jackson's theorem bounds the best-approximation error Eₙ(f) using techniques that parallel the convergence rate analysis from Topic 17.

Where this leads — next in formalCalculus

intermediate Functional Analysis 45 min

Metric Spaces & Topology

Density of polynomials in C[a,b] under the sup-norm is a statement about metric-space topology; the contraction mapping theorem sheds light on Bernstein operator iteration.

advanced Functional Analysis 50 min

Normed & Banach Spaces

C[a,b] and L²[a,b] as Banach spaces; best approximation in normed spaces; the equioscillation theorem as a characterization in the dual space.

advanced Functional Analysis 55 min

Inner Product & Hilbert Spaces

Best L² approximation as orthogonal projection; Legendre polynomials as an orthonormal basis; RKHS approximation theory.

advanced Functional Analysis 50 min

Calculus of Variations

Best approximation as optimization over function spaces. The direct method provides the general existence principle.

On to formalStatistics — where this calculus powers inference

Kernel Density Estimation

KDE is a function-approximation problem — approximating the unknown density f by convolution with a kernel at bandwidth h. Consistency rates depend on the smoothness class of f (Hölder, Sobolev). Minimax rates for density estimation are approximation-theory results.

Empirical Processes

Bracketing numbers and covering numbers — approximation-theoretic measures of the 'size' of a function class — control the uniform convergence rate √n(P_n - P) over function classes. Donsker classes are exactly those whose bracketing integral converges.

Bootstrap

The bootstrap approximates the true sampling distribution of an estimator by the bootstrap sampling distribution. The validity proof is an approximation statement: the bootstrap empirical process converges to the same Gaussian limit as the true empirical process.

On to formalML — where this calculus powers ML

PAC Learning

The universal approximation theorem for neural networks (Cybenko 1989, Hornik 1991) is the neural-network analog of the Weierstrass theorem. The proof structure mirrors Stone-Weierstrass: networks form an algebra that separates points. Barron's theorem gives O(1/√n) approximation rates for networks with bounded first moments — the same rate as Bernstein polynomials.

Gradient Descent

Polynomial and Chebyshev approximation of activation functions enables private ML inference via homomorphic encryption. Bernstein polynomials preserve the bounded range [0,1], making them suitable for approximating sigmoid and ReLU in encrypted computation.

Spectral Theorem

The Stone-Weierstrass theorem for C*-algebras gives the continuous functional calculus: for self-adjoint operator A, f(A) is defined by approximating f with polynomials in A. This is the infinite-dimensional generalization of the polynomial approximation theory developed here.

Riemannian Geometry

RKHS approximation theory provides error bounds for kernel regression and Gaussian processes. The representer theorem is an approximation-theoretic result about best approximation in RKHS — the kernel analog of best polynomial approximation in C[a,b].

References

book Rudin (1976). Principles of Mathematical Analysis Chapter 7 — the Stone-Weierstrass theorem and uniform approximation
book Trefethen (2019). Approximation Theory and Approximation Practice The definitive modern treatment of Chebyshev approximation, interpolation, and practical polynomial approximation — the computational perspective
book DeVore & Lorentz (1993). Constructive Approximation Chapters 1–4 — Bernstein polynomials, Jackson theorems, moduli of smoothness, inverse theorems
book Abbott (2015). Understanding Analysis Chapter 6 — the Weierstrass approximation theorem via Bernstein polynomials, the cleanest undergraduate proof
paper Cybenko (1989). “Approximation by Superpositions of a Sigmoidal Function” The original universal approximation theorem for neural networks — the neural-network Weierstrass theorem
paper Hornik, Stinchcombe & White (1990). “Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks” Extension of Cybenko's result to networks with arbitrary activation functions and derivative approximation