Single-Variable Calculus · foundational · 50 min read

The Riemann Integral & FTC

Area as a limit of sums — the Riemann integral defined via partitions and Darboux sums, integrability of continuous functions, and the Fundamental Theorem connecting differentiation and integration

Abstract. The Riemann integral formalizes the intuitive notion of 'area under a curve' as a limit of finite sums. Given a function f on [a,b] and a partition P = {x₀, x₁, ..., xₙ} of [a,b], a Riemann sum Σ f(tᵢ) Δxᵢ approximates the integral by summing the areas of narrow rectangles. The integral ∫ₐᵇ f(x) dx is the common value of the upper and lower Darboux sums — sup Σ Mᵢ Δxᵢ and inf Σ mᵢ Δxᵢ — when they agree in the limit of partition refinement. Continuous functions on [a,b] are always Riemann integrable: uniform continuity (guaranteed by compactness of [a,b]) ensures that the upper and lower sums can be made arbitrarily close by choosing a fine enough partition. The Fundamental Theorem of Calculus forges the link between differentiation and integration. Part 1 says that if f is continuous, the area function F(x) = ∫ₐˣ f(t) dt is differentiable with F'(x) = f(x) — integration followed by differentiation recovers the original function. Part 2 says that if F is any antiderivative of f, then ∫ₐᵇ f(x) dx = F(b) - F(a) — the integral is computed by evaluating the antiderivative at the endpoints. The proof of Part 2 uses the Mean Value Theorem to relate the telescoping sum F(b) - F(a) = Σ [F(xᵢ) - F(xᵢ₋₁)] to the Riemann sum Σ f(cᵢ) Δxᵢ. Substitution (u-substitution) is the chain rule in reverse: ∫ f(g(x))g'(x) dx = ∫ f(u) du. Integration by parts is the product rule in reverse: ∫ u dv = uv - ∫ v du. In machine learning, the integral is ubiquitous: expected values E[X] = ∫ x f(x) dx, loss functions as integrals over distributions, marginalizing over latent variables, and probability computations all require integration. Numerical quadrature — trapezoidal rule, Simpson's rule, Gaussian quadrature — provides the computational tools when antiderivatives don't exist in closed form.

Where this leads → formalML

  • formalML The Riemann integral defines expected values for continuous densities: E[X] = ∫ x f(x) dx. But Riemann integration fails for pathological functions like the indicator of the rationals. The Lebesgue integral, built on measure theory, resolves these limitations and is the foundation of rigorous probability theory.
  • formalML Differential entropy h(X) = -∫ f(x) log f(x) dx, KL divergence D_KL(p‖q) = ∫ p(x) log(p(x)/q(x)) dx, and cross-entropy are all defined as integrals. The properties of the integral — linearity, monotonicity, additivity — directly yield information-theoretic inequalities.
  • formalML Posterior computation requires integrating likelihood × prior over the parameter space. When conjugacy is unavailable, numerical quadrature (trapezoidal, Simpson's, Gaussian quadrature) provides the computational backbone. Understanding integration error bounds from this topic informs the accuracy of these approximations.

Overview & Motivation

You’re computing the expected loss of a model over a continuous data distribution: E[(fθ(X),Y)]=(fθ(x),y)p(x,y)dxdy\mathbb{E}[\ell(f_\theta(X), Y)] = \int \ell(f_\theta(x), y) \, p(x,y) \, dx \, dy. What does this integral mean, precisely? It’s the limit of a sum: partition the input space into small regions, compute the loss ×\times density ×\times volume in each region, and sum up. As the regions shrink, the sum converges to the integral. This is not a metaphor — it’s the definition.

The Riemann integral makes “area under a curve” (and its higher-dimensional generalizations) rigorous. We’ll define it via partitions and Darboux sums (§2–3), prove that continuous functions are integrable (§4), establish the Fundamental Theorem of Calculus that ties integration to differentiation (§6–7), develop the key integration techniques (§8), and connect everything to expected values and numerical computation in ML (§11).

Prerequisites: We use the sequence convergence framework from Sequences, Limits & Convergence, the epsilon-delta characterization of uniform continuity from Epsilon-Delta & Continuity, the Heine-Cantor theorem from Completeness & Compactness, and the Mean Value Theorem from Mean Value Theorem & Taylor Expansion.

Partitions and Riemann Sums

The geometric picture

Start with a curve y=f(x)y = f(x) above the xx-axis on [a,b][a, b]. Slice the interval into nn equal subintervals. In each subinterval, build a rectangle whose height is the function value at some sample point. The total area of these rectangles approximates the area under the curve. The approximation improves as the slices get thinner — and the Riemann integral is the limit of this process.

📐 Definition 1 (Partition)

A partition PP of [a,b][a, b] is a finite ordered set P={x0,x1,,xn}P = \{x_0, x_1, \ldots, x_n\} with a=x0<x1<<xn=ba = x_0 < x_1 < \cdots < x_n = b. The mesh (or norm) of PP is P=max1inΔxi\|P\| = \max_{1 \le i \le n} \Delta x_i where Δxi=xixi1\Delta x_i = x_i - x_{i-1}. A partition QQ is a refinement of PP if PQP \subseteq Q (every point of PP is also a point of QQ).

📐 Definition 2 (Riemann Sum)

Given a bounded function f:[a,b]Rf: [a, b] \to \mathbb{R}, a partition P={x0,,xn}P = \{x_0, \ldots, x_n\}, and sample points ti[xi1,xi]t_i \in [x_{i-1}, x_i] for each ii, the Riemann sum is

S(f,P,{ti})=i=1nf(ti)Δxi.S(f, P, \{t_i\}) = \sum_{i=1}^{n} f(t_i) \, \Delta x_i.

Specific choices of sample points yield named rules:

  • Left rule: ti=xi1t_i = x_{i-1} (left endpoint of each subinterval)
  • Right rule: ti=xit_i = x_i (right endpoint)
  • Midpoint rule: ti=xi1+xi2t_i = \frac{x_{i-1} + x_i}{2} (midpoint)

📝 Example 1 (Riemann sums for f(x) = x² on [0, 1])

Take the uniform partition with n=4n = 4 subintervals, so Δx=1/4\Delta x = 1/4 and the partition points are {0,1/4,1/2,3/4,1}\{0, 1/4, 1/2, 3/4, 1\}.

Left sum: SL=14(f(0)+f(1/4)+f(1/2)+f(3/4))=14(0+116+416+916)=14640.219.S_L = \frac{1}{4}\bigl(f(0) + f(1/4) + f(1/2) + f(3/4)\bigr) = \frac{1}{4}\bigl(0 + \tfrac{1}{16} + \tfrac{4}{16} + \tfrac{9}{16}\bigr) = \frac{14}{64} \approx 0.219.

Right sum: SR=14(f(1/4)+f(1/2)+f(3/4)+f(1))=14(116+416+916+1)=30640.469.S_R = \frac{1}{4}\bigl(f(1/4) + f(1/2) + f(3/4) + f(1)\bigr) = \frac{1}{4}\bigl(\tfrac{1}{16} + \tfrac{4}{16} + \tfrac{9}{16} + 1\bigr) = \frac{30}{64} \approx 0.469.

The exact integral is 01x2dx=1/30.333\int_0^1 x^2 \, dx = 1/3 \approx 0.333, which is trapped between the left and right sums (as expected for this increasing function). As nn grows, both sums converge to 1/31/3.

💡 Remark 1 (The trapezoidal rule)

Average the left and right Riemann sums to get the trapezoidal rule:

T(f,P)=12[SL(f,P)+SR(f,P)]=i=1nf(xi1)+f(xi)2Δxi.T(f, P) = \frac{1}{2}[S_L(f, P) + S_R(f, P)] = \sum_{i=1}^n \frac{f(x_{i-1}) + f(x_i)}{2} \, \Delta x_i.

Geometrically, each rectangle is replaced by a trapezoid whose top edge connects (xi1,f(xi1))(x_{i-1}, f(x_{i-1})) to (xi,f(xi))(x_i, f(x_i)). This typically gives a better approximation than either endpoint rule alone, because the trapezoid’s area is the average of the over-estimate and under-estimate.

S8 = 0.273438Exact = 0.333333|Error| = 5.990e-2

Three-panel illustration: left, right, and midpoint Riemann sums for x² on [0,1] with n = 8

Upper and Lower Sums — The Darboux Approach

Why Darboux?

Riemann sums depend on the choice of sample points tit_i. To define the integral without this ambiguity, we bound the sums from above and below using suprema and infima of ff on each subinterval. This is the Darboux approach, and it gives us a clean criterion for integrability: the integral exists if and only if the upper and lower bounds can be squeezed arbitrarily close.

📐 Definition 3 (Upper and Lower Darboux Sums)

For a bounded function ff on [a,b][a, b] and a partition P={x0,,xn}P = \{x_0, \ldots, x_n\}, define

Mi=supx[xi1,xi]f(x)andmi=infx[xi1,xi]f(x).M_i = \sup_{x \in [x_{i-1}, x_i]} f(x) \quad \text{and} \quad m_i = \inf_{x \in [x_{i-1}, x_i]} f(x).

The upper Darboux sum is U(f,P)=i=1nMiΔxiU(f, P) = \sum_{i=1}^n M_i \, \Delta x_i and the lower Darboux sum is L(f,P)=i=1nmiΔxiL(f, P) = \sum_{i=1}^n m_i \, \Delta x_i.

For any choice of sample points, L(f,P)S(f,P,{ti})U(f,P)L(f, P) \le S(f, P, \{t_i\}) \le U(f, P).

🔷 Proposition 1 (Refinement Monotonicity)

If QQ is a refinement of PP, then

L(f,P)L(f,Q)U(f,Q)U(f,P).L(f, P) \le L(f, Q) \le U(f, Q) \le U(f, P).

Refining a partition increases lower sums and decreases upper sums — the bounds get tighter.

Proof.

It suffices to show the result when QQ is obtained by adding a single point x(xk1,xk)x^* \in (x_{k-1}, x_k) to PP, since the general case follows by induction on the number of added points.

On the subinterval [xk1,xk][x_{k-1}, x_k], the upper sum contributes MkΔxkM_k \cdot \Delta x_k where Mk=sup[xk1,xk]fM_k = \sup_{[x_{k-1}, x_k]} f. Splitting at xx^* gives two subintervals [xk1,x][x_{k-1}, x^*] and [x,xk][x^*, x_k] with suprema

Mk=sup[xk1,x]fMkandMk=sup[x,xk]fMk.M_k' = \sup_{[x_{k-1}, x^*]} f \le M_k \quad \text{and} \quad M_k'' = \sup_{[x^*, x_k]} f \le M_k.

So the new contribution is Mk(xxk1)+Mk(xkx)Mk(xxk1)+Mk(xkx)=MkΔxk.M_k'(x^* - x_{k-1}) + M_k''(x_k - x^*) \le M_k(x^* - x_{k-1}) + M_k(x_k - x^*) = M_k \cdot \Delta x_k.

All other subintervals are unchanged, so U(f,Q)U(f,P)U(f, Q) \le U(f, P). The argument for L(f,P)L(f,Q)L(f, P) \le L(f, Q) is analogous (infima on smaller subintervals are \ge the infimum on the larger subinterval).

🔷 Proposition 2 (Upper-Lower Inequality)

For any partitions PP and QQ (not necessarily related by refinement), L(f,P)U(f,Q)L(f, P) \le U(f, Q).

Proof.

The common refinement PQP \cup Q refines both PP and QQ. By Proposition 1:

L(f,P)L(f,PQ)U(f,PQ)U(f,Q).L(f, P) \le L(f, P \cup Q) \le U(f, P \cup Q) \le U(f, Q).

The middle inequality L(f,PQ)U(f,PQ)L(f, P \cup Q) \le U(f, P \cup Q) holds because miMim_i \le M_i on every subinterval.

L(f, P) = 0.218750U(f, P) = 0.468750Gap = 2.500e-1Exact = 0.333333Mesh = 0.2500

Upper and lower Darboux sums squeezing toward the integral, with gap shading

The Riemann Integral

📐 Definition 4 (The Riemann Integral (Darboux Definition))

A bounded function f:[a,b]Rf: [a, b] \to \mathbb{R} is Riemann integrable if the upper and lower integrals agree:

abf=supPL(f,P)=infPU(f,P)=abf.\underline{\int_a^b} f = \sup_P L(f, P) = \inf_P U(f, P) = \overline{\int_a^b} f.

The common value is written abf(x)dx\int_a^b f(x) \, dx and called the Riemann integral of ff on [a,b][a, b].

💡 Remark 2 (Equivalent ε-characterization of integrability)

ff is Riemann integrable if and only if for every ε>0\varepsilon > 0, there exists a partition PP such that U(f,P)L(f,P)<εU(f, P) - L(f, P) < \varepsilon. This is the criterion we’ll use in all integrability proofs: show that the gap between upper and lower sums can be made as small as we like.

🔷 Theorem 1 (Continuous Functions Are Riemann Integrable)

If f:[a,b]Rf: [a, b] \to \mathbb{R} is continuous, then ff is Riemann integrable.

Proof.

Since [a,b][a, b] is compact (Heine-Borel, Completeness & Compactness), ff is uniformly continuous on [a,b][a, b] (Heine-Cantor theorem, also from Topic 3).

Given ε>0\varepsilon > 0, choose δ>0\delta > 0 such that

xy<δ    f(x)f(y)<εba.|x - y| < \delta \implies |f(x) - f(y)| < \frac{\varepsilon}{b - a}.

Take any partition PP with mesh P<δ\|P\| < \delta. On each subinterval [xi1,xi][x_{i-1}, x_i], the length is Δxi<δ\Delta x_i < \delta, so for any x,y[xi1,xi]x, y \in [x_{i-1}, x_i] we have f(x)f(y)<ε/(ba)|f(x) - f(y)| < \varepsilon/(b - a). In particular:

Mimi=sup[xi1,xi]finf[xi1,xi]f<εba.M_i - m_i = \sup_{[x_{i-1}, x_i]} f - \inf_{[x_{i-1}, x_i]} f < \frac{\varepsilon}{b - a}.

Therefore:

U(f,P)L(f,P)=i=1n(Mimi)Δxi<εbai=1nΔxi=εba(ba)=ε.U(f, P) - L(f, P) = \sum_{i=1}^n (M_i - m_i) \, \Delta x_i < \frac{\varepsilon}{b - a} \sum_{i=1}^n \Delta x_i = \frac{\varepsilon}{b - a} \cdot (b - a) = \varepsilon.

Every step is explicit. The chain compactness \to uniform continuity \to integrability is fully traced.

🔷 Theorem 2 (Monotone Functions Are Riemann Integrable)

If f:[a,b]Rf: [a, b] \to \mathbb{R} is monotone (either increasing or decreasing throughout), then ff is Riemann integrable.

Proof.

Assume ff is increasing (the decreasing case is analogous). For any partition PP, the supremum on [xi1,xi][x_{i-1}, x_i] is Mi=f(xi)M_i = f(x_i) and the infimum is mi=f(xi1)m_i = f(x_{i-1}) — since ff is increasing.

With the uniform partition Δxi=(ba)/n\Delta x_i = (b - a)/n:

U(f,P)L(f,P)=i=1n(f(xi)f(xi1))ban=bani=1n[f(xi)f(xi1)].U(f, P) - L(f, P) = \sum_{i=1}^n (f(x_i) - f(x_{i-1})) \cdot \frac{b - a}{n} = \frac{b - a}{n} \sum_{i=1}^n [f(x_i) - f(x_{i-1})].

The sum telescopes: i=1n[f(xi)f(xi1)]=f(b)f(a)\sum_{i=1}^n [f(x_i) - f(x_{i-1})] = f(b) - f(a).

So UL=(ba)(f(b)f(a))nU - L = \frac{(b - a)(f(b) - f(a))}{n}. Given ε>0\varepsilon > 0, choose n>(ba)(f(b)f(a))εn > \frac{(b - a)(f(b) - f(a))}{\varepsilon}.

📝 Example 2 (The Dirichlet function is not Riemann integrable)

Define 1Q(x)=1\mathbf{1}_{\mathbb{Q}}(x) = 1 if xQx \in \mathbb{Q} and 1Q(x)=0\mathbf{1}_{\mathbb{Q}}(x) = 0 if xQx \notin \mathbb{Q}. On any subinterval [xi1,xi][x_{i-1}, x_i]:

  • Mi=1M_i = 1 because the rationals are dense — every subinterval contains rational numbers.
  • mi=0m_i = 0 because the irrationals are dense — every subinterval contains irrational numbers.

So U(f,P)=MiΔxi=baU(f, P) = \sum M_i \, \Delta x_i = b - a and L(f,P)=miΔxi=0L(f, P) = \sum m_i \, \Delta x_i = 0 for every partition PP. The upper and lower integrals disagree: =ba0=\overline{\int} = b - a \neq 0 = \underline{\int}, so ff is not Riemann integrable.

This example matters for ML: the Dirichlet function is a legitimate measurable function, and the Lebesgue integral can handle it (1Q=0\int \mathbf{1}_{\mathbb{Q}} = 0). The failure of Riemann integration for such functions is precisely why measure-theoretic probability (→ formalML: Measure-Theoretic Probability) requires the more powerful Lebesgue integral.

💡 Remark 3 (Bounded functions with finitely many discontinuities)

If ff is bounded on [a,b][a, b] and has only finitely many points of discontinuity, then ff is still Riemann integrable. The idea: cover the discontinuities with subintervals of arbitrarily small total length (contributing at most ε\varepsilon to ULU - L), and handle the continuous portions with Theorem 1. Discontinuities on a set of “measure zero” don’t spoil integrability — a fact that foreshadows the Lebesgue criterion.

Integrability proof illustration: uniform continuity giving M_i - m_i < ε/(b-a), gap shrinking with n

Properties of the Integral

🔷 Theorem 3 (Properties of the Riemann Integral)

If ff and gg are Riemann integrable on [a,b][a, b] and cRc \in \mathbb{R}:

(a) Linearity:

ab[f(x)+g(x)]dx=abf(x)dx+abg(x)dxandabcf(x)dx=cabf(x)dx.\int_a^b [f(x) + g(x)] \, dx = \int_a^b f(x) \, dx + \int_a^b g(x) \, dx \quad \text{and} \quad \int_a^b c \, f(x) \, dx = c \int_a^b f(x) \, dx.

(b) Monotonicity: If f(x)g(x)f(x) \le g(x) for all x[a,b]x \in [a, b], then abfabg\int_a^b f \le \int_a^b g.

(c) Additivity over intervals: For any c(a,b)c \in (a, b):

abf=acf+cbf.\int_a^b f = \int_a^c f + \int_c^b f.

(d) Triangle inequality:

abf(x)dxabf(x)dx.\left|\int_a^b f(x) \, dx\right| \le \int_a^b |f(x)| \, dx.

Proof.

(a) Linearity (additivity). For any partition PP, the infimum of a sum satisfies inf(f+g)inff+infg\inf(f + g) \ge \inf f + \inf g on each subinterval. Therefore L(f+g,P)L(f,P)+L(g,P)L(f + g, P) \ge L(f, P) + L(g, P). Similarly, U(f+g,P)U(f,P)+U(g,P)U(f + g, P) \le U(f, P) + U(g, P). Taking the supremum of lower sums and infimum of upper sums gives

abf+abgab(f+g)ab(f+g)abf+abg.\int_a^b f + \int_a^b g \le \underline{\int_a^b} (f + g) \le \overline{\int_a^b} (f + g) \le \int_a^b f + \int_a^b g.

So all four quantities are equal. The scalar multiplication cf=cf\int cf = c \int f follows by similar reasoning (with separate cases for c0c \ge 0 and c<0c < 0).

(b) Monotonicity. If f(x)g(x)f(x) \le g(x) on [a,b][a, b], then L(f,P)L(g,P)L(f, P) \le L(g, P) for every partition PP (the infimum of ff on each subinterval is \le the infimum of gg). Taking supP\sup_P gives fg\int f \le \int g.

(c) Additivity. Use a common refinement that includes the point cc. On such a refinement, the Riemann sums over [a,b][a, b] split exactly into sums over [a,c][a, c] and [c,b][c, b].

(d) Triangle inequality. Since f(x)f(x)f(x)-|f(x)| \le f(x) \le |f(x)| for all xx, apply monotonicity: fff-\int |f| \le \int f \le \int |f|, which gives ff|\int f| \le \int |f|.

📝 Example 3 (Computing ∫₀¹ x² dx from the definition)

We compute the integral directly as a limit of Riemann sums — the “hard” way — to see the definition in action.

Using the right-endpoint sum with the uniform partition (Δx=1/n\Delta x = 1/n, sample points ti=i/nt_i = i/n):

Sn=i=1n(in)21n=1n3i=1ni2=1n3n(n+1)(2n+1)6=(n+1)(2n+1)6n2.S_n = \sum_{i=1}^n \left(\frac{i}{n}\right)^2 \cdot \frac{1}{n} = \frac{1}{n^3} \sum_{i=1}^n i^2 = \frac{1}{n^3} \cdot \frac{n(n+1)(2n+1)}{6} = \frac{(n+1)(2n+1)}{6n^2}.

As nn \to \infty:

limn(n+1)(2n+1)6n2=limn2n2+3n+16n2=26=13.\lim_{n \to \infty} \frac{(n+1)(2n+1)}{6n^2} = \lim_{n \to \infty} \frac{2n^2 + 3n + 1}{6n^2} = \frac{2}{6} = \frac{1}{3}.

So 01x2dx=1/3\int_0^1 x^2 \, dx = 1/3. This is the hard way — the FTC (coming in §7) will make it a one-line computation.

📝 Example 4 (Computing ∫₀¹ eˣ dx as a limit of sums)

Right-endpoint uniform partition: Sn=1ni=1nei/n=1ne1/nen(1/n)1e1/n1=e1/n(e1)n(e1/n1)S_n = \frac{1}{n} \sum_{i=1}^n e^{i/n} = \frac{1}{n} \cdot e^{1/n} \cdot \frac{e^{n \cdot (1/n)} - 1}{e^{1/n} - 1} = \frac{e^{1/n}(e - 1)}{n(e^{1/n} - 1)} (geometric series).

As nn \to \infty, using e1/n11/ne^{1/n} - 1 \approx 1/n: Sne1S_n \to e - 1. This matches [ex]01=e1[e^x]_0^1 = e - 1.

Integral properties: linearity (sum of areas), monotonicity (f ≤ g), and additivity (splitting at c)

The Fundamental Theorem of Calculus, Part 1

The geometric picture

Consider the “area function” F(x)=axf(t)dtF(x) = \int_a^x f(t) \, dt — the accumulated area under ff from aa to xx. As xx advances by a tiny amount hh, the area increases by approximately f(x)hf(x) \cdot h — a thin rectangle of height f(x)f(x) and width hh. So

F(x)F(x+h)F(x)hf(x).F'(x) \approx \frac{F(x+h) - F(x)}{h} \approx f(x).

The FTC Part 1 says this approximation is exact in the limit: F(x)=f(x)F'(x) = f(x). Differentiation undoes integration.

📐 Definition 5 (Antiderivative)

A function F:[a,b]RF: [a, b] \to \mathbb{R} is an antiderivative of ff on (a,b)(a, b) if F(x)=f(x)F'(x) = f(x) for all x(a,b)x \in (a, b).

If FF is an antiderivative of ff, then so is F+CF + C for any constant CC — and these are the only antiderivatives. This follows from Corollary 1 of the MVT (Mean Value Theorem & Taylor Expansion): if (FG)=0(F - G)' = 0 on (a,b)(a, b), then FGF - G is constant.

🔷 Theorem 4 (Fundamental Theorem of Calculus, Part 1)

If ff is continuous on [a,b][a, b], then the function F:[a,b]RF: [a, b] \to \mathbb{R} defined by

F(x)=axf(t)dtF(x) = \int_a^x f(t) \, dt

is continuous on [a,b][a, b], differentiable on (a,b)(a, b), and F(x)=f(x)F'(x) = f(x) for all x(a,b)x \in (a, b).

Proof.

Fix x(a,b)x \in (a, b). For h0h \neq 0 with x+h[a,b]x + h \in [a, b], the additivity of the integral gives:

F(x+h)F(x)h=1hxx+hf(t)dt.\frac{F(x+h) - F(x)}{h} = \frac{1}{h} \int_x^{x+h} f(t) \, dt.

Since ff is continuous at xx: given ε>0\varepsilon > 0, choose δ>0\delta > 0 such that tx<δ    f(t)f(x)<ε|t - x| < \delta \implies |f(t) - f(x)| < \varepsilon. For h<δ|h| < \delta, every tt between xx and x+hx + h satisfies txh<δ|t - x| \le |h| < \delta, so f(t)f(x)<ε|f(t) - f(x)| < \varepsilon. Therefore:

F(x+h)F(x)hf(x)=1hxx+h[f(t)f(x)]dt1hxx+hf(t)f(x)dt<1hεh=ε.\left|\frac{F(x+h) - F(x)}{h} - f(x)\right| = \left|\frac{1}{h} \int_x^{x+h} [f(t) - f(x)] \, dt\right| \le \frac{1}{|h|} \int_x^{x+h} |f(t) - f(x)| \, dt < \frac{1}{|h|} \cdot \varepsilon \cdot |h| = \varepsilon.

The integral bound uses f(t)f(x)<ε|f(t) - f(x)| < \varepsilon for all tt in the interval of length h|h|, and the triangle inequality for integrals (Theorem 3(d)).

This shows limh0F(x+h)F(x)h=f(x)\lim_{h \to 0} \frac{F(x+h) - F(x)}{h} = f(x), i.e., F(x)=f(x)F'(x) = f(x).

📝 Example 5 (Differentiating an integral with no closed-form antiderivative)

ddx0xsin(t2)dt=sin(x2).\frac{d}{dx} \int_0^x \sin(t^2) \, dt = \sin(x^2).

No antiderivative of sin(t2)\sin(t^2) exists in elementary functions (the integral defines the Fresnel function S(x)S(x)). But the FTC Part 1 tells us the derivative of the area function anyway. This illustrates the power of FTC1: we can differentiate integrals even when we cannot compute them in closed form.

📝 Example 6 (Chain rule version of FTC1)

ddx0x2et2dt=ex42x.\frac{d}{dx} \int_0^{x^2} e^{-t^2} \, dt = e^{-x^4} \cdot 2x.

Here the upper limit is g(x)=x2g(x) = x^2 rather than xx itself. By the chain rule (The Derivative & Chain Rule, Theorem 6): if G(u)=0uet2dtG(u) = \int_0^u e^{-t^2} \, dt, then G(u)=eu2G'(u) = e^{-u^2} by FTC1, and ddxG(x2)=G(x2)2x=ex42x\frac{d}{dx} G(x^2) = G'(x^2) \cdot 2x = e^{-x^4} \cdot 2x.

f(x) = 0.0000F(x) = 2.0000F'(x) = f(x) = 0.0000 (FTC1)

FTC Part 1: f(x) with shaded area to x, and F(x) below with tangent slope = f(x)

The Fundamental Theorem of Calculus, Part 2

FTC Part 1 tells us that every continuous function has an antiderivative (namely F(x)=axfF(x) = \int_a^x f). FTC Part 2 tells us how to use an antiderivative to compute a definite integral: evaluate at the endpoints and subtract.

🔷 Theorem 5 (Fundamental Theorem of Calculus, Part 2)

If ff is continuous on [a,b][a, b] and FF is any antiderivative of ff on [a,b][a, b] (i.e., F(x)=f(x)F'(x) = f(x) for all x(a,b)x \in (a, b)), then

abf(x)dx=F(b)F(a).\int_a^b f(x) \, dx = F(b) - F(a).

Proof.

Let P={x0,x1,,xn}P = \{x_0, x_1, \ldots, x_n\} be any partition of [a,b][a, b]. Write F(b)F(a)F(b) - F(a) as a telescoping sum:

F(b)F(a)=i=1n[F(xi)F(xi1)].F(b) - F(a) = \sum_{i=1}^n [F(x_i) - F(x_{i-1})].

By the Mean Value Theorem (Mean Value Theorem & Taylor Expansion, Theorem 2), applied to FF on each subinterval [xi1,xi][x_{i-1}, x_i]: there exists ci(xi1,xi)c_i \in (x_{i-1}, x_i) such that

F(xi)F(xi1)=F(ci)(xixi1)=f(ci)Δxi.F(x_i) - F(x_{i-1}) = F'(c_i)(x_i - x_{i-1}) = f(c_i) \, \Delta x_i.

So the telescoping sum becomes a Riemann sum:

F(b)F(a)=i=1nf(ci)Δxi.F(b) - F(a) = \sum_{i=1}^n f(c_i) \, \Delta x_i.

Since ff is continuous (hence Riemann integrable by Theorem 1), as P0\|P\| \to 0 every Riemann sum converges to abf\int_a^b f. But the left side F(b)F(a)F(b) - F(a) is independent of the partition. Therefore:

F(b)F(a)=abf(x)dx.F(b) - F(a) = \int_a^b f(x) \, dx.

📝 Example 7 (∫₀¹ x² dx via FTC2)

01x2dx=[x33]01=130=13.\int_0^1 x^2 \, dx = \left[\frac{x^3}{3}\right]_0^1 = \frac{1}{3} - 0 = \frac{1}{3}.

Compare with Example 3: the same answer, in one line instead of a page of algebra. The FTC transforms a limit evaluation into an antiderivative evaluation.

📝 Example 8 (∫₀π sin(x) dx)

0πsin(x)dx=[cos(x)]0π=cos(π)+cos(0)=1+1=2.\int_0^\pi \sin(x) \, dx = [-\cos(x)]_0^\pi = -\cos(\pi) + \cos(0) = 1 + 1 = 2.

The area under one arch of the sine curve is exactly 2.

💡 Remark 4 (The two parts of FTC are not converses)

Part 1 says: start with ff continuous \to construct F(x)=axfF(x) = \int_a^x f \to get F=fF' = f. Part 2 says: start with any antiderivative FF of ff \to abf=F(b)F(a)\int_a^b f = F(b) - F(a). They are complementary, not equivalent. Part 1 is an existence result (continuous functions have antiderivatives); Part 2 is a computation result (integrals can be evaluated via antiderivatives).

FTC Part 2: telescoping sum F(b) - F(a) becoming a Riemann sum via MVT

Integration Techniques

🔷 Theorem 6 (Substitution Rule (Change of Variables))

If g:[a,b]Rg: [a, b] \to \mathbb{R} is continuously differentiable and ff is continuous on the range of gg, then

abf(g(x))g(x)dx=g(a)g(b)f(u)du.\int_a^b f(g(x)) \, g'(x) \, dx = \int_{g(a)}^{g(b)} f(u) \, du.

Proof.

Let FF be an antiderivative of ff (exists by FTC Part 1 since ff is continuous). Then by the chain rule (The Derivative & Chain Rule, Theorem 6):

(Fg)(x)=F(g(x))g(x)=f(g(x))g(x).(F \circ g)'(x) = F'(g(x)) \cdot g'(x) = f(g(x)) \cdot g'(x).

So FgF \circ g is an antiderivative of f(g(x))g(x)f(g(x)) \cdot g'(x). By FTC Part 2:

abf(g(x))g(x)dx=F(g(b))F(g(a))=g(a)g(b)f(u)du.\int_a^b f(g(x)) \, g'(x) \, dx = F(g(b)) - F(g(a)) = \int_{g(a)}^{g(b)} f(u) \, du.

📝 Example 9 (∫₀¹ 2x e^(x²) dx)

Let u=x2u = x^2, so du=2xdxdu = 2x \, dx. The bounds transform: u(0)=0u(0) = 0, u(1)=1u(1) = 1. Therefore:

012xex2dx=01eudu=[eu]01=e1.\int_0^1 2x \, e^{x^2} \, dx = \int_0^1 e^u \, du = [e^u]_0^1 = e - 1.

🔷 Theorem 7 (Integration by Parts)

If uu and vv are continuously differentiable on [a,b][a, b], then

abu(x)v(x)dx=[u(x)v(x)]ababu(x)v(x)dx.\int_a^b u(x) \, v'(x) \, dx = [u(x) \, v(x)]_a^b - \int_a^b u'(x) \, v(x) \, dx.

Proof.

By the product rule (The Derivative & Chain Rule, Theorem 4): (uv)=uv+uv(uv)' = u'v + uv'. Integrate both sides from aa to bb using FTC Part 2:

u(b)v(b)u(a)v(a)=abu(x)v(x)dx+abu(x)v(x)dx.u(b) \, v(b) - u(a) \, v(a) = \int_a^b u'(x) \, v(x) \, dx + \int_a^b u(x) \, v'(x) \, dx.

Rearrange to get the integration by parts formula.

📝 Example 10 (∫₀¹ x eˣ dx)

Let u=xu = x and dv=exdxdv = e^x \, dx, so du=dxdu = dx and v=exv = e^x. Then:

01xexdx=[xex]0101exdx=(1e0)[ex]01=e(e1)=1.\int_0^1 x \, e^x \, dx = [x \, e^x]_0^1 - \int_0^1 e^x \, dx = (1 \cdot e - 0) - [e^x]_0^1 = e - (e - 1) = 1.

💡 Remark 5 (The chain rule and product rule in reverse)

Substitution is the chain rule (fg)=(fg)g(f \circ g)' = (f' \circ g) \cdot g' read from right to left. Integration by parts is the product rule (uv)=uv+uv(uv)' = u'v + uv' rearranged. Every differentiation rule has an integration counterpart — they are the same theorems, applied in opposite directions.

x-domain sum = 1.712343u-domain sum = 1.717566Exact = 1.718282Areas match by the substitution rule

Substitution and integration by parts: x-domain vs u-domain areas, and rectangle decomposition

The Integral Form of the Taylor Remainder

This section resolves the forward reference from Mean Value Theorem & Taylor Expansion, Remark 4, which stated the integral remainder without proof.

🔷 Theorem 8 (Integral Remainder for Taylor's Theorem)

If ff is (n+1)(n+1)-times continuously differentiable on an interval containing aa and xx, then

f(x)=k=0nf(k)(a)k!(xa)k+Rn(x)f(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x - a)^k + R_n(x)

where the remainder is given by

Rn(x)=1n!ax(xt)nf(n+1)(t)dt.R_n(x) = \frac{1}{n!} \int_a^x (x - t)^n \, f^{(n+1)}(t) \, dt.

Proof.

By induction on nn.

Base case (n=0n = 0): We need f(x)=f(a)+axf(t)dtf(x) = f(a) + \int_a^x f'(t) \, dt. This is exactly FTC Part 2 applied to F=fF = f: axf(t)dt=f(x)f(a)\int_a^x f'(t) \, dt = f(x) - f(a).

Inductive step: Assume the result holds for n1n - 1:

Rn1(x)=1(n1)!ax(xt)n1f(n)(t)dt.R_{n-1}(x) = \frac{1}{(n-1)!} \int_a^x (x - t)^{n-1} f^{(n)}(t) \, dt.

Apply integration by parts with

u=f(n)(t),dv=(xt)n1(n1)!dt,sov=(xt)nn!.u = f^{(n)}(t), \quad dv = \frac{(x - t)^{n-1}}{(n-1)!} \, dt, \quad \text{so} \quad v = -\frac{(x - t)^n}{n!}.

Then:

Rn1(x)=[(xt)nn!f(n)(t)]ax+1n!ax(xt)nf(n+1)(t)dt.R_{n-1}(x) = \left[-\frac{(x - t)^n}{n!} f^{(n)}(t)\right]_a^x + \frac{1}{n!} \int_a^x (x - t)^n f^{(n+1)}(t) \, dt.

Evaluating the boundary term: at t=xt = x, (xt)n=0(x - t)^n = 0; at t=at = a, the term is (1)(xa)nn!f(n)(a)=f(n)(a)n!(xa)n-(-1) \cdot \frac{(x - a)^n}{n!} f^{(n)}(a) = \frac{f^{(n)}(a)}{n!}(x - a)^n. So:

Rn1(x)=f(n)(a)n!(xa)n+Rn(x).R_{n-1}(x) = \frac{f^{(n)}(a)}{n!}(x - a)^n + R_n(x).

Absorbing f(n)(a)n!(xa)n\frac{f^{(n)}(a)}{n!}(x - a)^n into the Taylor polynomial of degree nn gives the result.

💡 Remark 6 (Lagrange remainder from the integral remainder)

The Lagrange form Rn(x)=f(n+1)(c)(n+1)!(xa)n+1R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x - a)^{n+1} (from Mean Value Theorem & Taylor Expansion, Theorem 4) follows from the integral form by the Mean Value Theorem for Integrals. The integral form is stronger: it gives the exact remainder as an integral, not just a bound involving an unknown intermediate point cc.

Taylor remainder: integral form vs Lagrange bound for e^x

Numerical Integration

When antiderivatives don’t exist in closed form — which is most of the time in practice — we compute integrals numerically. The Riemann sum framework we’ve developed is not just a theoretical tool; it’s the blueprint for computational quadrature.

🔷 Proposition 3 (Error Bounds for Quadrature Rules)

For ff with sufficient smoothness on [a,b][a, b] with nn subintervals of width h=(ba)/nh = (b - a)/n:

(a) Left/Right endpoint rule: En(ba)22nmax[a,b]f|E_n| \le \frac{(b - a)^2}{2n} \max_{[a,b]} |f'| — error O(h)O(h).

(b) Trapezoidal rule: En(ba)312n2max[a,b]f|E_n| \le \frac{(b - a)^3}{12n^2} \max_{[a,b]} |f''| — error O(h2)O(h^2).

(c) Simpson’s rule: En(ba)5180n4max[a,b]f(4)|E_n| \le \frac{(b - a)^5}{180n^4} \max_{[a,b]} |f^{(4)}| — error O(h4)O(h^4).

📝 Example 11 (Error comparison across quadrature rules)

Approximate 01exdx=e11.71828\int_0^1 e^x \, dx = e - 1 \approx 1.71828 with n=10n = 10 subintervals:

| Rule | Approximation | E10|E_{10}| | Bound | |------|---------------|-----------|-------| | Left | 1.5819\approx 1.5819 | 0.136\approx 0.136 | 0.136\le 0.136 | | Trapezoidal | 1.7197\approx 1.7197 | 0.00136\approx 0.00136 | 0.00227\le 0.00227 | | Simpson’s | 1.718282\approx 1.718282 | 1.5×106\approx 1.5 \times 10^{-6} | 1.5×106\le 1.5 \times 10^{-6} |

Simpson’s rule achieves 6 digits of accuracy with just 10 subintervals. The convergence rates — O(1/n)O(1/n), O(1/n2)O(1/n^2), O(1/n4)O(1/n^4) — explain why higher-order methods dominate in practice.

💡 Remark 7 (Simpson's rule and polynomial interpolation)

Simpson’s rule fits a parabola through every three consecutive points (xi1,f(xi1))(x_{i-1}, f(x_{i-1})), (xi,f(xi))(x_i, f(x_i)), (xi+1,f(xi+1))(x_{i+1}, f(x_{i+1})), then integrates the parabola exactly. It is exact for polynomials up to degree 3 — not just degree 2 — because of a symmetry cancellation. This is why the error involves f(4)f^{(4)}, not f(3)f^{(3)}.

Sum = 1.71971349Exact = 1.71828183|Error| = 1.432e-3

Log-log convergence plot: error vs n for left, trapezoidal, and Simpson rules

Connections to ML

The integral is one of the most ubiquitous operations in machine learning — arguably more so than the derivative, since every expected value, every probability computation, and every marginal distribution involves integration.

Expected values as integrals

If XX is a continuous random variable with density f(x)f(x), then the expected value of any function g(X)g(X) is

E[g(X)]=g(x)f(x)dx.\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \, f(x) \, dx.

The risk (expected loss) of a hypothesis hh is Risk(h)=E(X,Y)[(h(X),Y)]=(h(x),y)p(x,y)dxdy\text{Risk}(h) = \mathbb{E}_{(X,Y)}[\ell(h(X), Y)] = \int \ell(h(x), y) \, p(x, y) \, dx \, dy. We can’t compute this integral exactly (we don’t know pp), so we approximate it with the empirical risk: R^(h)=1ni=1n(h(xi),yi)\hat{R}(h) = \frac{1}{n} \sum_{i=1}^n \ell(h(x_i), y_i). This is literally a Riemann sum — the Monte Carlo approximation replaces the integral with a finite sum over data points.

Measure-Theoretic Probability on formalML

Marginalizing over latent variables

Mixture models define p(x)=p(xz)p(z)dzp(x) = \int p(x | z) \, p(z) \, dz — the observed density is obtained by integrating over the latent variable zz. Variational autoencoders work with logp(x)=logp(xz)p(z)dz\log p(x) = \log \int p(x | z) \, p(z) \, dz. When this integral is intractable (it almost always is in high dimensions), we resort to variational lower bounds (the ELBO) or numerical quadrature.

Bayesian Nonparametrics on formalML

Information-theoretic quantities

Entropy, KL divergence, and cross-entropy are all integrals over densities:

H(X)=f(x)logf(x)dx,DKL(pq)=p(x)logp(x)q(x)dx.H(X) = -\int f(x) \log f(x) \, dx, \qquad D_{\text{KL}}(p \| q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx.

The properties of the integral — linearity, monotonicity — directly yield Gibbs’ inequality (DKL0D_{\text{KL}} \ge 0) and other fundamental information-theoretic bounds.

Shannon Entropy on formalML

Numerical quadrature in practice

When closed-form antiderivatives don’t exist, we compute integrals numerically. For 1D integrals, the trapezoidal and Simpson’s rules from §10 are effective. scipy.integrate.quad uses adaptive Gaussian quadrature — choosing subinterval sizes based on local function behavior, exactly the kind of partition refinement we studied in §2. For high-dimensional integrals (d>3d > 3), Monte Carlo methods replace deterministic quadrature, because the curse of dimensionality makes grid-based methods impractical.

From Riemann to Lebesgue: why the generalization matters

The Riemann integral handles continuous and piecewise-continuous functions on bounded intervals. But ML requires more:

  • Unbounded domains: f(x)dx\int_{-\infty}^{\infty} f(x) \, dx (every expected value over R\mathbb{R}).
  • Pathological functions: Indicator functions, densities with heavy tails, functions with uncountably many discontinuities.
  • Exchanging limits and integrals: The dominated convergence theorem lets us pass limits inside integrals under mild conditions — essential for proving that the empirical risk converges to the true risk.

The Lebesgue integral handles all of these. The transition Riemann \to Lebesgue is the transition from introductory to measure-theoretic probability.

Measure-Theoretic Probability on formalML

ML connections: expected value as integral over density, and Monte Carlo approximation

Connections & Further Reading

Back-references

  • Sequences, Limits & Convergence — the integral is defined as a limit of Riemann sums; the convergence theory from Topic 1 provides the framework.
  • Epsilon-Delta & Continuity — uniform continuity (ε-δ framework) is the key ingredient in the proof that continuous functions are integrable.
  • Completeness & Compactness — compactness of [a,b][a,b] gives uniform continuity (Heine-Cantor), which is the hidden engine behind integrability.
  • The Derivative & Chain Rule — the chain rule gives substitution; the product rule gives integration by parts. The FTC connects differentiation and integration as inverse operations.
  • Mean Value Theorem & Taylor Expansion — the MVT powers the FTC Part 2 proof; the integral form of the Taylor remainder (§9) resolves Remark 4 from Topic 6.

Forward references (within formalCalculus)

References

  1. book Abbott (2015). Understanding Analysis Chapter 7 develops the Riemann integral via the Darboux approach with exceptional clarity — our primary reference for the definition and integrability proofs
  2. book Rudin (1976). Principles of Mathematical Analysis Chapter 6 on Riemann-Stieltjes integration — the definitive compact treatment, though we specialize to the Riemann case
  3. book Spivak (2008). Calculus Chapters 13–14 develop integration with unmatched geometric motivation alongside full rigor — the best reference for the geometric intuition behind partitions and area
  4. book Folland (1999). Real Analysis Chapter 2 gives the Lebesgue integral — useful for understanding where Riemann fails and why the generalization matters
  5. book Burden & Faires (2010). Numerical Analysis Chapter 4 on numerical integration — trapezoidal rule, Simpson's rule, Gaussian quadrature error analysis