Multivariable Integral · intermediate · 50 min read

Multiple Integrals & Fubini's Theorem

Extending integration to functions of several variables — double and triple integrals defined via multidimensional Riemann sums, Fubini's theorem reducing them to iterated single integrals, integration over general regions, and Monte Carlo methods for high-dimensional computation

Abstract. The Riemann integral integrates a function of one variable over an interval. Multiple integrals extend this to functions of several variables over regions in higher dimensions. The double integral over a rectangle R = [a,b] × [c,d] is defined via two-dimensional Riemann sums: partition R into sub-rectangles, evaluate f at sample points, sum the volume contributions, and take the limit as the mesh goes to zero. Fubini's Theorem is the computational engine: for continuous f, the double integral equals the iterated integral in either order. For non-rectangular regions, Fubini extends to Type I domains (vertical slices with variable y-limits) and Type II domains (horizontal slices with variable x-limits). Triple integrals extend the construction to three dimensions. In machine learning, multiple integrals are pervasive: joint densities, marginal densities obtained by integrating out variables, expected values, and covariance are all multiple integrals. When closed-form evaluation is impossible in high dimensions, Monte Carlo integration approximates by random sampling, connecting to the mini-batch strategy of SGD.

Where this leads → formalML

  • formalML Joint densities and marginals are computed via Fubini's theorem. Every two-variable probabilistic computation is a multiple integral. The Fubini-Tonelli theorem generalizes this to Lebesgue integrals.
  • formalML Multi-parameter posteriors require integrating over high-dimensional parameter spaces. Fubini enables Gibbs sampling, and Monte Carlo methods estimate intractable integrals.
  • formalML The expected risk is a double integral over the joint distribution. Each SGD mini-batch is a Monte Carlo estimate. Monte Carlo error bounds inform stochastic gradient variance.

Overview & Motivation

In machine learning, the expected loss over a joint distribution of data and parameters is

E[]=(y,y^(x))p(x,y)dxdy.\mathbb{E}[\ell] = \iint \ell(y, \hat{y}(x)) \, p(x, y) \, dx \, dy.

This is a double integral — the function p\ell \cdot p is integrated over two variables simultaneously. But how do we define this integral rigorously? And more importantly, how do we compute it?

The answer comes in two parts. First, we define the double integral as a limit of two-dimensional Riemann sums, directly generalizing the 1D construction from The Riemann Integral. Second, Fubini’s theorem tells us we can evaluate the double integral as two nested single integrals — and the order doesn’t matter (for continuous functions). This transforms a genuinely two-dimensional problem into a sequence of one-dimensional problems we already know how to solve.

This topic opens the Multivariable Integral Calculus track. We combine the integration machinery from The Riemann Integral & FTC with the partial-derivative perspective from Partial Derivatives & the Gradient: an iterated integral is “integrate with respect to yy, treating xx as constant” — precisely the partial-derivative viewpoint applied to integration rather than differentiation.

Double Integrals over Rectangles

We start where The Riemann Integral started — with Riemann sums — but now in two dimensions. The surface z=f(x,y)z = f(x, y) hovers above a rectangle RR in the xyxy-plane. We want to compute the “volume” between the surface and RR.

📐 Definition 1 (Partition of a Rectangle)

Let R=[a,b]×[c,d]R = [a, b] \times [c, d] be a closed rectangle in R2\mathbb{R}^2. A partition of RR is a pair (Px,Py)(P_x, P_y) where

Px={x0,x1,,xm},a=x0<x1<<xm=bP_x = \{x_0, x_1, \ldots, x_m\}, \quad a = x_0 < x_1 < \cdots < x_m = b

is a partition of [a,b][a, b] and

Py={y0,y1,,yn},c=y0<y1<<yn=dP_y = \{y_0, y_1, \ldots, y_n\}, \quad c = y_0 < y_1 < \cdots < y_n = d

is a partition of [c,d][c, d]. This creates m×nm \times n sub-rectangles Rij=[xi1,xi]×[yj1,yj]R_{ij} = [x_{i-1}, x_i] \times [y_{j-1}, y_j], each with area ΔAij=ΔxiΔyj\Delta A_{ij} = \Delta x_i \cdot \Delta y_j.

The mesh (or norm) of the partition is P=max(Px,Py)\|P\| = \max(\|P_x\|, \|P_y\|) — the larger of the two 1D meshes.

📐 Definition 2 (Double Riemann Sum)

Given a function f:RRf: R \to \mathbb{R}, a partition (Px,Py)(P_x, P_y) of RR, and sample points (xij,yij)Rij(x_{ij}^*, y_{ij}^*) \in R_{ij} in each sub-rectangle, the double Riemann sum is

S(f,P)=i=1mj=1nf(xij,yij)ΔAij.S(f, P) = \sum_{i=1}^{m} \sum_{j=1}^{n} f(x_{ij}^*, y_{ij}^*) \, \Delta A_{ij}.

Each term f(xij,yij)ΔAijf(x_{ij}^*, y_{ij}^*) \, \Delta A_{ij} is the signed volume of a box with base RijR_{ij} and height f(xij,yij)f(x_{ij}^*, y_{ij}^*).

📐 Definition 3 (The Double Integral)

The double integral of ff over RR is the limit of double Riemann sums as the mesh of the partition tends to zero:

Rf(x,y)dA=limP0i=1mj=1nf(xij,yij)ΔAij\iint_R f(x, y) \, dA = \lim_{\|P\| \to 0} \sum_{i=1}^{m} \sum_{j=1}^{n} f(x_{ij}^*, y_{ij}^*) \, \Delta A_{ij}

provided this limit exists and is the same for all choices of sample points. When this limit exists, we say ff is integrable on RR.

💡 Remark 1 (Notation)

The notations Rf(x,y)dA\iint_R f(x,y) \, dA, Rf(x,y)dxdy\iint_R f(x,y) \, dx \, dy, and RfdA\iint_R f \, dA are all equivalent. The "dAdA" notation emphasizes the area element dA=dxdydA = dx \, dy; the "dxdydx \, dy" notation makes the variables explicit. We use whichever is clearest in context.

📝 Example 1 (Constant Function: Area Computation)

Let f(x,y)=1f(x, y) = 1 on R=[0,1]2R = [0, 1]^2. Every Riemann sum equals

S(f,P)=i,j1ΔAij=i,jΔxiΔyj=(iΔxi)(jΔyj)=11=1.S(f, P) = \sum_{i,j} 1 \cdot \Delta A_{ij} = \sum_{i,j} \Delta x_i \cdot \Delta y_j = \left(\sum_i \Delta x_i\right)\left(\sum_j \Delta y_j\right) = 1 \cdot 1 = 1.

So R1dA=1\iint_R 1 \, dA = 1, which is the area of RR. This generalizes the 1D fact that ab1dx=ba\int_a^b 1 \, dx = b - a.

📝 Example 2 (f(x,y) = x + y on the Unit Square)

Let f(x,y)=x+yf(x, y) = x + y on R=[0,1]2R = [0, 1]^2. Using a uniform partition with nn subdivisions per axis and center sample points, the Riemann sum is

Sn=i=1nj=1n(2i12n+2j12n)1n2=1n2i=1nj=1n2i+2j22n.S_n = \sum_{i=1}^{n} \sum_{j=1}^{n} \left(\frac{2i - 1}{2n} + \frac{2j - 1}{2n}\right) \frac{1}{n^2} = \frac{1}{n^2} \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{2i + 2j - 2}{2n}.

As nn \to \infty, this converges to R(x+y)dA=1\iint_R (x + y) \, dA = 1.

Try the interactive explorer below — increase nn and watch the boxes fill the volume:

Boxes: 64Sum: 1.000000Exact: 1.000000Error: 0.00e+0
ML: Linear model: ŷ = w₁x + w₂y

Three-panel visualization showing 2D Riemann sums converging at 4×4, 8×8, and 20×20 partitions for f(x,y) = x+y on the unit square

Iterated Integrals

Instead of computing a two-dimensional limit, can we reduce the double integral to a sequence of ordinary integrals? The idea is to integrate “one variable at a time.”

📐 Definition 4 (Iterated Integral)

Given f:R=[a,b]×[c,d]Rf: R = [a,b] \times [c,d] \to \mathbb{R}, the iterated integral (with xx outer, yy inner) is

ab[cdf(x,y)dy]dx.\int_a^b \left[ \int_c^d f(x, y) \, dy \right] dx.

First, fix xx and integrate f(x,y)f(x, y) over y[c,d]y \in [c, d] to obtain the slice function A(x)=cdf(x,y)dyA(x) = \int_c^d f(x, y) \, dy. Then integrate A(x)A(x) over x[a,b]x \in [a, b].

The iterated integral with the reversed order is cd[abf(x,y)dx]dy\int_c^d \left[ \int_a^b f(x, y) \, dx \right] dy.

📝 Example 3 (Iterated Integral of x + y)

Compute 0101(x+y)dydx\int_0^1 \int_0^1 (x + y) \, dy \, dx. Inner integral first:

A(x)=01(x+y)dy=[xy+y22]01=x+12.A(x) = \int_0^1 (x + y) \, dy = \left[ xy + \frac{y^2}{2} \right]_0^1 = x + \frac{1}{2}.

Then the outer integral:

01A(x)dx=01(x+12)dx=[x22+x2]01=12+12=1.\int_0^1 A(x) \, dx = \int_0^1 \left(x + \frac{1}{2}\right) dx = \left[\frac{x^2}{2} + \frac{x}{2}\right]_0^1 = \frac{1}{2} + \frac{1}{2} = 1.

📝 Example 4 (Reversed Order Gives the Same Answer)

Now compute 0101(x+y)dxdy\int_0^1 \int_0^1 (x + y) \, dx \, dy. Inner integral:

B(y)=01(x+y)dx=[x22+xy]01=12+y.B(y) = \int_0^1 (x + y) \, dx = \left[\frac{x^2}{2} + xy\right]_0^1 = \frac{1}{2} + y.

Outer integral:

01(12+y)dy=[y2+y22]01=12+12=1.\int_0^1 \left(\frac{1}{2} + y\right) dy = \left[\frac{y}{2} + \frac{y^2}{2}\right]_0^1 = \frac{1}{2} + \frac{1}{2} = 1.

Both orders give 1 — this is not a coincidence.

💡 Remark 2 (Cavalieri's Principle)

The slice function A(x)=cdf(x,y)dyA(x) = \int_c^d f(x, y) \, dy represents the cross-sectional area at position xx. The iterated integral abA(x)dx\int_a^b A(x) \, dx sums these cross-sectional areas — this is precisely Cavalieri’s principle: the volume of a solid is the integral of its cross-sectional areas. Cavalieri stated this in 1635 as a geometric principle; we now prove it as a consequence of Fubini’s theorem.

Cavalieri's principle: cross-sectional slicing of a surface, showing the slice function A(x)

Fubini’s Theorem

This is the central result of this topic. It says that the double integral equals either iterated integral — the “two-dimensional” computation reduces to a pair of “one-dimensional” computations.

🔷 Theorem 1 (Fubini's Theorem (Rectangle Case))

Let f:R=[a,b]×[c,d]Rf: R = [a,b] \times [c,d] \to \mathbb{R} be continuous. Then ff is integrable on RR, and

Rf(x,y)dA=ab[cdf(x,y)dy]dx=cd[abf(x,y)dx]dy.\iint_R f(x, y) \, dA = \int_a^b \left[\int_c^d f(x, y) \, dy\right] dx = \int_c^d \left[\int_a^b f(x, y) \, dx\right] dy.

That is, the double integral equals the iterated integral in either order.

Proof.

We prove that RfdA=ab[cdf(x,y)dy]dx\iint_R f \, dA = \int_a^b \left[\int_c^d f(x,y) \, dy\right] dx. The equality with the other iterated integral follows by symmetry (swap the roles of xx and yy).

Step 1: Integrability. Since ff is continuous on the compact set R=[a,b]×[c,d]R = [a,b] \times [c,d], it is uniformly continuous (Heine-Cantor theorem, Completeness & Compactness). In particular, ff is bounded on RR. For any ε>0\varepsilon > 0, uniform continuity provides δ>0\delta > 0 such that f(x1,y1)f(x2,y2)<ε|f(x_1, y_1) - f(x_2, y_2)| < \varepsilon whenever (x1,y1)(x2,y2)<δ\|(x_1, y_1) - (x_2, y_2)\| < \delta. For any partition PP with P<δ\|P\| < \delta, on each sub-rectangle RijR_{ij}:

Mijmij=supRijfinfRijf<εM_{ij} - m_{ij} = \sup_{R_{ij}} f - \inf_{R_{ij}} f < \varepsilon

where MijM_{ij} and mijm_{ij} are the supremum and infimum of ff on RijR_{ij}. Therefore the gap between upper and lower Darboux sums is

U(f,P)L(f,P)=i,j(Mijmij)ΔAij<εArea(R).U(f, P) - L(f, P) = \sum_{i,j} (M_{ij} - m_{ij}) \Delta A_{ij} < \varepsilon \cdot \text{Area}(R).

Since ε\varepsilon was arbitrary, ff is integrable on RR.

Step 2: The slice function is well-defined. For each fixed x[a,b]x \in [a, b], the function yf(x,y)y \mapsto f(x, y) is continuous on [c,d][c, d] (since ff is continuous on RR). By the 1D integrability theorem (The Riemann Integral, Theorem 1), f(x,)f(x, \cdot) is integrable, so the slice function

A(x)=cdf(x,y)dyA(x) = \int_c^d f(x, y) \, dy

is well-defined for every x[a,b]x \in [a, b].

Step 3: A(x)A(x) is continuous. We show AA is continuous using uniform continuity of ff. For the δ\delta from Step 1:

A(x1)A(x2)=cd[f(x1,y)f(x2,y)]dycdf(x1,y)f(x2,y)dy.|A(x_1) - A(x_2)| = \left|\int_c^d [f(x_1, y) - f(x_2, y)] \, dy\right| \le \int_c^d |f(x_1, y) - f(x_2, y)| \, dy.

If x1x2<δ|x_1 - x_2| < \delta, then (x1,y)(x2,y)=x1x2<δ\|(x_1, y) - (x_2, y)\| = |x_1 - x_2| < \delta, so f(x1,y)f(x2,y)<ε|f(x_1, y) - f(x_2, y)| < \varepsilon for all yy. Therefore A(x1)A(x2)<ε(dc)|A(x_1) - A(x_2)| < \varepsilon(d - c). Since AA is continuous on [a,b][a, b], it is integrable.

Step 4: Squeeze. Take a partition P=(Px,Py)P = (P_x, P_y) with P<δ\|P\| < \delta. For each sub-rectangle RijR_{ij}:

mijΔyjyj1yjf(x,y)dyMijΔyjfor all x[xi1,xi].m_{ij} \, \Delta y_j \le \int_{y_{j-1}}^{y_j} f(x, y) \, dy \le M_{ij} \, \Delta y_j \quad \text{for all } x \in [x_{i-1}, x_i].

Summing over jj: jmijΔyjA(x)jMijΔyj\sum_j m_{ij} \Delta y_j \le A(x) \le \sum_j M_{ij} \Delta y_j for all x[xi1,xi]x \in [x_{i-1}, x_i]. Integrating over [xi1,xi][x_{i-1}, x_i]:

jmijΔyjΔxixi1xiA(x)dxjMijΔyjΔxi.\sum_j m_{ij} \Delta y_j \cdot \Delta x_i \le \int_{x_{i-1}}^{x_i} A(x) \, dx \le \sum_j M_{ij} \Delta y_j \cdot \Delta x_i.

Summing over all ii:

L(f,P)abA(x)dxU(f,P).L(f, P) \le \int_a^b A(x) \, dx \le U(f, P).

But RfdA\iint_R f \, dA also lies between L(f,P)L(f, P) and U(f,P)U(f, P). Since U(f,P)L(f,P)<εArea(R)U(f, P) - L(f, P) < \varepsilon \cdot \text{Area}(R) and ε\varepsilon is arbitrary:

Rf(x,y)dA=abA(x)dx=ab[cdf(x,y)dy]dx.\iint_R f(x, y) \, dA = \int_a^b A(x) \, dx = \int_a^b \left[\int_c^d f(x, y) \, dy\right] dx. \qquad \blacksquare

💡 Remark 3 (Fubini Beyond Continuity)

Fubini’s theorem extends well beyond continuous functions. The Fubini-Tonelli theorem applies to Lebesgue-integrable functions: if fdA<\int |f| \, dA < \infty (or if f0f \ge 0), then the iterated integrals exist and are equal to the double integral. The continuity hypothesis above is a convenient sufficient condition, but not necessary. The full generalization lives in measure theory — a topic for the Measure & Integration track (coming soon).

📝 Example 5 (∫∫ x sin(xy) dA over [0, π] × [0, 1])

Compute Rxsin(xy)dA\iint_R x \sin(xy) \, dA where R=[0,π]×[0,1]R = [0, \pi] \times [0, 1].

dy-first: Fix xx. Inner integral: 01xsin(xy)dy=[cos(xy)]y=0y=1=cos(x)+1=1cosx.\int_0^1 x \sin(xy) \, dy = \left[-\cos(xy)\right]_{y=0}^{y=1} = -\cos(x) + 1 = 1 - \cos x.

Outer integral: 0π(1cosx)dx=[xsinx]0π=π0=π.\int_0^\pi (1 - \cos x) \, dx = \left[x - \sin x\right]_0^\pi = \pi - 0 = \pi.

The other order (dxdx-first) requires integration by parts twice — both orders give π\pi, but dydy-first is dramatically simpler.

A(x=0.50) = 1.0000B(y=0.50) = 1.0000Exact: 1.0000

Side-by-side comparison of both integration orders: vertical slices (dy-first) vs. horizontal slices (dx-first)

Integration over General Regions

So far we’ve integrated over rectangles. Real applications require integration over non-rectangular regions — disks, triangles, regions bounded by curves.

📐 Definition 5 (Type I Region)

A Type I region DD in R2\mathbb{R}^2 is described by

D={(x,y):axb,g1(x)yg2(x)}D = \{(x, y) : a \le x \le b, \quad g_1(x) \le y \le g_2(x)\}

where g1g_1 and g2g_2 are continuous on [a,b][a, b] with g1(x)g2(x)g_1(x) \le g_2(x). The yy-limits depend on xx — we slice DD with vertical strips.

📐 Definition 6 (Type II Region)

A Type II region DD is described by

D={(x,y):cyd,h1(y)xh2(y)}D = \{(x, y) : c \le y \le d, \quad h_1(y) \le x \le h_2(y)\}

where h1h_1 and h2h_2 are continuous on [c,d][c, d] with h1(y)h2(y)h_1(y) \le h_2(y). The xx-limits depend on yy — we slice DD with horizontal strips.

🔷 Theorem 2 (Fubini's Theorem (General Regions))

Let ff be continuous on a region DD.

If DD is Type I: Df(x,y)dA=abg1(x)g2(x)f(x,y)dydx.\displaystyle\iint_D f(x, y) \, dA = \int_a^b \int_{g_1(x)}^{g_2(x)} f(x, y) \, dy \, dx.

If DD is Type II: Df(x,y)dA=cdh1(y)h2(y)f(x,y)dxdy.\displaystyle\iint_D f(x, y) \, dA = \int_c^d \int_{h_1(y)}^{h_2(y)} f(x, y) \, dx \, dy.

If DD is both Type I and Type II, both iterated integrals are equal.

📝 Example 6 (Triangle: Both Orders)

Let D={(x,y):0yx,  0x1}D = \{(x, y) : 0 \le y \le x, \; 0 \le x \le 1\} — the triangle below the line y=xy = x.

Type I (xx outer): xx ranges from 00 to 11; for each xx, yy ranges from 00 to xx: 010x1dydx=01xdx=12.\int_0^1 \int_0^x 1 \, dy \, dx = \int_0^1 x \, dx = \frac{1}{2}.

Type II (yy outer): yy ranges from 00 to 11; for each yy, xx ranges from yy to 11: 01y11dxdy=01(1y)dy=12.\int_0^1 \int_y^1 1 \, dx \, dy = \int_0^1 (1 - y) \, dy = \frac{1}{2}.

Both give 12\frac{1}{2} — the area of the triangle.

📝 Example 7 (Disk — Painful in Cartesian)

The area of the unit disk D={(x,y):x2+y21}D = \{(x,y) : x^2 + y^2 \le 1\} as a Type I integral:

111x21x21dydx=1121x2dx=π.\int_{-1}^{1} \int_{-\sqrt{1 - x^2}}^{\sqrt{1 - x^2}} 1 \, dy \, dx = \int_{-1}^{1} 2\sqrt{1 - x^2} \, dx = \pi.

This is computable (via x=sinθx = \sin\theta substitution) but awkward. In polar coordinates the same integral becomes 02π01rdrdθ=π\int_0^{2\pi} \int_0^1 r \, dr \, d\theta = \pi — trivially. This is the motivation for the next topic: Change of Variables & the Jacobian Determinant.

💡 Remark 4 (Regions That Are Neither Type I Nor Type II)

Some regions (e.g., an annulus, or an L-shaped domain) cannot be described as a single Type I or Type II region. The solution: decompose the region into finitely many pieces, each of which is Type I or Type II, and use the additivity property (Theorem 3 below). Alternatively, a change of coordinates (polar, cylindrical, spherical) may reduce the region to a simple rectangle in the new coordinates.

Four-panel comparison: two regions shown as both Type I (vertical strips) and Type II (horizontal strips), with labeled boundaries and integration formulas

Paraboloid z = x² + y² capped by the plane z = 4: the natural coordinates are polar, motivating the next topic

Choosing Integration Order

Sometimes one integration order leads to an impossible inner integral, while the other order is elementary. Recognizing this — and knowing how to reverse the order — is an essential computational skill.

📝 Example 8 (e^{x³}: Impossible in One Order, Elementary in the Other)

Compute 04y2ex3dxdy\int_0^4 \int_{\sqrt{y}}^2 e^{x^3} \, dx \, dy.

The inner integral y2ex3dx\int_{\sqrt{y}}^2 e^{x^3} \, dx has no closed-form antiderivative — ex3e^{x^3} cannot be integrated in terms of elementary functions. We are stuck.

Reverse the order. Sketch the region: 0y40 \le y \le 4, yx2\sqrt{y} \le x \le 2. Equivalently: 0x20 \le x \le 2, 0yx20 \le y \le x^2. In the reversed order:

020x2ex3dydx=02x2ex3dx=[ex33]02=e813.\int_0^2 \int_0^{x^2} e^{x^3} \, dy \, dx = \int_0^2 x^2 e^{x^3} \, dx = \left[\frac{e^{x^3}}{3}\right]_0^2 = \frac{e^8 - 1}{3}.

The key move: the inner integral 0x2ex3dy=x2ex3\int_0^{x^2} e^{x^3} \, dy = x^2 e^{x^3}, and now x2ex3x^2 e^{x^3} is a perfect uu-substitution candidate with u=x3u = x^3.

💡 Remark 5 (When to Switch Order)

Consider switching the integration order when:

  1. The inner integral has no elementary antiderivative.
  2. The limits in the current order are complicated but would simplify if swapped.
  3. The integrand factors nicely in the other variable.

The procedure: (1) sketch the region, (2) re-describe it as the other type (Type I ↔ Type II), (3) write the new limits, (4) evaluate.

Three-panel diagram: the original impossible integral, the region sketch, and the tractable reversed integral

Properties of Double Integrals

The double integral inherits the familiar properties of the 1D integral — linearity, monotonicity, additivity — because Fubini reduces it to 1D integrals.

🔷 Theorem 3 (Properties of the Double Integral)

Let f,gf, g be continuous on a region DD, and let cRc \in \mathbb{R}. Then:

  1. Linearity: D[f+cg]dA=DfdA+cDgdA.\iint_D [f + cg] \, dA = \iint_D f \, dA + c \iint_D g \, dA.
  2. Monotonicity: If f(x,y)g(x,y)f(x,y) \le g(x,y) on DD, then DfdADgdA.\iint_D f \, dA \le \iint_D g \, dA.
  3. Additivity: If D=D1D2D = D_1 \cup D_2 with D1D2D_1 \cap D_2 having zero area, then DfdA=D1fdA+D2fdA.\iint_D f \, dA = \iint_{D_1} f \, dA + \iint_{D_2} f \, dA.
  4. Triangle Inequality: DfdADfdA.\left|\iint_D f \, dA\right| \le \iint_D |f| \, dA.
  5. Area: D1dA=Area(D).\iint_D 1 \, dA = \text{Area}(D).

Proof.

Each property follows from Fubini’s theorem + the corresponding 1D property. For example, linearity: by Fubini,

D[f+cg]dA=abg1(x)g2(x)[f(x,y)+cg(x,y)]dydx.\iint_D [f + cg] \, dA = \int_a^b \int_{g_1(x)}^{g_2(x)} [f(x,y) + cg(x,y)] \, dy \, dx.

By linearity of the 1D integral (The Riemann Integral, Theorem 3), the inner integral splits:

=ab[g1(x)g2(x)f(x,y)dy+cg1(x)g2(x)g(x,y)dy]dx.= \int_a^b \left[\int_{g_1(x)}^{g_2(x)} f(x,y) \, dy + c \int_{g_1(x)}^{g_2(x)} g(x,y) \, dy\right] dx.

Linearity of the outer integral then gives

=abg1(x)g2(x)fdydx+cabg1(x)g2(x)gdydx=DfdA+cDgdA.= \int_a^b \int_{g_1(x)}^{g_2(x)} f \, dy \, dx + c \int_a^b \int_{g_1(x)}^{g_2(x)} g \, dy \, dx = \iint_D f \, dA + c \iint_D g \, dA.

The other properties follow similarly — monotonicity from the 1D monotonicity property, additivity from additivity of 1D integrals over adjacent intervals, and the triangle inequality from f0|f| \ge 0 and monotonicity. \blacksquare

🔷 Proposition 1 (Mean Value Theorem for Double Integrals)

If ff is continuous on a connected, bounded region DD with Area(D)>0\text{Area}(D) > 0, then there exists a point (x0,y0)D(x_0, y_0) \in D such that

Df(x,y)dA=f(x0,y0)Area(D).\iint_D f(x, y) \, dA = f(x_0, y_0) \cdot \text{Area}(D).

In other words, ff attains its average value somewhere in DD.

Proof.

Since ff is continuous on the compact set D\overline{D}, the Extreme Value Theorem (Completeness & Compactness) gives mf(x,y)Mm \le f(x,y) \le M for all (x,y)D(x,y) \in D, where mm and MM are the minimum and maximum of ff on DD. By monotonicity of the integral:

mArea(D)DfdAMArea(D).m \cdot \text{Area}(D) \le \iint_D f \, dA \le M \cdot \text{Area}(D).

Dividing by Area(D)\text{Area}(D):

m1Area(D)DfdAM.m \le \frac{1}{\text{Area}(D)} \iint_D f \, dA \le M.

The quantity in the middle lies between m=f(p1)m = f(p_1) and M=f(p2)M = f(p_2) for some p1,p2Dp_1, p_2 \in D. Since DD is connected and ff is continuous, the Intermediate Value Theorem guarantees there exists (x0,y0)D(x_0, y_0) \in D with f(x0,y0)=1Area(D)DfdAf(x_0, y_0) = \frac{1}{\text{Area}(D)} \iint_D f \, dA. \blacksquare

Triple Integrals and Higher Dimensions

The construction extends naturally to three (and more) dimensions. We partition a box B=[a1,b1]×[a2,b2]×[a3,b3]B = [a_1, b_1] \times [a_2, b_2] \times [a_3, b_3] into sub-boxes, form Riemann sums with volume elements ΔVijk\Delta V_{ijk}, and take the limit.

📐 Definition 7 (Triple Integral)

The triple integral of ff over a region ER3E \subset \mathbb{R}^3 is

Ef(x,y,z)dV=limP0i,j,kf(xijk,yijk,zijk)ΔVijk.\iiint_E f(x, y, z) \, dV = \lim_{\|P\| \to 0} \sum_{i,j,k} f(x_{ijk}^*, y_{ijk}^*, z_{ijk}^*) \, \Delta V_{ijk}.

Fubini’s theorem extends: for continuous ff, the triple integral equals any of the six possible iterated integrals (three nested single integrals in any order of the three variables).

📝 Example 9 (Volume of a Tetrahedron)

Compute EzdV\iiint_E z \, dV where E={(x,y,z):x,y,z0,  x+y+z1}E = \{(x,y,z) : x, y, z \ge 0, \; x + y + z \le 1\} — the standard tetrahedron.

Set up as a Type-I-style iterated integral: xx ranges from 00 to 11; for each xx, yy ranges from 00 to 1x1 - x; for each (x,y)(x, y), zz ranges from 00 to 1xy1 - x - y.

0101x01xyzdzdydx.\int_0^1 \int_0^{1-x} \int_0^{1-x-y} z \, dz \, dy \, dx.

Innermost: 01xyzdz=(1xy)22.\int_0^{1-x-y} z \, dz = \frac{(1-x-y)^2}{2}.

Middle: 01x(1xy)22dy.\int_0^{1-x} \frac{(1-x-y)^2}{2} \, dy. Substituting u=1xyu = 1-x-y: =1201xu2du=(1x)36.= \frac{1}{2} \int_0^{1-x} u^2 \, du = \frac{(1-x)^3}{6}.

Outer: 01(1x)36dx=1614=124.\int_0^1 \frac{(1-x)^3}{6} \, dx = \frac{1}{6} \cdot \frac{1}{4} = \frac{1}{24}.

💡 Remark 6 (The Curse of Dimensionality)

For a product-rule grid in dd dimensions with kk points per axis, the total number of evaluations is kdk^d. With k=100k = 100 and d=10d = 10, that’s 10010=1020100^{10} = 10^{20} evaluations — far beyond any computer. This exponential scaling is the curse of dimensionality. It’s why the ML integrals over 10,000-dimensional parameter spaces don’t use quadrature. They use Monte Carlo — next section.

3D tetrahedron with vertices labeled and integration limits annotated at each nesting level

Monte Carlo Integration

When deterministic quadrature fails (high dimensions, complex regions), we turn to randomness. Monte Carlo integration replaces systematic grids with random samples.

📐 Definition 8 (Monte Carlo Estimate)

Let DRdD \subset \mathbb{R}^d be a region with volume D|D|, and let X1,,XNX_1, \ldots, X_N be independent, uniformly distributed random points in DD. The Monte Carlo estimate of DfdV\int_D f \, dV is

I^N=DNk=1Nf(Xk).\hat{I}_N = \frac{|D|}{N} \sum_{k=1}^{N} f(X_k).

This is the sample mean of ff scaled by the volume of DD.

🔷 Proposition 2 (Monte Carlo Error)

The Monte Carlo estimate is unbiased: E[I^N]=DfdV\mathbb{E}[\hat{I}_N] = \int_D f \, dV. Its standard error is

SE(I^N)=DσfN\text{SE}(\hat{I}_N) = \frac{|D| \cdot \sigma_f}{\sqrt{N}}

where σf2=1DD[f(x)fˉ]2dV\sigma_f^2 = \frac{1}{|D|}\int_D [f(x) - \bar{f}]^2 \, dV is the variance of ff over DD. The convergence rate is O(1/N)O(1/\sqrt{N})independent of dimension. This is why Monte Carlo wins in high dimensions.

Proof.

Sketch. Since the XkX_k are i.i.d. uniform on DD, E[f(Xk)]=1DDfdV\mathbb{E}[f(X_k)] = \frac{1}{|D|} \int_D f \, dV. So

E[I^N]=DNN1DDfdV=DfdV.\mathbb{E}[\hat{I}_N] = \frac{|D|}{N} \cdot N \cdot \frac{1}{|D|} \int_D f \, dV = \int_D f \, dV.

The variance: Var(I^N)=D2N2NVar(f(Xk))=D2σf2N\text{Var}(\hat{I}_N) = \frac{|D|^2}{N^2} \cdot N \cdot \text{Var}(f(X_k)) = \frac{|D|^2 \sigma_f^2}{N}.

Taking square root gives the standard error O(1/N)O(1/\sqrt{N}). \blacksquare

📝 Example 10 (Estimating π via Monte Carlo)

The area of the unit disk is π\pi. Inscribe it in the square [1,1]2[-1, 1]^2 (area 4). Sample NN uniform points in the square. The fraction inside the disk estimates π/4\pi / 4:

π^N=4points inside diskN.\hat{\pi}_N = 4 \cdot \frac{\text{points inside disk}}{N}.

With N=10,000N = 10{,}000 points, we typically get π^3.14±0.02\hat{\pi} \approx 3.14 \pm 0.02. Try it below:

💡 Remark 7 (Monte Carlo vs. Quadrature — and SGD)

In dd dimensions, quadrature needs O(kd)O(k^d) evaluations for accuracy O(k2/d)O(k^{-2/d}) (for the trapezoidal rule). Monte Carlo needs O(N)O(N) evaluations for accuracy O(1/N)O(1/\sqrt{N}) — and NN is independent of dd. For d>4d > 4 or so, Monte Carlo dominates.

This is exactly why stochastic gradient descent (SGD) uses random mini-batches. The expected risk E[]=pdxdy\mathbb{E}[\ell] = \iint \ell \cdot p \, dx \, dy is a high-dimensional integral. Each mini-batch of size BB is a Monte Carlo estimate with error O(1/B)O(1/\sqrt{B}).

Gradient Descent on formalML

Estimate: 3.136000Exact: 3.141593Error: 5.59e-3SE: 5.21e-2

Monte Carlo estimation of π: scatter plots at N=500 and N=5000, plus log-log convergence plot with O(1/√N) reference line

Connections to ML

Multiple integrals are not just mathematical machinery — they are the language of probability and statistical learning.

Joint and Marginal Densities

A joint density fX,Y(x,y)f_{X,Y}(x, y) satisfies fX,Y(x,y)dxdy=1\iint f_{X,Y}(x,y) \, dx \, dy = 1. The marginal density of XX is obtained by “integrating out” YY:

fX(x)=fX,Y(x,y)dy.f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy.

This is Fubini applied to probability. Every time you marginalize — whether computing the marginal likelihood in Bayesian inference or the evidence lower bound (ELBO) in variational inference — you’re invoking Fubini’s theorem.

Measure-Theoretic Probability on formalML

Expected Values and Covariance

The expected value of a function g(X,Y)g(X, Y) under a joint density is

E[g(X,Y)]=g(x,y)fX,Y(x,y)dxdy.\mathbb{E}[g(X,Y)] = \iint g(x,y) \, f_{X,Y}(x,y) \, dx \, dy.

Covariance is: Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] — each term a double integral, simplified by Fubini + linearity.

Monte Carlo Estimation and SGD

The expected risk R(θ)=(y,hθ(x))p(x,y)dxdyR(\theta) = \iint \ell(y, h_\theta(x)) \, p(x, y) \, dx \, dy is a double integral we can’t compute (we don’t know p(x,y)p(x,y)). Each SGD mini-batch of size BB is a Monte Carlo estimate:

R^B(θ)=1Bk=1B(yk,hθ(xk))\hat{R}_B(\theta) = \frac{1}{B} \sum_{k=1}^{B} \ell(y_k, h_\theta(x_k))

with standard error O(1/B)O(1/\sqrt{B}). Larger batches reduce variance but cost more computation — the fundamental trade-off of stochastic optimization.

Gradient Descent on formalML

Partition Function in Energy-Based Models

In models like the Boltzmann machine, the partition function Z=eE(x)dxZ = \int e^{-E(x)} \, dx is an intractable integral over a high-dimensional space. Contrastive divergence and other training methods are essentially schemes for estimating ratios of such integrals without computing them directly.

Preview: Change of Variables

The Gaussian integral ex2dx=π\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} is proved by converting to polar coordinates — a change of variables that transforms the double integral into a simpler form. The general theory requires the Jacobian determinant as a volume scaling factor. The full proof appears in Change of Variables & the Jacobian Determinant, §5.

p_X(0.00) = ∫p(x,y)dy = 0.3979

Three-panel ML connections: joint density with marginal integration, expected loss as double integral with SGD as Monte Carlo, and curse of dimensionality comparison

Connections & Further Reading

This topic builds on:

This topic leads to:

References

  1. book Spivak (1965). Calculus on Manifolds Chapter 3 — Riemann integral in Rⁿ via partitions and Fubini's theorem
  2. book Rudin (1976). Principles of Mathematical Analysis Chapter 10 — rigorous treatment of multiple integrals via iterated integration
  3. book Munkres (1991). Analysis on Manifolds Chapters 2-3 — Riemann integration in Rⁿ with measurability conditions for Fubini
  4. book Hubbard & Hubbard (2015). Vector Calculus, Linear Algebra, and Differential Forms Chapter 4 — geometric motivation and computational techniques for multiple integrals
  5. book Bishop (2006). Pattern Recognition and Machine Learning Chapter 2 — marginal, conditional, and joint densities via multiple integrals