Multivariable Differential · advanced · 45 min read

Inverse & Implicit Function Theorems

When can you locally invert a function? When can you solve F(x, y) = 0 for y?

Abstract. The Inverse Function Theorem and the Implicit Function Theorem are the two great existence theorems of multivariable calculus. The IFT tells you when a differentiable function is locally invertible: if the Jacobian matrix is non-singular at a point, the function has a smooth local inverse. The ImFT tells you when a level set F(x, y) = 0 can be locally described as the graph of a function y = g(x): when the partial Jacobian with respect to y is invertible. Together, they provide the rigorous foundation for constraint optimization (Lagrange multipliers), the definition of smooth manifolds, change of variables in integration, and the local structure of parameter spaces in machine learning.

Where this leads → formalML

  • formalML The IFT guarantees local convergence of gradient descent near non-degenerate minima: the gradient map is locally invertible when the Hessian is non-singular, ensuring that small perturbations from the optimum produce proportional gradient signals.
  • formalML The Implicit Function Theorem is the foundational tool for proving that level sets are smooth manifolds. The regular value theorem — that F⁻¹(c) is a manifold when J_F has full rank on the level set — is a direct consequence of the ImFT.
  • formalML The statistical manifold of a parametric family is locally diffeomorphic to ℝⁿ via the IFT when the Fisher information matrix is non-singular. Reparametrization invariance relies on the IFT to guarantee that coordinate changes on the parameter space are locally invertible.

Overview & Motivation

A normalizing flow transforms a simple distribution — say, a standard Gaussian — into a complex one by composing a sequence of invertible maps f1,f2,,fKf_1, f_2, \ldots, f_K. The density of the transformed distribution involves the Jacobian determinant of each layer: pK(zK)=p0(z0)k=1KdetJfk(zk1)1p_K(z_K) = p_0(z_0) \prod_{k=1}^{K} |\det J_{f_k}(z_{k-1})|^{-1}. Every layer must be invertible, and we need to know it is invertible — not just hope. The Inverse Function Theorem provides exactly this guarantee: if the Jacobian matrix Jf(a)J_f(a) is non-singular at a point aa, then ff is locally invertible near aa, with a smooth inverse whose own Jacobian is the matrix inverse [Jf(f1(y))]1[J_f(f^{-1}(y))]^{-1}. This is the foundational existence theorem that underpins the entire normalizing flow framework.

The Implicit Function Theorem addresses a complementary question. Given a system of equations F(x,y)=0F(x, y) = 0 — such as the constraints in a Lagrangian optimization problem, or the fixed-point equation defining the output of a deep equilibrium model — when can we solve for yy as a smooth function of xx? The ImFT says: when the partial Jacobian DyFD_y F is invertible at the point of interest, the level set F(x,y)=0F(x, y) = 0 is locally the graph of a C1C^1 function y=g(x)y = g(x), and it hands us a formula for the derivative g(x)g'(x). This is implicit differentiation elevated from a calculus trick to a rigorous existence theorem.

These are existence theorems — they do not compute the inverse or the implicit function. They tell you when the thing you want exists and what properties it has. The computational toolkit was built in Topics 9 through 11: the gradient for first-order information, the Jacobian for multivariate derivatives, the Hessian for second-order structure. This topic provides the theoretical guarantees that make the entire framework coherent.

Local Invertibility — The Geometric Picture

Before stating the theorem, we need to be precise about what “invertible” means in the multivariable context — and in particular, the crucial distinction between local and global invertibility.

📐 Definition 1 (Local Diffeomorphism)

Let URnU \subseteq \mathbb{R}^n be open. A C1C^1 map f:URnf: U \to \mathbb{R}^n is a local diffeomorphism at aUa \in U if there exist open sets VaV \ni a and Wf(a)W \ni f(a) such that fV:VWf|_V: V \to W is:

  1. Bijective (one-to-one and onto),
  2. C1C^1 (continuously differentiable), and
  3. Its inverse fV1:WVf|_V^{-1}: W \to V is also C1C^1.

If ff is a local diffeomorphism at every point of UU, we call it a local diffeomorphism on UU. If f:Uf(U)f: U \to f(U) is a bijective local diffeomorphism, it is a (global) diffeomorphism.

The word “local” is essential. A function can be a local diffeomorphism at every point and yet fail to be globally injective. The critical distinction: the IFT guarantees local invertibility from a pointwise condition (non-singular Jacobian at a single point), but global invertibility requires additional structure.

📝 Example 1 (Polar coordinates — local diffeomorphism with a singularity)

The polar coordinate map f:R2R2f: \mathbb{R}^2 \to \mathbb{R}^2 defined by

f(r,θ)=(rcosθ,  rsinθ)f(r, \theta) = (r\cos\theta,\; r\sin\theta)

has Jacobian

Jf(r,θ)=(cosθrsinθsinθrcosθ)J_f(r, \theta) = \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}

with determinant detJf=rcos2θ+rsin2θ=r\det J_f = r\cos^2\theta + r\sin^2\theta = r. This is non-zero when r0r \neq 0, so ff is a local diffeomorphism everywhere except at the origin. At r=0r = 0, the Jacobian is singular — and geometrically, all angles θ\theta map to the same point (0,0)(0, 0). The polar coordinate map collapses an entire line (r=0r = 0, all θ\theta) to a single point, which is the geometric source of the singularity.

Restricting to r>0r > 0 and θ(0,2π)\theta \in (0, 2\pi), the map becomes a diffeomorphism onto R2{(x,0):x0}\mathbb{R}^2 \setminus \{(x, 0) : x \geq 0\}. The restriction to θ(0,2π)\theta \in (0, 2\pi) is necessary to prevent θ=0\theta = 0 and θ=2π\theta = 2\pi from mapping to the same point.

📝 Example 2 (Exponential map — everywhere locally invertible, not globally injective)

Define f:R2R2f: \mathbb{R}^2 \to \mathbb{R}^2 by

f(x,y)=(excosy,  exsiny).f(x, y) = (e^x \cos y,\; e^x \sin y).

The Jacobian is

Jf(x,y)=(excosyexsinyexsinyexcosy)J_f(x, y) = \begin{pmatrix} e^x\cos y & -e^x\sin y \\ e^x\sin y & e^x\cos y \end{pmatrix}

with detJf=e2x(cos2y+sin2y)=e2x>0\det J_f = e^{2x}(\cos^2 y + \sin^2 y) = e^{2x} > 0 everywhere. The Jacobian is never singular — the IFT guarantees that ff is a local diffeomorphism at every point. Yet ff is not globally injective: f(x,y)=f(x,y+2π)f(x, y) = f(x, y + 2\pi) for all (x,y)(x, y), because the complex exponential ex+iye^{x+iy} is 2πi2\pi i-periodic. This is a clean example of the gap between “locally invertible everywhere” and “globally invertible.”

💡 Remark 1 (Local vs. global invertibility in normalizing flows)

The distinction between local and global invertibility is not merely a mathematical subtlety — it drives architectural decisions in machine learning. The IFT only guarantees local invertibility: given a non-singular Jacobian at one point, the function has a smooth inverse in some neighborhood. Normalizing flows need global bijectivity (every input maps to a unique output over the entire domain) to define valid density transformations.

Architectures like Real-NVP and NICE achieve global bijectivity by construction: affine coupling layers y1:d=x1:dy_{1:d} = x_{1:d}, yd+1:n=xd+1:nexp(s(x1:d))+t(x1:d)y_{d+1:n} = x_{d+1:n} \odot \exp(s(x_{1:d})) + t(x_{1:d}) are globally invertible because the triangular structure makes the inverse explicit. The IFT is not needed for these architectures — but it is needed for residual flows f(x)=x+g(x)f(x) = x + g(x), where global invertibility requires additional conditions (e.g., Lipschitz constant of gg less than 1).

The geometry of local invertibility: a map f sends a small neighborhood of a to a small neighborhood of f(a), with the Jacobian controlling the local distortion. When det J_f = 0, the map collapses a direction — the tangent space image loses a dimension.

The Inverse Function Theorem

This is the central result. The hypothesis is a pointwise condition on the Jacobian; the conclusion is local invertibility with a smooth inverse.

🔷 Theorem 1 (Inverse Function Theorem (IFT))

Let URnU \subseteq \mathbb{R}^n be open and let f:URnf: U \to \mathbb{R}^n be C1C^1 (continuously differentiable). Let aUa \in U and b=f(a)b = f(a). Suppose that the Jacobian matrix Jf(a)J_f(a) is invertible (equivalently, detJf(a)0\det J_f(a) \neq 0).

Then there exist open sets VUV \subseteq U with aVa \in V and WRnW \subseteq \mathbb{R}^n with bWb \in W such that:

  1. Bijection: f:VWf: V \to W is a bijection.
  2. Smooth inverse: The inverse function g=f1:WVg = f^{-1}: W \to V is C1C^1.
  3. Derivative formula: For every yWy \in W,

Jg(y)=[Jf(g(y))]1.J_g(y) = [J_f(g(y))]^{-1}.

In words: if the linear approximation to ff at aa is invertible, then ff itself is invertible in a neighborhood of aa, with a C1C^1 inverse whose Jacobian is the matrix inverse of JfJ_f evaluated at the corresponding point.

Proof.

We prove the IFT via the Contraction Mapping Theorem (Theorem 2 below). The strategy: reformulate the equation f(x)=yf(x) = y as a fixed-point problem Ty(x)=xT_y(x) = x, show that TyT_y is a contraction on a closed ball near aa, and invoke completeness to obtain the unique fixed point g(y)=f1(y)g(y) = f^{-1}(y).

Setup. Let A=Jf(a)A = J_f(a). Since detA0\det A \neq 0, the inverse A1A^{-1} exists. Define the auxiliary map

Ty(x)=x+A1(yf(x))=xA1(f(x)y).T_y(x) = x + A^{-1}(y - f(x)) = x - A^{-1}(f(x) - y).

Observe that Ty(x)=xT_y(x) = x if and only if A1(yf(x))=0A^{-1}(y - f(x)) = 0, which (since A1A^{-1} is invertible) happens if and only if f(x)=yf(x) = y. So finding a fixed point of TyT_y is equivalent to solving f(x)=yf(x) = y.

Step 1: TyT_y is a contraction near aa. The Jacobian of TyT_y with respect to xx is

DTy(x)=IA1Jf(x)=A1(AJf(x)).DT_y(x) = I - A^{-1} J_f(x) = A^{-1}(A - J_f(x)).

Since JfJ_f is continuous (the C1C^1 hypothesis) and Jf(a)=AJ_f(a) = A, for any ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that

xa<δ    Jf(x)A<ε.\|x - a\| < \delta \implies \|J_f(x) - A\| < \varepsilon.

Choose ε=12A1\varepsilon = \frac{1}{2\|A^{-1}\|}. Then for xa<δ\|x - a\| < \delta:

DTy(x)=A1(AJf(x))A1AJf(x)<A112A1=12.\|DT_y(x)\| = \|A^{-1}(A - J_f(x))\| \leq \|A^{-1}\| \cdot \|A - J_f(x)\| < \|A^{-1}\| \cdot \frac{1}{2\|A^{-1}\|} = \frac{1}{2}.

By the Mean Value Inequality (the multivariable Mean Value Theorem), for any x1,x2B(a,δ)x_1, x_2 \in B(a, \delta):

Ty(x1)Ty(x2)supxB(a,δ)DTy(x)x1x212x1x2.\|T_y(x_1) - T_y(x_2)\| \leq \sup_{x \in B(a, \delta)} \|DT_y(x)\| \cdot \|x_1 - x_2\| \leq \frac{1}{2}\|x_1 - x_2\|.

So TyT_y is a contraction with Lipschitz constant λ=12\lambda = \frac{1}{2} on B(a,δ)B(a, \delta).

Step 2: TyT_y maps B(a,r)\overline{B}(a, r) to itself for yy near bb. Fix rδr \leq \delta (so the contraction estimate holds on B(a,r)\overline{B}(a, r)). For any xB(a,r)x \in \overline{B}(a, r):

Ty(x)aTy(x)Ty(a)+Ty(a)a.\|T_y(x) - a\| \leq \|T_y(x) - T_y(a)\| + \|T_y(a) - a\|.

The first term is controlled by the contraction:

Ty(x)Ty(a)12xar2.\|T_y(x) - T_y(a)\| \leq \frac{1}{2}\|x - a\| \leq \frac{r}{2}.

For the second term:

Ty(a)a=a+A1(yf(a))a=A1(yb)A1yb.\|T_y(a) - a\| = \|a + A^{-1}(y - f(a)) - a\| = \|A^{-1}(y - b)\| \leq \|A^{-1}\| \cdot \|y - b\|.

Therefore:

Ty(x)ar2+A1yb.\|T_y(x) - a\| \leq \frac{r}{2} + \|A^{-1}\| \cdot \|y - b\|.

This is r\leq r provided ybr2A1\|y - b\| \leq \frac{r}{2\|A^{-1}\|}. So let W=B ⁣(b,r2A1)W = B\!\left(b, \frac{r}{2\|A^{-1}\|}\right). For every yWy \in W, the map TyT_y sends B(a,r)\overline{B}(a, r) to itself.

Step 3: Fixed point existence and uniqueness. The closed ball B(a,r)Rn\overline{B}(a, r) \subset \mathbb{R}^n is a complete metric space (closed subsets of complete metric spaces are complete, and Rn\mathbb{R}^n is complete — this is where completeness enters the proof). The map Ty:B(a,r)B(a,r)T_y: \overline{B}(a, r) \to \overline{B}(a, r) is a contraction with constant 12\frac{1}{2}.

By the Contraction Mapping Theorem (Theorem 2), TyT_y has a unique fixed point g(y)B(a,r)g(y) \in \overline{B}(a, r). Since Ty(g(y))=g(y)T_y(g(y)) = g(y) if and only if f(g(y))=yf(g(y)) = y, we have constructed a function g:WV=B(a,r)g: W \to V = B(a, r) such that f(g(y))=yf(g(y)) = y for all yWy \in W.

Injectivity of ff on VV: If f(x1)=f(x2)=yf(x_1) = f(x_2) = y for x1,x2Vx_1, x_2 \in V, then both x1x_1 and x2x_2 are fixed points of TyT_y on B(a,r)\overline{B}(a, r). By uniqueness of the fixed point, x1=x2x_1 = x_2. So fVf|_V is injective.

Surjectivity onto WW: For every yWy \in W, the fixed point g(y)g(y) satisfies f(g(y))=yf(g(y)) = y, so yf(V)y \in f(V). Therefore f(V)Wf(V) \supseteq W. (We may shrink VV so that f(V)=Wf(V) = W.)

This establishes part (1): f:VWf: V \to W is a bijection with inverse gg.

Step 4: gg is C1C^1. We show that gg is continuous, then differentiable, then C1C^1.

Continuity of gg: For y1,y2Wy_1, y_2 \in W, let xi=g(yi)x_i = g(y_i), so f(xi)=yif(x_i) = y_i. Then:

x1x2=Ty1(x1)Ty2(x2)\|x_1 - x_2\| = \|T_{y_1}(x_1) - T_{y_2}(x_2)\| =(x1+A1(y1f(x1)))(x2+A1(y2f(x2)))= \|(x_1 + A^{-1}(y_1 - f(x_1))) - (x_2 + A^{-1}(y_2 - f(x_2)))\| =(x1x2)+A1((y1y2)(f(x1)f(x2)))= \|(x_1 - x_2) + A^{-1}((y_1 - y_2) - (f(x_1) - f(x_2)))\| =Ty1(x1)Ty1(x2)+A1(y1y2)= \|T_{y_1}(x_1) - T_{y_1}(x_2) + A^{-1}(y_1 - y_2)\| Ty1(x1)Ty1(x2)+A1y1y2\leq \|T_{y_1}(x_1) - T_{y_1}(x_2)\| + \|A^{-1}\|\|y_1 - y_2\| 12x1x2+A1y1y2.\leq \frac{1}{2}\|x_1 - x_2\| + \|A^{-1}\|\|y_1 - y_2\|.

Rearranging: g(y1)g(y2)=x1x22A1y1y2\|g(y_1) - g(y_2)\| = \|x_1 - x_2\| \leq 2\|A^{-1}\|\|y_1 - y_2\|. So gg is Lipschitz (hence continuous).

Differentiability: Since f(g(y))=yf(g(y)) = y and ff is differentiable with invertible Jacobian Jf(g(y))J_f(g(y)) (invertibility persists in a neighborhood because detJf\det J_f is a continuous function that is nonzero at aa, hence nonzero near aa), we differentiate both sides using the chain rule:

Jf(g(y))Jg(y)=I.J_f(g(y)) \cdot J_g(y) = I.

Solving: Jg(y)=[Jf(g(y))]1J_g(y) = [J_f(g(y))]^{-1}.

C1C^1 regularity: The map yJg(y)y \mapsto J_g(y) is the composition

ygg(y)JfJf(g(y))inv[Jf(g(y))]1.y \xrightarrow{g} g(y) \xrightarrow{J_f} J_f(g(y)) \xrightarrow{\text{inv}} [J_f(g(y))]^{-1}.

Each arrow is continuous: gg is continuous (shown above), JfJ_f is continuous (by the C1C^1 hypothesis on ff), and matrix inversion MM1M \mapsto M^{-1} is continuous on GL(n)GL(n) (the set of invertible matrices). Therefore JgJ_g is continuous, which means gg is C1C^1.

\square

The proof uses the Contraction Mapping Theorem as a black box. Let us now state and prove it.

🔷 Theorem 2 (Contraction Mapping Theorem (Banach Fixed-Point Theorem))

Let (X,d)(X, d) be a complete metric space and let T:XXT: X \to X be a contraction: there exists λ[0,1)\lambda \in [0, 1) such that

d(T(x),T(y))λd(x,y)for all x,yX.d(T(x), T(y)) \leq \lambda \, d(x, y) \quad \text{for all } x, y \in X.

Then TT has a unique fixed point xXx^* \in X (i.e., T(x)=xT(x^*) = x^*). Moreover, for any starting point x0Xx_0 \in X, the iterates xk+1=T(xk)x_{k+1} = T(x_k) converge to xx^*, with the convergence rate bound

d(xk,x)λk1λd(x0,x1).d(x_k, x^*) \leq \frac{\lambda^k}{1 - \lambda} \, d(x_0, x_1).

Proof.

Existence. Pick any x0Xx_0 \in X and define xk+1=T(xk)x_{k+1} = T(x_k). We show (xk)(x_k) is Cauchy. By the contraction property:

d(xk+1,xk)=d(T(xk),T(xk1))λd(xk,xk1)λkd(x1,x0).d(x_{k+1}, x_k) = d(T(x_k), T(x_{k-1})) \leq \lambda \, d(x_k, x_{k-1}) \leq \cdots \leq \lambda^k d(x_1, x_0).

For m>nm > n, the triangle inequality gives:

d(xn,xm)k=nm1d(xk,xk+1)k=nm1λkd(x0,x1)λn1λd(x0,x1).d(x_n, x_m) \leq \sum_{k=n}^{m-1} d(x_k, x_{k+1}) \leq \sum_{k=n}^{m-1} \lambda^k d(x_0, x_1) \leq \frac{\lambda^n}{1 - \lambda}\, d(x_0, x_1).

Since 0λ<10 \leq \lambda < 1, we have λn0\lambda^n \to 0 as nn \to \infty. For any ε>0\varepsilon > 0, choose NN such that λN1λd(x0,x1)<ε\frac{\lambda^N}{1-\lambda} d(x_0, x_1) < \varepsilon. Then for all m>nNm > n \geq N, d(xn,xm)<εd(x_n, x_m) < \varepsilon. So (xk)(x_k) is Cauchy.

Since XX is complete, the Cauchy sequence (xk)(x_k) converges to some xXx^* \in X.

xx^* is a fixed point. Since TT is a contraction, it is Lipschitz, hence continuous. Therefore:

T(x)=T ⁣(limkxk)=limkT(xk)=limkxk+1=x.T(x^*) = T\!\left(\lim_{k \to \infty} x_k\right) = \lim_{k \to \infty} T(x_k) = \lim_{k \to \infty} x_{k+1} = x^*.

Uniqueness. Suppose T(x)=xT(x^*) = x^* and T(y)=yT(y^*) = y^*. Then:

d(x,y)=d(T(x),T(y))λd(x,y).d(x^*, y^*) = d(T(x^*), T(y^*)) \leq \lambda \, d(x^*, y^*).

Since λ<1\lambda < 1, this implies (1λ)d(x,y)0(1 - \lambda) d(x^*, y^*) \leq 0, so d(x,y)=0d(x^*, y^*) = 0, hence x=yx^* = y^*.

Rate bound. Letting mm \to \infty in the estimate d(xn,xm)λn1λd(x0,x1)d(x_n, x_m) \leq \frac{\lambda^n}{1-\lambda} d(x_0, x_1):

d(xn,x)=limmd(xn,xm)λn1λd(x0,x1).d(x_n, x^*) = \lim_{m \to \infty} d(x_n, x_m) \leq \frac{\lambda^n}{1 - \lambda}\, d(x_0, x_1).

\square

📝 Example 3 (Inverse Jacobian for the complex squaring map)

Define f:R2R2f: \mathbb{R}^2 \to \mathbb{R}^2 by f(x,y)=(x2y2,  2xy)f(x, y) = (x^2 - y^2,\; 2xy) — this is the real-variable form of the complex map zz2z \mapsto z^2, where z=x+iyz = x + iy.

The Jacobian is:

Jf(x,y)=(2x2y2y2x),detJf=4x2+4y2=4z2.J_f(x, y) = \begin{pmatrix} 2x & -2y \\ 2y & 2x \end{pmatrix}, \qquad \det J_f = 4x^2 + 4y^2 = 4|z|^2.

At a=(1,1)a = (1, 1): detJf(1,1)=80\det J_f(1, 1) = 8 \neq 0, so the IFT guarantees a local inverse near b=f(1,1)=(0,2)b = f(1, 1) = (0, 2).

The inverse Jacobian is:

[Jf(1,1)]1=18(2222)=(1/41/41/41/4).[J_f(1,1)]^{-1} = \frac{1}{8}\begin{pmatrix} 2 & 2 \\ -2 & 2 \end{pmatrix} = \begin{pmatrix} 1/4 & 1/4 \\ -1/4 & 1/4 \end{pmatrix}.

This tells us that a small perturbation Δy=(Δu,Δv)\Delta y = (\Delta u, \Delta v) near b=(0,2)b = (0, 2) produces a perturbation Δx[Jf(1,1)]1Δy\Delta x \approx [J_f(1,1)]^{-1} \Delta y near a=(1,1)a = (1,1).

The Jacobian is singular when z=0|z| = 0, i.e., at the origin — exactly where the complex squaring map z2z^2 has a branch point. The IFT fails at the origin, and indeed z2z^2 is two-to-one in every neighborhood of 00 (except at 00 itself).

💡 Remark 2 (Higher regularity: the C^k version)

The IFT as stated gives a C1C^1 inverse when ff is C1C^1. But the regularity transfers: if ff is CkC^k (all partial derivatives up to order kk exist and are continuous), then the local inverse gg is also CkC^k. If ff is CC^\infty (smooth), so is gg. If ff is real-analytic, so is gg.

The key to the induction is the derivative formula Jg(y)=[Jf(g(y))]1J_g(y) = [J_f(g(y))]^{-1}. If gg is Ck1C^{k-1} and JfJ_f is Ck1C^{k-1} (which follows from ff being CkC^k), then JgJ_g is a composition of Ck1C^{k-1} maps with matrix inversion, hence Ck1C^{k-1}. But JgJ_g being Ck1C^{k-1} means gg is CkC^k.

Polar coordinates: det J = r, singular at the origin

Point a
(1.500, 0.785)
f(a)
(1.061, 1.061)
J_f(a)
[0.707, -1.061]
[0.707, 1.061]
det J_f(a)
1.500
[J_f(a)]⁻¹:[0.707, 0.707]   [-0.471, 0.471]

The Inverse Function Theorem: a C^1 map f with non-singular Jacobian at a point a is a local diffeomorphism — it maps a neighborhood of a bijectively onto a neighborhood of f(a), with a smooth inverse.

The Derivative of the Inverse

The derivative formula Jg(y)=[Jf(g(y))]1J_g(y) = [J_f(g(y))]^{-1} from the IFT proof deserves its own spotlight — it is the tool you actually use to compute derivatives of inverse functions in practice.

🔷 Proposition 1 (Jacobian of the Inverse)

Let f:VWf: V \to W be a C1C^1 diffeomorphism with inverse g=f1:WVg = f^{-1}: W \to V. Then for every yWy \in W:

Jg(y)=[Jf(g(y))]1.J_g(y) = [J_f(g(y))]^{-1}.

In components: if f=(f1,,fn)f = (f_1, \ldots, f_n) and g=(g1,,gn)g = (g_1, \ldots, g_n), then

giyj(y)=([Jf(g(y))]1)ij.\frac{\partial g_i}{\partial y_j}(y) = \left([J_f(g(y))]^{-1}\right)_{ij}.

In the scalar case n=1n = 1, this reduces to the familiar formula (f1)(y)=1f(f1(y))(f^{-1})'(y) = \frac{1}{f'(f^{-1}(y))}.

Proof.

Differentiate f(g(y))=yf(g(y)) = y with respect to yy using the chain rule:

Jf(g(y))Jg(y)=In.J_f(g(y)) \cdot J_g(y) = I_n.

Since Jf(g(y))J_f(g(y)) is invertible (invertibility persists in a neighborhood of aa by continuity of detJf\det J_f), multiply both sides on the left by [Jf(g(y))]1[J_f(g(y))]^{-1}:

Jg(y)=[Jf(g(y))]1.J_g(y) = [J_f(g(y))]^{-1}.

\square

📝 Example 4 (Inverse derivative for polar coordinates)

Polar coordinates: f(r,θ)=(rcosθ,rsinθ)f(r, \theta) = (r\cos\theta, r\sin\theta) with inverse g(x,y)=(x2+y2,  arctan(y/x))g(x, y) = \left(\sqrt{x^2 + y^2},\; \arctan(y/x)\right) (for x>0x > 0).

Via the formula: At the point (r,θ)=(2,π/6)(r, \theta) = (2, \pi/6), so (x,y)=(3,1)(x, y) = (\sqrt{3}, 1):

Jf(2,π/6)=(cos(π/6)2sin(π/6)sin(π/6)2cos(π/6))=(3/211/23).J_f(2, \pi/6) = \begin{pmatrix} \cos(\pi/6) & -2\sin(\pi/6) \\ \sin(\pi/6) & 2\cos(\pi/6) \end{pmatrix} = \begin{pmatrix} \sqrt{3}/2 & -1 \\ 1/2 & \sqrt{3} \end{pmatrix}.

The determinant is r=2r = 2. So:

Jg(3,1)=[Jf(2,π/6)]1=12(311/23/2)=(3/21/21/43/4).J_g(\sqrt{3}, 1) = [J_f(2, \pi/6)]^{-1} = \frac{1}{2}\begin{pmatrix} \sqrt{3} & 1 \\ -1/2 & \sqrt{3}/2 \end{pmatrix} = \begin{pmatrix} \sqrt{3}/2 & 1/2 \\ -1/4 & \sqrt{3}/4 \end{pmatrix}.

Direct computation (verification):

rx=xx2+y2=32,ry=yx2+y2=12,\frac{\partial r}{\partial x} = \frac{x}{\sqrt{x^2+y^2}} = \frac{\sqrt{3}}{2}, \quad \frac{\partial r}{\partial y} = \frac{y}{\sqrt{x^2+y^2}} = \frac{1}{2},

θx=yx2+y2=14,θy=xx2+y2=34.\frac{\partial \theta}{\partial x} = \frac{-y}{x^2+y^2} = \frac{-1}{4}, \quad \frac{\partial \theta}{\partial y} = \frac{x}{x^2+y^2} = \frac{\sqrt{3}}{4}.

The results agree. The formula Jg=[Jf(g(y))]1J_g = [J_f(g(y))]^{-1} saves us the trouble of computing the inverse function and differentiating it directly — which in higher dimensions may not be feasible.

📝 Example 5 (Inverse derivative in a normalizing flow coupling layer)

An affine coupling layer in Real-NVP splits the input z=(za,zb)Rd×Rdz = (z_a, z_b) \in \mathbb{R}^d \times \mathbb{R}^{d'} and defines:

f(za,zb)=(za,  zbexp(s(za))+t(za))f(z_a, z_b) = (z_a,\; z_b \odot \exp(s(z_a)) + t(z_a))

where s,t:RdRds, t: \mathbb{R}^d \to \mathbb{R}^{d'} are neural networks and \odot is elementwise multiplication. The Jacobian is block-triangular:

Jf=(Id0za[zbes(za)+t(za)]diag(es(za))).J_f = \begin{pmatrix} I_d & 0 \\ \frac{\partial}{\partial z_a}[z_b \odot e^{s(z_a)} + t(z_a)] & \operatorname{diag}(e^{s(z_a)}) \end{pmatrix}.

The determinant is detJf=j=1dexp(sj(za))=exp ⁣(jsj(za))\det J_f = \prod_{j=1}^{d'} \exp(s_j(z_a)) = \exp\!\left(\sum_j s_j(z_a)\right), which is always positive — so the IFT is satisfied everywhere. The inverse is explicit: g(ya,yb)=(ya,  (ybt(ya))exp(s(ya)))g(y_a, y_b) = (y_a,\; (y_b - t(y_a)) \odot \exp(-s(y_a))). The inverse Jacobian is:

Jg=[Jf]1=(Id0diag(es(ya)))J_g = [J_f]^{-1} = \begin{pmatrix} I_d & 0 \\ * & \operatorname{diag}(e^{-s(y_a)}) \end{pmatrix}

with logdetJg=jsj(ya)\log|\det J_g| = -\sum_j s_j(y_a). The entire normalizing flow density computation reduces to evaluating the ss network — this is why coupling layers are popular.

Computing the inverse Jacobian: the formula avoids computing the inverse function explicitly. The left panel shows the forward Jacobian field; the right panel shows the inverse Jacobian field.

The Implicit Function Theorem

The Inverse Function Theorem answers “when is f(x)=yf(x) = y solvable for xx near a known solution?” The Implicit Function Theorem answers a more structured question: given a system F(x,y)=0F(x, y) = 0 where xx and yy play different roles, when can we solve for yy as a function of xx?

📐 Definition 2 (Implicit Equation and Level Set)

Let F:Rn×RmRmF: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m be a C1C^1 map. An implicit equation is the system F(x,y)=0F(x, y) = 0, where xRnx \in \mathbb{R}^n and yRmy \in \mathbb{R}^m. The level set (or solution set) is

S={(x,y)Rn+m:F(x,y)=0}.S = \{(x, y) \in \mathbb{R}^{n+m} : F(x, y) = 0\}.

We say y=g(x)y = g(x) is an implicit function defined by FF near a point (a,b)(a, b) if gg is a C1C^1 map defined on a neighborhood of aa such that F(x,g(x))=0F(x, g(x)) = 0 for all xx in that neighborhood, and g(a)=bg(a) = b.

To state the theorem, we need to decompose the Jacobian of FF into parts corresponding to the xx-variables and yy-variables.

📐 Definition 3 (Partial Jacobians)

Let F:Rn×RmRmF: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m be C1C^1. At a point (a,b)(a, b), the partial Jacobian with respect to xx is the m×nm \times n matrix

DxF(a,b)=(Fixj(a,b))1im1jnD_x F(a,b) = \left(\frac{\partial F_i}{\partial x_j}(a,b)\right)_{\substack{1 \leq i \leq m \\ 1 \leq j \leq n}}

and the partial Jacobian with respect to yy is the m×mm \times m matrix

DyF(a,b)=(Fiyk(a,b))1im1km.D_y F(a,b) = \left(\frac{\partial F_i}{\partial y_k}(a,b)\right)_{\substack{1 \leq i \leq m \\ 1 \leq k \leq m}}.

The full Jacobian of FF viewed as a map Rn+mRm\mathbb{R}^{n+m} \to \mathbb{R}^m is the m×(n+m)m \times (n+m) block matrix JF(a,b)=(DxF(a,b)DyF(a,b))J_F(a,b) = \begin{pmatrix} D_x F(a,b) & D_y F(a,b) \end{pmatrix}.

🔷 Theorem 3 (Implicit Function Theorem (ImFT))

Let F:Rn×RmRmF: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m be C1C^1 in a neighborhood of (a,b)Rn+m(a, b) \in \mathbb{R}^{n+m}. Suppose:

  • F(a,b)=0F(a, b) = 0, and
  • the partial Jacobian DyF(a,b)D_y F(a, b) is invertible (detDyF(a,b)0\det D_y F(a, b) \neq 0).

Then there exist open sets UaU \ni a in Rn\mathbb{R}^n and VbV \ni b in Rm\mathbb{R}^m such that:

  1. Existence and uniqueness: For every xUx \in U, there is a unique y=g(x)Vy = g(x) \in V with F(x,g(x))=0F(x, g(x)) = 0.
  2. Smoothness: The implicit function g:UVg: U \to V is C1C^1.
  3. Derivative formula:

Jg(x)=[DyF(x,g(x))]1DxF(x,g(x)).J_g(x) = -[D_y F(x, g(x))]^{-1} \, D_x F(x, g(x)).

Proof.

We derive the ImFT from the IFT by a standard augmentation trick.

Step 1: Construct an auxiliary map. Define Φ:Rn×RmRn×Rm\Phi: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n \times \mathbb{R}^m by

Φ(x,y)=(x,F(x,y)).\Phi(x, y) = (x, \, F(x, y)).

So Φ\Phi keeps xx unchanged and applies FF to get the second component. Note Φ(a,b)=(a,F(a,b))=(a,0)\Phi(a, b) = (a, F(a,b)) = (a, 0).

Step 2: Compute the Jacobian of Φ\Phi. The Jacobian of Φ\Phi at (a,b)(a, b) is the (n+m)×(n+m)(n+m) \times (n+m) block matrix:

JΦ(a,b)=(In0DxF(a,b)DyF(a,b)).J_\Phi(a, b) = \begin{pmatrix} I_n & 0 \\ D_x F(a,b) & D_y F(a,b) \end{pmatrix}.

The determinant of this block-triangular matrix is:

detJΦ(a,b)=detIndetDyF(a,b)=detDyF(a,b)0\det J_\Phi(a, b) = \det I_n \cdot \det D_y F(a, b) = \det D_y F(a, b) \neq 0

by hypothesis. So JΦ(a,b)J_\Phi(a, b) is invertible.

Step 3: Apply the IFT to Φ\Phi. By the Inverse Function Theorem (Theorem 1), Φ\Phi is locally invertible near (a,b)(a, b): there exist neighborhoods V~(a,b)\widetilde{V} \ni (a, b) and W~(a,0)\widetilde{W} \ni (a, 0) and a C1C^1 inverse Ψ=Φ1:W~V~\Psi = \Phi^{-1}: \widetilde{W} \to \widetilde{V}.

Since Φ(x,y)=(x,F(x,y))\Phi(x, y) = (x, F(x, y)), the first component of Φ\Phi is the identity on xx. Therefore the inverse Ψ\Psi has the form Ψ(x,w)=(x,ψ(x,w))\Psi(x, w) = (x, \psi(x, w)) for some C1C^1 function ψ\psi, because the first component must invert the identity.

Step 4: Extract gg. Set w=0w = 0: the equation Φ(x,y)=(x,0)\Phi(x, y) = (x, 0) means F(x,y)=0F(x, y) = 0. Its unique solution near (a,b)(a, b) is (x,y)=Ψ(x,0)=(x,ψ(x,0))(x, y) = \Psi(x, 0) = (x, \psi(x, 0)). Define g(x)=ψ(x,0)g(x) = \psi(x, 0). Then:

  • g(a)=ψ(a,0)=bg(a) = \psi(a, 0) = b (since Ψ(a,0)=(a,b)\Psi(a, 0) = (a, b)),
  • F(x,g(x))=0F(x, g(x)) = 0 for all xx near aa,
  • gg is C1C^1 (as a restriction of the C1C^1 map ψ\psi).

Step 5: Derive the derivative formula. Differentiate F(x,g(x))=0F(x, g(x)) = 0 with respect to xx using the chain rule:

DxF(x,g(x))+DyF(x,g(x))Jg(x)=0.D_x F(x, g(x)) + D_y F(x, g(x)) \cdot J_g(x) = 0.

Solving for Jg(x)J_g(x) (using the invertibility of DyFD_y F):

Jg(x)=[DyF(x,g(x))]1DxF(x,g(x)).J_g(x) = -[D_y F(x, g(x))]^{-1}\, D_x F(x, g(x)).

\square

📝 Example 6 (The unit circle: x^2 + y^2 = 1)

Let F(x,y)=x2+y21F(x, y) = x^2 + y^2 - 1. The level set F=0F = 0 is the unit circle S1S^1.

We have Fx=2xF_x = 2x and Fy=2yF_y = 2y. The ImFT applies at any (a,b)S1(a, b) \in S^1 where Fy(a,b)=2b0F_y(a, b) = 2b \neq 0, i.e., where b0b \neq 0.

Near (0,1)(0, 1): Fy(0,1)=20F_y(0, 1) = 2 \neq 0. The ImFT gives g(x)=1x2g(x) = \sqrt{1 - x^2} (the upper semicircle) with

g(x)=FxFy=2x21x2=x1x2.g'(x) = -\frac{F_x}{F_y} = -\frac{2x}{2\sqrt{1-x^2}} = -\frac{x}{\sqrt{1-x^2}}.

Near (0,1)(0, -1): Fy(0,1)=20F_y(0, -1) = -2 \neq 0. The ImFT gives g(x)=1x2g(x) = -\sqrt{1 - x^2} (the lower semicircle).

At (±1,0)(\pm 1, 0): Fy=0F_y = 0. The ImFT does not apply, and indeed the circle has vertical tangent lines there — yy cannot be expressed as a function of xx near these points. However, Fx(±1,0)=±20F_x(\pm 1, 0) = \pm 2 \neq 0, so by applying the ImFT with the roles of xx and yy reversed, we can solve for x=h(y)x = h(y) near these points.

📝 Example 7 (A 2D system: two constraints in four variables)

Let F:R2×R2R2F: \mathbb{R}^2 \times \mathbb{R}^2 \to \mathbb{R}^2 be defined by

F(x1,x2,y1,y2)=(x1y1+x2y21x12+y12x2).F(x_1, x_2, y_1, y_2) = \begin{pmatrix} x_1 y_1 + x_2 y_2 - 1 \\ x_1^2 + y_1^2 - x_2 \end{pmatrix}.

At the point (a,b)=((1,1),(1,0))(a, b) = ((1, 1), (1, 0)): F(1,1,1,0)=(11+101,  1+11)=(0,1)0F(1, 1, 1, 0) = (1 \cdot 1 + 1 \cdot 0 - 1,\; 1 + 1 - 1) = (0, 1) \neq 0. Let us adjust: take (a,b)=((1,2),(1,0))(a, b) = ((1, 2), (1, 0)). Then F(1,2,1,0)=(1+01,  1+12)=(0,0)F(1, 2, 1, 0) = (1 + 0 - 1,\; 1 + 1 - 2) = (0, 0). Good.

The partial Jacobian with respect to (y1,y2)(y_1, y_2):

DyF=(x1x22y10).D_y F = \begin{pmatrix} x_1 & x_2 \\ 2y_1 & 0 \end{pmatrix}.

At our point: DyF(1,2,1,0)=(1220)D_y F(1, 2, 1, 0) = \begin{pmatrix} 1 & 2 \\ 2 & 0 \end{pmatrix} with detDyF=40\det D_y F = -4 \neq 0.

The ImFT guarantees a C1C^1 function g=(g1,g2):UR2g = (g_1, g_2): U \to \mathbb{R}^2 with g(1,2)=(1,0)g(1, 2) = (1, 0) and F(x,g(x))=0F(x, g(x)) = 0 near (1,2)(1, 2). The derivative is:

Jg(x)=[DyF]1DxF=14(0221)(y1y22x10).J_g(x) = -[D_y F]^{-1} D_x F = -\frac{1}{-4}\begin{pmatrix} 0 & -2 \\ -2 & 1 \end{pmatrix}\begin{pmatrix} y_1 & y_2 \\ 2x_1 & 0 \end{pmatrix}.

At our point: DxF=(1020)D_x F = \begin{pmatrix} 1 & 0 \\ 2 & 0 \end{pmatrix}, so Jg=14(0221)(1020)=14(4000)=(1000)J_g = \frac{1}{4}\begin{pmatrix} 0 & -2 \\ -2 & 1 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 2 & 0 \end{pmatrix} = \frac{1}{4}\begin{pmatrix} -4 & 0 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} -1 & 0 \\ 0 & 0 \end{pmatrix}.

💡 Remark 3 (The ImFT as a corollary of the IFT)

The proof above makes the logical relationship clear: the Implicit Function Theorem is a corollary of the Inverse Function Theorem. The augmentation trick — defining Φ(x,y)=(x,F(x,y))\Phi(x, y) = (x, F(x, y)) to turn a rectangular system into a square one — is the standard reduction. Conversely, the IFT is a special case of the ImFT: to find xx such that f(x)=yf(x) = y, set F(x,y)=f(x)yF(x, y) = f(x) - y and apply the ImFT. The two theorems are logically equivalent.

F_x = 0.0000F_y = 2.0000g′(x) = 0.0000Tangent: y = 0.0000(x − 0.00) + 1.00
F(x, y) = 0 Tangent line ∇F (normal to curve) Singular (F_y = 0)

Unit circle: F_y = 2y = 0 at (±1, 0) — vertical tangents where the ImFT fails for y = g(x)

Implicit curves defined by F(x,y) = 0 for various functions F. The Implicit Function Theorem guarantees local graph structure wherever the gradient of F is nonzero — the curve has a well-defined tangent line and can be locally parameterized.

Implicit Differentiation

The derivative formula from the ImFT is the rigorous version of the “implicit differentiation” technique from introductory calculus. Let us spell it out and apply it.

🔷 Proposition 2 (Implicit Differentiation Formula)

Let F:Rn×RmRmF: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m be C1C^1 with F(a,b)=0F(a, b) = 0 and detDyF(a,b)0\det D_y F(a, b) \neq 0. If y=g(x)y = g(x) is the implicit function guaranteed by the ImFT, then:

Jg(x)=[DyF(x,g(x))]1DxF(x,g(x)).J_g(x) = -[D_y F(x, g(x))]^{-1}\, D_x F(x, g(x)).

In the scalar case n=m=1n = m = 1 (one equation F(x,y)=0F(x, y) = 0, one unknown yy):

g(x)=Fx(x,g(x))Fy(x,g(x)).g'(x) = -\frac{F_x(x, g(x))}{F_y(x, g(x))}.

📝 Example 8 (Implicit differentiation on the circle (verification))

F(x,y)=x2+y21F(x, y) = x^2 + y^2 - 1. Near (0,1)(0, 1), the ImFT gives g(x)=1x2g(x) = \sqrt{1 - x^2}.

Implicit differentiation:

g(x)=FxFy=2x2y=xy=xg(x)=x1x2.g'(x) = -\frac{F_x}{F_y} = -\frac{2x}{2y} = -\frac{x}{y} = -\frac{x}{g(x)} = -\frac{x}{\sqrt{1 - x^2}}.

Direct differentiation of g(x)=1x2g(x) = \sqrt{1-x^2}:

g(x)=2x21x2=x1x2.g'(x) = \frac{-2x}{2\sqrt{1-x^2}} = -\frac{x}{\sqrt{1-x^2}}.

The results agree. The point of implicit differentiation is that we obtained gg' without solving for gg explicitly — we only needed the partial derivatives of FF. In higher dimensions, where explicit solutions are typically impossible, the implicit formula is the only option.

📝 Example 9 (Tangent to an elliptic curve — singular points)

Consider the elliptic curve F(x,y)=y2x3+x=0F(x, y) = y^2 - x^3 + x = 0.

Partial derivatives: Fx=3x2+1F_x = -3x^2 + 1, Fy=2yF_y = 2y.

Implicit derivative: g(x)=3x2+12y=3x212yg'(x) = -\frac{-3x^2 + 1}{2y} = \frac{3x^2 - 1}{2y}.

At (0,0)(0, 0): F(0,0)=0F(0, 0) = 0 (the origin is on the curve), but Fy(0,0)=0F_y(0, 0) = 0, so the ImFT does not apply. Also Fx(0,0)=10F_x(0, 0) = 1 \neq 0, so we can solve for xx as a function of yy near the origin. The curve passes through the origin with a vertical tangent? Let us check: near (0,0)(0, 0), y2x3x=x(x21)xy^2 \approx x^3 - x = x(x^2 - 1) \approx -x for small xx. So y±xy \approx \pm\sqrt{-x}, defined only for x0x \leq 0. The curve has a cusp-like structure near the origin.

At (1,0)(1, 0): F(1,0)=01+1=0F(1, 0) = 0 - 1 + 1 = 0 and Fy(1,0)=0F_y(1, 0) = 0, but Fx(1,0)=3+1=20F_x(1, 0) = -3 + 1 = -2 \neq 0. Similar analysis: we can solve for x=h(y)x = h(y) but not y=g(x)y = g(x).

At (1,1)(1, 1): F(1,1)=11+1=10F(1, 1) = 1 - 1 + 1 = 1 \neq 0. This point is not on the curve.

At a regular point like (1,1)(-1, 1): F(1,1)=1+11=10F(-1, 1) = 1 + 1 - 1 = 1 \neq 0. Also not on the curve. Let us find a regular point: (x,y)(x, y) with y2=x3xy^2 = x^3 - x and Fy=2y0F_y = 2y \neq 0. Take x=2x = 2: y2=82=6y^2 = 8 - 2 = 6, so y=6y = \sqrt{6}. Then g(2)=12126=1126g'(2) = \frac{12 - 1}{2\sqrt{6}} = \frac{11}{2\sqrt{6}}.

The singular points of an elliptic curve (where both Fx=0F_x = 0 and Fy=0F_y = 0) are precisely where the ImFT fails in both the xx-for-yy and yy-for-xx directions. For a non-singular elliptic curve, the gradient F\nabla F never vanishes on the curve, so the ImFT guarantees local graph structure everywhere on SS.

💡 Remark 4 (Second-order implicit differentiation)

The ImFT derivative formula gives the first derivative of gg. To find g(x)g''(x) (which connects to the Hessian), differentiate g(x)=Fx/Fyg'(x) = -F_x/F_y using the quotient rule and the chain rule — remembering that y=g(x)y = g(x) depends on xx:

g(x)=FxxFy22FxyFxFy+FyyFx2Fy3g''(x) = -\frac{F_{xx} F_y^2 - 2F_{xy} F_x F_y + F_{yy} F_x^2}{F_y^3}

(all partials evaluated at (x,g(x))(x, g(x))). This expression involves the second-order partials FxxF_{xx}, FxyF_{xy}, FyyF_{yy} — the entries of the Hessian of FF. The curvature of the implicit curve F(x,y)=0F(x, y) = 0 is governed by the interplay between the Hessian of FF and the gradient of FF. This is the bridge between the implicit function machinery and the second-order analysis from Topic 11.

Implicit differentiation in higher dimensions: the derivative formula computes the derivative of the implicit function without solving for it explicitly. The left panel shows a constraint surface in 3D; the right panel shows the resulting implicit function and its derivative.

Constraint Optimization & Lagrange Multipliers

The ImFT provides the theoretical backbone for constrained optimization — the setting where we minimize a function subject to equality constraints.

📐 Definition 4 (Constrained Optimization Problem)

Given f:RnRf: \mathbb{R}^n \to \mathbb{R} (the objective function) and G:RnRmG: \mathbb{R}^n \to \mathbb{R}^m with m<nm < n (the constraint map), the constrained optimization problem is:

minxRnf(x)subject toG(x)=0.\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad G(x) = 0.

The constraint set (or feasible set) is S={xRn:G(x)=0}S = \{x \in \mathbb{R}^n : G(x) = 0\}.

🔷 Theorem 4 (Lagrange Multiplier Necessary Condition)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} and G=(G1,,Gm):RnRmG = (G_1, \ldots, G_m): \mathbb{R}^n \to \mathbb{R}^m be C1C^1. Suppose aS=G1(0)a \in S = G^{-1}(0) is a local extremum of ff restricted to SS.

Constraint qualification: If the Jacobian JG(a)J_G(a) has full rank mm (equivalently, the gradients G1(a),,Gm(a)\nabla G_1(a), \ldots, \nabla G_m(a) are linearly independent), then there exist unique Lagrange multipliers λ1,,λmR\lambda_1, \ldots, \lambda_m \in \mathbb{R} such that

f(a)=i=1mλiGi(a),\nabla f(a) = \sum_{i=1}^{m} \lambda_i \nabla G_i(a),

or equivalently, f(a)=JG(a)Tλ\nabla f(a) = J_G(a)^T \lambda where λ=(λ1,,λm)T\lambda = (\lambda_1, \ldots, \lambda_m)^T.

In the scalar constraint case (m=1m = 1, single constraint g(x)=0g(x) = 0): f(a)=λg(a)\nabla f(a) = \lambda \nabla g(a).

Proof.

We use the ImFT to reduce the constrained problem to an unconstrained one.

Step 1: Apply the ImFT. Since JG(a)J_G(a) has rank mm, there is an m×mm \times m submatrix of JG(a)J_G(a) with nonzero determinant. After reindexing, we may assume that the last mm columns of JG(a)J_G(a) form an invertible matrix. Write x=(u,v)x = (u, v) with uRnmu \in \mathbb{R}^{n-m} (the “free” variables) and vRmv \in \mathbb{R}^m (the “constrained” variables). Then

DvG(a) is invertible.D_v G(a) \text{ is invertible.}

By the Implicit Function Theorem, there exists a C1C^1 function v=φ(u)v = \varphi(u) defined near u0=(a1,,anm)u_0 = (a_1, \ldots, a_{n-m}) such that G(u,φ(u))=0G(u, \varphi(u)) = 0. The constraint surface SS is locally the graph of φ\varphi.

Step 2: Reduce to unconstrained optimization. Define h(u)=f(u,φ(u))h(u) = f(u, \varphi(u)). Since aa is a local extremum of ff on SS, and SS is locally parametrized by u(u,φ(u))u \mapsto (u, \varphi(u)), the point u0u_0 is a local extremum of hh on Rnm\mathbb{R}^{n-m}. By the standard first-order necessary condition (no constraints), uh(u0)=0\nabla_u h(u_0) = 0, which gives:

fuj(a)+k=1mfvk(a)φkuj(u0)=0for j=1,,nm.\frac{\partial f}{\partial u_j}(a) + \sum_{k=1}^{m} \frac{\partial f}{\partial v_k}(a) \cdot \frac{\partial \varphi_k}{\partial u_j}(u_0) = 0 \quad \text{for } j = 1, \ldots, n-m.

Step 3: Use the ImFT derivative formula. From the ImFT, Jφ(u0)=[DvG(a)]1DuG(a)J_\varphi(u_0) = -[D_v G(a)]^{-1} D_u G(a). Substituting:

uf(a)vf(a)T[DvG(a)]1DuG(a)=0.\nabla_u f(a) - \nabla_v f(a)^T [D_v G(a)]^{-1} D_u G(a) = 0.

Define λT=vf(a)T[DvG(a)]1\lambda^T = \nabla_v f(a)^T [D_v G(a)]^{-1}, i.e., λ=[DvG(a)]Tvf(a)\lambda = [D_v G(a)]^{-T} \nabla_v f(a). Then:

uf(a)=λTDuG(a)=i=1mλiGiu(a).\nabla_u f(a) = \lambda^T D_u G(a) = \sum_{i=1}^{m} \lambda_i \frac{\partial G_i}{\partial u}(a).

And from the definition of λ\lambda:

vf(a)=DvG(a)Tλ=i=1mλiGiv(a).\nabla_v f(a) = D_v G(a)^T \lambda = \sum_{i=1}^{m} \lambda_i \frac{\partial G_i}{\partial v}(a).

Combining both sets of equations: f(a)=i=1mλiGi(a)\nabla f(a) = \sum_{i=1}^{m} \lambda_i \nabla G_i(a).

\square

📝 Example 10 (Maximum entropy on the probability simplex)

Maximize the entropy H(p)=i=1npilnpiH(p) = -\sum_{i=1}^{n} p_i \ln p_i subject to the constraint G(p)=i=1npi1=0G(p) = \sum_{i=1}^{n} p_i - 1 = 0 (probabilities sum to 1). Set f=Hf = -H to convert to a minimization problem.

The Lagrange condition f(a)=λG(a)\nabla f(a) = \lambda \nabla G(a) gives:

lnpi+1=λfor each i.\ln p_i + 1 = \lambda \quad \text{for each } i.

So pi=eλ1p_i = e^{\lambda - 1} for all ii. The constraint pi=1\sum p_i = 1 gives neλ1=1n \cdot e^{\lambda-1} = 1, hence pi=1/np_i = 1/n for all ii. The uniform distribution maximizes entropy on the simplex — a foundational result in information theory.

The constraint qualification holds: G=(1,1,,1)0\nabla G = (1, 1, \ldots, 1) \neq 0, so JGJ_G has rank 1 everywhere.

📝 Example 11 (Nearest point on a constraint surface)

Find the point on the surface G(x,y,z)=x2+y2+z24xz=0G(x, y, z) = x^2 + y^2 + z^2 - 4xz = 0 nearest to the origin. We minimize f(x,y,z)=x2+y2+z2f(x, y, z) = x^2 + y^2 + z^2 (distance squared) subject to G=0G = 0.

Lagrange condition: f=λG\nabla f = \lambda \nabla G.

f=(2x,2y,2z),G=(2x4z,  2y,  2z4x).\nabla f = (2x, 2y, 2z), \quad \nabla G = (2x - 4z,\; 2y,\; 2z - 4x).

This gives the system:

  • 2x=λ(2x4z)2x = \lambda(2x - 4z)
  • 2y=2λy2y = 2\lambda y
  • 2z=λ(2z4x)2z = \lambda(2z - 4x)

From the second equation: either y=0y = 0 or λ=1\lambda = 1.

Case λ=1\lambda = 1: Equation 1 gives 2x=2x4z2x = 2x - 4z, so z=0z = 0. Equation 3 gives 2z=2z4x2z = 2z - 4x, so x=0x = 0. Then G(0,y,0)=y2=0G(0, y, 0) = y^2 = 0, so y=0y = 0. The origin satisfies G=0G = 0 but f=0f = 0 — this is the minimum, but it’s a degenerate case (the origin is on the surface).

Case y=0y = 0: Equations 1 and 3 become 2x=λ(2x4z)2x = \lambda(2x - 4z) and 2z=λ(2z4x)2z = \lambda(2z - 4x). Adding: 2(x+z)=2λ(x+z)4λ(x+z)=2λ(x+z)2(x+z) = 2\lambda(x+z) - 4\lambda(x+z) = -2\lambda(x+z). So (1+λ)(x+z)=0(1+\lambda)(x+z) = 0. If λ=1\lambda = -1: 2x=2x+4z2x = -2x + 4z, giving 4x=4z4x = 4z, so x=zx = z. Then G=x2+x24x2=2x2=0G = x^2 + x^2 - 4x^2 = -2x^2 = 0, requiring x=0x = 0. If x+z=0x + z = 0: z=xz = -x. Then G=x2+4x2=5x2=0G = x^2 + 4x^2 = 5x^2 = 0, giving x=0x = 0 again.

The geometry reveals that the only point where G=0G = 0 and y=0y = 0 is the origin — the surface passes through the origin as an isolated point (in this particular slice). The nearest nonzero point requires a more careful analysis of the full surface structure.

Optimum: (1.0000, 1.0000)
f(x*, y*): 2.0000
λ: 2.0000
Minimize distance to origin on a line: optimum at (c/2, c/2), λ = c

Lagrange multipliers: at a constrained extremum, the gradient of the objective is a linear combination of the constraint gradients. The left panel shows contours of f, the constraint curve G = 0, and the tangent/normal vectors. At the optimum, ∇f is parallel to ∇G — perpendicular to the constraint surface.

The Contraction Mapping Principle

The Contraction Mapping Theorem (Theorem 2) played a supporting role in the IFT proof, but it is a powerful tool in its own right — one of the most versatile fixed-point theorems in analysis.

📐 Definition 5 (Contraction Mapping)

Let (X,d)(X, d) be a metric space. A map T:XXT: X \to X is a contraction (or contraction mapping) if there exists a constant λ[0,1)\lambda \in [0, 1) such that

d(T(x),T(y))λd(x,y)for all x,yX.d(T(x), T(y)) \leq \lambda \, d(x, y) \quad \text{for all } x, y \in X.

The constant λ\lambda is the contraction constant (or Lipschitz constant). Every contraction is continuous (it is Lipschitz), but the crucial point is that λ<1\lambda < 1 — each application of TT brings points strictly closer together.

💡 Remark 5 (Newton's method as approximate contraction)

Newton’s method for solving f(x)=0f(x) = 0 defines the iteration xk+1=xk[Jf(xk)]1f(xk)x_{k+1} = x_k - [J_f(x_k)]^{-1} f(x_k). Near a root xx^* where Jf(x)J_f(x^*) is invertible, the Newton map N(x)=x[Jf(x)]1f(x)N(x) = x - [J_f(x)]^{-1}f(x) satisfies DN(x)=0DN(x^*) = 0 (the derivative of the Newton map vanishes at the fixed point). This means NN is super-contractive near xx^* — the contraction constant is not merely less than 1, but tends to 0, which is why Newton’s method converges quadratically rather than geometrically.

The connection to the IFT is direct: the map Ty(x)=x+A1(yf(x))T_y(x) = x + A^{-1}(y - f(x)) used in the IFT proof is a “frozen Newton” iteration — it uses the fixed matrix A1=[Jf(a)]1A^{-1} = [J_f(a)]^{-1} instead of updating [Jf(xk)]1[J_f(x_k)]^{-1} at each step. Freezing the matrix makes the analysis cleaner (the contraction constant is uniform) at the cost of slower convergence (linear instead of quadratic).

📝 Example 12 (Fixed-point iteration for x = cos(x))

The equation x=cos(x)x = \cos(x) has a unique solution x0.7390851332x^* \approx 0.7390851332 — the Dottie number. Define T(x)=cos(x)T(x) = \cos(x) on [0,1][0, 1].

Is TT a contraction? We need T(x)=sin(x)λ<1|T'(x)| = |\sin(x)| \leq \lambda < 1 for all x[0,1]x \in [0, 1]. Since sin(x)sin(1)0.8415<1|\sin(x)| \leq \sin(1) \approx 0.8415 < 1 on [0,1][0, 1], the map TT is a contraction with λ0.8415\lambda \approx 0.8415. Also, T([0,1])=[cos(1),1][0.5403,1][0,1]T([0,1]) = [\cos(1), 1] \approx [0.5403, 1] \subset [0, 1].

Starting from x0=0x_0 = 0: the iterates x1=1x_1 = 1, x2=0.5403x_2 = 0.5403, x3=0.8576x_3 = 0.8576, x4=0.6543x_4 = 0.6543, … converge (slowly) to x0.7391x^* \approx 0.7391. The rate bound gives:

xkx0.8415k10.8415x1x0=0.8415k0.158516.310.8415k.|x_k - x^*| \leq \frac{0.8415^k}{1 - 0.8415} \cdot |x_1 - x_0| = \frac{0.8415^k}{0.1585} \cdot 1 \approx 6.31 \cdot 0.8415^k.

After k=50k = 50 iterations: 0.8415501.5×1040.8415^{50} \approx 1.5 \times 10^{-4}, giving about 3 digits of accuracy. Convergence is slow because λ0.84\lambda \approx 0.84 is close to 1 — contrast this with the IFT proof where λ=1/2\lambda = 1/2, which gives an extra digit of accuracy every 3-4 iterations.

Click plot to set x₀
Iteration k: 0
xₖ: 0.30000000
Fixed point: 0.73908513
Error bound: Infinity
λ ≈ 0.6718
cos(x) is a contraction on [0, 1.5] with |sin(x)| ≤ sin(1) ≈ 0.84. Fixed point ≈ 0.739

The Contraction Mapping Principle: iterating T(x) = cos(x) from any starting point produces a sequence that converges to the unique fixed point x* ≈ 0.739. The cobweb diagram (left) and error plot (right) show the geometric convergence rate.

Connections to ML

The IFT and ImFT are not abstract existence theorems that sit on the shelf — they are active structural guarantees used throughout modern machine learning. Here are three central connections.

Normalizing flows and invertible neural networks

A normalizing flow transforms a simple base distribution p0(z0)p_0(z_0) into a complex target distribution by composing invertible layers zK=fKf1(z0)z_K = f_K \circ \cdots \circ f_1(z_0). The change-of-variables formula gives the density:

logpK(zK)=logp0(z0)k=1KlogdetJfk(zk1).\log p_K(z_K) = \log p_0(z_0) - \sum_{k=1}^{K} \log |\det J_{f_k}(z_{k-1})|.

The IFT is the theoretical guarantee behind this formula: at each layer, if detJfk0\det J_{f_k} \neq 0, the layer is locally invertible and the density formula is valid. Architectures like Real-NVP, Glow, and Neural Spline Flows are designed so that each layer is globally invertible by construction (triangular Jacobians, monotone splines), sidestepping the “local vs. global” issue. But residual flows f(x)=x+g(x)f(x) = x + g(x) require the IFT: the Jacobian I+Jg(x)I + J_g(x) must be invertible at every point, which holds when Jg(x)<1\|J_g(x)\| < 1 (the spectral norm is less than 1) — making the map a contraction-like perturbation of the identity.

See Gradient Descent → formalML for the optimization perspective on these architectures.

Deep Equilibrium Models and implicit layers

A Deep Equilibrium Model (DEQ) defines its output implicitly: instead of stacking LL layers zl+1=fθ(zl)z_{l+1} = f_\theta(z_l), it finds the fixed point z=fθ(z)z^* = f_\theta(z^*) — the “equilibrium” of the infinite-depth limit. The output is z(θ,x)z^*(\theta, x), which depends implicitly on the parameters θ\theta and input xx.

How do we differentiate through this implicit layer? The ImFT. Define F(θ,z)=fθ(z)zF(\theta, z) = f_\theta(z) - z. At the fixed point, F(θ,z)=0F(\theta, z^*) = 0. The partial Jacobian is DzF=Jfθ(z)ID_z F = J_{f_\theta}(z^*) - I. If IJfθ(z)I - J_{f_\theta}(z^*) is invertible (i.e., 11 is not an eigenvalue of Jfθ(z)J_{f_\theta}(z^*)), the ImFT guarantees that zz^* varies smoothly with θ\theta, and the derivative is:

zθ=[DzF]1DθF=[IJfθ(z)]1fθθ(z).\frac{\partial z^*}{\partial \theta} = -[D_z F]^{-1} D_\theta F = [I - J_{f_\theta}(z^*)]^{-1} \frac{\partial f_\theta}{\partial \theta}(z^*).

This is implicit differentiation — we compute gradients through the equilibrium without backpropagating through the fixed-point iteration. The ImFT is not just a convenience here; it is the theoretical foundation guaranteeing that the gradient exists and is well-defined.

See Smooth Manifolds → formalML for the geometric perspective on implicit layers.

Constrained optimization on manifolds

Many ML problems involve optimization on constraint sets that are smooth manifolds: the Stiefel manifold Vk(Rn)V_k(\mathbb{R}^n) (orthonormal kk-frames, used in spectral methods), the probability simplex Δn1\Delta^{n-1} (used in attention mechanisms and mixture models), and the manifold of positive definite matrices S++nS^n_{++} (used in covariance estimation).

The ImFT is the theorem that makes these sets manifolds in the first place. The Stiefel manifold is the level set {WRn×k:WTW=Ik}\{W \in \mathbb{R}^{n \times k} : W^T W = I_k\} of the map F(W)=WTWIkF(W) = W^T W - I_k. The ImFT (via the Regular Value Theorem, Theorem 5 below) guarantees that this is a smooth manifold of dimension nkk(k+1)2nk - \frac{k(k+1)}{2}, with tangent spaces that can be computed from the Jacobian of FF.

Lagrange multipliers (Theorem 4) are then the first-order optimality conditions on these manifolds — and the entire framework of Riemannian optimization (projecting gradients onto tangent spaces, geodesic line search, retraction maps) rests on the smooth manifold structure guaranteed by the ImFT.

See Information Geometry → formalML for the Riemannian optimization perspective.

Normalizing flows (left): each invertible layer transforms a simple distribution into a complex one, with the IFT guaranteeing local invertibility. Deep Equilibrium Models (right): the output z* is defined implicitly as a fixed point, with the ImFT enabling gradient computation through the equilibrium.

Degenerate Cases

What happens when the IFT or ImFT hypotheses fail? When detJf(a)=0\det J_f(a) = 0, the function is not locally invertible — but the failure can be analyzed, and it leads to rich mathematical structure.

📐 Definition 6 (Critical Point of a Map)

Let f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m be C1C^1. A point aa is a critical point of ff if the Jacobian Jf(a)J_f(a) does not have full rank, i.e., rankJf(a)<min(n,m)\text{rank}\, J_f(a) < \min(n, m). The image f(a)f(a) of a critical point is a critical value. A value cRmc \in \mathbb{R}^m that is not a critical value is a regular value — this means that for every xf1(c)x \in f^{-1}(c), the Jacobian Jf(x)J_f(x) has full rank.

Note: a value cc is regular even if f1(c)=f^{-1}(c) = \emptyset — the condition is vacuously satisfied when no preimage exists.

🔷 Theorem 5 (Regular Value Theorem)

Let f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m be C1C^1 with nmn \geq m. If cRmc \in \mathbb{R}^m is a regular value of ff and the level set M=f1(c)={xRn:f(x)=c}M = f^{-1}(c) = \{x \in \mathbb{R}^n : f(x) = c\} is nonempty, then MM is a smooth (nm)(n-m)-dimensional manifold.

More precisely: for every aMa \in M, there exist an open neighborhood UU of aa in Rn\mathbb{R}^n and a C1C^1 diffeomorphism ϕ:Uϕ(U)Rn\phi: U \to \phi(U) \subseteq \mathbb{R}^n such that

ϕ(MU)=ϕ(U)(Rnm×{0}).\phi(M \cap U) = \phi(U) \cap (\mathbb{R}^{n-m} \times \{0\}).

The tangent space of MM at aa is TaM=kerJf(a)T_a M = \ker J_f(a), which has dimension nmn - m by the rank-nullity theorem (since Jf(a)J_f(a) has rank mm).

The proof follows directly from the ImFT: rewrite f(x)=cf(x) = c as F(x)=f(x)c=0F(x) = f(x) - c = 0. Since cc is regular, Jf(a)=JF(a)J_f(a) = J_F(a) has rank mm at every point of MM. After reindexing variables into “free” and “constrained” groups, the ImFT provides local graph representations xconstrained=g(xfree)x_{\text{constrained}} = g(x_{\text{free}}), which are the chart maps of the manifold structure.

📝 Example 13 (The fold catastrophe: f(x) = x³ − tx)

Consider the family of functions ft(x)=x3txf_t(x) = x^3 - tx parametrized by tRt \in \mathbb{R}. The critical points of ftf_t satisfy ft(x)=3x2t=0f_t'(x) = 3x^2 - t = 0, giving x=±t/3x = \pm\sqrt{t/3} (for t>0t > 0). Define the gradient map F:R×RRF: \mathbb{R} \times \mathbb{R} \to \mathbb{R} by F(t,x)=3x2tF(t, x) = 3x^2 - t.

The level set F(t,x)=0F(t, x) = 0 is the parabola t=3x2t = 3x^2 — the bifurcation curve in the (t,x)(t, x) parameter space. On this curve, Fx=6xF_x = 6x, which vanishes at (t,x)=(0,0)(t, x) = (0, 0). The ImFT fails at the origin: Fx(0,0)=0F_x(0, 0) = 0, so we cannot solve for xx as a smooth function of tt near (t,x)=(0,0)(t, x) = (0, 0).

This is a fold bifurcation: for t<0t < 0, ftf_t has no critical points; for t>0t > 0, ftf_t has two critical points (a local max and a local min); at t=0t = 0, the two critical points collide and annihilate. The ImFT detects this qualitative change in structure — the smooth implicit function x(t)x(t) ceases to exist precisely at the bifurcation point.

📝 Example 14 (Loss landscape bifurcations in neural networks)

Consider a two-layer neural network fθ(x)=w2σ(w1x)f_\theta(x) = w_2 \sigma(w_1 x) with scalar weights θ=(w1,w2)\theta = (w_1, w_2) and ReLU activation σ(z)=max(0,z)\sigma(z) = \max(0, z). The loss on a single data point (x0,y0)(x_0, y_0) is L(θ)=(fθ(x0)y0)2L(\theta) = (f_\theta(x_0) - y_0)^2.

The critical points satisfy L=0\nabla L = 0. For the ReLU network, the gradient is:

Lw2=2(w2σ(w1x0)y0)σ(w1x0),Lw1=2(w2σ(w1x0)y0)w2x01w1x0>0.\frac{\partial L}{\partial w_2} = 2(w_2\sigma(w_1 x_0) - y_0)\sigma(w_1 x_0), \qquad \frac{\partial L}{\partial w_1} = 2(w_2\sigma(w_1 x_0) - y_0)w_2 x_0 \cdot \mathbf{1}_{w_1 x_0 > 0}.

The Hessian HL(θ)=J(L)(θ)H_L(\theta) = J(\nabla L)(\theta) — the Jacobian of the gradient — is singular whenever the ReLU transitions between active and inactive: at w1x0=0w_1 x_0 = 0, the function is not differentiable, and the smooth landscape splits into two regions. This is a form of bifurcation in the loss landscape: the topology of the critical point set changes as w1w_1 crosses zero. The IFT applied to the gradient map L\nabla L tells us that critical points are isolated (and vary smoothly with hyperparameters) wherever HLH_L is non-singular — exactly the Hessian condition from Topic 11.

💡 Remark 6 (The rank theorem and the road to differential topology)

The IFT and ImFT are the starting points of differential topology — the study of smooth manifolds and smooth maps between them. The Regular Value Theorem (Theorem 5) is the bridge: it tells us that level sets of smooth maps are generically manifolds. The rank theorem generalizes this: near a point where JfJ_f has constant rank rr, the map ff can be locally reduced (by smooth coordinate changes in both domain and range) to the linear projection (x1,,xn)(x1,,xr,0,,0)(x_1, \ldots, x_n) \mapsto (x_1, \ldots, x_r, 0, \ldots, 0). Maps of constant rank are submersions (when r=mr = m, the map is “onto” locally) and immersions (when r=nr = n, the map is “one-to-one” locally). The IFT is the special case where r=n=mr = n = m.

For the connection to manifold theory in the ML context, see Smooth Manifolds → formalML.

Degenerate cases: the fold catastrophe (left) shows a bifurcation where two critical points collide and vanish as a parameter changes. The cusp catastrophe (right) shows a higher-order degeneracy. These are the simplest examples of catastrophe theory — the classification of degeneracies of smooth maps.

Computational Notes

The IFT and ImFT tell us when inverse and implicit functions exist; computation requires numerical methods. Here are the essential tools.

Solving implicit equations: scipy.optimize.fsolve

import numpy as np
from scipy.optimize import fsolve

# Find y such that F(x, y) = x^2 + y^2 - 1 = 0, near y = 0.8
def F(y, x):
    return x**2 + y**2 - 1

x_val = 0.5
y_solution = fsolve(F, x0=0.8, args=(x_val,))
print(f"y = {y_solution[0]:.6f}")  # y ≈ 0.866025 = sqrt(3)/2

Computing inverse Jacobians: np.linalg.solve vs. np.linalg.inv

import numpy as np

# Jacobian of the complex squaring map at (1, 1)
J_f = np.array([[2, -2],
                [2,  2]])

# Bad: explicit inverse (O(n^3), numerically unstable for large n)
J_g_bad = np.linalg.inv(J_f)

# Good: solve J_f @ J_g = I directly (same cost, better stability)
J_g_good = np.linalg.solve(J_f, np.eye(2))

print("Inverse Jacobian:\n", J_g_good)
print("Condition number:", np.linalg.cond(J_f))

Implicit differentiation with JAX: jax.custom_vjp

import jax
import jax.numpy as jnp
from functools import partial

def fixed_point_layer(params, x, f, max_iter=100, tol=1e-6):
    """Find z* such that z* = f(params, x, z*)."""
    z = jnp.zeros_like(x)
    for _ in range(max_iter):
        z_new = f(params, x, z)
        if jnp.max(jnp.abs(z_new - z)) < tol:
            break
        z = z_new
    return z

@jax.custom_vjp
def deq_layer(params, x):
    f = lambda p, x, z: jnp.tanh(p @ z + x)  # Example implicit layer
    return fixed_point_layer(params, x, f)

def deq_layer_fwd(params, x):
    z_star = deq_layer(params, x)
    return z_star, (params, x, z_star)

def deq_layer_bwd(res, g):
    """Backward pass via ImFT: dz*/dtheta = (I - df/dz)^{-1} df/dtheta."""
    params, x, z_star = res
    # Solve (I - J_f) @ v = g for v, where J_f = df/dz at z*
    f = lambda z: jnp.tanh(params @ z + x)
    J_f = jax.jacobian(f)(z_star)
    v = jnp.linalg.solve(jnp.eye(len(z_star)) - J_f, g)
    # Propagate through df/dparams and df/dx
    params_grad = jnp.outer(v, jnp.where(1 - jnp.tanh(params @ z_star + x)**2 > 0,
                                          z_star, 0))
    x_grad = v * (1 - jnp.tanh(params @ z_star + x)**2)
    return params_grad, x_grad

deq_layer.defvjp(deq_layer_fwd, deq_layer_bwd)

Constrained optimization with SciPy

from scipy.optimize import minimize

# Maximize entropy on the simplex (n=3)
def neg_entropy(p):
    return np.sum(p * np.log(p + 1e-12))

def constraint_sum(p):
    return np.sum(p) - 1.0

result = minimize(neg_entropy, x0=[0.2, 0.3, 0.5],
                  constraints={'type': 'eq', 'fun': constraint_sum},
                  bounds=[(1e-10, 1)] * 3)
print(f"Optimal p: {result.x}")  # Should be ≈ [1/3, 1/3, 1/3]

💡 Remark 7 (Numerical vs. mathematical invertibility — the condition number)

The IFT says that ff is locally invertible when detJf(a)0\det J_f(a) \neq 0. Numerically, the question is how far from zero the determinant is — or more precisely, how well-conditioned the Jacobian is. The condition number κ(Jf)=JfJf1=σmax/σmin\kappa(J_f) = \|J_f\| \cdot \|J_f^{-1}\| = \sigma_{\max}/\sigma_{\min} (where σi\sigma_i are the singular values) measures the sensitivity of the inverse to perturbations.

If κ(Jf)=1016\kappa(J_f) = 10^{16} and you are working in double precision (about 16 digits), the inverse is computed with roughly zero digits of accuracy. The map is mathematically invertible but numerically singular. In normalizing flows, this manifests as numerical overflow in logdetJf\log|\det J_f| computations — the density estimate blows up. Architectures like Real-NVP avoid this by ensuring triangular Jacobians with bounded diagonal entries.

The rule: never compute Jf1J_f^{-1} explicitly. Instead, solve Jfv=bJ_f \cdot v = b using np.linalg.solve (LU decomposition) or iterative methods. This is more numerically stable and often faster.

Connections & Further Reading

This topic connects the entire Multivariable Differential Calculus track into a unified framework:

TopicConnection
Epsilon-Delta ContinuityThe IFT proof requires ε\varepsilon-δ\delta control of the neighborhood size. The C1C^1 hypothesis ensures continuity of JfJ_f, which is used to guarantee DTy(x)<1/2\|DT_y(x)\| < 1/2.
Partial Derivatives & the GradientThe gradient f\nabla f is the constraint surface normal. The ImFT applied to f(x)=cf(x) = c recovers gradient orthogonality to level sets.
The Jacobian & Multivariate Chain RuleThe Jacobian Jf(a)J_f(a) is the central hypothesis of the IFT. The chain rule gives Jf1=[Jf]1J_{f^{-1}} = [J_f]^{-1}.
The Hessian & Second-Order AnalysisThe Hessian Hf=J(f)H_f = J(\nabla f). Non-singularity of HfH_f at a critical point is the IFT condition applied to the gradient map: critical points are isolated when the Hessian is non-singular.
Completeness & CompactnessThe Contraction Mapping Theorem uses completeness of Rn\mathbb{R}^n: the iterates form a Cauchy sequence that converges because the space is complete.

Prerequisites met:

This topic assumes familiarity with:

Within formalCalculus — forward references:

  • Change of Variables & the Jacobian Determinant — the IFT provides the theoretical foundation for the substitution formula ϕ(U)f=U(fϕ)detJϕ\int_{\phi(U)} f = \int_U (f \circ \phi)|\det J_\phi|
  • First-Order ODEs & Existence Theorems — existence and uniqueness of ODE solutions via the contraction mapping principle (Picard-Lindelöf theorem)
  • Metric Spaces & Topology — the Banach Contraction Mapping Theorem in abstract metric spaces, with the IFT argument from this topic appearing as Example 11

Forward links to formalML:

  • Gradient Descent → formalML — the IFT guarantees local convergence near non-degenerate critical points; the Hessian condition number governs convergence rate
  • Smooth Manifolds → formalML — the Regular Value Theorem (Theorem 5) is the primary tool for constructing smooth manifolds from level sets
  • Information Geometry → formalML — the Fisher information matrix as Hessian of the KL divergence; the IFT guarantees reparametrization invariance when the Fisher matrix is non-singular

References

  1. book Munkres (1991). Analysis on Manifolds Chapter 7 develops the Inverse Function Theorem and the Implicit Function Theorem with careful attention to the neighborhoods where invertibility holds — the primary reference for our treatment
  2. book Spivak (1965). Calculus on Manifolds Chapter 2 proves the IFT via contraction mapping and derives the ImFT as a corollary — the cleanest modern treatment and our model for the proof structure
  3. book Rudin (1976). Principles of Mathematical Analysis Chapter 9, Theorems 9.24 and 9.28 — the IFT and ImFT in Rudin's characteristically concise style. Useful for the contraction mapping setup
  4. book Lee (2013). Introduction to Smooth Manifolds Chapter 4 uses the IFT/ImFT to define submersions, immersions, and the regular value theorem — the direct bridge to the smooth manifolds topic on formalml.com
  5. book Nocedal & Wright (2006). Numerical Optimization Chapter 12 on constrained optimization via Lagrange multipliers — the practical application of the ImFT in optimization
  6. paper Chen, Rubanova, Bettencourt & Duvenaud (2018). “Neural Ordinary Differential Equations” Neural ODEs use the IFT implicitly: the flow map of an ODE is locally invertible when the vector field is smooth, and the adjoint method computes gradients through this inverse
  7. paper Dinh, Sohl-Dickstein & Bengio (2017). “Density Estimation using Real-NVP” Normalizing flows construct diffeomorphisms with tractable Jacobian determinants — the IFT guarantees invertibility at every point in the flow
  8. paper Bai, Kolter & Koltun (2019). “Deep Equilibrium Models” DEQs find fixed points of implicit layers and differentiate through them via the ImFT — the canonical ML application of implicit differentiation