The Gradient, Jacobian, Hessian, and Laplacian

Introduction

If you’ve spent any time training neural networks you’ve seen gradients everywhere. They’re the thing that makes learning happen. But gradients are just the beginning. There’s a whole family of mathematical operators sitting behind the scenes, and most ML practitioners only ever meet the first one. The rest (the Jacobian, the Hessian, and the Laplacian) tend to get a hand-wavy paragraph in a textbook and then everyone moves on.

I think that’s a shame. These operators aren’t abstract curiosities. They are the language physics uses to describe how fields change in space, how forces arise from potentials, and how heat flows through materials. The fact that machine learning rediscovered them independently should tell you something about how fundamental they are. So let’s walk through them properly. Not just “what they are” but where they come from, how they relate to each other, and why they matter when you’re building models.

If you’ve read my earlier post on derivatives you’ll have the foundation we need. If not, the short version is: a derivative tells you how a function changes when you wiggle its input a little. Everything else builds from there.

The gradient

Ok so you have a function \(f(\mathbf{x})\) that takes a vector \(\mathbf{x} = [x_1, x_2, \ldots, x_n]^T\) and spits out a single number. A loss function, for instance. The gradient is just the vector of all partial derivatives of \(f\) with respect to each input:

\[ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \]

That’s it. Each component tells you how much \(f\) changes when you nudge the corresponding input variable. The gradient as a whole points in the direction of steepest ascent. If you want to minimize \(f\) (which you usually do), you walk in the opposite direction. That’s gradient descent in a nutshell.

But here’s where the physics perspective adds something useful. In physics, the gradient of a scalar field gives you a vector field. If \(f\) is a potential energy surface (think: a hilly landscape), then \(-\nabla f\) gives you the force field. A ball placed on that surface will roll in the direction of \(-\nabla f\). Gradient descent is literally simulating a ball rolling downhill on your loss surface. Not metaphorically. Literally. The math is the same.

The gradient of a scalar field is a vector field pointing in the direction of steepest ascent. Negate it and you get the force that drives optimization.

A concrete example

Let’s say \(f(x_1, x_2) = x_1^2 + 3x_1 x_2 + x_2^2\). The gradient is

\[ \nabla f = \begin{bmatrix} 2x_1 + 3x_2 \\ 3x_1 + 2x_2 \end{bmatrix} \]

At the point \(\mathbf{x} = [1, 1]^T\) this evaluates to \([5, 5]^T\). So the function increases most steeply in the direction \([5, 5]^T\), and a gradient descent step would move you toward \([-5, -5]^T\) (scaled by some learning rate). Nothing mysterious.

Now, a question worth asking: the gradient tells you which direction to go, but does it tell you how far? No. And that’s where things get interesting.

The Jacobian

The gradient handles the case where your function maps \(\mathbb{R}^n \rightarrow \mathbb{R}\): many inputs, one output. But what if you have many outputs too? A function \(\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m\) that maps \(n\) inputs to \(m\) outputs? This is exactly what a neural network layer does. It takes an input vector and produces an output vector.

The Jacobian is the natural generalization. Instead of stacking partial derivatives into a vector (gradient), you stack them into a matrix:

\[ \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \]

Each row \(i\) is the gradient of the \(i\)-th output \(f_i\) with respect to all inputs. So the Jacobian is really just a stack of gradients. If \(m = 1\) (one output), the Jacobian collapses to the gradient (transposed). They’re the same thing in different clothes.

Why this matters for neural networks

Backpropagation is just repeated application of the chain rule through a composition of functions. Each layer in a neural network is a function, and the Jacobian of that layer tells you how its outputs change with respect to its inputs. When you chain layers together, the Jacobians multiply:

\[ \mathbf{J}_{\text{total}} = \mathbf{J}_L \cdot \mathbf{J}_{L-1} \cdots \mathbf{J}_2 \cdot \mathbf{J}_1 \]

This is backpropagation. That’s literally all it is. A product of Jacobian matrices, evaluated at the current inputs, propagated backward through the network. When people talk about “vanishing gradients” in deep networks, they’re really talking about this product of Jacobians having eigenvalues less than 1, which causes the product to shrink exponentially with depth.

The Jacobian determinant

There’s another reason to care about the Jacobian (especially in physics and generative modeling): the determinant. When you apply a transformation \(\mathbf{f}\) to a region of space, the absolute value of the Jacobian determinant \(|\det(\mathbf{J})|\) tells you how much the volume of that region changes.

This shows up directly in normalizing flows, a class of generative models where you transform a simple distribution (like a Gaussian) through a sequence of invertible functions to get a complex distribution. The change-of-variables formula requires the Jacobian determinant:

\[ p_Y(\mathbf{y}) = p_X(\mathbf{x}) \cdot |\det(\mathbf{J})|^{-1} \]

where \(\mathbf{y} = \mathbf{f}(\mathbf{x})\). The entire architecture of normalizing flows is designed to make this determinant cheap to compute. So when someone says “we need a triangular Jacobian for efficiency,” now you know what they mean and why.

The Hessian

Ok so the gradient tells you the slope. But the slope itself can change. Is the surface curving sharply or gently? Is it curving the same way in all directions? The Hessian answers these questions.

The Hessian is the matrix of second-order partial derivatives of a scalar function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\):

\[ \mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \]

Or more compactly: \(H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\). If \(f\) is twice continuously differentiable (which it usually is in ML), then \(H_{ij} = H_{ji}\) and the Hessian is symmetric. This is Schwarz’s theorem and it’s actually quite important because it means the Hessian has real eigenvalues, which we can interpret.

You might have noticed something. The Hessian is the Jacobian of the gradient. Just as the Jacobian generalizes the gradient to vector-valued functions, the Hessian is what you get when you apply the Jacobian operator to the gradient vector field:

\[ \mathbf{H} = \mathbf{J}(\nabla f) \]

These operators are nested. They build on each other.

What the eigenvalues tell you

Here’s where the Hessian becomes really powerful. The eigenvalues of the Hessian at a critical point (where \(\nabla f = 0\)) tell you the nature of that point:

All eigenvalues positive: you’re at a local minimum. The surface curves upward in every direction. Congratulations.
All eigenvalues negative: local maximum. Curves down everywhere. (Rare in ML loss landscapes, but it happens.)
Mixed signs: saddle point. It’s a minimum in some directions and a maximum in others. In high-dimensional optimization, saddle points are far more common than local minima. This is one of the most important insights in modern deep learning theory.

Newton’s method

The Hessian enables second-order optimization. While gradient descent uses only the slope to decide where to go, Newton’s method uses both the slope and the curvature:

\[ \mathbf{x}^{new} = \mathbf{x}^{old} - \mathbf{H}^{-1} \nabla f \]

The intuition is this: in a region where the surface curves sharply, you should take a small step (the curvature is large, so the gradient changes fast). In a flat region, you can take a big step. The Hessian inverse automatically scales the gradient by the curvature, giving you an adaptive step size per direction.

The catch? Computing the full Hessian for a neural network with millions of parameters means storing and inverting an \(n \times n\) matrix where \(n\) is in the millions. That’s \(O(n^2)\) memory and \(O(n^3)\) for inversion. Not exactly practical. This is why methods like L-BFGS (which approximate the Hessian) and Adam (which approximates the diagonal of the Hessian with running averages) are so popular. They give you some of the curvature information without the full cost.

A physics perspective

In physics, the Hessian shows up when you study small oscillations around equilibrium. If you have a potential energy surface \(V(\mathbf{x})\) and a particle sitting at a minimum, the Hessian of \(V\) at that point gives you the spring constants in every direction. The eigenvalues are the squared frequencies of the normal modes of vibration. This is how you analyze molecular vibrations, crystal lattice dynamics, and yes, the “vibrations” around minima in loss landscapes. Same math, different context.

The Laplacian

Now for the final member of our family. The Laplacian is the trace of the Hessian. Or equivalently, the sum of the unmixed second partial derivatives:

\[ \Delta f = \nabla^2 f = \sum_{i=1}^{n} \frac{\partial^2 f}{\partial x_i^2} = \text{tr}(\mathbf{H}) \]

Where the gradient gives you a vector, and the Hessian gives you a matrix, the Laplacian collapses everything back to a single scalar. It’s a measure of the average curvature at a point. Or more precisely: it tells you how much the value of \(f\) at a point differs from the average value of \(f\) in a small neighborhood around that point.

If \(\Delta f > 0\) at a point, the function value is below the neighborhood average (the surface is concave up, like a valley). If \(\Delta f < 0\), the function value is above the average (concave down, like a hilltop).

Diffusion and the heat equation

This is where the physics story really pays off. The heat equation, one of the most fundamental PDEs in physics, is:

\[ \frac{\partial u}{\partial t} = \alpha \nabla^2 u \]

It says that temperature \(u\) changes over time in proportion to its Laplacian. If a point is hotter than its neighbors (\(\nabla^2 u < 0\)), it cools down. If it’s cooler (\(\nabla^2 u > 0\)), it heats up. The Laplacian is the engine of diffusion.

Why should you care? Because diffusion models (the kind behind Stable Diffusion, DALL-E, etc.) are literally built on this idea. They add noise to data (forward diffusion) and then learn to reverse the process (denoising). The score function that these models learn is closely related to the gradient of the log-density, and the dynamics of the diffusion process are governed by a stochastic differential equation that has the Laplacian baked in. When people say “score-based generative models,” the score is \(\nabla_\mathbf{x} \log p(\mathbf{x})\) and the diffusion process follows a Fokker-Planck equation involving \(\nabla^2\).

Graph Laplacians

There’s also a discrete version of the Laplacian that shows up in graph neural networks. For a graph with adjacency matrix \(\mathbf{A}\) and degree matrix \(\mathbf{D}\), the graph Laplacian is:

\[ \mathbf{L} = \mathbf{D} - \mathbf{A} \]

This discrete Laplacian has the same interpretation as its continuous cousin: it measures how much a node’s value differs from its neighbors’ values. Graph convolution operations in GNNs are essentially applications of the graph Laplacian, smoothing signals over the graph topology. Spectral graph theory, which underlies many GNN architectures, is built entirely on the eigenvalues and eigenvectors of \(\mathbf{L}\).

Regularization

The Laplacian also appears in regularization. Penalizing the Laplacian of a function (or its discrete approximation) encourages smoothness. This is the basis of Laplacian smoothing in image processing and mesh processing. In ML, Laplacian regularization in semi-supervised learning encourages the model to produce similar predictions for nearby points in the data graph.

How they all fit together

Let’s put the hierarchy in one place:

Operator	Input	Output	What it tells you
Gradient \(\nabla f\)	\(f: \mathbb{R}^n \to \mathbb{R}\)	Vector (\(\mathbb{R}^n\))	Direction of steepest ascent
Jacobian \(\mathbf{J}\)	\(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\)	Matrix (\(\mathbb{R}^{m \times n}\))	How each output changes w.r.t. each input
Hessian \(\mathbf{H}\)	\(f: \mathbb{R}^n \to \mathbb{R}\)	Matrix (\(\mathbb{R}^{n \times n}\))	Curvature of \(f\) in every direction
Laplacian \(\Delta f\)	\(f: \mathbb{R}^n \to \mathbb{R}\)	Scalar (\(\mathbb{R}\))	Average curvature (trace of Hessian)

The relationships:

The gradient is a special case of the Jacobian (when \(m = 1\)).
The Hessian is the Jacobian of the gradient.
The Laplacian is the trace of the Hessian.

Each one peels back another layer of information about how your function behaves. The gradient gives direction. The Jacobian extends that to multi-output functions. The Hessian reveals curvature. The Laplacian summarizes that curvature into a single number.

Conclusion

I’ve heard people say “you don’t need to understand the math to do machine learning.” That’s technically true, the same way you don’t need to understand combustion to drive a car. But if you’re building the engine, debugging a misfire, or trying to make the thing go faster, it helps to know what’s actually happening.

These four operators (the gradient, Jacobian, Hessian, and Laplacian) are the vocabulary that physics developed over centuries to describe how fields change in space. Machine learning borrowed this vocabulary, sometimes without realizing it. Backpropagation is Jacobian products. Adam approximates the Hessian diagonal. Diffusion models are Laplacian dynamics. Normalizing flows are Jacobian determinants. It’s all connected.

The more of this you internalize, the less “magical” these algorithms feel. And the less magical they feel, the better you get at building them, fixing them, and knowing when they’re lying to you.

Don’t hesitate to reach out if I’ve made any mistakes or if something could be explained more clearly. I’m always happy to be corrected.