§1 Not dimensionality reduction

The companion paper states this plainly: diffusion maps is not dimensionality reduction in the sense of PCA or UMAP. Those methods find a low-dimensional representation that preserves some notion of distance or topology in the ambient space $\mathbb{R}^D$. DMAP does something categorically different — it recovers the intrinsic geometry of the manifold the data lives on.

High-dimensional scientific data — MD trajectories, cryo-EM particle stacks, XFEL diffraction — does not fill $\mathbb{R}^D$ uniformly. It concentrates near a low-dimensional manifold carved by the underlying physics. A protein does not explore all of $\mathbb{R}^{3N}$; it explores a conformational landscape shaped by energy and entropy. The diffusion coordinates are coordinates on that landscape — not a projection of the data, but a discovery of the geometry hidden within it.

Coifman & Lafon (2006): running a Markov random walk on the data graph for time $t$ filters noise and reveals the large-scale geometry. The eigenvectors of the walk operator are the natural intrinsic coordinates.

§2 The kernel $P = e^{-\beta H}$

The diffusion kernel in DMAP is the quantum Boltzmann operator. Given $N$ data points $\{x_i\}_{i=1}^N \subset \mathbb{R}^D$, we define the Hamiltonian as the simple harmonic oscillator potential:

Hamiltonian
$$ H \;=\; \tfrac{1}{2} D^2 $$

where $D^2_{ij} = \|x_i - x_j\|^2$ is the squared pairwise distance matrix. The kernel is then the matrix exponential of $-\beta H$:

Diffusion kernel — the Boltzmann operator
$$ P \;=\; e^{-\beta H} \;=\; \exp\!\left(-\,\frac{\beta}{2}\, D^2\right) $$

Entry-wise, this gives the familiar Gaussian kernel $P_{ij} = \exp\!\bigl(-\tfrac{\beta}{2}\|x_i - x_j\|^2\bigr)$, but the derivation from $H = \tfrac{1}{2}D^2$ is not merely notational. It places DMAP firmly within the framework of quantum statistical mechanics: $\beta$ is the inverse temperature, $P$ is the thermal density matrix, and the diffusion coordinates are the low-energy eigenstates of $H$.

Physical interpretation

The bandwidth parameter $\beta$ controls which energy scale is resolved. Large $\beta$ (low temperature) emphasizes local geometry — only nearby points interact strongly. Small $\beta$ (high temperature) reveals global structure across the manifold.

Computing $P$ requires all $N^2$ pairwise squared distances — an $O(N^2 D)$ operation. This is done exactly, without approximation. No kNN graph, no pruning, no sparsification. Every pair of points contributes.

§3 Normalization and the operator $P^+$

The raw kernel $P$ conflates geometry with sampling density. Coifman–Lafon normalization removes this bias, producing the row-stochastic diffusion operator $P^+$:

Diffusion operator — central object of DMAP
$$ P^+ \;=\; \mathrm{diag}(Z)^{-1}\, P, \qquad Z = P\,\mathbf{1} $$

$P^+$ is the transition matrix of a Markov random walk whose stationary distribution reflects the geometry of the manifold, independent of how densely the data was sampled in any region. The normalization parameter $\alpha \in [0,1]$ controls how aggressively density is corrected:

Special cases of $\alpha$

$\alpha = 0$ — graph Laplacian normalization, stationary measure $\propto$ sampling density.
$\alpha = 1$ — Laplace–Beltrami operator, stationary measure $\propto$ uniform on manifold. This is the natural choice for physics applications.

§4 Diffusion coordinates

The diffusion coordinates at time $t$ are the leading eigenvectors of $P^+$, weighted by their eigenvalues raised to power $t$:

Diffusion embedding
$$ \Psi_t(x_i) \;=\; \bigl( \lambda_1^t\,\psi_1(x_i),\; \lambda_2^t\,\psi_2(x_i),\; \ldots,\; \lambda_m^t\,\psi_m(x_i) \bigr) $$

where $1 = \lambda_0 > \lambda_1 \geq \lambda_2 \geq \ldots$ are the eigenvalues of $P^+$ and $\psi_j$ the corresponding eigenvectors. The fundamental theorem of diffusion maps: the Euclidean distance in this embedding equals the diffusion distance between points — a geometry-aware, noise-robust intrinsic distance on the manifold.

Choice of $t$

Large $t$ suppresses eigenvectors with $\lambda_j^t \approx 0$, revealing only the slowest, most global modes of the geometry. For MD trajectories $t$ has a physical interpretation as a lag time: fast local transitions stay close, slow conformational changes separate. In the quantum analogy, $t$ plays the role of imaginary time — the path integral representation of the thermal density matrix $e^{-\beta t H}$.

§5 Lanczos eigensolver

Extracting the top $m$ eigenvectors of $P^+$ does not require materializing or factorizing the full $N \times N$ matrix. The Lanczos algorithm only requires matrix-vector products $v \mapsto P^+ v$, each costing $O(N^2 D)$ — one pass over the kernel.

Lanczos iteration

Starting from a random vector, Lanczos builds a Krylov subspace $\mathcal{K}_k = \mathrm{span}\{v, P^+v, (P^+)^2 v, \ldots, (P^+)^{k-1}v\}$ and extracts the extremal eigenpairs from this subspace. Convergence to the top $m$ eigenvectors typically requires $k = O(m)$ to $O(m \log N)$ iterations, each a single $O(N^2 D)$ matvec.

This is the direct analogy to how FlashAttention computes attention: instead of neural network layers, the operator is applied iteratively via Lanczos to extract the geometric eigenmodes. The operator is never stored — only applied.

FlashAttention: implicit operator applied via forward pass through NN layers.
FlashDiffusion: implicit operator $P^+$ applied via Lanczos matvec. Same tiling strategy, same memory savings, different spectral target.

§6 FlashDiffusion kernel

The Lanczos matvec $v \mapsto P^+ v$ requires computing $P_{ij} = \exp\!\bigl(-\tfrac{\beta}{2}\|x_i - x_j\|^2\bigr)$ for all $i, j$ at each iteration — $O(N^2 D)$ floating point operations. On naive hardware this is memory-bandwidth limited: the $N \times N$ kernel matrix cannot fit in SRAM and must be streamed from HBM, making it intractable beyond $N \sim 10^4$.

FlashDiffusion solves this exactly as FlashAttention solves attention: by tiling. The kernel computation is blocked into tiles that fit in the SRAM of SM100/SM103 streaming multiprocessors on Blackwell GPUs. Each tile computes a block of $P_{ij}$ values, immediately accumulates them into the output vector $P^+ v$, and discards them — the kernel is never written to HBM.

Tiling strategy

For a tile of rows $I$ and columns $J$: load $\{x_i\}_{i \in I}$ and $\{x_j\}_{j \in J}$ into SRAM, compute $\|x_i - x_j\|^2$ for all $(i,j)$ pairs in the tile, apply the kernel $\exp(-\tfrac{\beta}{2}\,\cdot)$, multiply by $v_J$ and accumulate into $(P^+v)_I$. Tile size is chosen to saturate SM100/SM103 tensor cores.

The result: 1M frames $\times$ 64k features computed in approximately 30 minutes on a B300 — a dataset that would require weeks on CPU or be entirely intractable with naive GPU implementations. The geometry is exact. No approximation is made to the kernel or the walk operator.

§7 DMAP vs UMAP vs PCA

Method Kernel Eigensolver Guarantees
PCA Linear (dot product) SVD, $O(ND^2)$ Global linear structure only
Laplacian Eigenmaps kNN graph (sparse) Sparse eigensolver Local neighborhoods, graph-dependent
Diffusion Maps (classical) $e^{-\beta H}$, exact Dense, $O(N^3)$ — limited to ~10k Manifold geometry, $t$-dependent
UMAP kNN approximation Stochastic gradient descent Heuristic — no metric guarantee
DMAP (FlashDiffusion) $e^{-\beta H}$, exact, tiled Lanczos + flash matvec, up to 100M Exact manifold geometry + density correction

UMAP approximates the kernel and uses heuristic optimization. DMAP computes the exact Boltzmann kernel and exact eigenvectors. For reaction coordinates, guided NN training, or any downstream task requiring a true metric on the manifold, exactness is not optional.