Transformers, Attention, and Hamiltonian Neural Networks
This lecture bridges classical attention mechanisms from natural language processing with physics-inspired neural network architectures. Having developed graph calculus and the Hodge decomposition in previous lectures, we now see how these mathematical structures naturally arise in modern deep learning architectures—particularly in transformers and graph neural networks.
The key insight is that attention can be viewed as a soft dictionary lookup or, equivalently, as a data-driven basis function similar to finite element methods. When we introduced finite elements, we saw that functions could be represented as $f(x) = \sum_i \alpha_i(x, x_i) y_i$ where $\alpha_i$ are piecewise linear basis functions. Attention generalizes this by learning the weights $\alpha_i(q, \vec{k})$ from data rather than fixing them geometrically.
We begin with conditional neural fields, where a latent code $z$ modulates input-output relationships—crucial for operator learning in PDEs. We then dissect the transformer architecture: queries, keys, values, multi-head attention, and dropout. Moving to graphs, we examine Graph Attention Networks (GATs), which suffer from over-squashing and over-smoothing problems. Physics-inspired solutions emerge from recognizing that GATs implicitly define graph Laplacians.
Finally, we introduce Hamiltonian neural networks and double-bracket dynamics, showing how energy-preserving and energy-dissipating systems can be built into neural architectures. These connections reveal deep relationships between optimization, dynamical systems, and modern machine learning.
Consider the conditional neural field:
$$ y = f(x \mid z; \theta) $$Idea: The latent code $z$ modulates the $x \to y$ input-output relationship.
Examples:
Many options for implementing this:
Pedagogical note: Cross-attention provides the most flexible framework, allowing the model to dynamically select relevant features from the conditioning variable $z$ based on the query location $x$.
Reference: Murphy §15.4.1
Consider $(k_i, v_i)_{i=1}^N$ as key-value pairs describing input-output labels.
Consider a query $q$ representing where the model is evaluated.
Definition: The attention function is:
$$ \text{Attn}(q, \{k_i, v_i\}) = \sum_{i=1}^N \alpha_i(q, \vec{k}) v_i $$where attention weights satisfy:
$$ \begin{aligned} 0 &\leq \alpha_i \leq 1 \\ \sum_i \alpha_i &= 1 \end{aligned} $$Introduce an attention score $a(q, k_i)$:
$$ \alpha_i = \text{softmax}(a(q, \vec{k})) = \frac{\exp[a(q, k_i)]}{\sum_j \exp[a(q, k_j)]} $$Remark: This is a data-driven basis!
For example, taking $\alpha_i(q, \vec{k}) = \phi_i(q)$ (CPWL basis functions) satisfies the same properties:
$$ f(x) = \sum_i \alpha_i(x, x_n) y_i $$where:
Critical observation: Attention generalizes finite element interpolation by learning the basis functions rather than fixing them geometrically!
Introduce dense MLP blocks for $i = 1, \ldots, M$ (number of heads):
$$ \begin{aligned} \hat{q}_i &= Q_i(q) \\ \hat{k}_i &= K_i(k) \\ \hat{v}_i &= V_i(v) \end{aligned} $$Then:
$$ h_i = \text{Attn}(\hat{q}_i, \{\hat{k}_i, \hat{v}_i\}) $$Final aggregation:
$$ h = \text{MHA}(q, \{k, v\}) = \text{Concat}(h_1, \ldots, h_M) W^O $$Interpretation: Each head learns to attend to different aspects of the input (e.g., position vs. semantic meaning), analogous to using multiple finite element spaces for different solution components.
With probability $p$, at each forward pass the output of a given neuron will be dropped (set to zero).
Implementation: Replace weights:
$$ \begin{aligned} \theta_{lji} &= \omega_{lji} \varepsilon_{li} \\ \varepsilon_{li} &\sim \text{Bernoulli}(1-p) \end{aligned} $$Interpretation: Dropout forces the network to learn robust features that don't rely on any single neuron, preventing co-adaptation and improving generalization.
Steps:
Step 1: Input features. Each node $n_i \in N$ has feature vector $h_i^n \in \mathbb{R}^F$.
Step 2: Linear transform:
$$ h_i' = W h_i^n, \quad W \in \mathbb{R}^{F \times F} $$Step 3: Pre-attention coefficients:
$$ e_{ij} = \text{LeakyReLU}(\vec{a}^\top [W h_i \| W h_j]) $$where $\|$ denotes concatenation.
Step 4: Attention mechanism (softmax normalization):
$$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k \sim i} \exp(e_{ik})} $$Step 5: Aggregate:
$$ h^{n+1} = \sigma\left( \sum_{j \sim i} \alpha_{ij} W h_j^n \right) $$Important Notes:
Critical observation: With physics ideas we'll show how to keep them stable!
Key idea: Rewrite GAT update as a graph diffusion equation:
$$ h_i^{n+1} = h_i^n + \sigma\left( \sum_{j \sim i} \alpha_{ij} (h_j^n - h_i^n) \right) $$or in graph calculus:
$$ h^{n+1} = h^n + \sigma(\delta^* \alpha \delta h^n) $$where:
$$ \langle f, \delta^* g \rangle_\alpha = \langle \delta f, g \rangle_\alpha $$Interpretation: View attention as inducing a weighted inner product.
Writing $h^{n+1} = Q^n h^n$, given graph data $D = \{x_i, y_i\}$:
$$ \text{encoder} \to h^0 \to Q h^0 \to Q h^1 \to \cdots \to h^N \to \text{decoder} $$Critical observation: This is a discrete dynamical system where the graph Laplacian provides inherent stability through diffusion.
Recall the Hamiltonian formulation:
$$ \frac{d}{dt} \begin{pmatrix} q \\ p \end{pmatrix} = \begin{pmatrix} 0 & I \\ -I & 0 \end{pmatrix} \begin{pmatrix} \partial_q E \\ \partial_p E \end{pmatrix} $$where $E = \text{NN}(p, q)$ is the energy (learned by a neural network).
Key property: Energy conservation:
$$ \dot{E} = 0 $$Same structure as GRAND!
For graph dynamics:
$$ \frac{d}{dt} \begin{pmatrix} q \\ p \end{pmatrix} = \begin{pmatrix} 0 & -\delta_0^* \\ \delta_0 & 0 \end{pmatrix} \begin{pmatrix} \partial_q E \\ \partial_p E \end{pmatrix} $$Interpretation: The graph coboundary operators $\delta_0$ and $\delta_0^*$ play the role of symplectic structure, ensuring energy conservation exactly at the discrete level.
The encoder-decoder pipeline:
$$ x \to h^0 \to Q h^0 \to Q h^1 \to \cdots \to Q h^N \to y $$can be viewed as evolving a Hamiltonian system in latent space, where energy conservation provides implicit regularization.
Consider the dynamics:
$$ \frac{dx}{dt} = J \nabla E - J^\top J \nabla E $$Compute the energy rate of change:
$$ \begin{aligned} \frac{dE}{dt} &= \frac{\partial E}{\partial x} \cdot \frac{dx}{dt} \\ &= \frac{\partial E}{\partial x}^\top J \nabla E - \frac{\partial E}{\partial x}^\top J^\top J \nabla E \\ &= 0 - \|\nabla E\|_{J^\top J}^2 \\ &\leq 0 \end{aligned} $$Key property: The first term (Hamiltonian part) conserves energy, while the second term (double-bracket part) dissipates energy.
Interpretation: By combining Hamiltonian and double-bracket terms, we can design neural networks that:
This provides a principled way to build stable deep architectures with physics-inspired inductive biases.
where each $Q^k$ implements either:
This lecture covered:
Key Takeaway: Attention mechanisms can be viewed as learned finite element basis functions, generalizing interpolation from geometry to data. Graph attention networks implicitly define weighted graph Laplacians, connecting to diffusion equations and Hamiltonian mechanics. By recognizing these connections, we can design physics-inspired architectures that combine the expressiveness of modern deep learning with the stability and interpretability of classical numerical methods. The double-bracket formulation $\dot{x} = J \nabla E - J^\top J \nabla E$ provides a unified framework for energy-preserving and energy-dissipating dynamics, enabling neural networks that respect physical conservation laws while achieving stable training and generalization.