Lecture 18: Attention Mechanisms and Physics-Inspired Architectures

This lecture bridges classical attention mechanisms from natural language processing with physics-inspired neural network architectures. Having developed graph calculus and the Hodge decomposition in previous lectures, we now see how these mathematical structures naturally arise in modern deep learning architectures—particularly in transformers and graph neural networks.

The key insight is that attention can be viewed as a soft dictionary lookup or, equivalently, as a data-driven basis function similar to finite element methods. When we introduced finite elements, we saw that functions could be represented as $f(x) = \sum_i \alpha_i(x, x_i) y_i$ where $\alpha_i$ are piecewise linear basis functions. Attention generalizes this by learning the weights $\alpha_i(q, \vec{k})$ from data rather than fixing them geometrically.

We begin with conditional neural fields, where a latent code $z$ modulates input-output relationships—crucial for operator learning in PDEs. We then dissect the transformer architecture: queries, keys, values, multi-head attention, and dropout. Moving to graphs, we examine Graph Attention Networks (GATs), which suffer from over-squashing and over-smoothing problems. Physics-inspired solutions emerge from recognizing that GATs implicitly define graph Laplacians.

Finally, we introduce Hamiltonian neural networks and double-bracket dynamics, showing how energy-preserving and energy-dissipating systems can be built into neural architectures. These connections reveal deep relationships between optimization, dynamical systems, and modern machine learning.

1. Conditional Neural Fields

1.1 Problem Setup

Idea: The latent code $z$ modulates the $x \to y$ input-output relationship.

1.2 Architecture Options

Pedagogical note: Cross-attention provides the most flexible framework, allowing the model to dynamically select relevant features from the conditioning variable $z$ based on the query location $x$.

2. Attention as Soft Dictionary Lookup

2.1 Key-Value Pairs

Consider $(k_i, v_i)_{i=1}^N$ as key-value pairs describing input-output labels.

2.2 Attention Scores and Softmax

2.3 Connection to Finite Elements

For example, taking $\alpha_i(q, \vec{k}) = \phi_i(q)$ (CPWL basis functions) satisfies the same properties:

Critical observation: Attention generalizes finite element interpolation by learning the basis functions rather than fixing them geometrically!

3. Multi-Head Attention

3.1 Architecture

Interpretation: Each head learns to attend to different aspects of the input (e.g., position vs. semantic meaning), analogous to using multiple finite element spaces for different solution components.

4. Dropout Regularization

4.1 Mechanism

With probability $p$, at each forward pass the output of a given neuron will be dropped (set to zero).

Interpretation: Dropout forces the network to learn robust features that don't rely on any single neuron, preventing co-adaptation and improving generalization.

5. Graph Attention Networks

5.1 GAT Algorithm

Step 1: Input features. Each node $n_i \in N$ has feature vector $h_i^n \in \mathbb{R}^F$.

5.2 Problems with GATs

Critical observation: With physics ideas we'll show how to keep them stable!

6. Physics-Inspired Architectures: GRAND

6.1 Graph Neural Diffusion (GRAND)

6.2 Formulation as Dynamical System

Critical observation: This is a discrete dynamical system where the graph Laplacian provides inherent stability through diffusion.

7. Hamiltonian Neural Networks

7.1 Review: Hamiltonian Mechanics

7.2 Connection to GRAND

Interpretation: The graph coboundary operators $\delta_0$ and $\delta_0^*$ play the role of symplectic structure, ensuring energy conservation exactly at the discrete level.

7.3 Architecture Implications

can be viewed as evolving a Hamiltonian system in latent space, where energy conservation provides implicit regularization.

8. Double-Bracket Dynamics

8.1 Formulation

8.2 Energy Dissipation

Key property: The first term (Hamiltonian part) conserves energy, while the second term (double-bracket part) dissipates energy.

8.3 Hybrid Architecture

Interpretation: By combining Hamiltonian and double-bracket terms, we can design neural networks that:

This provides a principled way to build stable deep architectures with physics-inspired inductive biases.

8.4 Network Pipeline

Summary

This lecture covered:

Conditional neural fields for operator learning
Attention mechanisms as soft dictionary lookups
Multi-head attention for learning diverse feature representations
Dropout regularization for preventing overfitting
Graph Attention Networks (GATs) and their limitations
GRAND architecture using graph diffusion
Hamiltonian neural networks with energy conservation
Double-bracket dynamics for controlled energy dissipation

Key Takeaway: Attention mechanisms can be viewed as learned finite element basis functions, generalizing interpolation from geometry to data. Graph attention networks implicitly define weighted graph Laplacians, connecting to diffusion equations and Hamiltonian mechanics. By recognizing these connections, we can design physics-inspired architectures that combine the expressiveness of modern deep learning with the stability and interpretability of classical numerical methods. The double-bracket formulation $\dot{x} = J \nabla E - J^\top J \nabla E$ provides a unified framework for energy-preserving and energy-dissipating dynamics, enabling neural networks that respect physical conservation laws while achieving stable training and generalization.