Lecture 11: Stochastic Differential Equations and Probabilistic Learning

Topics Covered: Integrating probability into learned models, stochastic dynamics, maximum likelihood estimation (MLE), Markov processes, Euler-Maruyama integration, metriplectic/GENERIC formalism

0. Overview

Up until now, our focus has been on deterministic systems governed by variational principles and Hamiltonian mechanics. However, real-world physical systems invariably involve uncertainty: sensor noise, incomplete physics descriptions, and stochastic forcing all require us to think probabilistically. This lecture marks a crucial expansion of our framework to handle stochastic differential equations (SDEs), which combine deterministic dynamics with random processes.

The motivation for this probabilistic perspective is threefold. First, tracking a single point trajectory is often insufficient—we need to track entire probability distributions that evolve under both deterministic forces and random fluctuations. Second, even when our measurements are noisy or our physics model is incomplete, we can still learn meaningful dynamics by fitting probabilistic models to data using maximum likelihood estimation. Third, and perhaps most surprisingly, even in the absence of noise, training models in a probabilistic context often provides better regularization and more robust learning.

We'll build our SDE framework systematically, starting with a probability review, then introducing the Wiener process (Brownian motion) and showing how to integrate it using the Euler-Maruyama method. Using Markov process structure, we'll derive a negative log-likelihood (NLL) loss function that enables us to fit learnable SDE models to time series data. Finally, we'll see an elegant example of structure-preserving SDEs through the metriplectic (GENERIC) formalism, which simultaneously conserves energy while increasing entropy—a beautiful generalization of Hamiltonian mechanics to irreversible thermodynamic systems.

1. Motivation for Probabilistic Dynamics

1.1 From Point Tracking to Probability Tracking

Instead of tracking a single point trajectory, we often need to track a "blob" of probability that evolves over time. This arises naturally in several contexts:

1.2 Probabilistic Regularization

Even without noise, training a model in a probabilistic context is often easier. Probabilistic objectives provide natural regularization:

1.3 Recommended References for Probability

2. Probability Review

2.1 Random Variables and Distributions

Definition: A continuous random variable $\bar{X}$ takes a random value $x \in \mathbb{R}$.

Cumulative distribution function (CDF) defines probability over a range of values:

$$F_{\bar{X}}(x) = \mathbb{P}(\bar{X} \leq x)$$

If the CDF is differentiable, we define the probability density function (PDF):

$$f_{\bar{X}}(x) = \frac{d}{dx} F_{\bar{X}}(x)$$

Thus:

$$\mathbb{P}(a \leq \bar{X} \leq b) = \int_a^b f(x) \, dx = F(b) - F(a)$$

2.2 Joint Distributions and Conditioning

Probability chain rule:

$$\begin{aligned} f(x_1, \ldots, x_N) &= f(x_2, \ldots, x_N \mid x_1) f(x_1) \\ &= f(x_3, \ldots, x_N \mid x_1, x_2) f(x_2 \mid x_1) f(x_1) \\ &= f(x_N \mid x_1, \ldots, x_{N-1}) \cdots f(x_2 \mid x_1) f(x_1) \end{aligned}$$

2.3 Independence

3. Fitting Distributions to Data with Maximum Likelihood

3.1 Maximum Likelihood Estimation (MLE)

Define the likelihood by evaluating the parameterized joint distribution:

$$\mathcal{L}(\theta) = f(x_1, \ldots, x_{N_{\text{data}}} \mid \theta)$$

Fit the distribution to data by solving:

$$\theta^* = \arg\min_\theta \underbrace{-\log \mathcal{L}(x_1, \ldots, x_{N_{\text{data}}} \mid \theta)}_{\text{NLL (Negative Log-Likelihood)}}$$

3.2 Example: Fitting a Gaussian Distribution

Step 1: Assume independent data:

$$f(x_1, \ldots, x_N \mid \theta) = \prod_{i=1}^N f(x_i \mid \theta)$$

Step 2: Choose marginal distribution as Gaussian:

$$f(x; \theta) = \mathcal{N}(x; \mu, \sigma^2) \quad \text{where} \quad \theta = \{\mu, \sigma^2\}$$ $$\mathcal{N}(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\right]$$

Step 3: Compute log-likelihood:

$$\log \mathcal{N} = -C - \frac{1}{2} \log \sigma^2 - \frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2$$

Step 4: Write negative log-likelihood:

$$\text{NLL} = \sum_{i=1}^N \frac{1}{2} \log \sigma^2 + \frac{1}{2}\left(\frac{x_i - \mu}{\sigma}\right)^2$$

Step 5: Solve for $\mu^*$ and $\sigma^{2*}$:

$$\boxed{\mu^* = \frac{1}{N}\sum_{i=1}^N x_i}$$ $$\boxed{\sigma^{2*} = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2}$$

Result: MLE recovers the sample mean and sample variance.

4. Stochastic Differential Equations

4.1 From ODEs to SDEs

Critical observation: We will see that random forcing leads to solutions which aren't differentiable, so it's necessary to interpret them in the integral sense.

4.2 SDE Formulation

To account for stochastic terms, we consider an SDE:

$$dx_t = \underbrace{f(x_t, t)}_{\text{drift}} \, dt + \underbrace{g(x_t, t) \, dW_t}_{\text{diffusion (stochastic process)}}$$

where:

$f(x_t, t)$ is the drift (deterministic part)
$g(x_t, t)$ is the diffusion coefficient
$W_t$ is a Wiener process (Brownian motion)

5. The Wiener Process

5.1 Definition

We'll consider the Wiener process $W_t = \int_0^t dW_\tau$ defined via:

Initial condition: $W_0 = 0$
Independent increments: For $t > 0, u \geq 0$, the increment $W_{t+u} - W_t$ is independent of $W_s$ for any $s \leq t$
Gaussian increments with variance equal to time increment: $$W_{t+u} - W_t \sim \mathcal{N}(0, u)$$
Continuity: $W_t$ is continuous (almost surely)

5.2 Construction

Example construction: Let $\xi_1, \xi_2, \ldots$ be IID random variables with $\mathbb{E}[\xi_i] = 0$ and $\operatorname{Var}[\xi_i] = 1$. Define:

5.3 Properties

6. Integrating Against the Wiener Process

6.1 Riemann-Like Construction

Problem: This leads to many pathological issues that violate usual assumptions from standard calculus.

In a probability class, we would get into details about Itô calculus vs Stratonovich calculus—but this intuition is enough to pose a simple scheme for solving SDEs in our learning problems.

7. Euler-Maruyama Method

7.1 Discretization Scheme

Given the SDE:

$$dx_t = f(x_t, t) \, dt + g(x_t, t) \, dW_t$$

Solve for $x_{t_n} = x(t = nk)$ for $n = 0, 1, 2, \ldots$:

$$\boxed{x_{n+1} = x_n + k f(x_{t_n}, t_n) + \xi_n g(x_{t_n}, t_n)}$$

where:

$$\xi_n \sim \mathcal{N}(0, k)$$

Note: Some better integrators exist (see Milstein's method) but require a deeper dive into stochastic calculus.

8. Markov Processes and MLE for SDEs

8.1 Markov Process Definition

Definition: For a $K$-th order Markov process, given a discrete time series $\vec{y} = \langle y_0, y_1, \ldots, y_N \rangle$:

$$P(y_i \mid y_0, \ldots, y_{i-1}) = P(y_i \mid y_{i-1}, \ldots, y_{i-K})$$

This means you only need to model the last $K$ timesteps → this is like a multi-step integrator with $K$ steps.

8.2 Proof of Transition Distribution

Claim: If $X \sim \mathcal{N}(\mu, \sigma^2)$, then:

$$AX + b \sim \mathcal{N}(A\mu + b, \, A^2 \sigma^2)$$

For Euler-Maruyama:

$$x_{n+1} = x_n + kf_n + g_n \xi_n$$

where $\xi_n \sim \mathcal{N}(0, k)$.

Identify:

$b = x_n + kf_n$
$A = g_n$
$\sigma^2 = k$

Therefore:

$$x_{n+1} \sim \mathcal{N}(x_n + kf_n, \, k g_n^2) \quad \checkmark$$

8.3 Deriving the NLL for Trainable SDEs

Step 1: Apply chain rule using Markov property:

$$\begin{aligned} -\log p(y_1, \ldots, y_N) &= -\log \left[ p(y_N \mid y_{N-1}) p(y_{N-1} \mid y_{N-2}) \cdots p(y_1 \mid y_0) \right] \\ &= -\sum_{i=1}^N \log p(y_i \mid y_{i-1}) \end{aligned}$$

Step 2: Substitute Gaussian transition probabilities:

$$= -\sum_{i=1}^N \log \mathcal{N}(x_i + kf_i, \, kg_i^2)$$

Step 3: Expand log-Gaussian:

$$\text{NLL} = C + \sum_{i=1}^N \frac{1}{2} \log(kg_i^2) + \frac{1}{2} \frac{(x_{i+1} - x_i - kf_i)^2}{kg_i^2}$$

8.4 Learning SDEs from Data

If we replace $f(x, t)$ with $f(x, t; \theta)$ and $g(x, t)$ with $g(x, t; \theta)$ (i.e., make the SDE trainable), we can solve:

This provides a principled way to learn stochastic dynamics from noisy time series data!

9. Example: Structure-Preserving SDEs

9.1 Metriplectic / GENERIC Formalism

The metriplectic (also called GENERIC: General Equation for Non-Equilibrium Reversible-Irreversible Coupling) formalism provides a structure-preserving SDE framework:

$$\boxed{dx_t = \left(L \partial_x E + M \partial_x S + K_B \partial_x \cdot M\right) dt + \sqrt{2K_B M} \, dW_t}$$

where:

$E$ = energy (conserved functional)
$S$ = entropy (non-decreasing functional)
$L = -L^T$ = Poisson operator (skew-symmetric)
$M = M^T \geq 0$ = friction operator (symmetric, positive semi-definite)
$K_B$ = Boltzmann constant

Degeneracy conditions:

$$\begin{aligned} L \partial_x S &= 0 \\ M \partial_x E &= 0 \end{aligned}$$

9.2 Deterministic Limit ($K_B \to 0$)

Theorem 1: Energy is Conserved

Proof:

$$\begin{aligned} \frac{dE}{dt} &= \partial_x E^\top \frac{dx}{dt} \\ &= \partial_x E^\top (L \partial_x E + M \partial_x S) \\ &= \underbrace{\partial_x E^\top L \partial_x E}_{= 0 \text{ (skew-symmetry)}} + \underbrace{\partial_x E^\top M \partial_x S}_{= 0 \text{ (degeneracy)}} \\ &= 0 \end{aligned}$$

Therefore: $\frac{dE}{dt} = 0$ $\quad \checkmark$

Theorem 2: Entropy is Non-Decreasing

Proof:

$$\begin{aligned} \frac{dS}{dt} &= \partial_x S^\top \frac{dx}{dt} \\ &= \partial_x S^\top (L \partial_x E + M \partial_x S) \\ &= \underbrace{\partial_x S^\top L \partial_x E}_{= 0 \text{ (degeneracy)}} + \underbrace{\partial_x S^\top M \partial_x S}_{\geq 0 \text{ (positive semi-definite)}} \\ &\geq 0 \end{aligned}$$

Therefore: $\frac{dS}{dt} \geq 0$ $\quad \checkmark$

Remark: Metriplectic structure is therefore a generalization of Hamiltonian mechanics, with the degeneracy conditions preventing cross-terms between reversible ($L$) and irreversible ($M$) parts. This allows simultaneous treatment of conservative and dissipative dynamics in a geometrically consistent framework.

Summary

This lecture covered:

Motivation for probabilistic dynamics: tracking distributions, handling uncertainty
Probability review: random variables, distributions, conditioning, independence
Maximum likelihood estimation (MLE) for fitting distributions to data
Stochastic differential equations (SDEs) with drift and diffusion
Wiener process definition and properties (continuous but nowhere differentiable)
Euler-Maruyama method for discretizing SDEs
Markov processes and deriving NLL loss for learning SDEs from data
Metriplectic/GENERIC formalism as structure-preserving SDEs with energy conservation and entropy increase

Key Takeaway: Stochastic differential equations provide a principled framework for learning dynamics from noisy data while preserving physical structure. By formulating SDEs with trainable drift and diffusion, we can fit probabilistic models using maximum likelihood estimation. The Euler-Maruyama discretization, combined with the Markov property, yields a tractable negative log-likelihood loss. For physics-informed learning, the metriplectic formalism shows how to design SDEs that simultaneously conserve energy and satisfy the second law of thermodynamics—a beautiful generalization that unifies reversible Hamiltonian dynamics with irreversible dissipation.