Stochastic Processes and Differential Equations
Continuous Processes

Partial Differential Equations from Continuous Stochastic Processes

Banner

This quick tutorial on the deep connection between elliptic/parabolic systems and stochastic processes. The connection gives a lot of useful intuition about how to interpret elliptic/parabolic PDEs as well as Laplacian systems on graphs. This part covers continuous time stochastic processes and Ito calculus.

This tutorial is meant for students at the graduate or advanced undergraduate level.

Moving to Continuous Stochastic Processes: Ito's Formula

Here is where we finally make the jump to continuous time and space. To begin this conversion process, we need to make note that the fundamental analogue of the stochastic graph process XtX_t​ is going to be a continuous space/time Ito process XtX_t​ defined by the stochastic differential equation

dXt=μ(Xt)dt+σ(Xt)dBt.\begin{equation} dX_t = \mu(X_t) \, dt+ \sigma(X_t) \, dB_t​ \,. \end{equation}

where BtB_t​ is a standard Brownian motion (i.e., continuous time random walk). The term μ(x)\mu(x) is referred to as the drift of the process XtX_t when it is at xx and σ(x)\sigma(x) is the volitility of the process XtX_t when it is at xx. If you've never seen stochastic differential equations before, the above equation shouldn't scare you - you should try to read it defining as the continuous time limit of the following discrete-time process:

Xt+ΔtXt=μ(Xt)Δt+σ(Xt)ΔBt.\begin{equation} X_{t + \Delta t} - X_t = \mu(X_t) \Delta t + \sigma(X_t) \,\Delta B_t \,. \end{equation}

where ΔBtN(0,Δt)\Delta B_t \sim \mathcal{N}(0,\Delta t) are independent normal variables with mean 00 and variance Δt\Delta t, and the time step Δt\Delta t is taken to be infinitesimally small.

Here is where we are going to need a bit of Ito calculus. Don't worry! It's just like normal calculus, except instead of just having a single class of standard infinitesimal quantity dtdt, we also have a second class of "special" infinitesimal normal random variables dBtdB_t​. Just like in normal elementary calculus, the way we perform any calculation is to expand in powers of dtdt and dBtdB_t​ and then discard anything that is smaller than dd (since dtdt is already infinitesimally small). The caveat is that

dBtdt.\begin{equation} dB_t \approx \sqrt{dt} \,. \end{equation}

since dBtdB_t​ is an infinitesimal normal random variable with variance dtdt and hence standard deviation dtdt. The way that Ito calculus expresses this equivalence is with the identity

(dBt)2=dt.\begin{equation} (dB_t)^2 = dt \,. \end{equation}

Think of this like an infinitesimal version of the law of large numbers of Gaussian random variables. As dtdt grows infinitesimally small, the randomness in (dBt)2(dB_t)^2 basically vanishes and it becomes a deterministic quantity (in reality the randomness is still there, but it is negligible because it has order less than dtdt, and hence (dBt)2(dB_t)^2 concentrates about its expectation, dtdt). Therefore, when taking a Taylor expansion with an infinitesimal variable dtdt and the random variable dBtdB_t​ we can discard everything above first order in dtdt or second order in dBtdB_t​.

Suppose now that u(x)u(x) is a function and we want to say something about the stochastic process u(Xt)u(X_t). Performing a Taylor expansion on uu allows us to express the infinitesimal change in u(Xt)u(X_t) in terms of quantities we have already defined and know how to handle,

du(Xt)=uxdXt+122ux2(dXt)2.\begin{equation} du(X_t)=\frac{\partial u}{\partial x} \, dX_t + \frac{1}{2} \frac{\partial^2 u}{\partial x^2} \, ​(dX_t)^2 \,. \end{equation}

We ignore everything higher than second order because anything smaller than dtdt is effectively zero. Note that, by definition,

dXt=μ(Xt)dt+σ(Xt)dBt,\begin{equation} dX_t = \mu(X_t) \, dt + \sigma(X_t) \, dB_t \,​, \end{equation}

And also,

(dXt)2=μ(Xt)dt2+12σ2(Xt)μ(Xt)dBtdt+σ2(Xt)(dBt)2=σ2(Xt)dt.\begin{equation} (dX_t)^2 = \mu(X_t) \, dt^2 + \frac{1}{2} \sigma^2(X_t) \mu(X_t) \, dB_t \, dt + \sigma^2(X_t​) \, (dB_t​)^2 = \sigma^2(X_t) \, dt \,. \end{equation}

where we have used the identity (dBt)2=dt(dB_t)^2=dt, and discarded any terms that are higher order in dtdt. Therefore,

du(Xt)=μ(Xt)uxdt+σ(Xt)uxdBt+12σ2(Xt)2ux2dt,\begin{equation} du(X_t) = \mu(X_t) \frac{\partial u}{\partial x} dt + \sigma(X_t)\frac{\partial u}{\partial x} dB_t + \frac{1}{2} \sigma^2(X_t) \frac{\partial^2 u}{\partial x^2} \, dt \,, \end{equation}

This formula is the analogue of the chain rule for elementary calculus, and goes by the name of Ito's formula. Note that the presence of randomness adds the σ(Xt)uxdBt\sigma(X_t) \frac{\partial u}{\partial x} dB_t​ and 12σ2(Xt)2ux2\frac{1}{2}\sigma^2(X_t)\frac{\partial^2 u}{\partial x^2} terms to the otherwise standard expressions. The Ito formula is often written as follows

du(Xt)=Lu(Xt)dt+σ(Xt)uxdBt,\begin{equation} du(X_t)=\mathcal{L}u(X_t​) \, dt + \sigma (X_t) \frac{\partial u}{\partial x}​dB_t \,, \end{equation}

where L\mathcal{L} is known as the infinitesimal generator of the process XtX_t, and is given by

L=μ(x)x+12σ2(x)2x2.\begin{equation} \mathcal{L}=\mu(x)\frac{\partial}{\partial x} + \frac{1}{2}\sigma^2(x) \frac{\partial^2}{\partial x^2} \,. \end{equation}

Now, integrating eq. (9) in time from t=0t = 0 to t=Tt = T gives

u(XT)u(X0)=0TLu(Xt)dt+0Tσ(Xt)uxdBt.\begin{equation} u(X_T) - u(X_0) = \int_0^T \mathcal{L} u (X_t) \, dt + \int_0^T \sigma (X_t) \frac{\partial u}{\partial x} \, ​dB_t \,. \end{equation}

Taking expectations of both sides and letting X0=xX_0 = x, we note that the integral above with dBtdB_t becomes zero, since it is the infinite sum of mean-zero random variables. This leaves us with

E[u(XT)X0=x]u(x)=E[0TLu(Xt)dtX0=x].\begin{equation} \mathbb{E}\left[u(X_T) \mid X_0 = x\right] - u(x) = \mathbb{E}\left[ \int_0^T \mathcal{L} u (X_t) \, dt \mid X_0 = x\right] \,. \end{equation}

This gives us an expression for u(x)u(x) in terms of the expectation of a stochastic process

u(x)=E[u(XT)0TLu(Xt)dtX0=x].\begin{equation} u(x) = \mathbb{E}\left[u(X_T) - \int_0^T \mathcal{L} u (X_t) \, dt \mid X_0 = x\right] \,. \end{equation}

Of course, this by itself isn't a very useful expression for u(x)u(x), because the expression has u(x)u(x) on the right hand side. This is where boundary conditions for u(x)u(x) typically come in, but we will address this in the next section. First, it is important to note that is actually possible to take TT to be a random variable (we won't show it here because that requires some analysis stuff that's beyond the scope of this tutorial). TT can be (and is very often taken to be) the hitting time of some set in R\mathbb{R}. Often times, we let ΩR\Omega \subset \mathbb{R} be an interval and then define the stopping time TT as the first time at which XtX_t​ hits the boundary of the domain Ω\Omega, i.e.,

T=inft{tXtΩ}.\begin{equation} T= \inf_{t} \{ t \mid X_t \in \partial \Omega \} \,. \end{equation}

We can also define higher dimensional versions of XtX_t​ in Rn\mathbb{R}^n instead of in R\mathbb{R}. In that case, XtX_t is defined as

dXt=μ(Xt)dt+σ(Xt)dBtn.\begin{equation} dX_t = \mu(X_t) \cdot dt + \sigma(X_t) \, dB_t^n \,. \end{equation}

In this SDE, μ(x)\mu(x) is a vector field on RnRn determining the drift of the process XtX_t​ when it is at xx and σ(x)\sigma(x) determines the volatility (i.e., noise) of the process XtX_t​ when it is at xx. dBtndB_t^n​ denotes the standard multidimensional infinitesimal Brownian motion, i.e., dBtnN(0,dt)dB_t^n \sim \mathcal{N}(0, dt). In this case we get the same expression for u(x)u(x), but the infinitesimal generator becomes

L=μ(x)+12σ2(x)Δ.\begin{equation} \mathcal{L} = \mu(x)\cdot \nabla + \frac{1}{2} \sigma^2(x) \Delta \,. \end{equation}

Elliptic Partial Differential Equations

We can take the above analysis and use it to get intuition on many of the partial differential equations we are used to seeing in our everyday lives by connecting them to properties of stochastic processes (have you ever read a more ivory tower sentence?). First, let's think of the analogue to our discrete time-game where XtX_t​ wanders around a graph and collects f(x)f(x) reward after leaving vertex xx and g(x)g(x) terminal reward when it reaches a terminal vertex xDx \in D. In the continuous space/time case, we instead consider XtX_t as a stochastic process on a connected bounded domain ΩR\Omega \subset \mathbb{R}. The internal reward function f(x)f(x) is now defined on Ω\Omega and is interpreted as a rate, i.e., if XtX_t is at xx for a period of time dtdt, then it picks up f(x)dtf(x) \, dt reward. The terminal reward function g(x)g(x) is now defined on the boundary Ω\partial \Omega. Our expression from the previous section tells us that, if u(x)u(x) solves the elliptic PDE with Dirichlet boundary conditions,

Lu(x)=f(x),on Ω,u(x)=g(x),on Ω,\begin{equation} \begin{split} \mathcal{L} u (x) = −f(x)\,,&\qquad\text{on }\Omega \,, \\ u(x)=g(x)\,,&\qquad \text{on } \partial \Omega \,, \end{split} \end{equation}

then eq. (13) tells us that it must satisfy

u(x)=E[g(XT)+0Tf(Xt)dtX0=x].\begin{equation} \begin{split} u(x) = \mathbb{E}\left[g(X_T) + \int_0^T f(X_t) \, dt \mid X_0 = x\right] \,. \end{split} \end{equation}

This formula provides a very nice probabilistic interpretation of second order elliptic PDEs.

Note in particular, this gives a nice probabilistic form for the solution of the Laplace equation with Dirichlet boundary conditions, i.e.,

Δu(x)=0,on Ω,u(x)=g(x),on Ω.\begin{equation} \begin{split} \Delta u(x) = 0\,,&\qquad \text{on }\Omega \,, \\ u(x)=g(x) \,, &\qquad \text{on } \partial \Omega \,. \end{split} \end{equation}

The solution u(x)u(x) satisfies the probabilistic expression

u(x)=E[g(BT)B0=x],\begin{equation} u(x)=\mathbb{E}[g(B_T) \mid B_0 = x] \,, \end{equation}

where BtB_t​ is a standard Brownian motion. Therefore, the Laplace equation with Dirichlet boundary conditions is the continuous analogue of the expected terminal reward problem we saw in the discrete version of this tutorial. Note moreover, that it is possible to use this expression to evaluate u(x)u(x) with Monte Carlo without solving the entire equation. This is especially useful in high dimensional settings where solving the Laplace equation can be prohibitively expensive.

The Maximum Principle for Elliptic Equations

Another outcome of this probabilistic form is the maximum principle. Suppose that L\mathcal{L} is an infinitesimal generator of some stochastic process XtRnX_t \in \mathbb{R}^n, and suppose that the function u(x)u(x) satisfies

Lu(x)0,on Ω.\begin{equation} \mathcal{L} u(x) \geq 0\,, \qquad \text{on } \Omega\,. \end{equation}

Note when L=Δ\mathcal{L}=\Delta such a function u(x)u(x) is called subharmonic on Ω\Omega. If this is the case, then from the probabilistic interpretation of u(x)u(x) in the previous section, we have that

u(x)=E[0TLu(Xt)dt+u(XT)X0=x]E[u(XT)X0=x]supxΩ(x).\begin{equation} \begin{split} u(x)&=\mathbb{E}\left[\int_0^T -\mathcal{L} u(X_t) \, dt + u(X_T) \mid X_0=x\right] \\ &\leq\mathbb{E}\left[u(X_T) \mid X_0 =x\right] \leq \sup_{x \in \partial \Omega}(x) \,. \end{split} \end{equation}

This is therefore the continuous analogue of the discrete maximum principle we saw earlier in the discrete version of this tutorial.

Hitting Time Problems

Suppose that we have a stochastic process XtX_t on ΩRn\Omega \subset \mathbb{R}^n given by

dXt=μ(x)dt+σ(x)dBtn,\begin{equation} dX_t= \mu(x) \cdot dt + \sigma(x)\, dB_t^n \,, \end{equation}

and we want to ask the continuous analogue of the discrete hitting time problem. That is, how long does it take XtX_t​ to hit the boundary Ω\partial \Omega if it starts from X0=xX_0=x? We will write this function as

u(x)=E[TX0=x]=E[0TdtX0=x].\begin{equation} u(x)=\mathbb{E}\left[T \mid X_0 = x\right] = \mathbb{E}\left[\int_0^T ​dt \mid X_0 = x \right] \,. \end{equation}

The probabalistic interpretation we derived in the previous sections tells us that the function u(x)u(x) that satisfies the PDE

μ(x)u(x)+12σ2(x)Δu(x)=1,on Ω,u(x)=0,on Ω,\begin{equation} \begin{split} \mu(x) \cdot \nabla u(x)+\frac{1}{2}\sigma^2(x)\, \Delta u(x) = -1 \,, &\qquad \text{on }\Omega \,, \\ u(x)=0\,,&\qquad \text{on }\partial \Omega \,, \end{split} \end{equation}

also satisfies eq. (24) and hence is the function we want.

First Arrival Problems

Suppose that we have a stochastic process XtX_t on ΩRn\Omega \subset \mathbb{R}^n given by

dXt=μ(x)dt+σ(x)dBtn,\begin{equation} dX_t= \mu(x) \cdot dt + \sigma(x)\, dB_t^n \,, \end{equation}

and we want to ask the continuous analogue of the discrete first arrival problem. That is, suppose that Ω\partial \Omega can be split into two disjoint parts Ω+\partial \Omega^+ and Ω\partial \Omega^-. What is the probability that XtX_t hits Ω+\partial \Omega^+ before it hits Ω\partial \Omega^- given that it starts at X0=xX_0 = x? Let 1Ω+1_{\partial \Omega^+}​ be the indicator function for Ω+\partial \Omega^+, then the function we want to compute is

u(x)=P(XTΩ+)=E[1Ω+].\begin{equation} u(x) = \mathbb{P}(X_T \in \partial \Omega^+)=\mathbb{E}[1_{\partial \Omega^+}]\,. \end{equation}

The probabalistic interpretation we derived in the previous sections tells us that the u(x)u(x) that satisfies the PDE

μ(x)u(x)+12σ2(x)Δu(x)=0,on Ω,u(x)=1,on Ω+,u(x)=0,on Ω,\begin{equation} \begin{split} \mu(x) \cdot \nabla u(x)+\frac{1}{2}\sigma^2(x)\, \Delta u(x) = 0 \,, &\qquad \text{on }\Omega \,, \\ u(x)=1\,,&\qquad \text{on }\partial \Omega^+ \,, \\ u(x)=0\,,&\qquad \text{on }\partial \Omega^- \,, \\ \end{split} \end{equation}

also satisfies eq. (27) and hence is the function we want.

Parabolic Differential Equations

To move to parabolic differential equations, i.e., time evolution problems, we need a version of Ito's formula for functions for functions u(x,t)u(x,t) that depend both on space and time. Turns out, adding a time variable doesn't change anything except for adding an extra term in the time partial derivative in u(x,t)u(x,t). Let's redo the Taylor expansion of u(Xt,t)u(X_t,t) in terms of dXtdX_t and dtdt,

du(Xt,t)=utdt+uxdXt+2ux2(dXt)2.\begin{equation} du(X_t​,t) = \frac{\partial u}{\partial t} \, dt + \frac{\partial u}{\partial x} \, dX_t + \frac{\partial^2 u}{\partial x^2} \, (dX_t)^2 \,. \end{equation}

Redoing the calculation that we did earlier, this gives us

du(Xt,t)=(t+L)u(Xt,t)dt+σ(Xt)uxdBt.\begin{equation} du(X_t​,t) = \left(\frac{\partial}{\partial t} + \mathcal{L}\right) u(X_t, t) \, dt + \sigma(X_t) \frac{\partial u}{\partial x} \, dB_t\,. \end{equation}

Note that the only thing that has changed is that there is now a time partial derivative in the dtdt terms. Once again, we integrate this from t=0t = 0 to t=Tt = T to obtain

u(XT,T)=u(X0,0)+0T(t+L)u(Xt,t)dt+0Tσ(Xt)uxdBt.\begin{equation} u(X_T​,T) = u(X_0, 0) + \int_0^T \left(\frac{\partial}{\partial t} + \mathcal{L}\right) u(X_t, t) \, dt + \int_0^T \sigma(X_t) \frac{\partial u}{\partial x} \, dB_t \,. \end{equation}

Since dBtdB_t​ is mean zero and independent of everything that happens up until time tt, the second integral vanishes under expectation and we are left with

u(x,0)=E[u(XT,T)0T(t+L)u(Xt,t)dtX0=x].\begin{equation} u(x, 0) = \mathbb{E}\left[u(X_T​,T) - \int_0^T \left(\frac{\partial}{\partial t} + \mathcal{L}\right) u(X_t, t) \, dt \mid X_0 = x \right] \,. \end{equation}

Note that this is equivalent to the following more general expression (just shift the result in time):

u(x,s)=E[u(XT,T)sT(t+L)u(Xt,t)dtXs=x].\begin{equation} u(x, s) = \mathbb{E}\left[u(X_T​,T) - \int_s^T \left(\frac{\partial}{\partial t} + \mathcal{L}\right) u(X_t, t) \, dt \mid X_s = x \right] \,. \end{equation}

Once again, it is also possible for TT to be a stopping time TT (i.e., the hitting time of the boundary Ω\partial \Omega of a set, for example), deriving this is beyond the scope of this tutorial, if one assumes some nice properties about the stopping time TT (i.e., finite expectation).

In particular, if uu satisfies the parabolic second-order differential equation

(t+L)u(x,t)=f(x,t),for 0tTu(x,T)=g(x),\begin{equation} \begin{split} \left(\frac{\partial}{\partial t} + \mathcal{L}\right) u(x, t) &= -f(x, t)\,, \qquad \text{for } 0 \leq t \leq T \\ u(x, T) &= g(x) \,, \end{split} \end{equation}

then we get an analogue to eq. (19) for parabolic PDEs,

u(x,s)=E[g(XT)+sTf(Xt,t)dtXs=x].\begin{equation} u(x, s) = \mathbb{E}\left[g(X_T) + \int_s^T f(X_t, t) \, dt \mid X_s = x \right] \,. \end{equation}

This is a simplified version of the more general Feynman-Kac formula (opens in a new tab). Note that this is fundamentally a Kolmogorov backwards equation for the value function uu. The interpretation is that we let the stochastic process XtX_t run from time t=st = s to t=Tt = T and along the way it picks up reward at a rate of f(Xt,t)f(X_t, t) per unit time, until it recieves a final payout of g(XT)g(X_T) at time t=Tt = T. The value function u(x,s)u(x, s) measures the expected payout of this game.

The Maximum Principle for Parabolic Equations

A cool corollary of the above probabilistic expression is the maximum principle for parabolic equations. Suppose that u(x,t)u(x,t) is defined on Rn×[0,T]\mathbb{R}^n\times[0,T] and satisfies

(t+L)u0,\begin{equation} \left(\frac{\partial}{\partial t} + \mathcal{L}\right) u \leq 0 \,, \end{equation}

for the infinitesimal generator L\mathcal{L} of some stochastic process XtX_t​. Then it follows that

u(x,s)=E[u(XT,T)sT(t+L)u(Xt,t)dtXs=x]E[u(XT,T)]supyu(y,T).\begin{equation} \begin{split} u(x, s) &= \mathbb{E}\left[u(X_T​,T) - \int_s^T \left(\frac{\partial}{\partial t} + \mathcal{L}\right) u(X_t, t) \, dt \mid X_s = x \right] \\ &\leq \mathbb{E}\left[u(X_T​,T)\right] \leq \sup_y u(y, T) \,. \end{split} \end{equation}

Duality

It is at this point that it is worth taking a digression for another discussion of the concept of duality. Unfortunately, unlike in the discrete case, daulity is far more complicated in the continuous case because our underlying vector spaces are infinite dimensional. However, I will try to have a limited discussion of it here.

To begin, we need to consider our primal space. This will be the space of value functions analogous to the set of functions on vertices that we saw in the previous tutorial. However, in our case, I will take our primal set of functions to be the set of compactly supported infinitely differentiable functions C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n). Note that any function can be well-approximated by an infinitely differentiable function (see mollifiers (opens in a new tab)), so this restriction is not too much of a loss. Thus, we will denote f,gC0(Rn)f, g \in \mathcal{C}_0^\infty(\mathbb{R}^n) as such functions.

Now, the dual C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n)^* of C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n) is defined as the set of all linear functionals π\pi on C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n), i.e., satisfying

π[αf+βg]=απ[f]+βπ[g],for f,gC(R).\begin{equation} \pi[\alpha f + \beta g] = \alpha \pi[f] + \beta \pi[g] \,, \qquad \text{for } f, g \in \mathcal{C}^\infty(\mathbb{R}) \,. \end{equation}

Note that for Banach spaces VV (i.e., normed infinite vector spaces), the dual space usually has the additional restriction that the functionals must be continuous,

π[f]CfV,for all fV.\begin{equation} \pi[f] \leq C \|f\|_V \,, \qquad \text{for all } f \in V \,. \end{equation}

Since these two notions of duality are different (the second is stronger than the first), the reader should be aware of it, but it won't be important for our purposes.

Note that there is not a one-to-one correspondance between the primal set C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n) and C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n)^*, since the Dirac delta (opens in a new tab) distribution δ0\delta_0 is a member of the dual,

δ0[f]=f(0).\begin{equation} \delta_0[f] = f(0) \,. \end{equation}

However, there is no such function δ0\delta_0 like this in C\mathcal{C}^\infty with the property that

Rnδ0(x)f(x)=δ0[f]=f(0),\begin{equation} \int_{\mathbb{R}^n} \delta_0(x) f(x) = \delta_0[f] = f(0) \,, \end{equation}

Therefore, C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n) and C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n)^* are fundamentally different. Nonetheless, we can still nominally think of distributions as "row vectors" and functions as "column vectors", even if this relationship is strained somewhat in the continuous case. While is may not be possible to transform a distribution into a function, it is always possible to transform a function ff into a distribution πf\pi_f via the inner product

πf[g]=Rnf(x)g(x)dx.\begin{equation} \pi_f[g] = \int_{\mathbb{R}^n} f(x) g(x) \, dx \,. \end{equation}

But the converse is not necessarily true.

In particular, all probability distributions π\pi on Rn\mathbb{R}^n can be interpreted as linear functionals on C0(Rn)\mathcal{C}_0^\infty(\mathbb{R}^n) via the expectation operator

π[f]EXπ[f(X)].\begin{equation} \pi[f] \equiv \mathbb{E}_{X \sim \pi} [ f(X) ] \,. \end{equation}

This now returns us to a very fundamental problem with stochastic processes. We have a stochastic process XtX_t that begins with probability distribution π0\pi_0 at time t=0t = 0 and we would like to evaluate E[VT(XT)]\mathbb{E}[V_T(X_T)] at time TT. As we saw in the previous tutorial about discrete processes, there are two ways of doing this, we can either evolve π0\pi_0 forward in time to a distribution πT\pi_T and then take expectation with EXπT[VT(X)]\mathbb{E}_{X \sim \pi_T}[V_T(X)], or we can evolve VTV_T backward in time to a function V0V_0 and take expectation EXπ0[V0(X)]\mathbb{E}_{X \sim \pi_0} [V_0(X)]. Eq. (35) from the previous section already tells us exactly how we can do the latter backwards approach. So, let us start with the backwards equation first.

The Backward Kolmogorov Equation

To derive an expression for V0V_0, first we use the law of iterated expectation,

E[VT(XT)]=EXπ0[E[VT(XT)X0=X]].\begin{equation} \mathbb{E}[ V_T(X_T) ] = \mathbb{E}_{X \sim \pi_0}\left[\mathbb{E}\left[V_T(X_T) \mid X_0 = X\right]\right] \,. \end{equation}

Thus, we see that V0V_0 has the form

V0(x)=E[VT(XT)X0=x],\begin{equation} V_0(x) = \mathbb{E}\left[V_T(X_T) \mid X_0 = x\right] \,, \end{equation}

and we know from eq. (35) that VV therefore satisfies the PDE

Vt=LV(x,t).\begin{equation} \frac{\partial V}{\partial t} = -\mathcal{L} V(x, t) \,. \end{equation}

with the appropriate boundary conditions at t=Tt = T. If we do a variable substitution s=Tts = T - t, then we get

Vs=LV(x,s),\begin{equation} \frac{\partial V}{\partial s} = \mathcal{L} V(x, s) \,, \end{equation}

with boundary conditions at s=0s = 0. Because of the similarity of this expression to the ordinary differential equation

dyds=ay,\begin{equation} \frac{dy}{ds} = a y \,, \end{equation}

for the exponential function y(s)=easy(s) = e^{a s}, we write the operator that takes VTV_T at time t=Tt = T to VTsV_{T - s} at t=Tst = T - s as

VTs=exp(sL)VT.\begin{equation} V_{T - s} = \exp(s \mathcal{L}) V_T \,. \end{equation}

While there are some caveats, generally, one can treat the exponential of operators in much the same way as one treats the exponential of matrices. This expression therefore tells us how value functions evolve backwards in time under the dynamics of the stochastic process XtX_t. To write out the equation in its full glory, we have

Vs=μ(x)V(x,s)+12σ2(x)ΔV(x,s).\begin{equation} \frac{\partial V}{\partial s} = \mu(x) \cdot \nabla V(x,s) + \frac{1}{2} \sigma^2(x) \, \Delta V(x, s) \,. \end{equation}

The Forward Kolmogorov Equation

Now, we would like to turn to how the distribution π0\pi_0 evolves from π0\pi_0 to πT\pi_T. Note that the previous section gave us the following expression

E[VT(XT)]=π0[exp(sL)VT].\begin{equation} \mathbb{E}[V_T(X_T)] = \pi_0 [\exp(s \mathcal{L}) V_T] \,. \end{equation}

The distribution πT\pi_T of XTX_T is therefore given by

πT[f]=π0[exp(sL)f].\begin{equation} \pi_T[f] = \pi_0 [\exp(s \mathcal{L}) f] \,. \end{equation}

However, this is somewhat of a cop-out answer to the question of how distributions evolve forward in time. Ideally, we would like an actual equation that discribes the evolution of π\pi. To do this, we have to assume that πt\pi_t is a nice function. We will assume therefore, that it is a member of C(Rn×[0,T])\mathcal{C}^\infty(\mathbb{R}^n \times [0, T]) (note, not necessarily compactly supported). That is, there exists a function π0\pi_0 such that

πt[f]=Rnf(x)π(x,t)dx,\begin{equation} \pi_t[f] = \int_{\mathbb{R}^n} f(x) \pi(x, t) \, dx \,, \end{equation}

where the function π(x,t)\pi(x, t) is the probability distribution function (PDF) of the operator πt[]\pi_t[\cdot]. We would like to know how the probability distribution function π(x,t)\pi(x, t) evolves over time. To do this, we derive the weak form of the evolution equations.

First, consider test functions ϕ(x,t)\phi(x, t) that vanish at the end times ϕ(x,0)=ϕ(x,T)=0\phi(x, 0) = \phi(x, T) = 0 and are compactly supported in xx. Let us plug in the eq. (32) that we derived earlier,

ϕ(x,0)=E[ϕ(XT,T)0T(t+L)ϕ(Xt,t)dt].\begin{equation} \phi(x, 0) = \mathbb{E}\left[\phi(X_T, T) - \int_0^T \left(\frac{\partial }{\partial t} + \mathcal{L}\right) \phi(X_t, t) \, dt \right] \,. \end{equation}

But since, ϕ(x,0)=ϕ(XT,T)=0\phi(x, 0) = \phi(X_T, T) = 0 by construction, we have that

E[0T(t+L)ϕ(Xt,t)dt]=0.\begin{equation} \mathbb{E}\left[\int_0^T \left(\frac{\partial }{\partial t} + \mathcal{L}\right) \phi(X_t, t) \, dt \right] = 0\,. \end{equation}

Now, we write this as an integral against π\pi,

Rn[0T(t+L)ϕ(Xt,t)dt]π(x,t)dx=0.\begin{equation} \int_{\mathbb{R}^n} \left[\int_0^T \left(\frac{\partial }{\partial t} + \mathcal{L}\right) \phi(X_t, t) \, dt \right] \pi(x, t) \, dx = 0\,. \end{equation}

Taking the π\pi inside gives us

Rn0Tπ(x,t)(t+L)ϕ(Xt,t)dtdx=0.\begin{equation} \int_{\mathbb{R}^n} \int_0^T \pi(x, t) \left(\frac{\partial }{\partial t} + \mathcal{L}\right) \phi(X_t, t) \, dt \, dx = 0\,. \end{equation}

We now perform an integration by parts in both xx and tt to obtain

Rn0Tϕ(Xt,t)(t+L)π(x,t)dtdx=0.\begin{equation} \int_{\mathbb{R}^n} \int_0^T \phi(X_t, t) \left(-\frac{\partial }{\partial t} + \mathcal{L}^* \right) \pi(x, t)\, dt \, dx = 0\,. \end{equation}

where L\mathcal{L}^* is the adjoint operator of L\mathcal{L} (think of it like the matrix transpose). Note that the boundary terms of the integration by parts vanish because ϕ\phi is compactly supported and zero at t=0t = 0 and t=Tt = T. We will not do the computation here, but the adjoint operator written out in full is

Lπ=(μ(x)π(x,t))+12Δ(σ2(x)π(x,t)).\begin{equation} \mathcal{L}^*\pi=-\nabla \cdot (\mu(x) \pi(x,t)) + \frac{1}{2}\Delta (\sigma^2(x) \pi(x,t))\,. \end{equation}

But note that since the equation (56) is true for any choice of ϕ\phi, it must be the case that

(t+L)π(x,t)=0,\begin{equation} \left(-\frac{\partial }{\partial t} + \mathcal{L}^* \right) \pi(x, t) = 0\,, \end{equation}

or rather

πt=Lπ(x,t).\begin{equation} \frac{\partial \pi}{\partial t} = \mathcal{L}^* \pi(x, t) \,. \end{equation}

This is the Kolmogorov forward equation for π\pi, as it governs π\pi's evolution forward in time. Writing it out in full glory gives

πt=(μ(x)π(x,t))+12Δ(σ2(x)π(x,t)).\begin{equation} \frac{\partial \pi}{\partial t} = -\nabla \cdot (\mu(x) \pi(x,t)) + \frac{1}{2}\Delta (\sigma^2(x) \pi(x,t)) \,. \end{equation}

It is also sometimes written in the form

πt=(μ(x)π(x,t))+(a(x)π(x,t)),\begin{equation} \frac{\partial \pi}{\partial t} = -\nabla \cdot (\mu(x) \pi(x,t)) + \nabla \cdot(a(x) \nabla \pi(x,t)) \,, \end{equation}

and in this form, it is called the advection-diffusion equation and a(x)a(x) is called the coefficient of diffusivity.

One special case is the heat equation

πt=12Δπ,\begin{equation} \frac{\partial \pi}{\partial t} = \frac{1}{2} \Delta \pi \,, \end{equation}

which describes the time evolution of the probability distribution of a standard Brownian motion BtnB_t^n​ on Rn\mathbb{R}^n. This fact is the reason why the heat equation (and moreover, any forward Kolmogorov equation) preserves the energy functional given by

Et=Rnπ(x,t)dx,\begin{equation} \mathcal{E}_t = \int_{\mathbb{R}^n} \pi(x, t) \, dx \,, \end{equation}

as since π(x,t)\pi(x,t) is a probability distribution for fixed tt, Et(π)\mathcal{E}_t(\pi) must always evaluate to one.

Duality of the Evolution Operators

I will make a final remark that this result gives us the expression for the operator that evolves π\pi forward in time, namely, it is

πt=exp(tL)π0.\begin{equation} \pi_t = \exp(t \mathcal{L}^*) \pi_0 \,. \end{equation}

Furthermore, this implies the following

π0[exp(TL)f]=πt[f]=(exp(TL)π0)[f].\begin{equation} \pi_0[\exp(T \mathcal{L}) f] = \pi_t[f] = (\exp(T \mathcal{L}^*) \pi_0)[f] \,. \end{equation}

This means that the adjoint of the evolution operator exp(TL)\exp(T \mathcal{L}) for value functions ff is actually the evolution operator exp(TL)\exp(T \mathcal{L}^*) for probability distributions:

exp(TL)=exp(TL).\begin{equation} \exp(T \mathcal{L})^* = \exp(T \mathcal{L}^*) \,. \end{equation}

Stationary Distribution Problems

Like in the discrete case, the distribution πt\pi_t of a stochastic process XtX_t​ with nonzero diffusion (i.e., σ>0\sigma>0) usually approaches a stationary distribution π\pi as tt \rightarrow \infty. To compute what this stationary distribution is, we take a look at the evolution equation for πt\pi_t​, i.e.,

πt=Lπ(x,t).\begin{equation} \frac{\partial \pi}{\partial t} = \mathcal{L}^* \pi(x, t) \,. \end{equation}

For the change in π\pi to be zero, therefore, we must have that the stationary distribution satisfies the elliptic equation

Lπ=0.\begin{equation} \mathcal{L}^* \pi = 0\,. \end{equation}

Written out in full,

(μ(x)π(x,t))+(a(x)π(x,t))=0.\begin{equation} -\nabla \cdot (\mu(x) \pi(x,t)) + \nabla \cdot(a(x) \nabla \pi(x,t)) = 0 \,. \end{equation}

Naturally, one must also enforce the constraint that

Rnπ(x)dx=1,\begin{equation} \int_{\mathbb{R}^n} \pi(x) \, dx = 1 \,, \end{equation}

to ensure the result is actually a probability distribution.

Markov Chain Monte Carlo with Langevin Dynamics

Langevin dynamics is an important tool in computational science for sampling from distributions on Rn\mathbb{R}^n. Imagine that we are given a probability distribution p(x)p(x) such that we can evaluate the gradient of its logarithm logp(x)\nabla \log ⁡p(x). p(x)p(x) may be the posterior of a statistical model after it has been trained on data, for example. Often times, in these statistics settings, evaluating p(x)p(x) is hard (due to normalization factors), but logp(x)\nabla \log ⁡p(x) can be evaluated easily. Another common situation where one wants to sample from a distribution is in statistical physics, where the equilibrium distributions of systems is often times known and one is interested in computing properties of the system.

Sometimes we might want to evaluate the mode of p(x)p(x) (i.e., compute a maximum likelihood estimator). This is often used by using gradient descent, whose continuous time analogue can be written as

dXt=logp(x)dt.\begin{equation} dX_t = \nabla \log p(x) \,dt\,. \end{equation}

If XtX_t​ is allowed to run long enough, eventually XtX_t​ will converge to a local mode of the distribution p(x)p(x). However, often times the mode of a distribution is not a good representation of the distribution as a whole, and we would rather have the ability to sample from it. To see how this can be done, consider the above equation, but with a bit of noise added from Brownian motion,

dXt=logp(x)dt+2dBt.\begin{equation} dX_t = \nabla \log p(x) \, dt +\sqrt{2} \, dB_t \,. \end{equation}

This process is referred to as overdamped Langevin dynamics. Let's compute the stationary distribution of this process using the machinery from the previous section. The stationary distribution π\pi should satisfy

(π(x)logp(x))+Δπ(x)=0.\begin{equation} -\nabla \cdot (\pi(x) \nabla \log p(x)) + \Delta \pi(x) = 0 \,. \end{equation}

We can rearrange this to get

(π(x)π(x)logp(x))=0.\begin{equation} \nabla \cdot (\nabla \pi(x) - \pi(x) \nabla \log p(x)) = 0 \,. \end{equation}

Therefore,

π(x)π(x)logp(x)=φ,\begin{equation} \nabla \pi (x) − \pi(x) \, \nabla \log p(x) = \varphi \,, \end{equation}

where φ\varphi is a divergence free vector field (i.e., φ=0\nabla\cdot\varphi = 0). However, note that since π\pi is a probability distribution, it must vanish as xx\rightarrow \infty. There is no everywhere differentiable divergence free vector field which vanishes as xx \rightarrow \infty. Therefore φ=0\varphi = 0 and

π(x)π(x)logp(x)=0.\begin{equation} \nabla \pi(x) - \pi (x) \nabla \log p(x) = 0\,. \end{equation}

Note that there is always a nonzero probability density that XtX_t​ ends up at any given location xx, so therefore, π(x)>0\pi(x)>0 and we can divided by π(x)\pi(x) to get

π(x)π(x)=logp(x).\begin{equation} \frac{\nabla \pi(x)}{\pi(x)} = \nabla \log p(x) \,. \end{equation}

But the left hand side is just

logπ(x)=logp(x).\begin{equation} \nabla \log \pi(x) = \nabla \log p(x) \,. \end{equation}

which means

logπ(x)=logp(x)+C.\begin{equation} \log \pi(x) = \log p(x) + C \,. \end{equation}

Or rather,

π(x)=Ap(x),\begin{equation} \pi(x) = A p(x) \,, \end{equation}

for some constant AA. But integrating both sides gives

Rnπ(x)dx=ARnp(x)dx.\begin{equation} \int_{\mathbb{R}^n} \pi(x) \, dx = A\int_{\mathbb{R}^n} p(x) \, dx \,. \end{equation}

And since both π(x)\pi(x) and p(x)p(x) are probability distributions, their integrals evaluate to one and we therefore have A=1A=1 and

π(x)=p(x)\begin{equation} \pi(x) = p(x) \end{equation}

Therefore, the stationary distribution of XtX_t​ is precisely the distribution p(x)p(x). This means we can draw a sample from p(x)p(x) by sampling X0X_0 from some arbitrary distribution (preferably one already close to p(x)p(x) if we can) and letting XtX_t run for a long time. After this burn in period, we examine XtX_t and the result is an approximate sample from p(x)p(x). This is an example of a Markov Chain Monte Carlo technique. This technique forms the basis for stable diffusion (opens in a new tab) models in Machine Learning.