Partial Differential Equations from Continuous Stochastic Processes
This quick tutorial on the deep connection between elliptic/parabolic systems and stochastic processes. The connection gives a lot of useful intuition about how to interpret elliptic/parabolic PDEs as well as Laplacian systems on graphs. This part covers continuous time stochastic processes and Ito calculus.
This tutorial is meant for students at the graduate or advanced undergraduate level.
Moving to Continuous Stochastic Processes: Ito's Formula
Here is where we finally make the jump to continuous time and space. To begin this conversion process, we need to make note that the fundamental analogue of the stochastic graph process is going to be a continuous space/time Ito process defined by the stochastic differential equation
where is a standard Brownian motion (i.e., continuous time random walk). The term is referred to as the drift of the process when it is at and is the volitility of the process when it is at . If you've never seen stochastic differential equations before, the above equation shouldn't scare you - you should try to read it defining as the continuous time limit of the following discrete-time process:
where are independent normal variables with mean and variance , and the time step is taken to be infinitesimally small.
Here is where we are going to need a bit of Ito calculus. Don't worry! It's just like normal calculus, except instead of just having a single class of standard infinitesimal quantity , we also have a second class of "special" infinitesimal normal random variables . Just like in normal elementary calculus, the way we perform any calculation is to expand in powers of and and then discard anything that is smaller than (since is already infinitesimally small). The caveat is that
since is an infinitesimal normal random variable with variance and hence standard deviation . The way that Ito calculus expresses this equivalence is with the identity
Think of this like an infinitesimal version of the law of large numbers of Gaussian random variables. As grows infinitesimally small, the randomness in basically vanishes and it becomes a deterministic quantity (in reality the randomness is still there, but it is negligible because it has order less than , and hence concentrates about its expectation, ). Therefore, when taking a Taylor expansion with an infinitesimal variable dtdt and the random variable we can discard everything above first order in dtdt or second order in .
Suppose now that is a function and we want to say something about the stochastic process . Performing a Taylor expansion on allows us to express the infinitesimal change in in terms of quantities we have already defined and know how to handle,
We ignore everything higher than second order because anything smaller than is effectively zero. Note that, by definition,
And also,
where we have used the identity , and discarded any terms that are higher order in . Therefore,
This formula is the analogue of the chain rule for elementary calculus, and goes by the name of Ito's formula. Note that the presence of randomness adds the and terms to the otherwise standard expressions. The Ito formula is often written as follows
where is known as the infinitesimal generator of the process , and is given by
Now, integrating eq. (9) in time from to gives
Taking expectations of both sides and letting , we note that the integral above with becomes zero, since it is the infinite sum of mean-zero random variables. This leaves us with
This gives us an expression for in terms of the expectation of a stochastic process
Of course, this by itself isn't a very useful expression for , because the expression has on the right hand side. This is where boundary conditions for typically come in, but we will address this in the next section. First, it is important to note that is actually possible to take to be a random variable (we won't show it here because that requires some analysis stuff that's beyond the scope of this tutorial). can be (and is very often taken to be) the hitting time of some set in . Often times, we let be an interval and then define the stopping time as the first time at which hits the boundary of the domain , i.e.,
We can also define higher dimensional versions of in instead of in . In that case, is defined as
In this SDE, is a vector field on RnRn determining the drift of the process when it is at and determines the volatility (i.e., noise) of the process when it is at . denotes the standard multidimensional infinitesimal Brownian motion, i.e., . In this case we get the same expression for , but the infinitesimal generator becomes
Elliptic Partial Differential Equations
We can take the above analysis and use it to get intuition on many of the partial differential equations we are used to seeing in our everyday lives by connecting them to properties of stochastic processes (have you ever read a more ivory tower sentence?). First, let's think of the analogue to our discrete time-game where wanders around a graph and collects reward after leaving vertex and terminal reward when it reaches a terminal vertex . In the continuous space/time case, we instead consider as a stochastic process on a connected bounded domain . The internal reward function is now defined on and is interpreted as a rate, i.e., if is at for a period of time , then it picks up reward. The terminal reward function is now defined on the boundary . Our expression from the previous section tells us that, if solves the elliptic PDE with Dirichlet boundary conditions,
then eq. (13) tells us that it must satisfy
This formula provides a very nice probabilistic interpretation of second order elliptic PDEs.
Note in particular, this gives a nice probabilistic form for the solution of the Laplace equation with Dirichlet boundary conditions, i.e.,
The solution satisfies the probabilistic expression
where is a standard Brownian motion. Therefore, the Laplace equation with Dirichlet boundary conditions is the continuous analogue of the expected terminal reward problem we saw in the discrete version of this tutorial. Note moreover, that it is possible to use this expression to evaluate with Monte Carlo without solving the entire equation. This is especially useful in high dimensional settings where solving the Laplace equation can be prohibitively expensive.
The Maximum Principle for Elliptic Equations
Another outcome of this probabilistic form is the maximum principle. Suppose that is an infinitesimal generator of some stochastic process , and suppose that the function satisfies
Note when such a function is called subharmonic on . If this is the case, then from the probabilistic interpretation of in the previous section, we have that
This is therefore the continuous analogue of the discrete maximum principle we saw earlier in the discrete version of this tutorial.
Hitting Time Problems
Suppose that we have a stochastic process on given by
and we want to ask the continuous analogue of the discrete hitting time problem. That is, how long does it take to hit the boundary if it starts from ? We will write this function as
The probabalistic interpretation we derived in the previous sections tells us that the function that satisfies the PDE
also satisfies eq. (24) and hence is the function we want.
First Arrival Problems
Suppose that we have a stochastic process on given by
and we want to ask the continuous analogue of the discrete first arrival problem. That is, suppose that can be split into two disjoint parts and . What is the probability that hits before it hits given that it starts at ? Let be the indicator function for , then the function we want to compute is
The probabalistic interpretation we derived in the previous sections tells us that the that satisfies the PDE
also satisfies eq. (27) and hence is the function we want.
Parabolic Differential Equations
To move to parabolic differential equations, i.e., time evolution problems, we need a version of Ito's formula for functions for functions that depend both on space and time. Turns out, adding a time variable doesn't change anything except for adding an extra term in the time partial derivative in . Let's redo the Taylor expansion of in terms of and ,
Redoing the calculation that we did earlier, this gives us
Note that the only thing that has changed is that there is now a time partial derivative in the terms. Once again, we integrate this from to to obtain
Since is mean zero and independent of everything that happens up until time , the second integral vanishes under expectation and we are left with
Note that this is equivalent to the following more general expression (just shift the result in time):
Once again, it is also possible for to be a stopping time (i.e., the hitting time of the boundary of a set, for example), deriving this is beyond the scope of this tutorial, if one assumes some nice properties about the stopping time (i.e., finite expectation).
In particular, if satisfies the parabolic second-order differential equation
then we get an analogue to eq. (19) for parabolic PDEs,
This is a simplified version of the more general Feynman-Kac formula (opens in a new tab). Note that this is fundamentally a Kolmogorov backwards equation for the value function . The interpretation is that we let the stochastic process run from time to and along the way it picks up reward at a rate of per unit time, until it recieves a final payout of at time . The value function measures the expected payout of this game.
The Maximum Principle for Parabolic Equations
A cool corollary of the above probabilistic expression is the maximum principle for parabolic equations. Suppose that is defined on and satisfies
for the infinitesimal generator of some stochastic process . Then it follows that
Duality
It is at this point that it is worth taking a digression for another discussion of the concept of duality. Unfortunately, unlike in the discrete case, daulity is far more complicated in the continuous case because our underlying vector spaces are infinite dimensional. However, I will try to have a limited discussion of it here.
To begin, we need to consider our primal space. This will be the space of value functions analogous to the set of functions on vertices that we saw in the previous tutorial. However, in our case, I will take our primal set of functions to be the set of compactly supported infinitely differentiable functions . Note that any function can be well-approximated by an infinitely differentiable function (see mollifiers (opens in a new tab)), so this restriction is not too much of a loss. Thus, we will denote as such functions.
Now, the dual of is defined as the set of all linear functionals on , i.e., satisfying
Note that for Banach spaces (i.e., normed infinite vector spaces), the dual space usually has the additional restriction that the functionals must be continuous,
Since these two notions of duality are different (the second is stronger than the first), the reader should be aware of it, but it won't be important for our purposes.
Note that there is not a one-to-one correspondance between the primal set and , since the Dirac delta (opens in a new tab) distribution is a member of the dual,
However, there is no such function like this in with the property that
Therefore, and are fundamentally different. Nonetheless, we can still nominally think of distributions as "row vectors" and functions as "column vectors", even if this relationship is strained somewhat in the continuous case. While is may not be possible to transform a distribution into a function, it is always possible to transform a function into a distribution via the inner product
But the converse is not necessarily true.
In particular, all probability distributions on can be interpreted as linear functionals on via the expectation operator
This now returns us to a very fundamental problem with stochastic processes. We have a stochastic process that begins with probability distribution at time and we would like to evaluate at time . As we saw in the previous tutorial about discrete processes, there are two ways of doing this, we can either evolve forward in time to a distribution and then take expectation with , or we can evolve backward in time to a function and take expectation . Eq. (35) from the previous section already tells us exactly how we can do the latter backwards approach. So, let us start with the backwards equation first.
The Backward Kolmogorov Equation
To derive an expression for , first we use the law of iterated expectation,
Thus, we see that has the form
and we know from eq. (35) that therefore satisfies the PDE
with the appropriate boundary conditions at . If we do a variable substitution , then we get
with boundary conditions at . Because of the similarity of this expression to the ordinary differential equation
for the exponential function , we write the operator that takes at time to at as
While there are some caveats, generally, one can treat the exponential of operators in much the same way as one treats the exponential of matrices. This expression therefore tells us how value functions evolve backwards in time under the dynamics of the stochastic process . To write out the equation in its full glory, we have
The Forward Kolmogorov Equation
Now, we would like to turn to how the distribution evolves from to . Note that the previous section gave us the following expression
The distribution of is therefore given by
However, this is somewhat of a cop-out answer to the question of how distributions evolve forward in time. Ideally, we would like an actual equation that discribes the evolution of . To do this, we have to assume that is a nice function. We will assume therefore, that it is a member of (note, not necessarily compactly supported). That is, there exists a function such that
where the function is the probability distribution function (PDF) of the operator . We would like to know how the probability distribution function evolves over time. To do this, we derive the weak form of the evolution equations.
First, consider test functions that vanish at the end times and are compactly supported in . Let us plug in the eq. (32) that we derived earlier,
But since, by construction, we have that
Now, we write this as an integral against ,
Taking the inside gives us
We now perform an integration by parts in both and to obtain
where is the adjoint operator of (think of it like the matrix transpose). Note that the boundary terms of the integration by parts vanish because is compactly supported and zero at and . We will not do the computation here, but the adjoint operator written out in full is
But note that since the equation (56) is true for any choice of , it must be the case that
or rather
This is the Kolmogorov forward equation for , as it governs 's evolution forward in time. Writing it out in full glory gives
It is also sometimes written in the form
and in this form, it is called the advection-diffusion equation and is called the coefficient of diffusivity.
One special case is the heat equation
which describes the time evolution of the probability distribution of a standard Brownian motion on . This fact is the reason why the heat equation (and moreover, any forward Kolmogorov equation) preserves the energy functional given by
as since is a probability distribution for fixed , must always evaluate to one.
Duality of the Evolution Operators
I will make a final remark that this result gives us the expression for the operator that evolves forward in time, namely, it is
Furthermore, this implies the following
This means that the adjoint of the evolution operator for value functions is actually the evolution operator for probability distributions:
Stationary Distribution Problems
Like in the discrete case, the distribution of a stochastic process with nonzero diffusion (i.e., ) usually approaches a stationary distribution as . To compute what this stationary distribution is, we take a look at the evolution equation for , i.e.,
For the change in to be zero, therefore, we must have that the stationary distribution satisfies the elliptic equation
Written out in full,
Naturally, one must also enforce the constraint that
to ensure the result is actually a probability distribution.
Markov Chain Monte Carlo with Langevin Dynamics
Langevin dynamics is an important tool in computational science for sampling from distributions on . Imagine that we are given a probability distribution such that we can evaluate the gradient of its logarithm . may be the posterior of a statistical model after it has been trained on data, for example. Often times, in these statistics settings, evaluating is hard (due to normalization factors), but can be evaluated easily. Another common situation where one wants to sample from a distribution is in statistical physics, where the equilibrium distributions of systems is often times known and one is interested in computing properties of the system.
Sometimes we might want to evaluate the mode of (i.e., compute a maximum likelihood estimator). This is often used by using gradient descent, whose continuous time analogue can be written as
If is allowed to run long enough, eventually will converge to a local mode of the distribution . However, often times the mode of a distribution is not a good representation of the distribution as a whole, and we would rather have the ability to sample from it. To see how this can be done, consider the above equation, but with a bit of noise added from Brownian motion,
This process is referred to as overdamped Langevin dynamics. Let's compute the stationary distribution of this process using the machinery from the previous section. The stationary distribution should satisfy
We can rearrange this to get
Therefore,
where is a divergence free vector field (i.e., ). However, note that since is a probability distribution, it must vanish as . There is no everywhere differentiable divergence free vector field which vanishes as . Therefore and
Note that there is always a nonzero probability density that ends up at any given location , so therefore, and we can divided by to get
But the left hand side is just
which means
Or rather,
for some constant . But integrating both sides gives
And since both and are probability distributions, their integrals evaluate to one and we therefore have and
Therefore, the stationary distribution of is precisely the distribution . This means we can draw a sample from by sampling from some arbitrary distribution (preferably one already close to if we can) and letting run for a long time. After this burn in period, we examine and the result is an approximate sample from . This is an example of a Markov Chain Monte Carlo technique. This technique forms the basis for stable diffusion (opens in a new tab) models in Machine Learning.