Finite Difference Equations from Discrete Stochastic Processes
This quick tutorial on the deep connection between elliptic/parabolic systems and stochastic processes. The connection gives a lot of useful intuition about how to interpret elliptic/parabolic PDEs as well as Laplacian systems on graphs. Part I covers discrete time stochastic processes on graphs.
This tutorial is meant for students at the graduate or advanced undergraduate level.
Hitting Time Problems on the Path Graph
Let us consider the simplest case of an elliptic system - namely an elliptic system on the path graph with NN vertices, labelled . Let us define a random process for as a random walk on the graph (i.e., the process moves left or right with equal probability, unless it is at the boundary, where it bounces back).
A natural question to ask about the process is, given that starts at a certain location (say, ), how long will it take the process to hit either the left or right boundary of the graph? For some visual flair, maybe imagine that is a drunk guy who is trying to get home and is stumbling back on forth on a circular driveway (the endpoints of are his home) -- and we want to ask how long on average it will take the drunk man to get home?
This question turns out to essentially be a linear system. To see why, let's define the stopping time to be the first time that arrives either at or . The above problem is asking us to compute the expected value of , i.e., compute the function
A linear system for can be expressed using a technique known as first transition analysis. First transition analysis expands the above expression by considering all of the possible transitions that the process can make starting at . We know that can go left or right with probability , therefore by the properties of conditional expectation, we have that
However, note that the process is Markovian, that is, it has no memory of the past. What the process does at time doesn't dependent on anything except for the state of at time . Therefore, it follows that
In English, this mathematical statement is expressing the fact that expected time until hits the boundary if it is at state at time is the same as the expected time for hits the boundary starting had it started from originally, plus one (because we have effectively rewound the clock by one unit of time and need to account for this). Thus, we can plug this into the first transition analysis expression for to get
Rearranging,
Note however, that this is only valid for the interior of the path graph and therefore, we need to apply boundary conditions at and . This is fairly easy, because if the process starts at either of the ends of the path, then the expected amount of time until it reaches the ends of the graph should be zero,
If we write out what the matrix for this system, we see it has the form
You might recognize the matrix above as the second order finite difference stencil matrix for the Poisson equation in 1D. As we will see later, this is not entirely a coincidence, as the Poisson equation has a similar probabilistic interpretation.
Hitting Time Problems on General Graphs
Now that we've seen a basic elliptic system arrise out a hitting time problem for a stochastic process on the path graph , it is time to move to the more general types of graphs. Suppose we have a graph and we have a stochastic process defined on the vertex set . If we have a special set of death vertices , we can once again ask the analogue of the above hitting time question: how long does take to hit a vertex in , given that it starts at ? Once, again, we want to compute
where is the hitting time of , i.e.,
If we only consider which are Markovian, the dynamics of can be characterized completely by its probability transition kernel (also known as the transition matrix), defined by
i.e., the probability that we make the transition if we are at vertex . If we do first transition analysis on the process , we can write down an expression for like in the previous section,
Once again, rearranging this expression yields
This equation is often written as
where
is known as the generator of the process . You may recognize this matrix from spectral graph theory, where it is the negative of the graph Laplacian.
To get a well-posed system of equations for , we must also enforce boundary conditions, namely,
Together this gives us the complete elliptic system of equations for the hitting time of from anywhere in the graph, which we can write down as
A very nice thing about these types of problems is that they can be solved in nearly linear time by Laplacian system solvers.
However, there are many other types of interesting problems concerning stochastic processes which also reduce to elliptic systems. We will examine an example in the next section.
Expected Terminal Reward Problems on Graphs
Suppose instead of being interested in the hitting time of a set , we are instead interested in some kind of terminal reward on . Denote this reward as a function . Maybe think of this like is a person wandering through a maze until he gets to the an exit , whereupon he recieves the reward . A question we can ask about is the expected terminal reward that the process recieves given starts at a vertex , i.e.
Turns out we can use first transition analysis here too! Let's try it,
Now, note that by the Markov property,
because the expected terminal reward is the same if starts at or if we know that transitioned to on the first time step. Note we don't have to add one in this case like we did for the hitting time problem --- this is because the value of does not impact the terminal reward. Therefore,
which can be written as
However, we now need different boundary conditions. But, note that if starts on a vertex , then we immediately terminate and receive a reward of . Therefore, we have that
this is sometimes called a Laplace equation with Dirichlet boundary conditions.
First Arrival Problems
An interesting subset of the above collection of problems are first arrival problems. Suppose the termination set is split into two disjoint subsets . What is the probability that, when we terminate, it is because we are in ? Another way of phrasing this problem, is what is the probability that hits before it hits ?. It turns out that solving this problem is very easy with the above machinery. Simply define to be on and on . Then, the expected reward is precisely the probability that we hit before .
The system of equations written out in full is
General Discounted Reward Problems
The most general form of an elliptic system arrising from a stochastic process on a graph can be phrased in terms of discounted reward. Suppose, like above that wanders through the graph playing the above game until it arrives at a vertex in (this time though, we allow , i.e., the game continues forever). When arrives at a terminal vertex in , the game stops and receives a reward . However, along the way, the process also receives an ongoing stream of rewards depending on the vertices it visits. That is, if , then the process receives reward at time tt where is defined on the non-terminal vertices, i.e. . This problem revolves around the expected total reward received over the course of the game,
In general settings, one may also consider discounted reward, where the reward received at time tt is discounted by a factor of . This means that reward collected at the beginning of the game is more important than reward received later on (this is a common situation in economics or finance, where having money in the bank can net you interest, so one dollar today is worth more than one dollar a year from now). Then, the total discounted reward reads
How do we solve for ? You guessed it! First transition analysis! Here we go:
where from eq. (26) to eq. (27) we used the fact that the process receives reward and then transitions to a new state with probability , and from eq. (27) to eq. (28) we used the Markov property of . Therefore, on , we have
and on , we have
In matrix form, this can be written as
Therefore, if you ever see a linear system of the above form on a graph (i.e., a Laplacian system with discounting), where is a transition matrix, you can always interpret the result with the expression
where the inhomogeneous term is the reward received for leaving vertex , the Dirichlet boundary condition is the terminal reward received when the game ends as arrives at , and is the discount factor of this game. Note that if the termination set is empty, then the game runs forever (for the total reward to be finite the discount factor must be in this case), and the expression for reward is given by
An immediate consequence of these probabilistic expressions is the maximal principle,
Discrete Maximum Principle for Subharmonic Functions
Functions defined on the vertices of a graph with a stochastic process with generator are called subharmonic on if
(It is harmonic if ). Note that any is trivially the solution to the system
where we define as the set of vertices that not in , but are adjacent to a vertex in . Therefore, the probabilistic interpretation we've developed tells us that, for ,
where is the hitting time of . Note that we have used the fact that to drop the term and then used the fact that .
Therefore, if is a subharmonic function on a domain in the graph , its maximum must always occur on the boundary of . This is fundamental result in analysis (for subharmonic functions on ), but this probabilistic interpretation of the Laplacian gives us a super easy way of seeing why the result is true.
A byproduct of this analysis is that if the function is actually harmonic, instead of just subharmonic, we have that, for ,
Thus, is essentially just a weighted average over the boundary , where the contribution of depends on the probability of the walk ending up at when it terminates.
This idea of expressing a function in terms of an expectation of a function of stochastic process gives us a glimpse of a fundamental technique in stochastic differential equations called Ito's formula (to be talked about later).
Stationary Distribution Problems
Another important property is the stationary distribution of stochastic processes on graphs. Stationary distributions are extremely important in many data science applications. For example, one way to rank the relative importance of web domains on the internet is to think about a random walk through the internet and ask the question: after a very long time what is the average amount of time that has spent at a specific webpage? It turns out this question has a unique answer (for most graphs) that doesn't depend on the starting point of . Indeed, the average fraction of the time spends at (over an infinitely long time) is given by the entries of a special distribution called the stationary distribution. I won't go into much detail here, but basically if denotes the probability distribution of at time on the graph, then is the limit of the as .
The key property which defines the stationary distribution is that if has distribution , i.e., , then etc. also have distribution . Hence, why is called stationary. To write down an equation for the stationary distribution, we should first write down how the distributions evolve with time. This turns out not to be to difficult, as to compute we simply need to some over all of the possible ways can reach and weigh those quantities by the probability of making the transition ,
If we represent as a row vector, this is written in matrix form by
Note since multiplication by occurs on the right, effectively the transpose of the probability transition kernel has been used.We remark that the stationary distribution should therefore satisfy
Rewriting this gives us
Note that this equation is actually underdetermined because any scalar multiple of will also solve the above problem. This can be solved by adding the additional constraint that the entries of should sum to ,
If the graph is aperiodic and connected, these system of equations will give the unique stationary distribution (see fundamental theorem of Markov chains). In particular, in this situtation the generator has exactly one zero eigenvalue, corresponding to the all-ones vector ,
This is the right zero eigenvector of , the left zero eigenvector of is .
Duality
In the above section, we saw a hint of what is called duality. In particular, it is the duality between functions on the vertex set and distributions on the vertex set . This notion of duality is worth investigating because it reveals two fundamental approaches to computations involving these types of stochastic processes, forward and backward computation.
First, let us denote the set of real-valued functions on via . Note that is a vector space. The dual space is defined as the set of all linear functionals . That is, functionals satisfying
This space is itself a vector space over because you can add two distributions and together via:
Linearity and finite dimensionality allows us to find a correspondance between the space and the dual space . If we let denote the indicator function on the vertex (i.e., on and zero elsewhere), any function on can be written as:
And hence, any distribution on can be computed as:
This is why, for finite linear algebra, one can save time and just define as a vector with entries . By convention, distributions are treated as row-vectors whereas functions are treated as column-vectors. This means that the product of the two as vectors corresponds to the evaluation of the distribution:
In particular, is said to be a probability distribution if the following is true,
where is the function that takes value everywhere. When this happens, the product between and is the expected value of for a random variable ,
Why is this picture useful good for computation? Well, observe that if is a Markov chain with transition probability matrix , then we have that
This is somewhat profound because it shows us that there two two completely different approaches to computing the expected value of at time :
- Forward: Compute the probability distribution and then take the product with .
- Backward: Compute the quantity and then take the product with the initial probability .
The first method of computation is pretty clear, we simply take the initial probability distribution of the Markov chain and evolve it forward in time. However, the second method of computation is a little profound: we can take the final value function at time and evolve it backwards in time to get a function . This is the crux of duality for stochastic processes, probability distributions always evolve forward in time, whereas value functions (things we compute expectations of) evolve backward in time. The latter is very closely related to the Heisenberg picture from quantum physics -- we don't actually ever have to think about the initial state of the system, we only need to think about the quantity that we want to evaluate at time and this will give us a corresponding quantity that can be evaluated at time with any starting probability distribution. Let us examine the equations governing the evolution of these objects in the next two sections.
The Forward Kolmogorov Equation
First we look at the governing equations for the evolution of probability distributions. Let us recall from the previous section that the equation governing the evolution of the probability distributions is given by
By subtracting from both sides, we arrive at
which can be rewritten as
where is the generator of (i.e., the graph Laplacian) and is the discrete time derivative. This is the discrete-time forward Kolmogorov equation or the discrete-time heat equation. Note the similarity of this equation to the continuous heat equation,
where is the continuous Laplacian. However, there are some notable differences between the discrete and continuous heat equations. The factor of is to some extent the result of convention, but it is hard to explain away for now. Morally, you should think of as the discrete analogue of (note the factor of that appeared in the introduction section). Indeed, on grid graphs, an appropriately normalized ( is symmetric on grid graphs) will "converge" to as the grid becomes infinitely fine. This bridge allows us to translate intuition between the worlds discrete spectral graph theory and continuous PDEs.
The Backward Kolmogorov Equation
In constrast to probability distributions, value functions are evolved via the expression
We therefore perform the same calculation as the previous section to arrive at
where is the backward finite difference operator. The negative sign arises from the fact that we are evolving this equation backwards in time. If we substitute variables we instead get
However, we can ask for a fair bit more than this. Instead of just having a payoff at time , we can have a series of payoffs at different times. Let us modify the discounted reward problem a bit to turn it into a time evolution problem. Suppose we have a similar setup as before: i.e., we have a process on the graph . However, this time, the reward function is allowed to be time dependent, so it is a function both of a vertex and a time . Suppose, that we let the process run on the graph until time , and at every time step, it collects reward until time , when it collects terminal reward and the game ends. Note that unlike from the previous section, which is a stopping time, and may be random, is simply a constant that we choose before hand. Now, we ask the question: given we start the game with , what is the expect amount of (discounted) reward that accumulates in time up to ? That is, we want to compute the value function of the state ,
One way to solve this problem is by recursion. Let us extend the value function to starting times . We do this as follows:
That is, is the expected reward collected if we start at time and at state . We can use first transition analysis to write down an equation for this function that we can then solve,
Note that the above is a linear system for the in terms of the . Therefore, since we have the boundary condition
we can recursively solve for the backwards in time until we get the we originally desired. Let's rewrite the system of equations to put it into a more (subjectively) digestable form,
Note we have simply multiplied subtracted from both sides. Note also that the left hand side of the equation is essentially a time derivative of . Thus, we can write the whole equation above in matrix form as
Where is the backward finite difference time derivative. When the discount factor is identity, we get
Note that the negative signs here are an artifact of the fact that we are solving backwards in . If instead we rewrite , then we have
Compare this to the continuous backward heat equation with an inhomogenous term,
We will examine how all of these results from the discrete case carry over into the continuous case in subsequent sections.
You should also compare it to the forward Kolmogorov equation and note that the since the forward equation involves vector multiplication on the left, it effectively involves the transpose/adjoint of the operator . A good way to make sense of this is to think of in the following way: essentially, generates the evolution of distributions forwards in time, while generates the evolution of value functions / test functions backwards in time. When the underlying process is reversible (i.e., is the same process as , so the process looks the same when time is reversed), we have that and the backwards and forwards equations are the same. This is a manifestation of the fact that the evolution of the process forward in time and backward in time are indistinguishable. This happens to be the case on grid graphs.