Stochastic Processes and Differential Equations
Discrete Processes

Finite Difference Equations from Discrete Stochastic Processes

Banner

This quick tutorial on the deep connection between elliptic/parabolic systems and stochastic processes. The connection gives a lot of useful intuition about how to interpret elliptic/parabolic PDEs as well as Laplacian systems on graphs. Part I covers discrete time stochastic processes on graphs.

This tutorial is meant for students at the graduate or advanced undergraduate level.

Hitting Time Problems on the Path Graph

Let us consider the simplest case of an elliptic system - namely an elliptic system on the path graph PNP_N with NN vertices, labelled 0,...,N10,...,N10,...,N-10,...,N-1. Let us define a random process XtV(PN)X_t \in V(P_N) for tZ0t \in \mathbb{Z}_{\geq 0} as a random walk on the graph PNP_N (i.e., the process XtX_t moves left or right with equal probability, unless it is at the boundary, where it bounces back).

A natural question to ask about the process XtX_t is, given that XtX_t starts at a certain location (say, X0=nX_0=n), how long will it take the process XtX_t to hit either the left or right boundary of the graph? For some visual flair, maybe imagine that XtX_t is a drunk guy who is trying to get home and is stumbling back on forth on a circular driveway (the endpoints of PNP_N are his home) -- and we want to ask how long on average it will take the drunk man to get home?

This question turns out to essentially be a linear system. To see why, let's define the stopping time TT to be the first time that XtX_t arrives either at 00 or N1N-1. The above problem is asking us to compute the expected value of TT, i.e., compute the function

u(x)=E[TX0=x].\begin{equation} u(x)=\mathbb{E}[T \mid X_0=x] \,. \end{equation}

A linear system for u(x)u(x) can be expressed using a technique known as first transition analysis. First transition analysis expands the above expression by considering all of the possible transitions that the process XtX_t can make starting at X0=xX_0=x. We know that XtX_t can go left or right with probability 1/21/2, therefore by the properties of conditional expectation, we have that

u(x)=12E[TX1=x1]+12E[TX1=x+1].\begin{equation} u(x)=\frac{1}{2}\mathbb{E}[T\mid X_1 = x-1 ] +\frac{1}{2} \mathbb{E} [T \mid X_1=x+1] \,. \end{equation}

However, note that the process XtX_t is Markovian, that is, it has no memory of the past. What the process XtX_t does at time t+1t+1 doesn't dependent on anything except for the state of XtX_t at time tt. Therefore, it follows that

E[TX1=y]=E[TX0=y]+1.\begin{equation} \mathbb{E}[T \mid X_1 = y] = \mathbb{E}[T \mid X_0 = y ]+1 \,. \end{equation}

In English, this mathematical statement is expressing the fact that expected time until XtX_t hits the boundary if it is at state yy at time 11 is the same as the expected time for XtX_t hits the boundary starting had it started from yy originally, plus one (because we have effectively rewound the clock by one unit of time and need to account for this). Thus, we can plug this into the first transition analysis expression for u(x)u(x) to get

u(x)=12E[TX0=x1]+12E[TX0=x+1]+1=12u(x1)+12u(x+1)+1.\begin{equation} u(x)=\frac{1}{2}\mathbb{E}[T \mid X_0=x-1] + \frac{1}{2}\mathbb{E}[T \mid X_0=x+1] + 1=\frac{1}{2} u(x-1) + \frac{1}{2}u(x+1) + 1 \,. \end{equation}

Rearranging,

12u(x1)+u(x)12u(x+1)=1.\begin{equation} -\frac{1}{2}u(x-1)+u(x)-\frac{1}{2}u(x+1)=1\,. \end{equation}

Note however, that this is only valid for the interior of the path graph PNP_N and therefore, we need to apply boundary conditions at x=0x=0 and x=N1x=N-1. This is fairly easy, because if the process starts at either of the ends of the path, then the expected amount of time until it reaches the ends of the graph should be zero,

u(0)=0,u(N1)=0.\begin{equation} \begin{split} u(0)&=0 \,, \\ u(N-1)&=0 \,. \end{split} \end{equation}

If we write out what the matrix for this system, we see it has the form

12[2112112112112][u(1)u(2)u(3)u(N3)u(N2)]=[11111]\begin{equation} \frac{1}{2} \left[ \begin{array}{ccccc} 2 & -1 & & & \\ -1 & 2 & -1 & & \\ & -1 & 2 & -1 & \\ & & & \ddots & \\ & & -1 & 2 & -1 \\ & & & -1 & 2 \end{array}\right] \left[\begin{array}{c} u(1) \\ u(2) \\ u(3) \\ \vdots \\ u(N - 3) \\ u(N - 2) \end{array}\right] = \left[\begin{array}{c} 1 \\ 1 \\ 1 \\ \vdots \\ 1 \\ 1 \end{array}\right] \end{equation}

You might recognize the matrix above as the second order finite difference stencil matrix for the Poisson equation in 1D. As we will see later, this is not entirely a coincidence, as the Poisson equation has a similar probabilistic interpretation.

Hitting Time Problems on General Graphs

Now that we've seen a basic elliptic system arrise out a hitting time problem for a stochastic process on the path graph PNP_N, it is time to move to the more general types of graphs. Suppose we have a graph G=(V,E)G=(V,E) and we have a stochastic process XtVX_t \in V defined on the vertex set VV. If we have a special set of death vertices DVD \subset V, we can once again ask the analogue of the above hitting time question: how long does XtX_t take to hit a vertex in DD, given that it starts at xVx \in V? Once, again, we want to compute

u(x)=E[TX0=x],\begin{equation} u(x)=\mathbb{E}[T \mid X_0=x] \,, \end{equation}

where TT is the hitting time of DD, i.e.,

T=inft{tXtD}.\begin{equation} T = \inf_t \{ t \mid X_t \in D \} \,. \end{equation}

If we only consider XtX_t which are Markovian, the dynamics of XtX_t can be characterized completely by its probability transition kernel PP (also known as the transition matrix), defined by

P(x,y)=P(Xt=yXt1=x),\begin{equation} \mathbb{P}(x,y)=\mathbb{P}(X_t=y \mid X_{t-1}=x) \,, \end{equation}

i.e., the probability that we make the transition xyx \rightarrow y if we are at vertex xx. If we do first transition analysis on the process XtX_t, we can write down an expression for u(x)u(x) like in the previous section,

u(x)=vP(x,v)u(v)+1.\begin{equation} u(x) = \sum_v P(x,v) \, u(v) + 1\,. \end{equation}

Once again, rearranging this expression yields

u(x)vP(x,v)u(v)=1.\begin{equation} u(x)-\sum_v P(x,v) \, u(v) = 1 \,. \end{equation}

This equation is often written as

Lu=1.\begin{equation} \mathcal{L} u=-1\,. \end{equation}

where

L=PI,\begin{equation} \mathcal{L}=P-I \,, \end{equation}

is known as the generator of the process XtX_t. You may recognize this matrix from spectral graph theory, where it is the negative of the graph Laplacian.

To get a well-posed system of equations for uu, we must also enforce boundary conditions, namely,

u(x)=0,for xD.\begin{equation} u(x)=0 , \qquad \text{for } x \in D \,. \end{equation}

Together this gives us the complete elliptic system of equations for the hitting time of DD from anywhere in the graph, which we can write down as

Lu=1,on VD,u=0,on D.\begin{equation} \begin{split} \mathcal{L} u=-1 \,, &\qquad \text{on } V \setminus D \,, \\ u=0\,, &\qquad \text{on } D \,. \end{split} \end{equation}

A very nice thing about these types of problems is that they can be solved in nearly linear time by Laplacian system solvers.

However, there are many other types of interesting problems concerning stochastic processes which also reduce to elliptic systems. We will examine an example in the next section.

Expected Terminal Reward Problems on Graphs

Suppose instead of being interested in the hitting time of a set DVD \subset V, we are instead interested in some kind of terminal reward on DD. Denote this reward as a function g:DRg:D\rightarrow R. Maybe think of this like XtX_t is a person wandering through a maze until he gets to the an exit vDv \in D, whereupon he recieves the reward g(v)g(v). A question we can ask about XtX_t is the expected terminal reward that the process XtX_t recieves given XtX_t starts at a vertex xx, i.e.

u(x)=E[g(XT)X0=x].\begin{equation} u(x)=\mathbb{E}[g(X_T) \mid X_0=x] \,. \end{equation}

Turns out we can use first transition analysis here too! Let's try it,

u(x)=vP(X1=vX0=x)E[g(XT)X1=v].\begin{equation} u(x)=\sum_v \mathbb{P}(X_1=v \mid X_0 = x) \mathbb{E}[g(X_T) \mid X_1 = v] \,. \end{equation}

Now, note that by the Markov property,

E[g(XT)X1=v]=E[g(XT)X0=v],\begin{equation} \mathbb{E}[g(X_T) \mid X_1=v ]= \mathbb{E}[g(X_T) \mid X_0= v] \,, \end{equation}

because the expected terminal reward is the same if XtX_t starts at vv or if we know that XtX_t transitioned to vv on the first time step. Note we don't have to add one in this case like we did for the hitting time problem --- this is because the value of TT does not impact the terminal reward. Therefore,

u(x)=vP(x,v)u(v),\begin{equation} u(x)=\sum_v P(x,v) \, u(v) \,, \end{equation}

which can be written as

Lu=0,on VD.\begin{equation} \mathcal{L} u=0\,, \qquad \text{on } V\setminus D \,. \end{equation}

However, we now need different boundary conditions. But, note that if XtX_t starts on a vertex xDx\in D, then we immediately terminate and receive a reward of g(x)g(x). Therefore, we have that

u=g,on D,\begin{equation} u=g\,, \qquad \text{on } D \,, \end{equation}

this is sometimes called a Laplace equation with Dirichlet boundary conditions.

First Arrival Problems

An interesting subset of the above collection of problems are first arrival problems. Suppose the termination set DVD \subset V is split into two disjoint subsets AB=DA \cup B =D . What is the probability that, when we terminate, it is because we are in AA? Another way of phrasing this problem, is what is the probability that XtX_t hits AA before it hits BB?. It turns out that solving this problem is very easy with the above machinery. Simply define gg to be 11 on AA and 00 on BB. Then, the expected reward is precisely the probability that we hit AA before BB.

The system of equations written out in full is

Lu=0,on VD,u=1,on A,u=0,on B.\begin{equation} \begin{split} \mathcal{L}u=0\,,&\qquad \text{on } V\setminus D \,, \\ u=1\,,&\qquad \text{on }A \,, \\ u=0\,,&\qquad \text{on }B \,. \end{split} \end{equation}

General Discounted Reward Problems

The most general form of an elliptic system arrising from a stochastic process on a graph can be phrased in terms of discounted reward. Suppose, like above that XtX_t wanders through the graph playing the above game until it arrives at a vertex in DD (this time though, we allow D=D=\emptyset, i.e., the game continues forever). When XtX_t arrives at a terminal vertex in xDx \in D, the game stops and XtX_t receives a reward g(x)g(x). However, along the way, the process XtX_t also receives an ongoing stream of rewards depending on the vertices it visits. That is, if Xt=xX_t=x, then the process receives f(x)f(x) reward at time tt where ff is defined on the non-terminal vertices, i.e. f:VDRf: V \setminus D \rightarrow \mathbb{R}. This problem revolves around the expected total reward received over the course of the game,

u(x)=E[t=0T1f(Xt)+g(XT)X0=x].\begin{equation} u(x)=\mathbb{E}[\sum_{t = 0}^{T-1}f(X_t)+g(X_T) \mid X_0=x] \,. \end{equation}

In general settings, one may also consider discounted reward, where the reward received at time tt is discounted by a factor of 0γ10 \leq \gamma \leq 1. This means that reward collected at the beginning of the game is more important than reward received later on (this is a common situation in economics or finance, where having money in the bank can net you interest, so one dollar today is worth more than one dollar a year from now). Then, the total discounted reward reads

u(x)=E[t=0T1γtf(Xt)+γTg(XT)X0=x].\begin{equation} u(x)=\mathbb{E}[\sum_{t = 0}^{T-1} \gamma^t f(X_t) + \gamma^T g(X_T) \mid X_0=x]\,. \end{equation}

How do we solve for u(x)u(x)? You guessed it! First transition analysis! Here we go:

E[t=0T1γtf(Xt)+γTg(XT)X0=x]=vP(v,x)(E[t=1T1γtf(Xt)+γTg(XT)X1=v]+f(x))=f(x)+γvP(x,v)E[t=0T2γtf(Xt+1)+γT1g(XT)X1=v]=f(x)+γvP(x,v)E[t=0T1γtf(Xt)+γTg(XT)X0=v]=f(x)+γvP(x,v)u(v),\begin{align} &\mathbb{E}\left[\sum_{t = 0}^{T-1}\gamma^t f(X_t)+\gamma^T g(X_T) \mid X_0 = x\right] \\ &\quad= \sum_v P(v,x) \left(\mathbb{E}\left[\sum_{t=1}^{T-1} \gamma^t f(X_t)+\gamma^T g(X_T) \mid X1 =v\right] +f(x)\right) \\ &\quad=f(x)+\gamma \sum_v P(x,v) \mathbb{E}\left[\sum_{t=0}^{T-2} \gamma^t f(X_{t+1})+\gamma^{T-1}g(X_T) \mid X_1=v\right] \\ &\quad=f(x)+\gamma \sum_v P(x,v) \mathbb{E}\left[\sum_{t = 0}^{T-1} \gamma^t f(X_t)+\gamma^T g(X_T) \mid X_0=v\right] \\ &\quad=f(x)+\gamma \sum_v P(x,v) u(v)\,, \end{align}

where from eq. (26) to eq. (27) we used the fact that the process receives f(x)f(x) reward and then transitions to a new state vv with probability P(x,v)P(x,v), and from eq. (27) to eq. (28) we used the Markov property of XtX_t. Therefore, on VDV\setminus D, we have

u(x)γvP(x,v)u(v)=f(x),\begin{equation} u(x)- \gamma\sum_v P(x,v) \, u(v) =f(x) \,, \end{equation}

and on DD, we have

u(x)=g(x).\begin{equation} u(x)=g(x)\,. \end{equation}

In matrix form, this can be written as

(IγP)u=f,on VD,u=g,on D.\begin{equation} \begin{split} (I-\gamma P)u=f\,,&\qquad \text{on }V\setminus D \,, \\ u=g\,,&\qquad \text{on } D \,. \end{split} \end{equation}

Therefore, if you ever see a linear system of the above form on a graph (i.e., a Laplacian system with discounting), where PP is a transition matrix, you can always interpret the result with the expression

u(x)=E[t=0T1γtf(Xt)+γTg(XT)X0=x],\begin{equation} u(x)=\mathbb{E}\left[\sum_{t = 0}^{T-1}\gamma^t f(X_t) + \gamma^T g(X_T)\mid X_0=x\right]\,, \end{equation}

where the inhomogeneous term f(x)f(x) is the reward received for leaving vertex xx, the Dirichlet boundary condition g(x)g(x) is the terminal reward received when the game ends as XtX_t arrives at xx, and γ\gamma is the discount factor of this game. Note that if the termination set DD is empty, then the game runs forever (for the total reward to be finite the discount factor must be 0γ10\leq\gamma\leq1 in this case), and the expression for reward is given by

u(x)=E[t=0γtf(Xt)X0=x].\begin{equation} u(x)=\mathbb{E}\left[\sum_{t = 0}^\infty \gamma^t f(X_t) \mid X_0 = x \right]\,. \end{equation}

An immediate consequence of these probabilistic expressions is the maximal principle,

Discrete Maximum Principle for Subharmonic Functions

Functions vv defined on the vertices of a graph GG with a stochastic process XtX_t with generator L\mathcal{L} are called subharmonic on AVA \subset V if

Lv0,on A.\begin{equation} \mathcal{L} v\geq 0\,,\qquad \text{on }A\,. \end{equation}

(It is harmonic if Lv=0\mathcal{L}v=0). Note that any vv is trivially the solution to the system

(IP)v=Lv,on A,v=v,on A,\begin{equation} \begin{split} (I-P)v=-\mathcal{L}v\,,&\qquad \text{on }A\,, \\ v=v\,,&\qquad \text{on }\partial A \,, \end{split} \end{equation}

where we define A\partial A as the set of vertices that not in AA, but are adjacent to a vertex in AA. Therefore, the probabilistic interpretation we've developed tells us that, for xAx \in A,

v(x)=E[t=0T1Lv(Xt)+v(XT)X0=x]E[v(XT)]maxyAv(y),\begin{equation} v(x)=\mathbb{E}\left[\sum_{t = 0}^{T-1}-\mathcal{L} v(X_t)+v(X_T)\mid X_0=x\right]\leq\mathbb{E}\left[v(X_T)\right]\leq \max_{y \partial A} v(y), \end{equation}

where TT is the hitting time of A\partial A. Note that we have used the fact that Lv0\mathcal{L}v\leq0 to drop the t=0T1(Lv)(Xt)\sum_{t=0}^{T-1}(\mathcal{L}v)(X_t) term and then used the fact that XTAX_T \in \partial A.

Therefore, if vv is a subharmonic function on a domain AA in the graph GG, its maximum must always occur on the boundary of AA. This is fundamental result in analysis (for subharmonic functions on Rn\mathbb{R}^n), but this probabilistic interpretation of the Laplacian gives us a super easy way of seeing why the result is true.

A byproduct of this analysis is that if the function v(x)v(x) is actually harmonic, instead of just subharmonic, we have that, for xAx \in A,

v(x)=E[v(XT)X0=x]=yAP(XT=y)v(y).\begin{equation} v(x)=\mathbb{E}\left[v(X_T) \mid X_0 =x \right] = \sum_{y\in\partial A}\mathbb{P}(X_T=y) v(y) \,. \end{equation}

Thus, v(x)v(x) is essentially just a weighted average over the boundary A\partial A, where the contribution of v(y)v(y) depends on the probability of the walk XtX_t ending up at yy when it terminates.

This idea of expressing a function in terms of an expectation of a function of stochastic process XtX_t gives us a glimpse of a fundamental technique in stochastic differential equations called Ito's formula (to be talked about later).

Stationary Distribution Problems

Another important property is the stationary distribution of stochastic processes XtX_t on graphs. Stationary distributions are extremely important in many data science applications. For example, one way to rank the relative importance of web domains on the internet is to think about a random walk XtX_t through the internet and ask the question: after a very long time what is the average amount of time that XtX_t has spent at a specific webpage? It turns out this question has a unique answer (for most graphs) that doesn't depend on the starting point of XtX_t. Indeed, the average fraction of the time XtX_t spends at xx (over an infinitely long time) is given by the entries of a special distribution π\pi called the stationary distribution. I won't go into much detail here, but basically if πt\pi_t denotes the probability distribution of XtX_t at time tt on the graph, then π\pi is the limit of the πt\pi_t as tt \rightarrow \infty.

The key property which defines the stationary distribution is that if X0X_0 has distribution π\pi, i.e., X0πX_0 \sim \pi, then X1,X2,X3,...X_1,X_2,X_3,... etc. also have distribution π\pi. Hence, why π\pi is called stationary. To write down an equation for the stationary distribution, we should first write down how the distributions πt\pi_t evolve with time. This turns out not to be to difficult, as to compute πt(x)\pi_t(x) we simply need to some over all of the possible ways Xt1X_{t - 1} can reach xx and weigh those quantities by the probability of making the transition Xt1xX_{t - 1} \rightarrow x,

πt(x)=vP(Xt=xXt1=v)P(Xt1=v)=vP(v,x)πt1(v).\begin{equation} \pi_t(x)=\sum_v \mathbb{P}(X_t=x \mid X_{t-1}=v) \mathbb{P}(X_t-1=v) = \sum_v P(v,x) \pi_{t-1}(v) \,. \end{equation}

If we represent πt\pi_t as a row vector, this is written in matrix form by

πt=πt1P.\begin{equation} \pi_t= \pi_{t - 1} P. \end{equation}

Note since multiplication by PP occurs on the right, effectively the transpose of the probability transition kernel has been used.We remark that the stationary distribution should therefore satisfy

π=πP.\begin{equation} \pi = \pi P\,. \end{equation}

Rewriting this gives us

πL=0.\begin{equation} \pi \mathcal{L} = \bm{0} \,. \end{equation}

Note that this equation is actually underdetermined because any scalar multiple of π\pi will also solve the above problem. This can be solved by adding the additional constraint that the entries of π\pi should sum to 11,

xπ(x)=1.\begin{equation} \sum_x \pi(x) = 1 \,. \end{equation}

If the graph is aperiodic and connected, these system of equations will give the unique stationary distribution (see fundamental theorem of Markov chains). In particular, in this situtation the generator L\mathcal{L} has exactly one zero eigenvalue, corresponding to the all-ones vector 1\bm{1},

L1=0.\begin{equation} \mathcal{L} \bm{1} = \bm{0} \,. \end{equation}

This is the right zero eigenvector of L\mathcal{L}, the left zero eigenvector of L\mathcal{L} is π\pi.

Duality

In the above section, we saw a hint of what is called duality. In particular, it is the duality between functions on the vertex set VV and distributions on the vertex set VV. This notion of duality is worth investigating because it reveals two fundamental approaches to computations involving these types of stochastic processes, forward and backward computation.

First, let us denote the set of real-valued functions on VV via VRV^{\mathbb{R}}. Note that VRV^{\mathbb{R}} is a vector space. The dual space (VR)(V^{\mathbb{R}})^* is defined as the set of all linear functionals π:VRR\pi : V^{\mathbb{R}} \rightarrow \mathbb{R}. That is, functionals satisfying

π[αf+βg]=απ[f]+βπ[g].\begin{equation} \pi[\alpha f + \beta g] = \alpha \pi[f] + \beta \pi[g] \,. \end{equation}

This space is itself a vector space over R\mathbb{R} because you can add two distributions π1\pi_1 and π2\pi_2 together via:

(π1+π2)[f]=π1[f]+π2[f].\begin{equation} (\pi_1 + \pi_2)[f] = \pi_1[f] + \pi_2[f] \,. \end{equation}

Linearity and finite dimensionality allows us to find a correspondance between the space VRV^{\mathbb{R}} and the dual space (VR)(V^{\mathbb{R}})^*. If we let 1v\bm{1}_v denote the indicator function on the vertex vv (i.e., 11 on vv and zero elsewhere), any function on VV can be written as:

f=vf(v)1v.\begin{equation} f = \sum_v f(v) \bm{1}_v \,. \end{equation}

And hence, any distribution on π\pi can be computed as:

π[f]=vf(v)π[1v].\begin{equation} \pi[f] = \sum_v f(v) \pi[\bm{1}_v] \,. \end{equation}

This is why, for finite linear algebra, one can save time and just define π\pi as a vector with entries π(v)=π[1v]\pi(v) = \pi[\bm{1}_v]. By convention, distributions π\pi are treated as row-vectors whereas functions ff are treated as column-vectors. This means that the product of the two as vectors corresponds to the evaluation of the distribution:

πf=vπ(v)f(v)=vf(v)π[1v]=π[f]\begin{equation} \pi f = \sum_v \pi(v) f(v) = \sum_v f(v) \pi[\bm{1}_v] = \pi[f] \end{equation}

In particular, π\pi is said to be a probability distribution if the following is true,

π[1v]0,π[1]=1.\begin{equation} \begin{split} \pi[\bm{1}_v] &\geq 0 \,, \\ \pi[\bm{1}] &= 1 \,. \end{split} \end{equation}

where 1\bm{1} is the function that takes value 11 everywhere. When this happens, the product between π\pi and ff is the expected value of f(X)f(X) for a random variable XπX \sim \pi,

πf=EXπ[f(X)].\begin{equation} \pi f = \mathbb{E}_{X \sim \pi}[f(X)] \,. \end{equation}

Why is this picture useful good for computation? Well, observe that if XtX_t is a Markov chain with transition probability matrix PP, then we have that

E[f(Xt)]=πtf=(π0Pt)f=π0(Ptf).\begin{equation} \mathbb{E}[f(X_t)] = \pi_t f = (\pi_0 P^t) f = \pi_0 (P^t f) \,. \end{equation}

This is somewhat profound because it shows us that there two two completely different approaches to computing the expected value of f(Xt)f(X_t) at time tt:

  1. Forward: Compute the probability distribution πt=Ptπ0\pi_t = P^t \pi_0 and then take the product with ff.
  2. Backward: Compute the quantity f(t)=Ptff^{(t)} = P^t f and then take the product with the initial probability π0\pi_0.

The first method of computation is pretty clear, we simply take the initial probability distribution π0\pi_0 of the Markov chain and evolve it forward in time. However, the second method of computation is a little profound: we can take the final value function f(t)=ff^{(t)} = f at time tt and evolve it backwards in time to get a function f(0)f^{(0)}. This is the crux of duality for stochastic processes, probability distributions always evolve forward in time, whereas value functions (things we compute expectations of) evolve backward in time. The latter is very closely related to the Heisenberg picture from quantum physics -- we don't actually ever have to think about the initial state of the system, we only need to think about the quantity f(t)f^{(t)} that we want to evaluate at time tt and this will give us a corresponding quantity f(0)f^{(0)} that can be evaluated at time 00 with any starting probability distribution. Let us examine the equations governing the evolution of these objects in the next two sections.

The Forward Kolmogorov Equation

First we look at the governing equations for the evolution of probability distributions. Let us recall from the previous section that the equation governing the evolution of the probability distributions πt\pi_t is given by

πt+1=πtP.\begin{equation} \pi_{t+1}=\pi_t P \,. \end{equation}

By subtracting πt\pi_t from both sides, we arrive at

πt+1πt=πt(PI).\begin{equation} \pi_{t+1} - \pi_t = \pi_t (P - I) \,. \end{equation}

which can be rewritten as

Δtπt=πtL.\begin{equation} \Delta_t \pi_t = \pi_t \mathcal{L} \,. \end{equation}

where L=PI\mathcal{L} = P - I is the generator of XtX_t (i.e., the graph Laplacian) and Δt\Delta_t is the discrete time derivative. This is the discrete-time forward Kolmogorov equation or the discrete-time heat equation. Note the similarity of this equation to the continuous heat equation,

πtt=12Δπt,\begin{equation} \frac{\partial \pi_t}{\partial t}=\frac{1}{2} \Delta \, \pi_t \,, \end{equation}

where Δ\Delta is the continuous Laplacian. However, there are some notable differences between the discrete and continuous heat equations. The factor of 12\frac{1}{2} is to some extent the result of convention, but it is hard to explain away for now. Morally, you should think of L\mathcal{L} as the discrete analogue of 12Δ\frac{1}{2} \Delta (note the factor of 1/21/2 that appeared in the introduction section). Indeed, on grid graphs, an appropriately normalized L\mathcal{L} (L\mathcal{L} is symmetric on grid graphs) will "converge" to 12Δ\frac{1}{2} \Delta as the grid becomes infinitely fine. This bridge allows us to translate intuition between the worlds discrete spectral graph theory and continuous PDEs.

The Backward Kolmogorov Equation

In constrast to probability distributions, value functions ff are evolved via the expression

f(t1)=Pf(t).\begin{equation} f^{(t - 1)} = P f^{(t)} \,. \end{equation}

We therefore perform the same calculation as the previous section to arrive at

f(t)f(t1)=(PI)f(t).\begin{equation} f^{(t)} - f^{(t - 1)} = -(P - I) f^{(t)} \,. \end{equation} tf(t)=Lf(t),\begin{equation} \nabla_{t} f^{(t)} = -\mathcal{L} f^{(t)} \,, \end{equation}

where t\nabla_t is the backward finite difference operator. The negative sign arises from the fact that we are evolving this equation backwards in time. If we substitute variables t=Tst = T - s we instead get

Δsf(s)=Lf(s).\begin{equation} \Delta_{s} f^{(s)} = \mathcal{L} f^{(s)} \,. \end{equation}

However, we can ask for a fair bit more than this. Instead of just having a payoff ff at time tt, we can have a series of payoffs at different times. Let us modify the discounted reward problem a bit to turn it into a time evolution problem. Suppose we have a similar setup as before: i.e., we have a process XtX_t on the graph G=(V,E)G=(V,E). However, this time, the reward function f(x)f(x) is allowed to be time dependent, so it is a function both of a vertex xx and a time tt. Suppose, that we let the process XtVX_t \in V run on the graph until time τN\tau \in \mathbb{N}, and at every time step, it collects f(Xt,t)f(X_t, t) reward until time τ\tau, when it collects g(Xτ)g(X_\tau) terminal reward and the game ends. Note that unlike TT from the previous section, which is a stopping time, and may be random, τ\tau is simply a constant that we choose before hand. Now, we ask the question: given we start the game with X0=xX_0=x, what is the expect amount of (discounted) reward that XtX_t accumulates in time up to τ\tau? That is, we want to compute the value function of the state xx,

V(x)=E[r=0τ1γrf(Xr,t)+γτg(Xτ)X0=x].\begin{equation} V(x)=\mathbb{E}\left[\sum_{r = 0}^{\tau - 1} \gamma^r f(X_r,t) + \gamma^\tau g(X_\tau) \mid X_0=x\right]\,. \end{equation}

One way to solve this problem is by recursion. Let us extend the value function to starting times t>0t>0. We do this as follows:

V(x,t)=E[r=tτ1γrtf(Xr,t)+γτtg(Xτ)Xt=x].\begin{equation} V(x,t)=\mathbb{E}\left[\sum_{r = t}^{\tau - 1} \gamma^{r-t} f(X_r,t) + \gamma^{\tau -t} g(X_\tau) \mid X_t = x\right]\,. \end{equation}

That is, V(x,t)V(x,t) is the expected reward collected if we start at time tt and at state xx. We can use first transition analysis to write down an equation for this function that we can then solve,

V(x,t)=f(Xt,t)+γvP(x,v)E[r=t+1τ1γrt1f(x,t)+γτt1Xt+1=v]=f(Xt,t)+γvP(x,v)V(v,t+1).\begin{equation} \begin{split} V(x,t) &= f(X_t,t)+\gamma \sum_v P(x,v) \, \mathbb{E}\left[\sum_{r = t + 1}^{\tau-1} \gamma^{r-t-1} f(x,t) + \gamma^{\tau - t - 1} \mid X_{t+1}=v\right] \\ &=f(X_t,t)+ \gamma \sum_v P(x,v) \, V(v,t+1) \,. \end{split} \end{equation}

Note that the above is a linear system for the {V(x,s)}xV\{V(x,s)\}_{x\in V} in terms of the {V(x,t+1)}xV\{V(x,t+1)\}_{x \in V}. Therefore, since we have the boundary condition

V(x,τ)=g(x),\begin{equation} V(x,\tau)=g(x) \,, \end{equation}

we can recursively solve for the V(x,t)V(x,t) backwards in time until we get the V(x,0)V(x,0) we originally desired. Let's rewrite the system of equations to put it into a more (subjectively) digestable form,

V(x,t)V(x,t+1)=(γvP(x,v)V(v,t+1)V(x,t+1))+f(Xt,t).\begin{equation} V(x,t) - V(x,t+1) = \left(\gamma\sum_v P(x,v) V(v,t+1) - V(x,t+1)\right) + f(X_t, t) \,. \end{equation}

Note we have simply multiplied subtracted V(x,t+1)V(x,t+1) from both sides. Note also that the left hand side of the equation is essentially a time derivative of VV. Thus, we can write the whole equation above in matrix form as

tVt+1=(γPI)Vt+1ft.\begin{equation} \nabla_t V_{t + 1} = -(\gamma P - I) V_{t+1} - f_t \,. \end{equation}

Where t\nabla_t is the backward finite difference time derivative. When the discount factor γ\gamma is identity, we get

tVt=LVtft1,for 0t<τVτ=g.\begin{equation} \begin{split} \nabla_t V_{t} = -\mathcal{L} V_{t} - f_{t - 1}\,, &\qquad \text{for } 0 \leq t < \tau \\ V_{\tau} = g \,.& \end{split} \end{equation}

Note that the negative signs here are an artifact of the fact that we are solving backwards in tt. If instead we rewrite s=τts = \tau - t, then we have

ΔsVs=LVs+fs+1,for 0<sτV0=g.\begin{equation} \begin{split} \Delta_s V_{s} = \mathcal{L} V_{s} + f_{s + 1} \,, &\qquad \text{for } 0 < s \leq \tau \\ V_{0} = g \,.& \end{split} \end{equation}

Compare this to the continuous backward heat equation with an inhomogenous term,

Vt=12ΔV+f.\begin{equation} -\frac{\partial V}{\partial t} = \frac{1}{2} \Delta V + f \,. \end{equation}

We will examine how all of these results from the discrete case carry over into the continuous case in subsequent sections.

You should also compare it to the forward Kolmogorov equation and note that the since the forward equation involves vector multiplication on the left, it effectively involves the transpose/adjoint L\mathcal{L}^* of the operator L\mathcal{L}. A good way to make sense of this is to think of L\mathcal{L} in the following way: essentially, L\mathcal{L}^* generates the evolution of distributions forwards in time, while L\mathcal{L} generates the evolution of value functions / test functions backwards in time. When the underlying process XtX_t is reversible (i.e., Yt=XtY_t=X_{−t} is the same process as XtX_t, so the process looks the same when time is reversed), we have that L=L\mathcal{L}^*=\mathcal{L} and the backwards and forwards equations are the same. This is a manifestation of the fact that the evolution of the process forward in time and backward in time are indistinguishable. This happens to be the case on grid graphs.