Random Walks, Mixing Time
Today we will discuss random walks on graphs, the concept of stationary distributions, and the mixing time of random walks.
Random Walks on Graphs
Given a graph $G = (V, E)$, a random walk on $G$ is a sequence of vertices $v_0, v_1, v_2, \ldots$ such that $v_0$ is the starting vertex and for each $i \geq 0$, $v_{i+1}$ is chosen uniformly at random from the neighbors of $v_i$.
Now that we have defined a random walk, we can ask questions about the behavior of the walk. Here are 4 basic questions we might ask:
 Stationary Distribution: Does the random walk converge to a “stable” distribution over the vertices? If it does, what is this distribution?
 Mixing Time: How long does it take for the random walk to converge to the stationary distribution?
 Cover Time: How long does it take for the random walk to visit every vertex at least once?
 Hitting Time: Starting from a vertex $v_0$, how long does it take for the random walk to reach a specific vertex $v_f$? (When $v_f$ is the same as $v_0$, this is called the return time.)
Example of a Random Walk
Let $G = K_n$ be the complete graph on $n$ vertices.
 Stationary Distribution: One stationary distribution of the random walk on $K_n$ is the uniform distribution over the vertices.
Question: Does there exist another stationary distribution?
We will answer this question later in the lecture.
 Mixing Time: the $\varepsilon$mixing time of the random walk on $K_n$ is $\Theta(\log_n(n/\varepsilon))$.
We will also prove this later in the lecture.
 Cover Time: the cover time of the random walk on $K_n$ is $\Theta(n \log n)$.
You can see this by considering the coupon collector’s problem.
 Hitting Time: given two vertices $a, b \in [n]$, the expected hitting time of the random walk on $K_n$ from $a$ to $b$ is $\Theta(n)$.
Markov Chains
A random walk on a graph is an example of a special kind of stochastic process: Markov chains.
A Markov chain is a stochastic process which is memoryless/forgetful in the sense that the future behavior of the process depends only on the current state and not on the past. In formal terms, a Markov chain is a sequence of random variables $X_0, X_1, X_2, \ldots$ such that for all $i \geq 0$ and all states $v \in V$, we have $$\Pr[X_{i+1} = v_{i+1} \mid X_0 = v_0, X_1 = v_1, \ldots, X_i = v_i] = \Pr[X_{i+1} = v_{i+1} \mid X_i = v_i].$$
Remark: in this course, we will only be interested in finite Markov chains. These are Markov chains where the state space $V$ is finite.
Why are Markov Chains Important?
Markov chains and random walks are ubiquitous in computer science and mathematics, appearing quite prominently in the study of randomized algorithms. For instance, they appear in:
 PageRank algorithm (next lecture)
 Approximation algorithms for counting problems
 Algorithms for efficient sampling from distributions
 Probability amplification without too much randomness
 many more…
Representing Markov Chains
A Markov chain can be seen as a weighted directed graph $G = (V, E, P)$ where $P_{uv}$ is the probability of transitioning to state $u$ from state $v$. The matrix $P$ is called the transition matrix of the Markov chain.
TODO: put a picture of a Markov chain here
The transition matrix $P$ allows us to represent the transition probabilities of the Markov chain in a really nice way: if we denote by $p_t \in \mathbb{R}^{V}$ the probability distribution over the vertices at time $t$, that is, $p_t(v)$ is the probability of being at vertex $v$ at time $t$, then we have $$p_{t+1} = P \cdot p_t.$$
For instance, for the random walk on $K_n$, the transition matrix $P$ is given by $$P_{uv} = \begin{cases} \frac{1}{n1} & \text{if } u \neq v, \ 0 & \text{if } u = v. \end{cases}$$ More succinctly, we can write $P = \frac{1}{n1} \cdot (J  I)$, where $J$ is the allones matrix and $I$ is the identity matrix.
Properties of Markov Chains
We will now define some important properties of Markov chains.

A Markov chain is irreducible if the directed graph is strongly connected.

A Markov chain is aperiodic if the greatest common divisor of the lengths of all cycles in the directed graph is 1.
In other words, if we define the period of a state $v$ as $$\text{period}(v) = \gcd{t \geq 1 \mid P_{vv}^{t} > 0},$$ then the Markov chain is aperiodic if $\text{period}(v) = 1$ for all $v \in V$.
Are there examples of periodic Markov chains? Yes, for instance, bipartite graphs yield periodic Markov chains.
The following lemma is a useful tool for analyzing Markov chains.
Lemma 1 (Positivity of Powers of Transition Matrix): Let $P$ be the transition matrix of a finite, irreducible and aperiodic Markov chain. Then there exists an integer $0 < T < \infty$ such that for all $t \geq T$, and all $u, v \in V$, we have $P_{uv}^{t} > 0$.
For a proof of the above lemma, see [Häggström, Chapter 4].
Stationary Distributions and Mixing Time
Now that we have formally defined Markov chains, we can return to the questions we asked about random walks on graphs, but in the more general setting of Markov chains. From now on, we will consider finite Markov chains over the state space $[n]$, with transition matrix $P$.
Definition (Stationary Distribution): a probability distribution $\pi \in \mathbb{R}^n$ is a stationary distribution of the Markov chain with transition matrix $P$ if $\pi = P \cdot \pi$.
Informally, a stationary distribution is an “equilibrium/fixed” state of the Markov chain, as the distribution does not change over time (that is, $\pi = P^t \cdot \pi$ for all $t \geq 0$).
The definition of a stationary distribution can be seen through the lens of linear algebra: $\pi$ is a stationary distribution if, and only if, it is a scalar multiple of an eigenvector of $P$ with corresponding eigenvalue 1.
Now that we have defined stationary distributions, we can ask the following questions:
 When does a stationary distribution exist?
 If it exists, is it unique?
 If there is a unique stationary distribution, and we start from an arbitrary distribution, how long does it take for the distribution to converge to the stationary distribution?
We will address the existence and uniqueness of stationary distributions in the next subsection, and in more detph in the next lecture. But to formalize the last question, we need to define a distance between probability distributions, so we can talk about how close two distributions are.
The definition that we will use will be the total variation distance.
Definition (Total Variation Distance): Given two probability distributions $p, q \in \mathbb{R}^n$, the total variation distance between $p$ and $q$ is defined as $$\Delta_{TV}(p, q) = \frac{1}{2} \sum_{i=1}^{n} p_i  q_i = \frac{1}{2} p  q_1.$$
With the notion of total variation distance, we say that a sequence of probability distributions $p_0, p_1, p_2, \ldots$ converges to a distribution $p$ if $\Delta_{TV}(p_t, p) \to 0$ as $t \to \infty$.
We are also able to quantify how fast the sequence converges to $p$ by defining the mixing time of the Markov chain.
Definition (Mixing Time): The $\varepsilon$mixing time of a Markov chain with transition matrix $P$ is defined as the smallest integer $t$ such that for any initial distribution $p_0$, we have $$\Delta_{TV}(P^t \cdot p_0, \pi) \leq \varepsilon,$$ where $\pi$ is the stationary distribution of the Markov chain.
Let us now return to the example of the random walk on the complete graph $K_n$, and compute it’s $\varepsilon$mixing time.
Example: Random Walk on $K_n$
It is easy to see that the stationary distribution of the random walk on $K_n$ is the uniform distribution over the vertices. Thus, we have $\pi = \frac{1}{n} \cdot \vec{1}$, where $\vec{1}$ is the allones vector.
Since the transition matrix $P$ of the random walk on $K_n$ is $P = \frac{1}{n1} \cdot (J  I)$, we have that the eigenvalues of $P$ are $\lambda_1 = 1$ and $\lambda_2 = \lambda_3 = \ldots = \lambda_n = \frac{1}{n1}$. Since $P$ is a symmetric matrix, we have that $P$ has an orthonormal basis of eigenvectors. It is easy to see that the eigenvector corresponding to $\lambda_1 = 1$ is $v_1 := \dfrac{1}{\sqrt{n}} \cdot \vec{1}$. Let $v_2, v_3, \ldots, v_n$ be the other eigenvectors corresponding to $\lambda_2, \lambda_3, \ldots, \lambda_n$, such that ${v_1, v_2, \ldots, v_n}$ is an orthonormal basis of $\mathbb{R}^n$.
Then, we can write $$ P = \sum_{i=1}^{n} \lambda_i \cdot v_i \cdot v_i^T = \dfrac{1}{n} \cdot \vec{1} \cdot \vec{1}^T  \frac{1}{n1} \sum_{i=2}^{n} v_i \cdot v_i^T.$$
Hence, for any initial distribution $p_0$, we have $$P^t \cdot p_0 = \sum_{i=1}^{n} \lambda_i^t \cdot v_i \cdot v_i^T \cdot p_0 = \dfrac{1}{n} \cdot \vec{1} \cdot \langle \vec{1} , p_0 \rangle  \left( \frac{1}{n1} \right)^t \cdot \sum_{i=2}^{n} v_i \cdot \langle v_i, p_0 \rangle.$$ Since $p_0$ is a probability distribution, we have that the RHS above becomes $$\dfrac{1}{n} \cdot \vec{1} + \left(\frac{1}{n1}\right)^t \cdot \sum_{i=2}^{n} \langle v_i, p_0 \rangle \cdot v_i = \pi + \left(\frac{1}{n1}\right)^t \cdot \vec{u} $$ where $\vec{u} := \sum_{i=2}^{n} \langle v_i, p_0 \rangle \cdot v_i$ is a vector of $\ell_1$norm at most $$\vec{u}_1 \leq \sum_{i=2}^{n} \langle v_i, p_0 \rangle \cdot v_i_1 \leq \sum_{i=2}^{n} v_i_1 \leq \sum_{i=2}^{n} \sqrt{n} \leq n \sqrt{n}. $$ Hence, if we choose $t = \Theta(\log_{n1}(n/\varepsilon))$, we have that $$\Delta_{TV}(P^t \cdot p_0, \pi) \leq \varepsilon.$$ This gives us the $\varepsilon$mixing time of the random walk on $K_n$.
Fundamental Theorem of Markov Chains
In order to formally state the Fundamental Theorem of Markov Chains, we need to formally define the hitting time of a Markov chain, and the return time.
Definition (Hitting Time): Given a Markov chain with transition matrix $P$ and initial state $v_0$, the hitting time of the Markov chain from $v_0$ to a state $v_f$ is defined as $$T_{v_0, v_f} = \min{t \geq 0 \mid X_t = v_f, X_0 = v_0},$$ where $X_t$ is the state of the Markov chain at time $t$.
When $v_f = v_0$, the hitting time is called the return time.
Note that the hitting time is a random variable, as it depends on the randomness of the Markov chain.
The mean hitting time of the Markov chain from $i$ to $j$ is defined as $$\tau_{ij} := \mathbb{E}[T_{i, j}].$$
We are now able to state the hitting time lemma.
Lemma 2 (Hitting Time Lemma): Let $P$ be the transition matrix of a finite, irreducible and aperiodic Markov chain. Then, for any two states $i, j$, we have $$\Pr[T_{i, j} < \infty] = 1 \quad \text{and} \quad \tau_{ij} < \infty.$$
Proof: Since our Markov chain is finite, irreducible, and aperiodic, by Lemma 1, there exists an integer $M$ such that for all $t \geq M$ and any two states $u, v$, we have $P_{uv}^{t} > 0$.
Let $\alpha := \min_{u, v} P_{uv}^{M} > 0$. Then, we have $$\Pr[T_{i, j} > M] \leq \Pr[X_M \neq j] = 1  P_{ij}^{M} \leq 1  \alpha.$$ Moreover, we have $$\Pr[T_{i, j} > 2M] = \Pr[T_{i, j} > M] \cdot \Pr[T_{i, j} > 2M \mid T_{i, j} > M] \ \leq (1  \alpha) \cdot \Pr[X_{2M} \neq j \mid T_{i,j} > M] \ \leq (1  \alpha)^2.$$ By induction, we have that for all $\ell \geq 1$, we have $$\Pr[T_{i, j} > \ell M] \leq (1  \alpha)^{\ell}.$$ Thus, we have proved that $\Pr[T_{i, j} < \infty] = 1$.
Moreover, we have $$\tau_{ij} = \sum_{t=0}^{\infty} \Pr[T_{i, j} > t] = \sum_{\ell=0}^{\infty} \sum_{t=\ell M}^{(\ell+1)M  1} \Pr[T_{i, j} > t] \leq \sum_{\ell=0}^{\infty} M \cdot \Pr[T_{i, j} > \ell M] \ \leq \sum_{\ell=0}^{\infty} M \cdot (1  \alpha)^{\ell} = \frac{M}{1  (1  \alpha)} = \frac{M}{\alpha} < \infty.$$ This completes the proof of the lemma.
We are now able to state the Fundamental Theorem of Markov Chains.
Theorem (Fundamental Theorem of Markov Chains): Let $P$ be the transition matrix of a finite, irreducible and aperiodic Markov chain. Then, the following statements hold:
 There exists a unique stationary distribution $\pi$ of the Markov chain, where $\pi_i > 0$ for all $i \in [n]$, where $n$ is the number of states of the Markov chain.
 For any initial distribution $p_0$, we have $$ \lim_{t \to \infty} \Delta_{TV}(P^t \cdot p_0, \pi) = 0.$$
 The stationary distribution $\pi$ is given by $$\pi_i = \lim_{t \to \infty} P_{ii}^{t} = \frac{1}{\tau_{ii}}.$$
We will see a proof of this theorem in the next lecture.