12. Set Disjointness II

At the beginning of the last lecture, we introduced the claim that the randomized communication complexity measure of the Set Disjointness function is \(\mathrm{R}(\mathsf{Disj}_n) = \Theta(n)\). By the end of the lecture, we had only established the weaker lower bound of \(\Omega(\sqrt{n})\).

The optimal lower bound was first established by Kalyanasundaram and Schnitger (1992) and shortly afterwards an alternative proof was given by Razborov (1992). But we will cover a different proof of the same theorem that provides us with a great excuse to introduce a fundamental tool in communication complexity (and other areas of theoretical computer science): information theory.

Basics of Information Theory

The starting point for information theory is the notion of entropy. When \(X\) is a random variable with distribution \(\mu\) over some finite set \(\mathcal{X}\), the entropy of \(X\) is

\[H(X) = \sum_{x \in \mathcal{X}} \mu(x) \log \frac{1}{\mu(x)}.\]

The notion of entropy satisfies a number of useful properties. Perhaps the most fundamental is a direct consequence of Jensen’s inequality which gives a universal upper bound on the entropy of any random variable.

Proposition 1. For any random variable \(X\) with distribution \(\mu\) over a finite set \(\mathcal{X}\), the entropy of \(X\) is abounded above by
\[H(X) \le \log |\mathcal{X}|.\]

When \(\mu\) is some joint distribution on \(\mathcal{X} \times \mathcal{Y}\) and \((X,Y) \sim \mu\), the conditional entropy of \(X\) relative to \(Y\) is

\[H(X \mid Y) = \mathrm{E}_{y \sim \mu_Y}[ H(X \mid Y = y) ]\]

where we write \(\mu_Y\) to represent the marginal distribution of \(Y\) under \(\mu\). The most basic property of conditional entropy is that conditioning never increases entropy.

Proposition 2. For any jointly distributed random variables \(X\) and \(Y\),
\[0 \le H(X \mid Y) \le H(X).\]

The mutual information between jointly distributed random variables \(X\) and \(Y\) is

\[I(X ; Y) = H(X) - H(X \mid Y).\]

(There are many other equivalent formulations of mutual information as well.) Informally, mutual information measures “how much information” about \(X\) that we learn by observing \(Y\).

We can also define conditional mutual information similarly.

Mutual information satisfies a number of useful properties. One of these particularly useful ones is the chaining rule.

Proposition 3. When \(X\) and \(Y_1,\ldots,Y_n\) are jointly distributed random variables,
\[I(X ; Y_{[n]}) = \sum_{i=1}^{n} I( X ; Y_i \mid Y_{[i-1]}).\]

And another useful property is that independent random variables do not affect conditional mutual information.

Proposition 4. When \(X, Y, Z\) are jointly distributed and \(Y\) and \(Z\) are independent under this joint distribution, then
\[I( X; Y \mid Z) \ge I( X; Y).\]

Intuition for a Set Disjointness Lower Bound

We can use the information theory results introduced above to obtain a linear lower bound on the randomized communication complexity of Set Disjointness.

The starting point of this lower bound requires us to go back to a very basic question: why do we believe that the randomized communication complexity of the set disjointness function should be linear in \(n\) in the first place? Well, recalling that our function is defined as

\[\mathsf{Disj}_n(x,y) = \neg \bigvee_{i \in [n]} (x_i \wedge y_i),\]

there is one very natural answer to this question: the communication complexity of disjointness should be \(\Omega(n)\) because it appears to implicitly make us compute the AND function on a linear number of instances. Letting \(\mathsf{And} : \{0,1\} \{0,1\} \to \{0,1\}\) denote the simple AND function on a pair of bits, this intuition suggests that we should maybe try to prove an inequality of the form

\[\mathrm{R}(\mathsf{Disj}_n) \overset{?}{\ge} \Omega( n \cdot \mathrm{R}(\mathsf{And})).\]

If such an inequality can be established, then we are done because \(\mathrm{R}(\mathsf{And}) \ge 1\) follows from the easy observation that any protocol for the AND function must involve some communication between Alice and Bob.

The problem with the intuition described abovbe is that inequalities like the conjectured one above do not hold in general. Indeed, this intuition implicitly assumes that the best way to compute a function which can be represented as multiple instances of a smaller function must be to compute each of those instances separately to obtain the final result. This is certainly not always the best course of action! In fact, we can see how wrong this implicit assumption can be by considering a slight variant of the disjointness function where we replace the \(\wedge\) with a \(\vee\). The exact same intuition we described could also be applied to the resulting function \(\bigvee_{i \in [n]} (x_i \vee y_i)\), and yet this function, in strong contrast to disjointness, can always be computed with constant communication cost.

At this point, it would be natural to assume we need to go back to the very beginning and find a different line of attack to try to prove a lower bound on disjointness. But remarkably, using information theory we don’t need to do that: we can show that our original intuition on the disjointness function was in fact generally correct if we simply relax our target inequality slightly to replace \(\mathrm{R}(\mathsf{And})\) with a different information-theoretic measure of complexity of the AND function.

Let us see how we obtain such a result.

From Set Disjointness to AND

Fix a randomized protocol \(\pi\) for the \(\mathsf{Disj}_n\) function. We let \(X\) and \(Y\) denote the inputs to Alice and Bob drawn from some distribution that we will define shortly. And we let \(\Pi\) denote the transcript of \(\pi\) on input \((X,Y)\). The transcript is simply a binary string that describes the bits that are communicated by Alice and Bob during the execution of \(\pi\).

The distribution \(\mu\) on \((X,Y,\Pi)\) is obtained by running the following procedure:

Draw \(D \in \{0,1\}^n\) uniformly at random.
Draw \(Z \in \{0,1\}^n\) uniformly at random.
For each \(i \in [n]\), set
\[X_i = \begin{cases} Z_i & \mbox{if } D_i = 0 \\ 0 & \mbox{if } D_i = 1 \end{cases} \qquad \mbox{ and } \qquad Y_i = \begin{cases} 0 & \mbox{if } D_i = 0 \\ Z_i & \mbox{if } D_i = 1. \end{cases}\]
Set \(\Pi\) to be the transcript of \(\pi\) run on \((X,Y)\).

Let \(\nu\) be a distribution \((X',Y',\Sigma)\) defined analogously except that in this case \(n=1\) and \(\Sigma\) is the transcript of a protocol \(\sigma\) for the AND function.

We can use the information theory tools developed earlier to establish the following result.

Lemma 4. With \(\mu\) and \(\nu\) defined as above and letting \(\mathrm{CIC}_\nu(\mathsf{And}) = \inf_\sigma I_\nu( \Sigma; X'Y' \mid D)\), we have
\[\mathrm{R}(\mathsf{Disj}_n) \ge n \cdot \mathrm{CIC}_\nu(\mathsf{And}).\]

Proof. Let \(\pi\) represent a randomized communication protocol that computes \(\mathsf{Disj}_n\) with minimal worst-case cost. Then
\[\begin{align*} \mathrm{R}(\mathsf{Disj}_n) &= \max |\Pi| \\ &\ge H(\Pi) & \mbox{(Entropy upper bound)} \\ &\ge H(\Pi \mid D) & \mbox{(Conditioning does not increase entropy)} \\ &\ge H(\Pi \mid D) - H(\Pi \mid XYD) & \mbox{(Non-negativity of entropy)} \\ &= I(\Pi; XY \mid D) & \mbox{(Mutual information definition)} \\ &= \sum_{i=1}^n I(\Pi ; X_iY_i \mid DX_1\cdots X_{i-1} Y_1 \cdots Y_{i-1}) & \mbox{(Chain Rule)} \\ &= \sum_{i=1}^n I(\Pi ; X_iY_i \mid D_i). & \mbox{(Independence)} \end{align*}\]
For any \(i \in [n]\), consider now the protocol \(\sigma_i\) for the AND function that proceeds as follows. On input \(x', y'\), it generates \(D, Z, X, Y\) as in the procedure for \(\mu\) and then overwrites \(X_i = x'\) and \(Y_i = y'\). Then Alice and Bob simulate the protocol \(\pi\) on the resulting input \((X,Y)\). Note that for any \(j \neq i\), we will have \(X_j \wedge Y_j = 0\) by construction, and so the outcome of \(\pi\) determines the value of \(x' \wedge y'\). Furthermore, if \((x',y',\Sigma_i)\) is drawn from the \(\nu\) distribution with the protocol \(\sigma_i\), then we have
\[I(\Sigma_i ; X_iY_i \mid D_i) = I(\Pi ; X_iY_i \mid D_i)\]
and by definition in the lemma statement this value is bounded below by \(\mathrm{CIC}_\nu(\mathsf{And}).\)

Note that we are not quite done at this point. Earlier, we obtained the trivial lower bound of \(\mathrm{R}(\mathsf{And}) \ge 1\) from the observation that any protocol for the AND function must involve some amount of communication. But with information complexity, this basic observation does not suffice to give a constant lower bound on the conditional information cost of the AND function. After all, it’s possible that there are infinitely many different communication protocols that each have positive information cost but that this cost tends to 0 as we consider more and more sophisticated protocols. Such a situation would of course invalidate our entire proof approach. But, as our intuition suggests, this is not the case: it’s possible to give a (positive constant) lower bound on \(\mathrm{CIC}_\nu(\mathsf{And})\) and this will complete the proof of the claim that Set Disjointness has linear communication complexity.

We will see this lower bound in the next lecture.