This post revisits the confounding phenomenon under the angle of probabilistic reasoning with conditional probabilities and Bayes' theorem. This analysis reveals what sets probabilistic and causal inference apart. It also shows how probabiistic models and causal models formulate assumptions of a fundamentally different nature.
Tip: Feel free to jump to the conclusion if you find this discussion too technical.
Suppose for instance that we have access to a large trove of medical reports describing the outcomes of applying medical treatment to patients displaying certain symptoms. In the following, we use variable $Y$ to denote observable outcomes such as “patient is considered cured after five days”; we use variable $X$ to denote the treatment selected by a doctor; and we use variable $Z$ to represent the known contextual information. Although we refer to $Z$ as the symptoms variable for simplicity, it really represents all the observed and recorded context of the treatment decision: patient symptoms, patient history, hospital history, place and time, or even environmental factors such as the pollution level.
We merely want to know which treatment $X$ works best for symptoms $Z$.
This question is natural but also ambiguous. Did we mean
Confounding exists because these two questions often have different answers.
To answer the first question, we merely need to go through our medical records and count the observed outcomes $Y$ for each symptom-treatment pair $(Z,X)$. If we assume that our records are independent samples from a joint probability distribution $P(Z,X,Y)$, this amounts to estimating the conditional probability distribution $P(Y|Z,X)$.
Since it is very unlikely that our trove of medical reports covers all possible combinations of symptoms $Z$ and treatments $X$ we may have no data to estimate $P(Y|Z,X)$ when $(Z,X)$ is one of these missing pairs. This is not a problem if our goal is merely to answer the first question because saying which treatments have historically worked best does not require us to know anything about treatments we never tried. In contrast, the second question is more demanding because the best treatment for symptoms $Z$ could possibly be a treatment that was never tried for such symptoms. This difference is only the tip of a much larger iceberg.
Essentially the second question envisions an intervention to change the treatment selection policy. Instead of doing whatever the doctors were doing to select a treatment during the data collection period, we are considering an alternate treatment selection policy and we would like to know about its performance.
We can approach such an intervention by decomposing the joint distribution as \begin{equation} \label{eq:xyz} P(Z,X,Y) = P(Z) \, P(X|Z) \, P(Y|Z,X)~. \end{equation} This particular ordering of the variables is attractive because it obeys the arrow of time: first come the symptoms with distribution $P(Z)$, then comes the treatment selection with distribution $P(X|Z)$, and finally comes the treatment outcome with distribution $P(Y|X,Z)$.
Graphical representation[1] of the decomposition \eqref{eq:xyz}.
Each node in this graph represents one variable and receives arrows from all
variables that condition the corresponding term in the decomposition.
We would like to know how our medical records would look like if we were to follow an alternate policy to select a treatment $X$ given symptoms $Z$. This calls for replacing the term $P(X|Z)$ in the decomposition \eqref{eq:xyz} by a new conditional distribution $\textcolor{red}{P^*(X|Z)}$ that represents the alternate treatment selection policy. \begin{equation} \label{eq:xyz:interv} \textcolor{red}{P^*(Z,X,Y)} = P(Z) \, \textcolor{red}{P^*(X|Z)} \, P(Y|Z,X)~. \end{equation}
Does the new joint distribution $\textcolor{red}{P^*(Z,X,Y)}$ represents what one could observe if one were to implement the new treatment policy? Equivalently, do the conditional distributions $P(Z)$ and $P(Y|Z,X)$ remain invariant when one changes the treatment policy $P(X|Z)$.
Nontrivial assumption: Changing the treatment selection policy in decomposition \eqref{eq:xyz} leaves the terms $P(Z)$ and $P(Y|Z,X)$ invariant.
Such an assumption is not a part of our probabilistic model of the data, namely the assumption that the medical records were independently sampled from a joint distribution $P(Z,X,Y)$. To determine whether this assumption is correct or not, we must invoke pieces of knowledge that do not change the probabilistic model itself. These pieces of knowledge are often formulated in causal terms:
Probabilities are often used to represent processed that we cannot model completely. For instance, what happens when we roll the dices might depend on their initial configuration, the speed of our motion, the exact moment we release the dices, the temperature of the air, etc. Since we do not know these things, we simply say that each face of a dice is shown with equal chance.
We shall now change our model to include a new variable $U$ that represents additional information that was not recorded in the medical reports and therefore is no longer accessible. The exact nature of the information $U$ is not important at this point, except for the fact that it may impact which treatment was applied as well as its outcome. Such a variable $U$ is often called an unobserved variable because its value is unknown at the time of the analysis of the data.
The canonical decomposition then becomes \begin{equation} \label{eq:xyzu} P(Z,U,X,Y) = P(Z) \, P(U|Z) \, P(X|Z,U) \, P(Y|X,Z,U) ~. \end{equation}
We can then write the observed joint distribution $P(Z,X,Y)$ as \begin{align*} P(Z,X,Y) &= \sum_U P(Z,U,X,Y) \\ &= P(Z) \, \sum_U P(U|Z) \, P(X|U,Z) \, P(Y|X,U,Z)~. \end{align*}
The term $P(X|Z,U)$ that represents which treatments $X$ were applied now depends on both the symptoms $Z$ and the unobserved variable $U$. Proceeding as in the previous section, we replace this term by the conditional distribution $\textcolor{red}{P^*(X|Z)}$ that represents an alternate treatment policy. Note that the alternate policy cannot depend on $U$ because we do not know $U$. This gives the following expression for the alternate joint distribution \begin{equation} \label{eq:xyzu:interv} \textcolor{red}{P^*(Z,X,Y)} = P(Z) \, \textcolor{red}{P^*(X|Z)} ~ \textcolor{blue}{\underbrace{\sum_U P(U|Z) \, P(Y|Z,X,U)}_{P^*(Y|Z,X)} } \end{equation}
New assumption: We now assume here that changing the treatment policy leaves the terms $P(Z)$, $P(U|Z)$, and $P(Y|Z,X,U)$ invariant.
The nature of this new assumption depends of course on the assumed nature of the unobserved variable $U$. For instance, if the variable $U$ is constant, our new assumption is equivalent to our previous assumption, namely the invariance of $P(Z)$ and $P(Y|Z,X)$. We now would like to know whether there are choices of $U$ that make this assumption different and lead to different conclusions.
This boils down to determining whether the blue term in \eqref{eq:xyzu:interv} \[ \textcolor{blue}{P^*(Y|Z,X) = \sum_U P(U|Z) P(Y|Z,X,U) } ~, \] is equal to \[ P(Y|Z,X) = \sum_U P(Y,U|Z,X) = \sum_U P(U|Z,X) P(Y|Z,X,U) \] If these terms were equal, expressions \eqref{eq:xyzu:interv} and \eqref{eq:xyz:interv} would be identical and therefore would lead to the same conclusions regardless of the nature of the unknown information $U$. Unfortunately the first expression has $P(U|Z)$ where the second expression has $P(U|Z,X)$.
These two conditional distribution are nevertheless equal in two important cases:
Proof: In the first case, we can indeed write \[ P^*(Y|Z,X) = \sum_U P(U|Z) P(Y|Z,X,U) = \big( \sum_U P(U|Z) \big) P(Y|Z,X) = P(Y|Z,X)~. \] For the second case, decomposing $P(X,U|Z)$ in two ways, \[ P(X,U|Z) = P(X|Z) P(U|Z,X) = P(U|Z) P(X|Z,U)~, \] shows that $P(X|Z,U){=}P(X|Z) \Longleftrightarrow P(U|Z){=}P(U|Z,X)$. Therefore $P^*(Y|Z,X){=}P(Y|Z,X)$. $~\blacksquare$
However, when the unobserved variable $U$ impacts both the treatments and the outcomes, the two conditional distributions $\textcolor{blue}{P^*(Y|Z,X)}$ and $P(Y|Z,X)$ are not equal. Since equations \eqref{eq:xyz:interv} and \eqref{eq:xyzu:interv} are different, it is very likely that these two approaches lead to different joint distributions $\textcolor{red}{P^*(Z,X,Y)}$ and therefore different conclusions about which treatment works best for which symptom. To make things worse, we cannot in general estimate $\textcolor{blue}{P^*(Y|Z,X)}$ as defined in equation \eqref{eq:xyzu:interv} because it depends on the unobserved variable $U$ which cannot be found in our medical records.
This example illustrates a fundamental difference between probabilistic and causal inference.
This also implies that probabilistic and causal modeling are fundamentally different