We have an intuitive sense of causality, but how can define it mathematically? In Defining Causality we saw a definition which revolves around intervention. But in that post we assumed that 1. we could observe all the variables in our model and 2. we had complete access to the exact parameters of our underlying distribution. In reality 1. there are things we can’t observe and 2. the things we can observe are samples, not populations.

## The Addictive Gene (recap)

Consider the causal diagram in Figure 1. In this example we have the binary variables indicating whether someone:

- Smokes
- Drinks
- Dies
- Has the (fictional) addictive gene, gene X, which makes an individual more likely to both smoke and drink

Figure 1: Causal diagram for smoking (observational distribution)

For some choice selection of parameters for this distribution, we arrive at the underlying distribution as visualised by the probability bars in Figure 2.

Figure 2: Probability bars for smoking (observational distribution)

Our big causal question here is: does smoking cause death? By eyeballing Figure 2 we can see that under the smoking (orange) bars there is a higher rate of death (red) bars. Is our question as simple as that? Sadly no. The reason this is difficult is because there is also a higher rate of drinking (green) bars, due to the influence of gene X. How can we correctly adjust for the fact that the drinking variable also contributes to higher mortality?

The key to a correct answer is to examine a set of **different distributions**, namely the experimental distributions. The importance of these distributions was discussed in the last post, and they are visualised in the probability bars in Figure 3 and Figure 4.

Our task now is to estimate them.

Figure 3: Probability bars for the everyone experimental distribution.

Figure 4: Probability bars for the no-one experimental distribution.

## Recovering the experimental distribution

Suppose that we only have access to samples from the observational distribution, then it’s easy to estimate this observational distribution through sample averages. So given that we can (to some extent) recover the observational distribution, **how do we use this to recover the experimental distributions**? To answer this, let us begin with the following interpretation.

Imagine that we are Mother Nature, and it is our job to generate samples from the observational distribution. The causal diagram is important because it provides rules for the order in which we may proceed.

First, we draw the root nodes – which are our first causes – each drawn according to it’s own sacred distribution, not depending on any other node in the graph. Then we may draw the remaining nodes at any time so long as it’s parents have already been drawn; for it is the outcomes of the parents which determine the distribution from which we draw. In the smoking example, we can proceed with one of two different ordering strategies:

- gene X → drinking → smoking → death
- gene X → smoking → drinking → death

Okay, job done for Mother Nature generating observational samples, who we shall now refer to as Observational Mother Nature.

Now image that we are an Experimental Mother Nature, and we wish to generate samples from the experimental distribution. The obvious technique would be *the intervention technique*: we proceed as before but with one crucial change: at the moment where we are about to draw the experimental variable, instead we forcibly set this variable to our experimental value, then we continue on as normal drawing the remaining downstream variables. In our smoking example, just at the point where we would have *randomly* determined whether an individual was a smoker, instead we force them to be so.

But here is another technique: *the restriction technique*. The key insight here is that from the point of intervention, the life of a person who we force into being a smoker is exactly the same as the life of a person who was determined to be a smoker by chance. Therefore, given a large set of samples generated by Observational Mother Nature, Experimental Mother Nature can piggy back off these in the following way:

- Divide up Observational Mother Nature’s samples into groups depending on the history previous to that of the experimental value. That is, each separate combination of outcomes occurring in the variables drawn upstream of the experimental variable gets it’s own group.
- At that point, keep only the samples which agree with our experimental value.
- Stitch together these remaining samples,
**weighting each group by the history distribution**, that is the original frequencies of each history group.

In our example, suppose Observational Mother Nature is following the first ordering strategy: gene X → drinking → smoking → death. Then there are four history groups before the smoking variable is drawn. The restricting and re-weighting which we describe has been visualised in Figure 5. This is also referred to as *adjusting for gene X and drinking*.

So here we have it: miraculously we recover the entire experimental distribution without any experimental data!

Figure 5: Adjusting for drinking and gene X, proof-by-picture. (I have switched the ordering of *smoke* and *drink* to make the comparison clearer)

**Note on how to read these graphs**

In this and all the pictures that follow, we have the observational distribution on the top, and the experimental distribution on the bottom (sometimes marginalised to a subset of variables). The blue boxes separate out each history group – i.e. each combination of our *adjustment variables* – but restricted to the experimental value.

The arrows indicate that after restricting and re-weighting, the distributions are the same.

## Mother Nature: the second ordering strategy

What if Mother Nature decided to draw variables according to the other ordering strategy: gene X → smoking → drinking → death. This means only gene X is upstream of the smoking variable, and we only have two possible histories. And indeed adjusting only for gene X still works: see Figure 6.

Figure 6: Adjusting for gene X only

Also, it is overkill to want to recover the entire experimental distribution: often we only care about the causal effect, which is the marginal distribution in the dependent variable. In this case mortality.

If this is the so, then according to this – the second ordering strategy – we don’t even need to see the outcome of the drinking variable in our observational sample; if it were hidden from us, we can still recover the marginal of the remaining three variables, and in particular, the causal effect. See Figure 7 for the visualisation.

Figure 7: Adjusting for gene X, with the drinking variable being unobserved.

You will notice that this graph looks different because we can no longer group by whether the individual is a drinker or not. E.g. for those with gene X, all of the drinkers and non-drinkers are grouped together, but it still works.

## Mother Nature: for two players

Let’s make things more complicated.

Imagine now that Observational Mother Nature is tired of drawing all the variables herself, so she divides the set of variables in two. Observational Mother Nature 1 (MN1) draws all the variables up to **and including** the experimental variable, then she passes her outcomes over to Observational Mother Nature 2 (MN2), who draws the remaining downstream variables. She only needs to pass along the relevant outcomes: those which have a child in the downstream variables.

Suppose we only care about the causal effect – i.e. the dependent variable – then MN1 doesn’t need to pass on all of her outcomes, only those which are relevant for drawing downstream variables: only those which have children among the downstream variables. For the smoking example in the case of the first ordering, this two-player method is visualised in Figure 8.

Figure 8: Two player generating of the observational distribution

Experimental Mother Nature (EMN) realises that to play the intervention technique, she doesn’t need to interfere with what MN1 is doing, she needs only to intervene with MN2. Her technique involves intervening just after receiving the outcomes from MN1, and her method is as follows:

- Take MN1’s outcomes (which include the experimental variable)
- Forcibly set the experimental variable to the experimental outcome
- Take over the task of MN2 in drawing all of the remaining variables.

For the smoking example, this method is depicted in Figure 9. Note that it is crucial that the experimental variable was the last thing to be drawn by MN1 before hand-off! So long as this is the case, we can be sure that tinkering with this variable would have no influence on the other variables handed-off by MN1.

Figure 9: Two player generating of the experimental distribution

And the restriction technique should still work. EMN might not know the outcomes of all the upstream variables drawn by MN1, but so long as EMN preserves the distribution of groups as given to her by MN1, then restricting and re-weighting ought to work just like before, for the same reason as before.

Returning then for the last time to our example, the restriction technique is the following:

- MN1 draws gene X, drinking, smoking
- MN1 passes only the drinking and smoking outcome to EMN
- EMN restricts only to the handed-off outcomes for which the individual smokes
- EMN re-weights according to the distribution of groups handed-off by MN1: i.e. preserving the original prevalence of drinkers and non-drinkers

This has been depicted in Figure 10. What is the corollary? We can recover the causal effect **even when gene X** is hidden from view!

Figure 10: Adjusting for drinking, when gene X is unobserved.

## Back door paths

This is a very convoluted attempt to make intuitive what the literature refers to as *blocking all the back door paths*^{1}. The main takeaway is that for any group of variables chosen in a way that is consistent with the above fairy-tale, the adjustment we describe here correctly transforms the observational distribution into the experimental distribution. These groups are exactly those which are said to *block all the backdoor paths*.

In fact there are other valid groups and different methods of adjustment, so the fun doesn’t stop here. If you have any ideas for crazy ways of explaining any of these other methods, then please get in touch.

## Footnotes:

^{1}

For a technical exposition of what it means to block all the back door paths, see Pearl, J. (2009). Causality. : Cambridge University Press.