The correctness or otherwise of different statistical methods can be a difficult and contentious topic. To this end, I want to talk about an interesting trio of principles, and a result which I found surprising when I first heard it. And still I am trying to develop an intuition for what it means!

## What is Evidence?

Without wanting to sound too dramatic, I would argue that everything is evidence. But good evidence on the other hand is that which is useful to a scientist; which can be used as ammunition in an argument for or against a certain theory. And often, in search of good evidence, a scientist sets up an experiment.

In this article we will use the following formalisation. An instance of *statistical evidence* is a pair describing the experiment and the observed outcome:

The description may contain all sorts of information as to how the experiment is conducted, and the likelihoods of all possible outcomes according to different hypotheses. In our case, the hypotheses relate to the possible values of some unknown parameter , which we assert as belonging to some parameter-space . Moreover we will be fully describing the experiment with a likelihood function .

## What is the problem that we are trying to solve?

Ideally, if two statisticians perform the same experiment and get the same outcome, then they should draw the same conclusions. But in what other cases should we expect their conclusions to be the same? For example, what if the outcomes were different only in that one was in imperial units, and the other in metric units. These experiments are different, so what exactly is it that they have in common?

We shall call the abstract essential properties of a piece of evidence the *evidential meaning*, written . In the example just given — where we change the units, or perform some other invertible conversion — the two scientists have different experiments and data, but the same evidential meaning.

With this in mind, we are tying to solve two tasks:

- In the experimental regime described above, can we mathematically characterise the statistical evidence? To put this in a silly way, can we replace all of the experimental articles in all of the scientific journals with a database of mathematical objects? And as a corollary, when is it correct to say that two bits of evidence are the same; when is ?
- Once characterised, how should these objects be interpreted qualitatively?

This article will address the first question. What follows are three principles on evidential meaning which concern exactly the question of when two different experiments should lead to the same conclusions. The first two seem fairly intuitive, and the third is a bit more mysterious.

## The Sufficiency Principle (S)

*If one has the value of a sufficient statistic, further details are superfluous.*

Given a model parameterised by , and some data from that model, a *sufficient statistic* is function of the data containing everything there is to know about . Often the sample-mean might be sufficient, which indeed is the case when drawing from a normal distribution with unknown mean parameter. To be precise a statistic is sufficient when conditional on the distribution of the data does not depend on .

The sufficiency principle asserts that no evidential meaning is lost when we alter our experiment to return not the original outcome , but instead any corresponding sufficient statistic . As a corollary, if we perform an experiment and get two outcomes which have the same sufficient statistic, then our conclusions should be the same. To put this in notation, suppose that is the experiment returning the sufficient statistic , then:

As an example, suppose we wish to estimate the bias of a potentially unfair coin. Our experiment is to flip the coin times and record the results in their exact order. The sufficiency principle states that it would have been just as well to record only the number of heads; the order in which they came up is superfluous.

## The Conditionality Principle (C)

*Experiments which were considered, but not performed, are irrelevant.*

Suppose we have a super-experiment which performs one experiment from a possible set of experiments . We select according to some distribution — a stand alone distribution independent of our parameter — and then perform the corresponding experiment. The statistical evidence in this case is . The conditionality principle asserts that we should come away with the same conclusions as any scientist whose experimental plan was always to perform only , and whose outcome , was the same as . In other words:

To add some intuition, suppose a scientist could have used one of two instruments with which to perform an experiment, and the decision will be made by the availability of funding. Sadly the funding is tight, and the scientist has to use the cheap and inferior instrument. The fact that the scientist *could* have been doing the experiment with the other instrument is surely no longer relevant.

## The Likelihood Principle (L)

*Experiments with the same likelihood function give the same conclusions.*

Consider two scientists, each of whom generates some statistical evidence — — each parameterised by the shared parameter . Their experiments, and their results are such that as functions of , the two likelihood functions are the same (up to a constant factor): . The likelihood principle asserts that their conclusions are the same:

What is surprising about this principle, is not that the different data can give the same conclusion — which is already the case for the sufficiency principle — but that the entire experimental design can also be different, so long as the observed data give the same likelihood function. At first glance, this principle is more objectionable than the other two, which seem to be very intuitive.

**Example one**

Consider a Poisson model for the number of customers visiting a cafe in any single hour: . Scientist-A sits in the cafe for an hour and counts the number of customers: only one. Scientist-B comes early the next day, and times the gap between the first and second customer: it takes and hour. Since these two experimental outcomes correspond to the same likelihood function, the likelihood principle asserts that they have the same evidential meaning.

**Example two**

Consider again the case of a potentially unfair coin. Scientist-A flips the coin until they get a tail: HHHHT. Scientist-B simply flips five coins and gets four heads and a tail in some unimportant order: HTHHH. Same likelihood, same evidential meaning.

## C + S = L

Although the likelihood principle is less intuitive, it happens to be equivalent to sufficiency plus conditionality (proof is in the appendix). So for the scientist who believes in S and C, we have a solution to our first problem of inference: the substance of any piece of statistical evidence is exactly the resulting likelihood function.

What does this mean? What does this mean for p-values? The observant reader may have noticed that p-values rely not on what was observed, but also on what could have been observed. Take another look at example two. The experiments have the same likelihood function, but do they have the same p-value? No! In fact under a hypothesis test with a significance of 5%, in one case we would reject the coin being fair, and in the other we would not.

Figure 1: “Frequentists vs. Bayesians” strip from xkcd

You may have seen the popular xkcd strip about p-values and the death of the sun (see Figure 1). What is the interpretation under the lens of the Likelihood Principle? For one, it doesn’t matter that the probability of the detector having said yes was 0.027. What matters is how this probability scales as we vary our hypothesis: in this case the sun having exploded or not. In particular, our method should be scale invariant. Whereas for p-values, the fact that 0.027 is 0.027 and not 27, or 0.5, or anything else, is the only thing that matters. It what we look at to conclude our hypothesis test: “is this number less than 5%?”

Secondly, it doesn’t matter that the detector could have said no. For all we care, the counterfactual could have been the detector running another test, or considering an output of “maybe”. Once we have our observation, all of the counterfactuals go out of the window.

## Conclusion

The three principles — of sufficiency, conditionality, and likelihood — help us determine the essential properties of a scientific experiment. And surprisingly the likelihood principle is only as debatable as the other two combined.

The material of this post comes from the paper *On The Foundations Of Statistical Inference* [†]. The content is theoretical, and it is natural to wonder what are the practical applications? Should we stop using p-values? I expect not, p-values are ubiquitous and extremely useful! However, in my opinion as a likelihoodist, they are incorrect. But then again, so is Newtonian mechanics.

## Appendix

### The Likelihood Lemma

We will use the following result: if two outcomes of the **same** experiment admit the same likelihood function, then there exists a sufficient statistic such that . Then, assuming the principle of sufficiency, we have the corollary:

### Proof of C + S = L

The proof is short, and goes like this: It is clear that L C + S. Suppose then that and admit proportional likelihood functions for as described. Consider then the following experiment where we flip a coin, and if heads we perform , if tails :

Then by the conditionality principles we have , and likewise for . Then since these two outcomes are now from the same experiment and they admit the same likelihood — although now each has changed by a constant factor of — we can use the likelihood lemma to deduce the existence of a sufficient statistic , for which . Then we apply the sufficiency principle to get the following chain:

And therefore .

## References:

- [birnbaum62-found-statis-infer] Allan Birnbaum, On the Foundations of Statistical Inference,
*Journal of the American Statistical Association*,**57(298)**, 269-306 (1962). link. doi.