Bad Stuff in Food: Risk Analysis and Political Commentary: Statistical Insignificance

Regression

Perhaps, the idea of using mathematics to formulate decision theory perhaps began with the use on linear regression as a technique for fitting a line to data. Although the correlation coefficient (r) was originally used by Francis Galton, Karl Pearson, and others to describe the degree to which different biological measures co-occurred, it was soon discovered that the same mathematics could also be used to describe the correlation between a linear model and observed values, and the r-value came to be regarded as a measure of the strength of an inference (Porter, 2004). This came to be seen as a solution to the problem of induction; instead of relying on subjective judgment, inferences could be drawn from the data mathematically. It never really was, of course. Although least squares regression was settled on as a standard curve fitting technique, it never has been universally accepted, and r values are often not especially useful for discriminating between alternative models. But, regression is undeniably useful as a technique for generating a model that fits the data as well as possible.

Hypothesis Testing

The development of statistical decision theory began in earnest in England between the two world wars. Two very different schools of thought developed, but they both had two important commonalities. First, they put the business of inductive reasoning in the past by giving it the role of framing a hypothesis to be tested. For example, Ronald Fisher (1939; quoted in Gigerenzer et al, 1989) wrote:

Constructive imagination, together with much knowledge based on experience of data of the same kind, must be exercised before deciding on what hypotheses are worth testing, and in what respects. Only when this fundamental thinking has been accomplished can the problem be given a mathematical form.

Second, it was generally presumed that the decision to be made would revolve around the conduct of a single experiment.

For Fisher, hypothesis testing compared a new hypothesis against an old “null” hypothesis, which Fisher sometimes defined as the “treatment that has no effect, period” (Gigerenzer et al, 1989). Typically, the new hypothesis would assert a difference between two groups that the “null”, and if an experiment demonstrated that a difference occurred that was unlikely (typically defined to be a chance of less than 5%) to be explained by random variation, the difference was considered to be “statistically significant”.

But, there was a problem with this procedure that was pointed out by some of Fisher’s contemporaries: The measure of significance functioned like a signal-to-noise ratio, where the ability to detect a signal is dependent on background variation of what is being measured. The actual magnitude of the difference does not matter; if a teeny difference cannot be explained by teeny background variation, the result will be “statistically significant” even if it is not significant in any practical sense.

To address this issue, the Neyman-Pearson (Egon was the son of Karl) method arose (Gigerenzer et al, 1989). NP-hypothesis testing had two crucial differences. First, neither hypothesis had any a priori primacy over the other. They are simply two alternative hypotheses being compared to each, and one might suppose that there could just as well be more than two. Second, the actual value deemed to be significant was considered to be driven by practical considerations arising from how the test was to be used. Ideally, the significance value would be an optimum where the consequence of being wrong in either direction are approximately equal. In short, Neyman and Pearson considered “significance” to be a value judgment.

Academic Significance

In the academic world, Fisher won the argument. He sold many more textbooks than Neyman and Pearson, and some variation of Fisherian significance testing is the standard in many academic fields. It is not too hard to figure out why that happened. Most of the scientific studies published in academic journals are not really trying to come to a decision at all. Academic authors are, generally speaking, simply trying to establish facts. So, a test that allows factdom to be established, or at least supported, without considering whether or not the fact actually matters is exactly what they need.

Practical Significance

Outside the Ivory tower, Neyman and Pearson were obviously right; Fisherian tests are indeed not very significant. An effect of large magnitude that is not quite statistically significant may be far more important than a very small effect that is. In fact, the replacement of the NOAEL with the BMD in regulatory toxicology is an example where a Fisher-test has replaced with a statistical construct that is far more compatible with NP decision theory.

However, both Fisher and NP testing regimens have common problems that stem from the strategies they all settled on. First, inductive reasoning does not stop in its tracks after an experiment has been completed. In fact, after seeing the actual data, scientists often think of a new hypothesis that they should have tested instead. Statistical decision theorists have tried to nip further induction in the bud by prohibiting “ad hoc” analyses that are based on introduction of a new theory that was not considered before the experiment was planned (Mayo, 1996). Since this essentially mandates repetition of the same experiment just for the sake of testing a new hypothesis that perhaps should have been considered in the first place, this is quite silly. Second, decisions often do not revolve around a single experiment. So, as Hill (1966) noted, there really is no substitute for a good old-fashioned weight of the evidence evaluation that looks at all the plausible explanations for all of the studies:

Between the two world wars there was a strong case for emphasizing to the clinician and other research workers the importance of not overlooking the effects of the play of chance upon their data. Perhaps too often generalities were based upon two men and a laboratory dog while the treatment of choice was deduced from a difference between two bedfuls of patients and might easily have no true meaning. It was therefore a useful corrective for statisticians to stress, and to teach the need for, tests of significance merely to serve as guides to caution before drawing a conclusion, before inflating the particular to the general.

I wonder whether the pendulum has not swung too far - not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately I believe we have not yet gone so far as our friends in the USA where, I am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance.

Of course I exaggerate. Yet too often I suspect we waste a deal of time, we grasp the shadow and lose the substance, we weaken our capacity to interpret data and to take reasonable decisions whatever the value of P. And far too often we deduce 'no difference' from 'no significant difference'. Like fire, the [chi-square] test is an excellent servant and a bad master.

References

Gigerenzer G, Swijtink Z, Porter T, Daston L, Beatty J, and Krüger L (1989). The Inference Experts. In: The Empire of Chance. Cambridge University Press, Cambridge, pp.70-122.

Hill, Sir Arthur Bradford (1965). The Environment and Disease: Association or Causation? Proc Royal Soc Med 58:295-300.

Mayo, DG (1996). Error and the Growth of Experimental Knowledge. University of Chicago Press, Chicago.

Porter, TH (2004). Karl Pearson. Princeton University Press, Princeton.

Official Post Soundtrack

Porcupine Tree (1996). Insignificance. In: Metanoia, Track 6

Post Note

Thesis Post #7. They didn't call him Sir for nothing.

Bad Stuff in Food: Risk Analysis and Political Commentary

Saturday, March 7, 2015

Statistical Insignificance