Thursday, July 23, 2015

The Meaning of the Mean

The Average as a Surrogate for the Total

In statistics and probability, the arithmetic mean or the average value is often considered to be especially important.  There are some good reasons for this – sometimes.  Other times, not so much. For starters, there really is no such thing as an average person.  So, knowing the average value for a population may not give you very much information about yourself or any other specific individual.  But, for the purpose of providing a quantitative description of a population, the average often works rather well.  The reason is simple; the average is proportional to the total:

Average = Population Total / number of persons

Therefore, as long as the utility function is also proportional to the quantitative value, the average serves as a utilitarian measure of value.   Even though that proposition is dubitable to the point of being obviously wrong under some circumstances (e.g. for risk assessments where the risk is driven by extreme values), the “average person” often serves as a useful stand in for "everyone".

Then, there is the "Expected Value".  Mathematical probability was originally devised to calculate the frequency of occurrence of specific results from games of chance played many times.  This in turn, allowed the rate of return over a long period (theoretically infinite) of time to be estimated.   Once again, the value of interest corresponds to the arithmetic mean.   For example, a gambler seeking to profit from a series of bets can evaluate the bet as follows:
Expected Value = Total Net Return / number of bets

The use of mathematical probability in finance and insurance often uses the same underlying logic:  Given the fact are sure to be some bad loans and bad insurance risks, the key having a profitable business is to have the average return be positive.  At least, that is what investors expect.

Measurement Error

Using the Standard Error of the Mean to characterize the uncertainty associated with scientific measurements has a long history.  Writing in 1755, Thomas Simpson adapted the Bernoulli theorem (aka the law of large numbers) to make the following observation (quote from Stigler, 1986):
Upon the whole …. It appears, that the taking of the Mean of a number of observations, greatly diminishes the chances for all the smaller errors, and cuts off almost all possibility of any great ones: which last consideration, alone, seems sufficient to recommend the use of the method, not only to astronomers, but to all others concerned in making experiments of any kind (to which the above reasoning is equally applicable).  And the more observations or experiments that are made, the less will the conclusion be liable to err, provided they admit of being repeated under similar circumstances.
However Simpson’s claim was met by immediate criticism from Thomas Bayes, who noted (also in Stigler, 1986):
As I see no mistakes in Mr. Simpson’s calculations, I will venture to say that there is one in the Hypothesis on which he proceeds.  And I think it is manifestly this, when we observe with imperfect instruments or organs; he supposes that the chances for the same error in excess or defect are exactly the same, and upon this hypothesis only has he shown the incredible advantage, which he would prove arises from taking the mean of a great many observations.
In other words, the standard error of the mean accurately characterizes the uncertainty of a measurement only when, as Simpson assumed, the true value corresponds to the arithmetic mean.  If it doesn’t, then even though the theorem is true, the result is irrelevant.  For example, if the errors are lognormally distributed, then the true value will correspond to the geometric mean rather than the arithmetic mean.  If the underlying distribution of the measurements to the true value is unknown, then so is the relationship of the true value to the distribution.  Calling the mean value the expected value doesn’t help at all.

Averaging the Truth

In the realm of the probability of chance, the mean value is almost certainly given far more credence than it deserves.  But still, under most circumstances the arithmetic mean isn’t too far from the actual value of interest to not be considered approximately true.   On the other hand, with the probability of causes, or any other notion of probability arising from a notion of competing theoretical propositions, there is no basis for using a mean value at all.  For example, consider the probability that the earth is round as opposed to flat.   As a decision problem, under no circumstances would it make any sense to average the flat earth theory with the round one.  Yet, that is essentially what Bayesian Model Averaging does.

The admirable trait of Bayesian Model Averaging (BMA) is that it acknowledges that different plausible models may yield estimates that may be quite different (Hoeting et al, 1999).   Like a probability tree treatment of model uncertainty (e.g. Evans et al, 1994; Carrington et al, 2013), BMA requires identification of a set of alternative plausible models and establishing a model probability that will surely require some degree of subjective judgement.  But, with BMA the subjective probability is just the prior probability rather than the finished product.  Bayesian updating and averaging is the next step.

The differences between BMA and an unvarnished probability tree are all attributable to different notions of probability.  Like Bayesian schemes in general, BMA is intended to give the probability of causes a mathematical treatment that resembles that used for the probability of chance; and the fixation on the arithmetic mean comes with that package.  A probability tree approach that embellishes a weight-of-the-evidence evaluation is apt to use something like the Bradford-Hill criteria (Hill, 1966) to establish model probabilities, none of which assign any importance to the arithmetic mean.  Given the fact that assuming the mean is what led Thomas Bayes to criticize Simpson, it seems that the real Bayes would never have approved of BMA.

Along with a range or outer bounds, the mean is perhaps a useful central estimate even when uncertainty arises from competing plausible propositions.   But, since it corresponds to a common legal standard of proof (“preponderance of the evidence”), the median is better for many purposes.  But there may be room for both.  The real problem with BMA is that it proffers the arithmetic mean as the value of interest.  It isn’t really; the value at stake is the truth.  If current science is unable to divulge it, then we really don’t know what to expect.

References

Carrington CD, Murray C, and Tao, S. (2013). A Quantitative Assessment of Inorganic Arsenic in Apple Juice
Evans, J.S., Graham, J.D., Gray, G.M., and Sielken, R.L., Jr. (1994). A distributional approach to characterizing low-dose cancer risk.  Risk Anal 14:25-34.
Hill, Sir Arthur Bradford (1965).  The Environment and Disease: Association or Causation?  Proc Royal Soc Med 58:295-300.
Hoeting JA, Madigan D, Raftery AE, and Volinsky VT (1999).  Bayesian Model Averaging: A Tutorial. Statistical Science 14:382–417.
Stigler SM (1986).  Probabilities and the Measurement of Uncertainty.  In: The History of Statistics: The Measurement of Uncertainty before 1900.  Belknap Press, Cambridge MA, pp. 62-98.

Official Post Soundtrack


Supertramp (1975).  The Meaning.  In: Crisis, What Crisis?, Track 9.

Post Notes

Thesis Post #47.  Best read in conjunction with "A Dictionary of Probability" an "Quantifiers". 

Thursday, July 9, 2015

Middle Ground

The “Low Dose” Problem

Evidence of potential harm from arsenic and other contaminants usually comes from epidemiology studies exposures that are much higher than those that occur in the diet.  Because these exposures are both statistically significant and there is strong evidence that the association is causally related to exposure to the contaminant, this “high-dose” region is the only part of the curve where the data are good enough to empirically characterize the shape of a dose-response curve.   Potential effects at lower doses necessarily involve extrapolation from high doses to the “low-dose” region where exposures from the U.S. diet actually occur.  There is also often a substantial “intermediate-dose” part of the curve that is in between the high- and low-dose regions. 

The inevitable question underlying most toxicological assessments is this:  What effects in the low dose region can be inferred from the demonstrable effects in the high dose region?  Since effects in the low-dose-region are not within the limits of detection, by definition, any claim of an effect or lack there of must be theoretical.  The scientific debate typical revolves around whether or not the shape of dose response is “linear” or “nonlinear”.  If it is “linear”, then it is supposed that the risk at low doses is proportional to the risk at high doses.  If it is nonlinear, then it is supposed that the risk at low doses is negligible, and therefore, no quantification of the risk is necessary.  But, there are many other plausible alternatives.   In particular, the risk at low doses may be linear without being proportion to the effect at high doses.  As a result, a risk assessment isn’t just about what happens at high doses and low doses; it is about what happens in the middle as well.

Some Theoretical Alternatives

A comparison of some of the mathematical models used for benchmark dose modeling is illustrative.  The behavior of these models when used to describe the relationship between exposure to inorganic arsenic in NE Taiwan (Chen et al, 2010; Carrington et al, 2013) are illustrated in the following three figures that show four different models in three different dose ranges. 

The High-Dose Region



The Intermediate-Dose region

The Low-Dose Region


At high doses, all four of the models are nonlinear.  Even the Weibull model, which appears to be linear in Figure 1, becomes nonlinear at doses that result in incidence rates that exceed 50%.  However, near the transition point between the high and intermediate dose ranges there is a large discrepancy in the models.  While the Weibull model is almost completely linear, the Probit model is somewhat nonlinear, while the Logprobit and Quantal Hill models are highly nonlinear.   As a result of their nonlinearity in the intermediate dose range, the latter two models are nearly flat at low doses, which is indicative of an incremental risk that is very close to zero .  Although the increase is very small relative to background, the other two models exhibit a noticeable slope in the range of dietary exposure.

No Dichotomy

Given the complexity of biological reality, none of these simple models are likely to be entirely correct:  They are approximations at best.  Nonetheless, they serve to demonstrate that the shape of the curve really does matter.  Just about all plausible curves are non linear at some point, yet are still approximately linear at very low doses.  Nonlinearity does not imply that there is a threshold.  Linearity does not imply that the risk is of any significance.  It all depends on how and where the nonlinearities occur, and in the intermediate region theoretical justification is the only game in town. 

References

Carrington CD, Murray C, and Tao, S. (2013). A Quantitative Assessment of Inorganic Arsenic in Apple Juice

Chen CL, Chiou HY, Hsu LI, Hsueh YM, Wu MM, and Chen CJ (2010).  Ingested arsenic, characteristics of well water consumption and risk of different histological types of lung cancer in northeastern Taiwan.  Environ Res. 110:455-62.

Official Post Soundtrack

Fixx, The (2003).   Straight 'Round the Bend.  In: Want That Life, Track 7.

Post Notes

Thesis Post #46.  First post in almost a month.  That mostly because my manfesto is pretty much manifested - I've already covered most of the main ideas I wanted to cover when I set out.