Overview of confidence intervals

Introduction to confidence intervals

A confidence interval is a range of values which quantifies the imprecision in the estimate of a particular value. Specifically, it quantifies the imprecision that results from random variation in the estimation of the value. It does not include imprecision resulting from systematic error (bias).

In the context of research, in many studies the source of this random variation is sampling. Even in the best designed studies there will be random differences between the sample group selected and the overall target population of inference. Any measurement taken from the sample group therefore provides an imprecise estimate of the true population value.

In public health many indicators are based on data sets which can be considered complete, such as mortality rates based on death registers, and not samples. In these instances, the imprecision arises not because of sampling variation but of ‘natural’ variation. The indicator is influenced by the random occurrences which are inherent in the world around us and so the result of a stochastic (random) process. In such instances the observed value is only one of a set which could occur under the same circumstances. Generally, in public health, it is the underlying circumstances or process that is of interest and the actual value observed gives only an imprecise estimate of this ‘underlying risk’. Although we have a ‘complete’ observation of the underlying process, it is not the only possible result that could have occurred from the process, so effectively we have ‘sampled’ from the underlying process.

For example, if we had two identical populations, and in any given year one of them had one or two more deaths than the other, we would not consider this to be a statistically significant difference: it could have been the other way round, purely by chance. We use confidence intervals to summarise this random element of the differences between observations. This helps us to focus on differences that are unlikely to be purely attributed to random variation, and hence suggest differences between the underlying risks in those populations.

The width of the confidence interval depends on three things:

  1. The sample size from which the estimate is derived (or population size if derived from a complete data set). Larger samples, or populations, give more precise estimates with smaller confidence intervals.

  2. The degree of variability in the phenomenon being measured. Fortunately, observed phenomena often are known, or can safely be assumed, to follow certain probability distributions, such as the Poisson or binomial. This allows us to express the amount of variability mathematically and build it into the confidence interval formulae.

  3. The required level of confidence. This is an arbitrary value set by the analyst which gives the desired probability that the interval includes the true value. In medicine and public health, the conventional practice is to use 95% confidence but it is not uncommon to see alternatives, such as 99.8% confidence intervals (95% confidence intervals approximately correspond to two standard deviations and 99.8% confidence intervals to three standard deviations, but the choice is arbitrary). It is sensible to be consistent in the presentation of data, to make it easier to interpret different indicators alongside one another. Increasing the level of confidence results in wider limits. For a given level of confidence, the wider the confidence interval, the greater the uncertainty in the estimate.

Figure 1 demonstrates confidence intervals for smoking prevalence generated by 20 random samples of 200 people. For sample one the smoking prevalence observed is 24.7% and the associated 95% confidence interval is 19.2% to 31.1%. This means that we are 95% confident that the range 19.2% to 31.1% includes the true population prevalence. On average, 95 out of every 100 of these intervals will include the population value. This can be seen in the figure where 19 of the 20 intervals (95%) include the 25% population prevalence, the exception being sample 9. In many public health applications, we do not have the luxury of repeated samples, but the confidence intervals are calculated in such a way as to ensure that if we did, then 95% of them would contain the true value.

Figure 1: Smoking prevalence: sample estimates and 95% confidence intervals for 20 random samples of 200 persons from a population with a smoking prevalence of 25%

When should confidence intervals be used?

Confidence intervals should be presented alongside the point estimate wherever an inference is being made from a sample to a population or from a set of observations to the underlying process (or ‘risk’) that generated them. This covers virtually any public health information being published because we always intend to infer something from them. It is particularly important to provide confidence intervals when numbers are small, and hence the random component of variation is likely to be large. This avoids over-interpretation of differences.

When should confidence intervals not be used?

In general, if confidence intervals can be calculated robustly there is no reason not to use them but, if the numbers are very large, they may add nothing as they may be almost identical to the indicator value.

In some cases, when looking at the relationship between an outcome and a predictor that gives a gradient, adding confidence intervals to each decile value, while not incorrect, may make presentation of the gradient less clear. For example, with data for mortality rates across a range of deprivation deciles, the gradient across the deciles may be of more interest than individual comparisons between deciles.

Confidence intervals should only be calculated if they are correct. Examples where it may be impossible to calculate confidence intervals robustly could include:

  • if the events recorded as the numerator are known not to be independent of each other (such as multiple follow-up appointments in hospital)

  • there are known systematic differences between the data collection methods

  • the distributions underlying the occurrence of events are complex or unknown

It may be possible to calculate confidence intervals in these cases, using adjusted methods or simulation: examples are covered in the relevant sections of this guidance. Where it is not possible, it may still be appropriate to publish the data with caveats around interpretation.

Estimation of confidence intervals using simulation

For most statistics, we can construct confidence intervals by making appropriate distributional assumptions, for example proportions should be binomially distributed, and counts or rates should follow a Poisson distribution.

However, some statistics or indicators have quite complex calculations, with several sources of uncertainty. In these cases it can be impossible or impractical to calculate confidence intervals directly from theoretical assumptions. However, we can often estimate the confidence intervals by stochastic (random) simulation (also called Monte Carlo methods). Other terms used for specific simulation methods are jackknife and bootstrap, and they are all very closely related.

Each case requires a specific model to be built which reflects the construction of the statistic. The model is used to randomly generate very large numbers of values of the indicator, and the confidence interval is estimated from the distribution of those random values. For example, if 100,000 values have been generated randomly, the 2,500th smallest and 2,500th largest value are very good estimates of the lower and upper 95% confidence limits.

Random numbers are generated using an algorithm such as the Mersenne Twister. These are then transformed to the appropriate distribution for each element in the model, for example, binomial or Poisson with a given mean and variance.

An example is the slope index of inequality (SII). The SII is usually calculated from 10 values of an indicator (one for each deprivation decile within the overall population). Each of the decile values has a confidence interval, having been based on observations (for example mortality or life expectancy). The weighted combination of these decile values is complex, but we can construct a simulation based on the uncertainty described by the decile confidence intervals. The method is set out in the PHE Slope Index of Inequality Tool (Simulated CIs).

Specific applications of simulation for confidence intervals will be set out individually in this guidance.

Presentation of confidence intervals

When presenting the confidence interval, the level of confidence should be clearly stated and the lower and upper limits of the interval clearly labelled or presented as a range. The use of the ± sign or a dash to indicate the range should be avoided. The examples below should be followed instead:

Area Statistic 95% confidence interval
Area 1 95.2 Lower limit 93.5, upper limit 96.9
Area 1 95.2 93.5 to 96.9
Area 1 95.2 (93.5, 96.9)

For Area 1 the statistic was 95.2 (95% confidence interval: 93.5 to 96.9).

In graphs or charts, the confidence limits are usually represented as bars or whiskers extending above and below the value of the estimate. Figure 1 is a typical example.

Using confidence intervals for making comparisons

The confidence interval may be used to compare an estimate against a target or benchmark value in a similar way to performing an exact statistical test. In such a test we proceed on the assumption that the true value is equal to the target or benchmark value. The appropriate probability distribution is then used to determine the probability (p-value) of observing a value as extreme, in either direction (a two-sided test), as the actual observed estimate. If the p-value is below a given significance threshold, α, then the original assumption is rejected and the difference is said to be statistically significant.

If the target or benchmark value is outside the confidence interval, it can be inferred that the difference between the estimate and the target is statistically significant at the corresponding significance level. For instance, if the target value is outside the 95% confidence interval, then the difference is significant at the α = 5% level (p<0.05).

Comparisons are often made between two or more different estimates, for example, between different organisations or time periods. Sometimes in such cases statistical testing is undertaken by comparing the confidence intervals of the estimates to see if they overlap, with confidence intervals which do not overlap being considered as statistically significantly different. This is quick and easy to do, but not necessarily conclusive. While it is safe to assume that confidence intervals which do not overlap indicate a statistically significant difference, it is not always the case that overlapping confidence intervals mean that there is no significant difference. A more exact approach is to calculate the ratio of the two estimates, or the difference between them, and construct a test or confidence interval based on that statistic. Such methods are not covered in this guidance but can be found in standard textbooks.


Page last updated: August 2024