The following information was published in August 2009 as part of a series of technical briefings on public health methods and techniques. The text below is a summary of the original briefing which can be accessed in full on the public health methods guidance pages.
Jo Watson, Stacey Croft, Heather Heard, and Philip Mills: APHO
Each nation within the UK and Ireland has primary geographical units for local administration of public services and, naturally, data tend to be widely available to describe and compare these populations. However, these areas can have widely varying, and sometimes quite large, populations. For example, Essex Upper Tier Local Authority and Lancashire Upper Tier Local Authority both have populations of over one million. Such large populations are very diverse and contain communities with different characteristics. Detailed knowledge of the demography, socioeconomic structure, and usage of services in small areas is required in order to assess their needs and to provide appropriate services. For the purposes of this briefing, small areas are taken to include any geography below local authority district level in England.
There are four principal categories of small area which can be used to produce data:
This briefing concentrates on area based populations, for census and electoral areas as these are the geographies used in local health. There are other non geographical ways of defining populations, e.g. users of a particular service (patients registered with a GP practice, children attending a particular school, etc.) or people with particular characteristics (aged over 65, Asian ethnic origin, etc.). Many of the analytical issues discussed apply also to non geographically defined populations
Census geography is the hierarchical set of areas defined for the release of Census data in the UK. They include Census output areas (OAs) and lower super output areas (LSOAs) and Middle layer super output areas (MSOAs) in England.
OAs were defined by computational analysis to be as homogeneous as possible, within the constraints set on population size, and the intention was that they would reflect natural communities as much as possible. Most of the Census results are published at this level, with small area counts adjusted to prevent the disclosure of individuals’ responses from the figures.
Following the 2001 Census, the Office for National Statistics (ONS) introduced the SOA, which was intended to provide a stable, permanent geography that could be used to publish a wide range of statistics on a consistent basis. In order to allow for their use in a wide range of applications, three levels of SOA were proposed for England and Wales: lower super output areas (LSOAs) were generated automatically by ONS, constrained to the electoral ward boundaries at the time (see later) and hence could (and still can) be grouped to be coterminous with LAs; middle super output areas (MSOAs) nest within LAs and are not constrained by ward boundaries; upper super output areas have never been defined but were planned to be grouped to be coterminous with LAs. Ref.1
The first data to be published at SOA level were the Indices of Multiple Deprivation 2004. These indices, revised in 2007, provide data on various aspects of socioeconomic status at small area level and provide very useful information to support the assessment of health needs and interpretation of health outcomes. Ref 2
These include wards, civil parishes and parliamentary constituencies in England, Wales and Scotland; wards and parliamentary constituencies in Northern Ireland; and electoral divisions in the Republic of Ireland. The electoral ward’s primary purpose is to provide the constituencies for local elections; each ward elects one or more councilors to the LA. However, until the publication of the 2001 Census, wards were also used by ONS and others as the standard geography for data publication below LA level. Each LA is made up of a number of wards and, while wards vary hugely in size between districts (from a few hundred residents to over 30,000), they tend to be reasonably consistent within a district, to ensure residents are fairly represented by their local councillors. For the same reason, they are subject to regular review and in areas with substantial levels of development or population movement, they may undergo frequent change.
The electoral wards that were in place or on the statute books at the time the 2001 Census was published have continued to be used for data publication by ONS, to provide comparability over time, and are referred to as (2003) statistical wards. Census Area Statistics and Standard Tables were released using slight variants (CAS wards and ST wards) which merge the smallest statistical wards.Ref. 4 Civil parishes are now defined only in rural areas and are not commonly used outside a very local context. UK parliamentary constituencies have an average total population of about 90,000. They are not coterminous with either Local Authorities but are used for a range of statistics of particular interest to members of parliament and constituency organisations.
Analysing data at the smallest area level increases the granularity of the results, giving several advantages:
As mentioned in the previous section, data may not be available to be attributed to small areas (e.g. where data are published aggregated to a higher level of geography). Furthermore, variations in data quality or completeness become more important factors as the numbers get smaller. A key example is population data, required as denominators for comparison of areas through calculation of rates and discussed below. However, even where the availability and format of data technically allow very small area analysis, there can be problems that prevent such analysis or render the results unreliable or even meaningless. The most obvious and common problem with small areas is a simple function of their small populations: as the size of the population reduces, so does the reliability of statistics calculated for those areas. Hence it becomes increasingly difficult to observe any statistically significant differences between areas, as the size of random variation in the area statistics can mask the underlying differences. Similarly, real changes over time can be hidden by massive year-to-year random variation. It is particularly important to present small area data with confidence intervals or P values to avoid over-interpretation of apparent differences (see Technical Briefing 3: Commonly used public health statistics and their confidence intervals for further information).
When publishing health data, it is a primary concern that individuals should not be identifiable from the data, either directly or indirectly. If data are published for very small populations, small counts can, in certain circumstances, have the effect of disclosing information about individuals. For example, if there is known to be only one elderly man in a Census OA, and a published breakdown of the ages of residents gives a ‘1’ for one particular age, with zeros for all other categories around it, local people can deduce the age of that neighbour. This would undermine the confidentiality of the Census.
For many datasets analysed within the health and LA sectors, data disclosed could be a great deal more sensitive than just a person’s age. Hence a great deal of effort goes into disclosure control when data are published for small areas. The usual approach is simply to suppress counts of, say, three or less and rates derived from them, but there are many alternative methods. The UK Census 2001 used a combination of record swapping (randomly muddling the individual records prior to any analysis to a degree that would not affect aggregate analyses), minimum thresholds for population and households, and rounding of cell counts. Ref.10
Specific guidelines are often set by data owners and vary in detail. Some data owners, such as UK cancer registries, also specify a minimum denominator population for publication of results. Guidelines set by data owners should be carefully observed, but in the absence of specific guidelines it will generally be safe to suppress cells in tables that are based on fewer than five individuals, and any other cells which might enable data users to derive suppressed results. In most situations, results based on fewer than five cases will have very wide confidence intervals and hence be uninformative in any case.
This would not be true when analysing potential clusters of very rare conditions, where even two or three cases together may be highly significant, but while such data may be appropriate for public health surveillance, they should not be made available outside the appropriate authorities (note that the Freedom of Information Act does not require data to be made available if there is a risk of disclosing confidential information). It is important to ensure that suppressed values cannot be derived from other data in a table, e.g. if a column of data has one suppressed value, but the correct total, then the suppressed value can easily be obtained by subtraction.
This is a simple example of an effect called disclosure by differencing. In fact, reasonable efforts must be made to ensure that no suppressed values can be derived through combination of the data being published and any other published data. A similar situation arises when comparable data are published for a range of different geographies. For example, where results are published at both ward and LSOA level, by differencing the results for areas with slightly different boundaries it may be possible to produce information about an identifiable individual. It is wise to be very wary of producing the same datasets on the basis of more than one type of small area and be aware of other publications of the same data. More detailed confidentiality guidance has been published by ONS, Ref. 12 and by NHS National Services Scotland in the ISD Statistical Disclosure Control Protocol. Ref. 13
UK resident population estimates are published annually (usually at least a year behind) for administrative and electoral area residents. Since the underlying source is the decennial Census, the population estimates have the limitations of all Census datasets, including the reliability of the ages and ethnicity recorded and the under-recording of people in categories such as young adults, travellers, homeless people, and illegal immigrants. Figures are adjusted to take account of these problems, but the effectiveness of the adjustments is difficult to assess.
The Census figures have to be adjusted further to take account of the ageing of the population, births, deaths, migration, and other changes (such as movements of armed forces). Some additional information is provided by GP practice registers and electoral rolls which are updated annually. However, in areas with high levels of population movement, such as inner cities and university cities, practice populations and electoral returns are unlikely to be more reliable than Census returns for the categories mentioned.
In public health terms, rates are generally of greater interest than numbers because different populations can be compared with each other and with national or regional averages and the statistical significance of the results can be assessed. Methods for calculating and comparing rates are described fully in Technical Briefing 3 Ref. 9 and all of the principles apply equally to small area data.
However, there are some issues that arise when using these methods for small area analysis which are highlighted here. These comments should be read in conjunction with Technical Briefing 3. Ref. 9 When standardising rates to account for the different age and sex structures of different areas, there are some advantages to using indirect standardisation rather than direct standardisation for areas with smaller populations.
By contrast, to calculate directly standardised rates for aggregated areas it is necessary to recalculate the rates from scratch. It must be noted, however, that indirectly standardised ratios only allow comparison between the small area rate and the reference rate: comparisons between different small areas can be misleading. The methods used for rates within Local health are all indirect standardisation.
The calculation of confidence intervals and significance testing of results are also dealt with in Technical Briefing 39 and with small numbers of cases in small areas, it is particularly important that underlying variability is highlighted clearly by use of confidence intervals. Significant variations may still be observed in small areas, especially where there is a specific factor, such as the presence of a nursing home or prison.
Figure 1 below indicates the way that confidence intervals increase as the population size is reduced. The larger the confidence interval, the lower the likelihood of achieving statistically significant differences.
The problems of small numbers and disclosure can both be avoided by aggregating data from small areas into groups, defined in various ways. Data can be aggregated on the basis of geography, time, age group, or other characteristics such as socio-economic group or deprivation score. In each case, some detail or specificity of the information is lost in order to improve its statistical robustness. Aggregation over time involves grouping, for example, three or five years together. For visual purposes, a moving average can be graphed to smooth annual variations and give a clearer impression of underlying trends. A disadvantage of grouping data over several years is that the data become less timely. Time series or forecasting methods can be used to estimate more up-to-date underlying rates where observed values fluctuate from year to year. These methods will be covered in a future technical briefing on analysing trends and forecasting. Geographical aggregation involves grouping smaller areas together and need not necessarily be constrained to adjacent geographical areas.
Mapping data using GIS can be a helpful way of presenting small area data. In particular, for abstract areas without names, such as OAs, showing the data on a map can make the data comprehensible, people can see where the data relate to. However, displaying the data on a map does nothing to overcome the small number of issues discussed on page 6. If the data are dominated by random variation, a kaleidoscopic effect is created, obscuring any geographical patterns. There are many spatial analytical methods which can help to interpret small area data. Detailed description of these methods is beyond the scope of this briefing.
However, methods that should be considered include spatial smoothing Ref.17 and Ref.18 (whereby each area’s data value is replaced by an average for the area and its immediate neighbours) and Bayesian methods for identifying significant clusters (e.g. WinBUGS). 19 When displaying data on maps, it should be noted that areas with a low population density tend to be most prominent on such maps because of their size, whereas areas with a high population density are so much smaller that they may be almost invisible. This can be tackled partially by presenting separate maps on different scales for urban and rural areas, or more radically by using cartograms. Ref. 20
There are many ways of presenting area-based data, either as profiles of an area or comparatively across areas. In general, best practice methodology applies to small area data exactly as for data for larger areas, so display methods such as spine charts (Figure 3) as used in the APHO Health Profiles 22 or funnel plots (Figure 4), which clearly show statistical significance of variations should be considered. Funnel plots, which are explained fully in Technical Briefing 2: Statistical process control methods in public health intelligence, Ref. 23 have the advantage of being able to present data for hundreds of areas on a single graph, highlighting statistical outliers very clearly. As mentioned above, GIS software can be used to present data on maps, which helps people relate the data to their own local knowledge of areas.
1. ONS. Beginner’s Guide to UK Geography. Available at http://www.statistics.gov.uk/geography/beginners_guide.asp
2. Department of Communities and Local Government. Indices of Deprivation 2007. Available at http://www.communities.gov.uk/ communities/neighbourhoodrenewal/deprivation/deprivation07/
3. The Scottish Government. Scottish Neighbourhood Statistics. Available at http://www.sns.gov.uk/
4. ONS. Statistical Wards, CAS Wards and ST Wards. Available at http://www.statistics.gov.uk/geography/statistical_cas_st_wards.asp
5. Cabinet Office. UK Government Data Standards Catalogue. Available at http://www.govtalk.gov.uk/gdsc/html/frames/Postcode.htm
6. NHS. NHS Postcode Directory. Available at http://www.datadictionary. nhs.uk/web_site_content/supporting_information/nhs_postcode_ directory.asp?shownav=1
7. Abbas J et al. Technical Briefing 5: Geodemographic Segmentation. York: APHO; 2009. Available at http://www.apho.org.uk/resource/ item.aspx?RID=67914
8. Department of Health. Quality and Outcomes Framework. Available at http://www.dh.gov.uk/en/Healthcare/Primarycare/Primarycarecontra cting/QOF/DH_099079
9. Eayres D. Technical Briefing 3: Commonly used public health statistics and their confidence intervals. York: APHO; 2008. Available at http://www.apho.org.uk/resource/item.aspx?RID=48457
10. ONS. 2001 Census Disclosure Control. London: ONS; 2001. Available at http://www.statistics.gov.uk/about_ns/downloads /info_to_commission/AG(01)06_Disclosure_Control.doc
11. OPSI. Freedom of Information Act 2000. Available at http://www.opsi.gov.uk/acts/acts2000/ukpga_20000036_en_1
12. ONS. Review of the Dissemination of Health Statistics: Confidentiality Guidance (Working Paper 4: Glossary). Available at http://www.statistics.gov.uk/about/Consultations/downloads/Health_ Stats/Health_Stats_4_Glossary.pdf
13. ISD Scotland. ISD Statistical Disclosure Control Protocol. 2009. Available at http://www.isdscotland.org/isd/4489.html#smallNumbers
14. Arrundale J et al. Handbook and Guide to the Investigation of Clusters of Diseases. London: Leukaemia Research Fund; 1997.
15. Vickers D. Introducing Clustering, Area Classification and Geodemographics. In: Multi-Level Integrated Classifications: Based on the 2001 Census. PhD thesis. Leeds: Department of Geography, University of Leeds; 2006.
16. Shewan J. Catchment areas and populations. Cambridge: ERPHO; 2003. Available at http://www.erpho.org.uk/viewResource.aspx? id=14754
17. Holmes N. Spatial Smoothing. BURISA 2006;No.167:9-13. Available at http://www.burisa.org/Temp/167.pdf
18. Baker A, Ralphs M, Griffiths C. Standardised Mortality Ratios _ the effect of smoothing ward-level results. Health Statistics Quarterly 2008;48.
19. Lunn DJ et al. WinBUGS _ a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing 2000;10:325-337.
20. Krygier JK, Wood D. Making Maps: A Visual Guide to Map Design for GIS. New York: Guildford Press; 2005.
21. APHO. Prevalence Modelling. Available at http://www.apho.org.uk/ resource/view.aspx?RID=48308
22. APHO. Health Profiles. Department of Health; 2009. Available at http://www.healthprofiles.info/
23. Flowers J. Technical Briefing 2: Statistical process control methods in public health intelligence. York: APHO; 2008. Available at http://www.apho.org.uk/resource/item.aspx?RID=39445 All links accessed 27 July 2009.