Academics are sometimes accused of living in an ‘Ivory Tower’, protected from the chaos of the world outside yet completely unconnected from it also. This opinion is well-articulated in a Guardian article by Aditya Chakrabortty. This problem has, in his view, got worse with “intellectual cleansing at many universities.” See below for my take on the issue and ways around it, with an example from my own research.
For my research into the energy costs of commuting there are two types of dataset that are useful: the first is geographical aggregates: counts of the number of people fitting into different categories, with each row representing a bounded area. The second reports characteristics at the individual-level but provides little or no information about location. There is no publicly available data that contains both a large number of individuals and ‘georeferenced’ data about where they live.
This problem can be tackled through a technique known ‘spatial microsimulation': the allocation of individuals from a non-geographical dataset to administrative zones to simulate populations in local areas. In this context, the following post serves two purposes: as an introduction to spatial microsimulation and an overview (with links to worked examples) of how it can be performed in R. It is based on a paper just published by my PhD supervisor and I in Computers, Environment and Urban Systems (essentially identical preprint in arXiv, with ‘user manual’ appended). The IPFinR project, with plenty of R code and test data is available on Github.
Levels of geographical aggregation used by UK data dissemination services
The technique, which is also called ‘population synthesis’ (by transport modellers), is useful in various situations: many governments collect information about every individual in the nation, but only release the data as counts at administrative zones for confidentiality reasons: publishing lists of individuals by area could allow people to be identified, with the potential to reduce public trust in the Census, undermining the process as a whole. On the other hand, making data available clearly has many benefits for academics, companies and local authorities. Therefore data is usually made available, but only as aggregated counts, at a sufficiently coarse level of geography and/or cross-tabulation to make identification of individuals highly unlikely. In general, the amount of detail provided (variables and cross-tabulations of variables) tends to increase with the size of the areal units, as illustrated by the administrative levels of the UK (Fig. 1).
The other common source of data about people comes from surveys. These tend to be funded by the government or public/private research institutions to investigate particular issues in greater detail than is covered in the census. Again, measures are taken to prevent any individual being identified: in this case geographical data is removed completely (a national survey) or supplied at only a very coarse level (e.g. the Government Office regions displayed in Fig. 1).
Clearly both data types have advantages: individual-level data can be used to analyse the distribution of continuous variables such as age or distance travelled to work and the dependencies between different variables; the geographical data can be used to find out about spatial patterns. The problem emerges when we want to combine the advantages of both data types, for example estimating the distribution of distances travelled to work within small areas or income (available in individual-level data, but not in goegraphically aggregated data) variability over space.
Spatial microsimulation solves this problem by allocating individuals to zones (Fig. 2). This can be done by creating a matrix of weights (rows representing individuals, columns representing areas) to allocate the individuals of a survey to zones. This depends on shared variables between the two datasets. These are constraint variables. As well as this common scenario of matching individual and goegraphic datasets, the technique is also useful in a range of other situations, from modelling the distribution of car models in a particular fleet (Clarke, 2013) to projecting the values of internal cells within a two-way contingency table based only on marginal totals (Fienberg, 1970). Iterative proportional fitting (IPF), the deterministic reweighting strategy (or ‘microsimulation engine’ in Fig. 2) used here, has an even wider range of uses, described in Jirousek et al (1995). Spatial microsimulation is useful, and IPF is one of the most established and computationally efficient ways of generating the necessary micro-level data, so making this tool more widely available was a major reason for writing this piece.
Schematic of spatial microsimulation
(Un) reproducible research
Another motivation for writing this post is that researchers facing this kind of problem often need to start from scratch. This is despite previous work, including hundreds of lines of code and thousands of tests, to perform spatial microsimulation. Unfortunately, code is rarely published and results comparing the performance of different approaches to spatial microsimulation limited to a handful of studies, which seem not to provide reproducible examples (e.g. Harland et al., 2012). (A notable exception to this is Paul Williamson’s ‘Instruction Manual’ for his CO algorithms (Williamson, 2007).) Clearly it is much better to start from something than nothing: time is better spent advancing and applying existing models that trying to ‘re-invent the wheel’. Another factor hindering transparency is the use of proprietary software: if an algorithm is written in SAS or SPSS, for example, not everyone can use it due to licensing issues; some are. Overall, this lack of ‘openness’ is surprising because ensuring the reproducibility of data analysis is actually quite easy (Ince et al. 2012):
Table 1: Criteria for reproducible research
Believing these principles to offer great benefits to people wanting to use spatial microsimulation for their own research, I have attempted to implement them in my own research. So let’s proceed with a worked example.
A worked example
The first table below describes a hypothetical micro-dataset comprising 5 individuals, who are defined by two constraint variables, age and sex. Each has two categories. The following table contains aggregated data for a hypothetical area, as it would be download from census dissemination portal Casweb.
Table 2: A hypothetical input microdata set (the original weights set to one). The bold value is used subsequently for illustrative purposes.
Table 3: Hypothetical aggregated constraint count data , the ‘small area constraints’ (s)
The next stage is to put the two datasets into the same format. Table 4 illustrates the aggregated data table in a different form, which shows our ignorance of interaction between age and sex; table 5 shows the individual-level data in the same format, that we’ll be using for IPF.
Table 4: Aggregated constraints expressed as marginal totals, and the cell values to be estimated.
Table 5: The aggregated results of the weighted microdata set (m1). Note, these values depend on the weights allocated in Table 4.5 and therefore change after each iteration
Using these data it is possible to readjust the weights of the hypothetical individuals, so that their sum would add up to the totals given in Table 4 (12). In particular, the weights can be readjusted by multiplying them by the marginal totals, originally taken from Table 3 and then divided by the respective marginal totals in Table 5. Because the total for each small-area constraint is 12, this must be done one constraint at a time.
This can be expressed, for a given area and a given constraint (i or age in this case), as follows:
where w(n + 1)ij is the new weight for individuals with characteristics i (age, in this case), and j (sex), w(n)ij is the original weight for individuals with these characteristics, sTi is element marginal total of the small area constraint, s (Table 3) and mT (n)i is the marginal total of category j of the aggregated results of the weighted microdata, m (Table 5). n represents the iteration number. Although the marginal totals of s are known, its cell values are unknown. Thus, IPF estimates the interaction (or cross-tabulation) between constraint variables. (Follow the emboldened values in the tables to see how the new weight of individual 3 is calculated for the sex constraint.) Table 6 illustrates the weights that result. Notice that the sum of the weights is equal to the total population, from the constraint variables.
Table 6: Reweighting the hypothetical microdata set in order to fit Table 3
After the individual-level data has been re-aggregated (table 7, the next stage is to repeat eq. (1 for the age constraint to generate a third set of weights, by replacing the i in sTi and mT (n)i with j and incrementing the value of n:
To test your understanding of IPF, apply eq. (2)to the information above and that presented in table 7 below. This should result in the following vector of new weights, for individuals 1 to 5:
Table 7: The aggregated results of the weighted microdata set after constraining for age (m(2))
This is IPF in action! The above process, when applied to more categories (e.g. socio-economic class) and repeated iteratively until a satisfactory convergence occurs, results in a series of weighted microdatasets, one for each of the small areas being simulated. This allows for the estimation of variables whose values are not known at the local level (e.g. income) (Ballas et al., 2005). An issue with the results of IPF (absent from combinatorial optimisation methods), however, is that it results in non-integer weights: fractions of individuals appear in simulated areas, tackled in our CEUS paper.
Doing it in R
The above example is best undertaken by hand, probably with a pen and paper to gain an understanding of IPF. To do the calculations for a larger dataset would clearly be a waste of time, as computers have been developed to automate repetitive calculations. This section explains how the IPF algorithm described above was implemented in R, using exactly the same example. At this stage, the reader is recommended to transfer to the rpubs website ( http://rpubs.com/RobinLovelace/6193 ), to see the fully reproducible code.
Instead of repeated the code exhaustively here, I will focus on some assorted issues that may be needed to replicate the above example for larger datasets.
Reading in the data is the first stage, and must be done with care if the process is to work for your data. One problem when reading individual-level data is that variable classes are incorrect. Using our example, the age variable is initially treated as a character variable. as.numeric() solves this problem.
For the geographic data, it is important that the marginal totals add-up, otherwise a different total population will result after each constraint is applied. The command rowSums() applied to the columns that correspond to each variable can be used to check this for many areas. In the rpubs example, the following command showed us that this was the case, as the answer was ‘true':
rowSums(all.msim[,1:2]) == rowSums(all.msim[,3:4])
If the counts do not add-up, they can be adjusted to fit the desired population (I had to do this for my research into commuting, as the number of employed people aged 16 and above does not equal the number of commuters in every areas.)
Assessing goodness of fit is vital to ensure that the IPF algorithm is working correctly. There are a range of measures that can be used, including root mean squared (RMS), total absolute error (TAE) or simply the correlation between the aggregates and the aggregated individual data. For the purposes of understanding how the process works, however, I found scatter plots to be most useful, as illustrated in the rpubs example. To add context, attributes can be assigned to different constraint variables and plotted after each constraint is applied (Fig. 3):
Figure 3: Visualisation of the fitting process over the course of one complete iteration involving 4 constraints.
Integerisation is the process of converting the fractional weight matrices generated by IPF into integer weights that represent whole or absent individuals from each area. There are a number of algorithms that can be used to do this. According to my tests, which are replicable using code and data that has been published online, the probabilistic methods of ‘proportional probabilities’ and ‘truncate, replicate, sample’ are the most accurate and no slower. The latter of these, TRS, is the most accurate according to our tests, although there remains potential for further tests and perhaps more integerisation strategies (Lovelace and Ballas, 2013).
Integerisation is important for people wanting to use the outputs of IPF as an input into agent-based models: you need whole individuals. It is also much easier to deal with 3 people, than 6 half people!
Spatial microsimulation has been defined in this article as the allocation of individual-level data to aggregate counts based on shared constraints. It is useful for a wide range of purposes, especially due to the way governments tend to disseminate census data, but methods for performing spatial microsimulation have tended to be inaccessible due to unpublished code, test data or the use of expensive closed-source software. My research on the subject has tackled these issues by providing documentation explaining how IPF, a re-weighting algorithm used for spatial microsimulation, can be implemented in the statistical software R. I hope this article is useful to others wanting to harness IPF for their own purposes. If it enables just one person to focus on applying or advancing the methods presented here for the greater good, rather than eternally starting from scratch, then it has served its purpose.
Ballas, D., Dorling, D., Thomas, B., Rossiter, D., (2005). Geography matters: simulating the local impacts of national social policies. 3, Joseph Roundtree Foundation.
Chengchao Zuo, Mark Birkin, Graham Clarke, F.McEvoy, A.Bloodworth (2013) Modelling the transportation of primary aggregates in England and Wales: exploring initiatives to reduce CO2 emissions. Presentation at the International Geographic Union 2013 Leeds conference.
Fienberg, S. (1970). An iterative procedure for estimation in contingency tables. The Annals of Mathematical Statistics, 41, 907–917.
Harland, K., Heppenstall, A., Smith, D., & Birkin, M. (2012). Creating realistic synthetic populations at varying spatial scales: A comparative critique of population synthesis techniques. Journal of Artificial Societies and Social Simulation, 15, 1.
Ince, D. C., Hatton, L., & Graham-Cumming, J. (2012). The case for open computer programs. Nature, 482, 485–488.
Jiroušek, R., & Preucil, S. (1995). On the effective implementation of the iterative proportional fitting procedure. Computational Statistics & Data Analysis, 19, 177–189.
Lovelace, R., & Ballas, D. (2013). ‘Truncate, replicate, sample’: A method for creating integer weights for spatial microsimulation. Computers, Environment and Urban Systems, 41, 1-11.
Williamson, P. (2007). CO instruction manual: Working Paper 2007/1 (v. 07.06.25). Technical Report June. University of Liverpool.
On Monday 11th of February I caught a train to Brussels from London: my final destination was Amsterdam, but it was cheaper (£59 return!) and more challenging to cycle the last leg. The trip was fun, and shows how proper bicycle infrastructure can make long distance rides very appealing compared to places where big cities are connected only by fast direct roads or slow, winding paths. Read on and be inspired to make a long-distance bicycle trip of your own!
Here’s a panorama from just outside my house in Sheffield.