Frankenstein, PhD: pseudo-individual populations

The doctorate college has its annual symposium going on this weekend (20-21 Sept 2013).

I will, among other things, contribute a poster about the recently completed first work package of my thesis. It covers disaggregating census data (a looooot of columns with a loooot of data) onto individual building polygons (no population data at all :/) with the help of a fine mesh population grid (only population count, but comparably high resolution). Final product is a set of interrelational tables of individuals, households, and buildings.

What I did was …

… apply a local/regional filter to the buildings, omitting everything with an area larger or smaller than the median of its building block plus/minus three standard deviations. The aim here is to discard malls, supermarkets, news stands, and the like; and work on mostly residential buildings.
Then, I distribute the grid cells’ population count over the buildings, by area. Buildings overlapping with more than one grid cell would receive aliquot population counts from all respective grid cells. In the same way than later on with the census data columns, I first assign decimal values, and then round them, iterating from highest to smallest value (as long as the total of the grid cell is not reached). This “initial population” serves only as a seed value for later on – fortunately, because the sample data I used (the données carroyées from INSEE) was found to contain errors.
Next, I calculate something like an IDW (inverse distance weighting) for each value for each building, taking into concern every census tract polygon. Distance between centroids, obviously.
In the same processing step, I normalise the calculated IDW values by the value of the respective local census tract. This leaves me with “gradients” towards the neighbouring census tract polygons.
Then, I calculate a share of each building on the census tract polygons by the “initial seed population”, and – together with the modified IDW value – use it as multipliers on the census tracts values.
We’re nearly there: I just repeat the whole-number thingy again (see second step), and I have an integer population in buildings.
Finally then, I use an ugly fitting algorithm (developed by trial-and-error) to distribute the individuals over households, which fit into the buildings population counts. Household sizes and counts are in the census data – that gives a rough estimate in the distributed values.

Find the poster as a PDF here

The source code is stored in a git-repository at bitbucket: bitbucket.org/christophfink/one-by-one/, you will need Python 2.x, SpatiaLite, Shapely, and GDAL/OGR. All of them are also available via easy_install/pip/etc. Data and/or explanation of the – I’m being honest with you – at some lengths poorly commented source code upon request 🙂