Weighting in Vain

Michael O'Connor
22 min readNov 11, 2021

--

Back in January, Jonathan Portes and I conducted a thought experiment that led us to conclude that considerably more EU migrants might have left the UK in the course of the pandemic than officially estimated, published here. Not everyone agreed, including reputable voices like Prof Ian Gordon at LSE here and Madeleine Sumption of the Migration Observatory here. Notably, both of them conducted further analyses of the Labour Force Survey sample and concluded that the kind of people who had become less likely to appear in the survey was more consistent with non-response than non-presence in the UK. Broadly, Ian used variables on English language difficulties as the basis of his argument, and Madeleine variables on family composition. I don’t need to go into detail as a key point that Jonathan and I made was that ever finer slicing and dicing of the survey sample wouldn’t get us any closer to what was happening. What was needed was comparison of survey data to actual counts of EU migrants, for example in government data on taxpayers and benefit claimants.

To cut a long story slightly shorter, ONS did then embark on an exercise to reweight the Labour Force Survey by comparing survey data with HMRC data on payroll employees. For those who don’t like a still-long story….

Tl;dr

The ONS model for reweighting the Labour Force Survey is based on implausible assumptions that would lead to impossible results

Key data ONS used from their own datasets had already been significantly re-estimated by ONS but the old data was used.

ONS did not use data from the best dataset available to them

ONS are not using new data as it becomes available

The population of EU migrants in the UK pre-pandemic was probably much higher than official estimates

It still seems likely that there was a large exodus over the course of the pandemic that has yet to be reversed, though this cannot be ‘proved’.

The ONS article on their re-weighting exercise is here, published on 17 May 2021. So here we go with, firstly, some commentary on the conceptual and methodological approach taken and, secondly, a look at some of the actual numbers used and comparison with other indicators.

The ONS article starts by saying that given that available data to estimate a statistical model are limited and a model based on data from the pre-pandemic may not be appropriate from 2020, they have developed a simple and robust (my emphasis) method to estimate the population growth rates of the EU and non-EU sub-populations using RTI employee growth rates and that this is based on two main assumptions.

The first assumption is that change in the population growth rate of the non-UK sub-populations is in the same direction as the change in their RTI employee growth rate. However, this is by no means a necessary relationship as net migration may continue to be positive despite employment going down. Official statistics say that this was certainly the case in 2007–2009 for EU migrants.

The second assumption is that the magnitude of change in population growth rate does not exceed that of change in RTI employee growth rate. This is not a necessary relationship either. For example if there is a change in family composition of migrants such that more dependents are associated with each new employee, this will boost population ahead of growth in workers. Or if growth in worker numbers is driven by self-employment, the RTI employee growth rate might well fall behind population growth.

The article then says that the method involves adjusting the known population growth rate of a base period before the pandemic with the change in RTI employee growth rates adjusted by a specified factor. But this puts things the wrong way around: the RTI employee growth rates are ‘knowns’ i.e. the number of payroll employees reported each month to HMRC, but the population growth rate is not, being merely an estimate based on the International Passenger Survey.

The article then describes how the adjustment factor was established using three steps:

1. we show that the change of the population growth rate between a base period, from the pre-pandemic period, and a period from 2020 is approximately proportional to the change in RTI employee growth rates between the same periods; the proportionality factor is unknown but it is shown that it is positive and less than 1

2. the average prediction error of the estimator with respect to the unknown proportionality factor is minimised by adjusting the RTI growth rates by a factor of 1/2

3. to improve the accuracy of the method, we adjusted the RTI employee growth rates by subtracting the RTI growth rates of UK nationals (as this accounts for background change in employment)

Using the RTI employee growth rate to estimate the non-UK population from 2020 is likely to lead to biased estimates as the population tends to change at a lower rate than employment.

While it is possible that population might tend to change at a lower rate than employment, again this is by no means necessarily so. The same principle applies as above in relation to RTI vs population, for example a surge in the number of arrivals looking for work would boost population more than employment if there was a lag between arrival and getting a job or if there were a wave of immigration by family members to join people who had found work in the UK. Conversely, population would shrink at a greater rate than employment if people with a job stayed in the UK but family members left.

There then follows a passage that is rather hard to interpret.

If the relative bias is constant over time, then it can be shown that the change in population growth rate is approximately equal to the change in RTI employee growth rates. However, when the relative bias varies over time, in the absence of information on the actual magnitude of the variation, it has proven optimal to adjust the RTI growth rates by half.

Either the ‘bias’ i.e. the relationship between the growth rate of estimated population and the growth rate of RTI employees population is constant over time or if it isn’t. The implication of the passage is that it has been observed that the relative bias does vary over time, but in that case the variation must also have been observed, so it is hard to understand how there can have been ‘absence of information on the actual magnitude of the variation’. Unless there was no observation of variation, but variation has been inferred from it having ‘proven optimal’ to adjust RTI growth rates by half.

Either way, it seems very much the wrong approach in an exercise whose purpose is to estimate better what is happening in a situation of highly unusual potential shock. If relative bias does vary over time, and thus change in population is not clearly correlated to change in RTI employment, then calculating or inferring some sort of ‘average’ bias and using that to create a model for projecting change in a shock situation seems quite unsound. This also seems quite inconsistent with the statement above that “a model based on data from the pre-pandemic may not be appropriate from 2020” as indeed the adjustment only ‘optimises’ the results for past trends, and thus tends to project continuation of trend despite the clear possibility of a significant break from trend.

The article continues

So, this adjustment is set half-way between adding the change in RTI employee growth rates and making no change to the population growth rate of the base period. Fitting the model to pre-pandemic data, yields adjustment factors equal to 0.45 and 0.42 for the EU and non-EU sub-populations, respectively. This explains why the proposed method performs well using previous data.

This does not appear to explain anything. Instead it seems to be a circular argument for using an adjustment factor — to make it fit the previous data. The section then concludes

We expect an adjustment factor of around 0.5 if the employee part of the population is about half the total population and change in the size of the population is dominated by employees leaving or entering the UK. Therefore, if this has continued to hold approximately since 2020, then the prediction error of the estimates based on the proposed method should be small.

This seems to be throwing stuff at the wall and seeing what sticks, merely that you get a rough match between change in population estimates and change in non-UK RTI employees if you subtract change in UK employees from the latter and then divide by two. But there is no explanatory power in this, and saying “we expect an adjustment of around half if…” seems wrong, as the reasoning is merely a search a posteriori for an explanation.

Secondly, now for some numbers starting with the impact of the re-weighting on the key labour market variable of people in work, using data from the most recent ONS EMP06 dataset published on 15 July.

The first point to note is a key apparent implausibility that Jonathan Portes and I had previously identified — the number of UK-born workers increasing by half a million during the pandemic — has been disappeared, and replaced with a rather more likely drop of a million. We noted that the extra UK-born workers seemed to have been magicked-up to replace non-UK born workers, in particular from the EU, who we theorised might well have left the UK. The ONS re-weighting gives a different picture from before, but it is a mixed one, showing a 30% drop in workers from Romania and Bulgaria and a 20% drop in workers from other Eastern European member states. These are large drops although smaller than in previous ONS estimates, but in contrast changing a small drop in workers from the EU14 older member states to a rise of 10%.

For workers born outside the EU, the re-weighting also changes a small drop into a boost to their numbers by 10% since the start of the pandemic.

But putting together the non-UK series gives the result that there has been no material drop in the numbers of non UK-born workers in the UK at all over the course of the pandemic.

Of course, the numbers of workers don’t directly tell us anything about the size of the underlying populations. A loss of workers might simply be matched by an increase in unemployed or inactive, with the population staying the same. Although the EMP06 dataset includes employment rates as well as employment levels, the populations can’t be derived by dividing one by the other, as the levels are for all ages, and the rates for people aged 16–64 only. ONS don’t publish population figures with any great regularity or timeliness, but in dataset A12 have unemployment and inactivity figures by country of birth, although unlike EMP06 these only distinguish between EU and non EU, with no further country or group breakdown. The employment levels are again for all ages as are unemployment levels, but inactivity levels are only for 16–64 year olds. Still, adding these up should give a sense of how the populations are believed to have changed, especially for non UK sub-populations as a relatively small proportion are aged 65 and over. For UK-born, these imply a population little changed over the past five years, which is a far more intuitively plausible than what would have been an unprecedented rise of over a million in a year.

For non UK-born, the re-weighted figures say that the equivalent EU born population has remained essentially unchanged, but that the equivalent population of those born outside the EU had noticeably jumped over the course of the pandemic.

Remember that these new figures were produced by observing a change in RTI payroll for those recorded as EU and non-EU, applying a coefficient of 0.5 (i.e. halving RTI change after adjusting for ‘background change’) and reweighting the LFS accordingly. So if adjusted RTI showed that ‘EU NINo nationals’ increased by 2%, it would be estimated that the EU-born population had increased by 1%, and the weighting of each EU-born respondent in the LFS sample would then be increased by 1%. So if the previous weighting for an individual had been 800 if would be increased to 808. But even if there did exist such a relationship between total population and RTI employees, applying the coefficient to individuals in the LFS dataset seems to mean that employees in the LFS will change at half the rate of employees in RTI. Superficially this might seem plausible, as over the period 2015 to 2019 EU LFS employees indeed changed at a rather lower rate than EU employees in RTI, with a 26% increase in RTI vs 42% increase in LFS. Not exactly half but in the same ballpark. However, one thing clear to anyone who regularly looks at these data is that over this period there was anything but a constant relationship between EU RTI employees and EU LFS employees. Some difference between levels in the two series is to expected. For example the LFS does not cover people in communal accommodation so farm workers living in bunkhouses or people working in hotels and living on site won’t be counted. Nor does it cover people who are seasonal workers or otherwise on a temporary engagement in the UK. But it doesn’t seem very plausible that following a consistent difference of about 10% between the two series, a near 15% increase in EU RTI payroll between 2016 and 2019 (or over 300,000 people) was composed entirely of people who weren’t covered by LFS, which showed no increase in EU-born employees at all from 2016 onwards. Alternatively, if this were the case, it would suggest that that there isn’t at all a simple relationship between the two measures that enables EU LFS employees to be estimated from RTI over time, let alone the EU population.

Thus the difference between EU employees in RTI and LFS increased from a tenth to well over a quarter. Notably though, no material change was observed in the difference between RTI data and LFS estimates for UK employees. This is the issue of ‘relative bias’ described previously, and that there are two fairly clear periods of quite different bias suggests to me that it is not at all sound to project a subsequent period by simply averaging out the previous.

But returning to the re-weighting itself, rather than actually calibrating the number of EU employees in the LFS using RTI data, and then building this up to the whole EU population using a relationship between employees and whole population, the ONS instead calibrated RTI employees against their previous estimates of long-term international migration based on the International Passenger Survey so as to establish a ‘direct’ relationship between RTI employees and whole population. Thus if in a year they had estimated net migration from the EU to be 100,000 man, woman and child, and the number of EU RTI employees had changed by 100,000 then there would a coefficient of 1. If the number of EU RTI employees had changed by 50,000 then there would be a coefficient of 2, and if the number of EU RTI employees had changed by 200,000 the coefficient would be 0.5.

The ‘coefficient’ is essentially a relationship between measures of change in stock. In this exercise, the smaller the coefficient, the greater the difference between the measures. The ‘coefficient’ columns show the change in LFS population that would result from applying varying ratios of total population to RTI employees. These changes in total population implied by RTI employees are compared to the change in total population expected from official estimates of net Long Term International Migration.

The ONS note says

The method was tested and evaluated using data from the pre-pandemic period and was found to produce estimates of year-on-year change that are close to previous long-term international migration (LTIM) estimates.

One key test of their method was by varying the coefficient used but it is quite hard to see how the data table in the publication squares with their conclusion. Instead, LTIM overall seems to match best a coefficient of 0.3, not 0.5. Also note that increasing the coefficient pushes down the estimation of net migration.

Notably, these results come from a ‘split the difference’ between rather different results for the EU and non-EU sub-populations. For EU, LTIM falls within the values given by coefficients 0.4 and 0.5 for three of the four years, but is at 0.6 for 2016. As with the overall results, increasing the coefficient pushes down the estimation, but it should be clear that the estimation for EU is highly sensitive to the coefficient used.

Yet their pre-pandemic estimates for the periods used in the re-weighting show unsurprisingly that, within the LFS, generally population and employees change at quite similar rates.

Over a longer period, when there are material differences these generally have another explanation. For example while employees grow ahead of population in 2014, total workers do not, implying switching from self-employment to employee status. Conversely, in 2012, population growth was rather more than employee growth, but less than total worker growth, implying switching from employee to self-employed status.

Differences between growth in population and growth in employees is very largely explained by the changing proportions of employee and self-employed and the overall employment rate.

There is thus no reason at all to ‘expect’ population to grow at half employee rate, and it is easy to calculate that if it did while employee numbers grew at their pre-pandemic average then we would see more EU employees than total EU population within ten years. This isn’t a reducto ad absurdum but a simple sense check, and indeed, whatever positive rate is used for employee growth then eventually employees will exceed population growing at half that rate.

However, remember that ONS said ‘if the employee part of the population is about half the total population’. Looking at the LFS estimates, this is true enough, with employee/population ratio not straying very far from a half. But in this exercise, ONS have not been comparing LFS population with LFS employees but with RTI employees. And the ratio between these numbers has always been comfortably above half and in the run-up to the pandemic had reached well over two-thirds.

As to which is more likely to represent the relationship of actual employees to actual population, i.e. household composition, it should be obvious that the LFS — as a household survey including thousands of households — should be a better guide regardless of any differential response between EU households and UK households generally unless there is a considerable differential response among different types of EU households. So what seems most likely is that the LFS was falling further behind in its capturing of an EU sub-population whose household composition was changing little.

But in a further complication even more fateful for the re-weighting exercise the LTIM estimates had been revisited by ONS in a quite separate publication as part of their Population and Migration Statistics Transformation programme by comparing them to the new RAPID database created from HMRC and DWP data that encompassed not just RTI employees but the self-employed, others making Self-Assessment tax returns, and everyone claiming benefits from both HMRC (tax credits and Child Benefit) and DWP (Universal Credit, JSA, Housing Benefit, disability benefits etc etc). These data suggested that net migration from the EU had been considerably underestimated in all of these years, and adding estimates based on RAPID to the chart above gives a rather different picture.

Using these re-estimated figures for LTIM would mean that to get the change in LFS whole population to match the RAPID estimates a coefficient of something like 0.25 would have to be applied to RTI employees. But this would mean replacing the already implausible assumption that population grows at half RTI employee rate with the even less plausible one that it grows at only a quarter of RTI employee rate.

The alternative is that the LFS whole EU population in the period has been considerably under-estimated, which is of course the obvious implication of the RAPID data, and thus the appropriate base from which to start the projections not being existing LFS stock. In fact, the RAPID-based estimates of long-term international migration imply a pre-pandemic EU population as much as a million higher than in the LFS, or around 4.5m.

This is given support by the numbers of applications under the EU Settlement Scheme, the number being counted at around 5.5m once duplicates are removed. Allowing for the non-EEA nationals who applied, and the fact that a proportion of applicants will have been born in the UK, this would also suggest an EU-born population of around 4.5m rather than the re-weighted LFS figure of 3.6m. The official line is that very few applications have been made from abroad and that almost all are associated with addresses in the UK — as per the breakdown by local authority in the detailed EUSS statistics published quarterly. If this is correct, then again it points to more EU nationals having been in the UK than official statistics suggested.

Unfortunately, none of this can tell us very much reliably about how the population might have changed over the pandemic. While the benefit of RAPID is its broad coverage of interactions with officialdom from which presence in the UK can be inferred, many of the normal relationships between interaction and presence will have broken down. For example it would have been impossible to tell whether any of the many switched to permanent home-working or put on furlough stayed in the UK or returned to their home country, because no change would be likely to appear in the data submitted to HMRC by their employers. Similarly with claims to benefits, the ability to claim online and the abandonment of requirements to attend at a DWP office means that claims could be started while in the UK by people who subsequently left but continued to claim or claims could even be made after departure.

Again, anyone looking at this is slightly hamstrung by the lack of publicly available data, but returning to the general theme, DWP publish an annual snapshot of benefit combinations that identifies UK, EU and non-EU individuals (as usual by ‘NINo nationality’) in benefit claims. The number of claimants in total is typically slightly higher than the number of unemployed in the LFS. For the EU sub-population, looking at the number of unemployed in the LFS together with the number of claimants of JSA and Out-of-Work Universal Credit showed about 32,000 more unemployed EU migrants in the LFS in Oct-Dec 2019 than the number of claimants in DWP data for November 2019. There are intuitive reasons for this — in particular that new arrivals even under free movement who came to the UK looking for work couldn’t immediately claim unemployed benefits. But roll forward a year to the LFS for Oct-Dec 2020 and while the number of EU unemployed has increased by three-quarters or 63,000, the DWP snapshot for November 2020 saw the number of EU claimants of the same benefit combination as before in their administrative data increase nearly five times (in contrast to a ‘mere’ doubling among the UK population) and by near 200,000. This is rather more than the decrease in RTI payroll numbers, and so yet another reason to question whether the population can possibly change at half the rate of RTI employees.

But even leaving that aside this begs the question of why the re-weighting factor has been carried out using the original LTIM figures when these have been adjusted very significantly based on a broad base of administrative data.

A further wrinkle is that ONS have told me in correspondence that they have received no new data on the numbers of non-UK NINo nationals in RTI since those originally published in March 2021, meaning that the last growth rate for RTI that they have is for the year Oct-Dec 2019 to Oct-Dec 2020. This seems to mean that while they will have re-weighted previous periods on the basis of contemporaneous RTI change matching those periods, for any subsequent period i.e. current and future estimates they will be using a non-contemporaneous RTI change from a different period. Thus the exercise resulted in a one-off re-weighting. This seems odd. If a relationship were established between RTI employee growth and population growth then one would use RTI employee growth in the current period of interest — say in the year to September 2021 — and estimate population growth from that using the relationship established in the model. It would be a trivial task to compare LFS estimates with change in the RTI data month on month. But it seems that this isn’t being done — certainly I was told in September that ONS had had no new information on any period after December 2020.

On top of this, if EU/non-EU regional population figures that match LTIM are derived from RTI employee numbers and then used to re-weight the LFS, some confounders will include movement between regions compared to the period on which the model was based (say previous had tended to go to London and new are tending to go to the Midlands) and changes in proportions working/non-working or employee/self-employed. The change in proportion working/not working might be ‘real’ i.e. labour market flows between employment, unemployment and inactivity, or compositional e.g. more children. In both cases extrapolation of employees among whole population from employees in RTI will fall down. It all seems very simplistic and broad brush.

So does this show that Jonathan and I were wrong to suggest that there might have been a very large drop in the number of migrant workers in the UK, particularly from the EU. The short answer is no, because ONS have assumed that their pre-pandemic estimates were correct. Having magicked-up UK-born people to make up for the loss of foreign-born people from their LFS sample ONS now seem to have disappeared them and magicked back the foreign born. I say magicked back because they obviously remain disproportionately absent from the sample. Ultimately their results seem based on a determination to assume that their pre-pandemic estimates for population and migration were correct even though they have now changed their migration estimates and have fairly compelling evidence that their population estimates were incorrect.

The bottom line from all this, both conceptually and empirically, it that it seems very likely that the ONS model for re-weighting applies the wrong rate of change to the wrong starting-base.

It’s possible for all sorts of things to be true at the same time. ONS could be right about the size of the UK-born and migrant sub-populations now if despite a significant drop in the number of migrants there had previously been far more than thought. As claims of labour shortage and recruitment difficulties are so widespread, very often with reference to a lack of EU workers, yet the official statistics show as many EU workers here as before by applying a modelled change to previous estimates, then either the claims are all specious, or the modelled change is wrong or the original estimates were wrong. But as the modelled change is based on actual data, and there are material indicators of labour market tightness in pay levels and advertised vacancies, it seems most likely that the original estimates were wrong. Bear in mind that they are assumed by ONS to have been correct despite their noticeable difference from the narrow RTI measure and even greater difference from the wider RAPID measure. A further interesting pointer to the original population estimates being wrong has just today been splashily reported e.g. in the Mail here. A version of the actual research paper by Francesco Rampazzo et al at the University of Southampton is here. It applies a Bayesian probabilistic analysis to social media data and concludes “Overall … an undercount of 25% for 2018 and 20% for 2019 based on the LFS data” which is largely in line with what the RAPID data above suggests.

Further, there’s a certain carelessness about the ONS approach that makes one wonder further about the quality assurance processes. It was puzzling but noticeable fairly obviously that the initial publication having described a model based on the LFS country of birth variable, then illustrate it with a table of data that was instead based on nationality. Far harder to discover, and only apparent after repeated failures on my part to reproduce the results in the publication, ONS finally acknowledged that they had actually made an error in the calculation of a key input to the model, misaligning the months used to calculate quarterly growth rates such that the data they used for e.g. Apr-Jun was the average of March, April and May rather than April, May and June. While ONS slightly dismissively said that the difference this made didn’t put the results outside ‘usual confidence intervals’ that isn’t really the point, it being mere coincidence that the figures were displaced by a month rather than by a quarter or a year, and real-world confidence really requiring confidence that checks are undertaken that will ensure this kind of ‘spreadsheet error’ is detected.

Finally, while I understand that the exercise might have been intended only as a quick and dirty interim measure to be replaced by something better in due course and that only limited resources might have been available, resource or time constraints certainly don’t seem to explain the use of ‘old’ LTIM rather than re-estimated LTIM as it’s as easy to plug the latter into the formulae as the former. Nor do resource or time constraints explain any of the conceptual issues like the methodological step that guarantees non-UK LFS employees will change at half the rate of non-UK RTI employees. Nor really do they explain why ongoing changes in RTI aren’t being (or haven’t been) observed.

And the reason this matters is that beyond simple population or labour market figures, the results are being fed into wider economic statistics like productivity estimates and will feed into OBR forecasts etc. and thus inform policy decisions. Having been involved in policy in my time I do realise that the evidence base is rarely as good as one would like but it’s very disappointing that in this case it seems that throughout the exercise really quite basic checks on accuracy and plausibility haven’t been carried out.

The better way would have been (as I and some others have always proposed) comparing like with like in survey and admin data e.g. LFS employees with RTI, LFS self-employed with HMRC self-assessors, LFS children with children in Child Benefit claims, LFS unemployed with DWP claimants etc etc. Obviously something more like this general approach is now being taken in the Population and Migration Statistics Transformation programme (though I have doubts about aspects of this too, they pale in comparison) and bearing in mind the availability of data in this strand of ONS work, it is hard to see why it wasn’t used for the re-weighting exercise.

So while what actually happened over the course of the pandemic remains something of a known unknown, it seems ever clearer that there were many more EU migrants in the UK pre-pandemic than shown in official statistics and this increases the likelihood that rather more might well have left than in official statistics.

As usual from me, this is intended as a contribution to an ongoing debate and all views expressed are mine personally and not to be assumed to be shared by anyone else, as of course are any errors or omissions!

--

--

Michael O'Connor
Michael O'Connor

No responses yet