Government spending data: Some cleaning required

The release this morning of data detailing every Whitehall payment above £25,000 is a step towards the culture of public transparency that the previous Government intended to create when it passed the Freedom of Information Act a decade ago.

Rather than waiting for requests under the FOI regime, the coalition Government has committed to releasing this basic spending information proactively, in a format that allows scrutiny by anyone with the necessary time, software and skills.

This morning’s data release will be repeated each month, and from January, local government bodies will have to release similar datasets accounting for all transactions above £500.

Some of the frustrations with analysing public data to which journalists have become accustomed were absent. There were no files released as locked PDF documents that are difficult to import into database software, for example.

While each department’s monthly spending was released as a separate spreadsheet document, these were formatted in a consistent structure across departments, thanks to detailed Treasury guidance on how to release the data.

Nevertheless, analysing the data still posed significant technical challenges.

After downloading the 181 separate spreadsheet CSV files and collating them into single database of nearly 200,000 records, it quickly became apparent that comparing spending across Whitehall would prove difficult unless the data was subjected to an extensive “cleaning” process.

Suppliers’ VAT numbers were collected but removed from the publicly-released datasets at the last moment, making it difficult to identify identical firms that were described in slightly different ways by each department. The same professional services firm was listed variously as “Accenture”, “Accenture UK”, and “Accenture (UK) Ltd”. And so it went for more than 9,500 different suppliers.

Each department also has a different terminology for classifying its expenditures, meaning it is difficult to group similar costs across Government. There were some 2,092 different “expense types” listed.

Moreover, the vast majority of spending were transactions between public sector bodies, such as grants to local authorities. As a result, it will remain difficult to see the true impact of public spending on the private sector until smaller public bodies’ spending data becomes available.

For much of the past week, Michael Jacobs, helped by William DeGenst and the Open Knowledge Foundation, have been developing a series of database lookups to regularise the data so that it became possible to collate suppliers and expenditure types across departments and identify those payments being made to the private sector.

The Government is clearly serious about opening up its data to aid public accountability. Releasing data is all well and good, but to encourage the nation’s “armchair auditors”, it must be readily usable. As the data flood increases with the addition of local government and NHS bodies’ records in the new year, one urgent task for the Government will be to regularise the structure public bodies use to categorise of their expenditures.