I dislike the premise. I mean, good data is wonderful.
But if institutions are expected to release clear data or nothing, almost always it is the later.
What is important, is to offer as much methodology and caveats as possible, even if in an informal way. Because there is a difference between "data covers 72% of companies registered in..." vs expecting that data is full and authoritative, whereas it is missing.
(Source: 10 years ago I worked a lot with official data. All data requires cleaning.)
But surely we should expect some basic sanity checks on published data? This isn't some petrol stations being placed in the middle of a field due to minor typos or bad rounding, or some petrol stations' prices being listed as all 1.00 £/l out of laziness, or even a case of all unknown locations being listed as 0°0'0" N, 0°0'0" E by default. What the author reports appear to be mistakes which should be rather trivially detectable on input.
The problem is stats can actually do more with all the data including obvious errors. If you start filtering out data where they miss entered lat log you might introduce a new bias.
Sure we should indeed expect that they do that. But look at enough data and you'll learn that those expectations are a path towards never-ending frustration. I've been there, spending >100 hours cleaning data... that never got published because I was too damn focused on the dozens of years of errors that many, many people created.
To be clear, I'm not saying that we should accept messy data. Just, reality is messy and it's naive to think we can catch and remove all of reality's messiness -- which includes the bureaucratic slop that led to the data being published in the first place.
I don't think these issues are close to the issues the article talks about. The author does not talk about data coverage, data collection methodologies or missing values or whatever, but data that is actually wrong, ie location coordinates, prices, numbers that make no sense. Including swapping latitude/longitude and wrong decimal points in numbers.
On the other hand, I agree that bad (but usually fixable) data is better than no data.
I have mixed feelings about this. On one hand, yeah stop publishing garbage data, but as a FOIA nerd... I'll take the data in any state it is. I'm not personally going to be able to clean the data before I receive it. Does that mean I shouldn't release the unsanitized (public) data knowing that it has garbage data within? Hell no. Instead, we should learn and cultivate techniques to work with shit data. Should I attempt to clean it? Sure. But it becomes a liability problem very, very quickly.
One of those people can republish their cleaned and validated version and the 999 others can compare it to the original to decide whether they agree with the way it was cleaned or not.
Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?
If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?
Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.
Clean data is expensive--as in, it takes real human labor to obtain clean data.
One problem is that you can't just focus on outliers. Whatever pattern-matching you use to spot outliers will end up introducing a bias in the data. You need to check all the data, not just the data that "looks wrong". And that's expensive.
In clinical drug trials, we have the concept of SDV--Source Data Verification. Someone checks every data point against the official source record, usually a medical chart. We track the % of data points that have been verified. For important data (e.g., Adverse Events), the goal is to get SDV to 100%.
As you can imagine, this is expensive.
Will LLMs help to make this cheaper? I don't know, but if we can give this tedious, detail-oriented work to a machine, I would love it.
Data and metrics is 90% what upper management sees of your project. You might not care about it, and treat it as an afterthought, but it's almost the most important thing about it organizationally.
People who don't heed this advice get to discover it for themselves (I sure did)
IF you can't make the data convincing, you'll lose all trust, and nobody will do business with you.
That it's it's better to publish the garbage data than to not publish it though. I would worry about complaining too much lest they just decide to stop publishing it because it creates bad PR.
This article assumes that there is a person with dedicated time to validate the data. Imagine you want this data and ask for it, but the government says, “sorry, we have this data, but we read an article that said we can only publish it if we spend a lot of time validating it. This data changes frequently and we don’t have a chunk of a full-time data analyst’s salary to spend on it, so we just aren’t going to publish anything. We’d rather put out nothing than embarrass ourselves, so you can’t even try to validate it yourself.”
A couple of days after the UK Fuel Finder service launch last month, I wrote a hobby site using its API to get the cheapest local fuel prices: https://fuelseeker.net. I too discovered prices which had obviously been entered in pounds rather than pennies, or even missing altogether some cases. You would think that they could have done a bit more basic data cleansing on the server to catch that type of thing.
But, hey, we’re all wise after the event. To their credit though, they do seem to be actively reacting to feedback. I also contacted them about the bad data issue, and they are now adding user warnings about bad price values at the point of data entry (according to https://www.developer.fuel-finder.service.gov.uk/release-not...).
I was looking at that RAC chart this morning. Given it's Sunday, and I was reading before my morning coffee, I'm not ashamed to say it took me a good few seconds of zooming in and out to realise they'd used a decimal point where a comma should have been.
Easy type to make, but seriously, does no one even take a cursory look at the charts when publishing articles like this? The chart looks _obviously_ wrong, so imagine how many are only slightly wrong and are missed.
The fuel prices one could surely be solved with a tiny bit of validation; are the coordinates even within a reasonable range? Fortunately, in the UK, it's really easy to tell which is latitude and which is longitude due to one of them being within a digit or two of zero on either side.
64 comments
But if institutions are expected to release clear data or nothing, almost always it is the later.
What is important, is to offer as much methodology and caveats as possible, even if in an informal way. Because there is a difference between "data covers 72% of companies registered in..." vs expecting that data is full and authoritative, whereas it is missing.
(Source: 10 years ago I worked a lot with official data. All data requires cleaning.)
To be clear, I'm not saying that we should accept messy data. Just, reality is messy and it's naive to think we can catch and remove all of reality's messiness -- which includes the bureaucratic slop that led to the data being published in the first place.
On the other hand, I agree that bad (but usually fixable) data is better than no data.
I prefer to get data with swapped lat/lng (a trivial fix), or prices said in dollars but being in cents, to no data.
Those seem reasonable asks.
Edit to add: the tragedy of the school in Minab is an example of how bad things can go--and it just hints at how much worse bad data can bem
Vide https://news.ycombinator.com/item?id=47544980
The list was never updated when the building was turned into a school.
It wasn't vibe bombing, and there certainly was enough time to do due diligence, but there was no process in place to do so.
Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?
If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?
Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.
One problem is that you can't just focus on outliers. Whatever pattern-matching you use to spot outliers will end up introducing a bias in the data. You need to check all the data, not just the data that "looks wrong". And that's expensive.
In clinical drug trials, we have the concept of SDV--Source Data Verification. Someone checks every data point against the official source record, usually a medical chart. We track the % of data points that have been verified. For important data (e.g., Adverse Events), the goal is to get SDV to 100%.
As you can imagine, this is expensive.
Will LLMs help to make this cheaper? I don't know, but if we can give this tedious, detail-oriented work to a machine, I would love it.
People who don't heed this advice get to discover it for themselves (I sure did)
IF you can't make the data convincing, you'll lose all trust, and nobody will do business with you.
> Authors should have their work proof read
Agreed.
Opening passage:
> A quick plot of the latitude and longitude shows some clear outliners
"outliners"
Ouch!
I have written my own Home Assistant custom component for the UK fuel finder data, and yes, the data really is that bad.
But, hey, we’re all wise after the event. To their credit though, they do seem to be actively reacting to feedback. I also contacted them about the bad data issue, and they are now adding user warnings about bad price values at the point of data entry (according to https://www.developer.fuel-finder.service.gov.uk/release-not...).
"Stop Publishing Garbage Data, It’s Embarrassing"
To the rather lamer:
"Twice this week, I have come across embarassingly bad data"
?
Easy type to make, but seriously, does no one even take a cursory look at the charts when publishing articles like this? The chart looks _obviously_ wrong, so imagine how many are only slightly wrong and are missed.
The fuel prices one could surely be solved with a tiny bit of validation; are the coordinates even within a reasonable range? Fortunately, in the UK, it's really easy to tell which is latitude and which is longitude due to one of them being within a digit or two of zero on either side.