You can back up climate data, but how can anyone be sure your backups are accurate? Let’s suppose the databases you’ve backed up have been deleted, so that there’s no way to directly compare your backup with the original. And to make things really tough, let’s suppose that faked databases are being promoted as competitors with the real ones! What can you do?
One idea is ‘safety in numbers’. If a bunch of backups all match, and they were made independently, it’s less likely that they all suffer from the same errors.
Another is ‘safety in reputation’. If a bunch of backups of climate data are held by academic institutes of climate science, and another are held by climate change denying organizations (conveniently listed here), you probably know which one you trust more. (And this is true even if you’re a climate change denier, though your answer may be different than mine.)
But a third idea is to use a cryptographic hash function. In very simplified terms, this is a method of taking a database and computing a fairly short string from it, called a ‘digest’.
A good hash function makes it hard to change the database and get a new one with the same digest. So, if the person owning a database computes and publishes the digest, anyone can check that your backup is correct by computing its digest and comparing it to the original.
It’s not foolproof, but it works well enough to be helpful.
Of course, it only works if we have some trustworthy record of the original digest. But the digest is much smaller than the original database: for example, in the popular method called SHA-256, the digest is 256 bits long. So it’s much easier to make copies of the digest than to back up the original database. These copies should be stored in trustworthy ways—for example, the Internet Archive.
When Sakari Maraanen made a backup of the University of Idaho Gridded Surface Meteorological Data, he asked the custodians of that data to publish a digest, or ‘hash file’. One of them responded:
Sakari and others,
I have made the checksums for the UofI METDATA/gridMET files (1979-2015) as both md5sums and sha256sums.
You can find these hash files here:
After you download the files, you can check the sums with:
md5sum -c hash.md5
sha256sum -c hash.sha256
Please let me know if something is not ideal and we’ll fix it!
Thanks for suggesting we do this!
Thank you so much! This means everything to public mirroring efforts. If you’d like to help further promoting this Best Practice, consider getting it recognized as a standard when you do online publishing of key public information.
1. Publishing those hashes is already a major improvement on its own.
2. Publishing them on a secure website offers people further guarantees that there has not been any man-in-the-middle.
3. Digitally signing the checksum files offers the best easily achievable guarantees of data integrity by the person(s) who sign the checksum files.
Please consider having these three steps included in your science organisation’s online publishing training and standard Best Practices.
Feel free to forward this message to whom it may concern. Feel free to rephrase as necessary.
As a separate item, public mirroring instructions for how to best download your data and/or public websites would further guarantee permanence of all your uniquely valuable science data and public contributions.
Right now we should get this message viral through the government funded science publishing people. Please approach the key people directly – avoiding the delay of using official channels. We need to have all the uniquely valuable public data mirrored before possible changes in funding.
Again, thank you for your quick response!
There are probably lots of things to be careful about. Here’s one. Maybe you can think of more, and ways to deal with them.
What if the data keeps changing with time? This is especially true of climate records, where new temperatures and so on are added to a database every day, or month, or year. Then I think we need to ‘time-stamp’ everything. The owners of the original database need to keep a list of digests, with the time each one was made. And when you make a copy, you need to record the time it was made.