Saving Climate Data (Part 3)

You can back up climate data, but how can anyone be sure your backups are accurate? Let’s suppose the databases you’ve backed up have been deleted, so that there’s no way to directly compare your backup with the original. And to make things really tough, let’s suppose that faked databases are being promoted as competitors with the real ones! What can you do?

One idea is ‘safety in numbers’. If a bunch of backups all match, and they were made independently, it’s less likely that they all suffer from the same errors.

Another is ‘safety in reputation’. If a bunch of backups of climate data are held by academic institutes of climate science, and another are held by climate change denying organizations (conveniently listed here), you probably know which one you trust more. (And this is true even if you’re a climate change denier, though your answer may be different than mine.)

But a third idea is to use a cryptographic hash function. In very simplified terms, this is a method of taking a database and computing a fairly short string from it, called a ‘digest’.

740px-cryptographic_hash_function-svg

A good hash function makes it hard to change the database and get a new one with the same digest. So, if the person owning a database computes and publishes the digest, anyone can check that your backup is correct by computing its digest and comparing it to the original.

It’s not foolproof, but it works well enough to be helpful.

Of course, it only works if we have some trustworthy record of the original digest. But the digest is much smaller than the original database: for example, in the popular method called SHA-256, the digest is 256 bits long. So it’s much easier to make copies of the digest than to back up the original database. These copies should be stored in trustworthy ways—for example, the Internet Archive.

When Sakari Maraanen made a backup of the University of Idaho Gridded Surface Meteorological Data, he asked the custodians of that data to publish a digest, or ‘hash file’. One of them responded:

Sakari and others,

I have made the checksums for the UofI METDATA/gridMET files (1979-2015) as both md5sums and sha256sums.

You can find these hash files here:

https://www.northwestknowledge.net/metdata/data/hash.md5

https://www.northwestknowledge.net/metdata/data/hash.sha256

After you download the files, you can check the sums with:

md5sum -c hash.md5

sha256sum -c hash.sha256

Please let me know if something is not ideal and we’ll fix it!

Thanks for suggesting we do this!

Sakari replied:

Thank you so much! This means everything to public mirroring efforts. If you’d like to help further promoting this Best Practice, consider getting it recognized as a standard when you do online publishing of key public information.

1. Publishing those hashes is already a major improvement on its own.

2. Publishing them on a secure website offers people further guarantees that there has not been any man-in-the-middle.

3. Digitally signing the checksum files offers the best easily achievable guarantees of data integrity by the person(s) who sign the checksum files.

Please consider having these three steps included in your science organisation’s online publishing training and standard Best Practices.

Feel free to forward this message to whom it may concern. Feel free to rephrase as necessary.

As a separate item, public mirroring instructions for how to best download your data and/or public websites would further guarantee permanence of all your uniquely valuable science data and public contributions.

Right now we should get this message viral through the government funded science publishing people. Please approach the key people directly – avoiding the delay of using official channels. We need to have all the uniquely valuable public data mirrored before possible changes in funding.

Again, thank you for your quick response!

There are probably lots of things to be careful about. Here’s one. Maybe you can think of more, and ways to deal with them.

What if the data keeps changing with time? This is especially true of climate records, where new temperatures and so on are added to a database every day, or month, or year. Then I think we need to ‘time-stamp’ everything. The owners of the original database need to keep a list of digests, with the time each one was made. And when you make a copy, you need to record the time it was made.

10 Responses to Saving Climate Data (Part 3)

  1. Julie Sylvia says:

    I am so grateful for your goodwill, hard work, and dedication. Thank you.

  2. John Wehrle says:

    “What if the data keeps changing with time? This is especially true of climate records, where new temperatures and so on are added to a database every day, or month, or year. Then I think we need to ‘time-stamp’ everything. The owners of the original database need to keep a list of digests, with the time each one was made. And when you make a copy, you need to record the time it was made.”

    That sounds like a job for a distributed version control system like Git.

  3. Tobias Fritz says:

    I bet that you’ve already looked into trusted timestamping? You send your hash to a trusted authority, who then concatenates it with a timestamp, hashes the result again, and signs it. Then anyone who trusts the authority’s timekeeping can verify that you’ve had the data before the date of the timestamp.

  4. I don’t understand the technology well, but there is the transactional safeguarding system called blockchaining which may deal with this kind of thing. I don’t really understand how blockchains deal with deletions and edits, but that’s one more thing on the huge pile of new things to learn.

  5. By the way, I don’t recall where I got this link from, perhaps from John, perhaps somewhere else. But it does a nice job of sketching the possibilities and why it’s wise to do what we are doing.

  6. domenico says:

    I think that could be more useful a single secure site – that distributes the climate data database – with files encrypted with pretty good privacy protocol, so the source is secure, and the source is verified.
    So that each request could be made on an intermediate site, that make the request with the encrypted site using a site public key-private key registration, and the researcher don’t know nothing of the cryptographic protocol: the intermediate site exercises the deciphering phases on behalf of the client, so that the deciphering is concealed to the clients.
    The hash table could fail (with a little probability), but the deciphering ensure that the file is not modified, and only the encrypted site could modify the original files.
    There may be multiple copies of the secure database (for example with torrent), to ensure a quick download, but these could be multiple copies of a single secure database

  7. John Baez says:

    Sakari put up an article on G+ about this issue:

    • Sakari Maaranen, Permanent digital publication, 26 December 2016.

    Here’s a tiny taste:

    This is a draft for comments. It’s requirement level description of long term archival quality electronic publishing.

    The main problem with various persistent or permanent document identifiers, like PURLs, DOIs, LSIDs, or INFO URIs, is that they all assume being the one scheme that does it. They may certainly overlap, such that one is defined as a namespace under another, but none of them recognizes that various identification schemes come and go and that the same resource may be identified by many different names over time.

    Permanent identification metadata

    For permanent record, a digital work needs a separate set of metadata that is not a part of the work itself, but claims a reference to it. The reference is a set of one or more cryptographic hash values of the same work. These hash values may be added or dropped, as long as at least one currently relevant value remains at all times.

    In addition to the set of hash values, the metadata may contain any number of any persistent (or transient) identifiers for the work. These may be authenticated or contested by any currently relevant means. The metadata definition must be independent of any particular identification scheme and support all of them, including any future schemes that have not been thought of yet.

  8. gwpl says:

    One may want to verify in future if data were not changed – it’s domain of #DataIntegrity.

    Historically it was done by timestamping, e.g. by official stamps on post.

    Nowadays we have cryptographic timestamps, of two kinds:
    * Based on PKI , assymetric cryptography
    * Based.on blockchain.

    TL;DR- Please timestamp! Making timestamps it’s not big hassle and later makes legally valid claims that data are as day are (e.g. blockchain method of GuardTime prints hash of all hashes in NewsPaper so one can be formally verified in libraries etc). The only moment of making timestamps is now: if we do later we can not proof it was not modified in mean time. After all, I am fan of block chain basedethod but if personally I timestamp with as many providers (including PKI ones) as handy, to be better covered.

    lease allow future #DataIntegrity checks and do #DigitalTimestamp of those hashes! With #PKI & wothout (e.g. @GuardTime )

    (As I wrote on
    https://twitter.com/GWierzowiecki/status/814196190669602816 )

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

This site uses Akismet to reduce spam. Learn how your comment data is processed.