Saving Climate Data (Part 3)

You can back up climate data, but how can anyone be sure your backups are accurate? Let’s suppose the databases you’ve backed up have been deleted, so that there’s no way to directly compare your backup with the original. And to make things really tough, let’s suppose that faked databases are being promoted as competitors with the real ones! What can you do?

One idea is ‘safety in numbers’. If a bunch of backups all match, and they were made independently, it’s less likely that they all suffer from the same errors.

Another is ‘safety in reputation’. If a bunch of backups of climate data are held by academic institutes of climate science, and another are held by climate change denying organizations (conveniently listed here), you probably know which one you trust more. (And this is true even if you’re a climate change denier, though your answer may be different than mine.)

But a third idea is to use a cryptographic hash function. In very simplified terms, this is a method of taking a database and computing a fairly short string from it, called a ‘digest’.

A good hash function makes it hard to change the database and get a new one with the same digest. So, if the person owning a database computes and publishes the digest, anyone can check that your backup is correct by computing its digest and comparing it to the original.

It’s not foolproof, but it works well enough to be helpful.

Of course, it only works if we have some trustworthy record of the original digest. But the digest is much smaller than the original database: for example, in the popular method called SHA-256, the digest is 256 bits long. So it’s much easier to make copies of the digest than to back up the original database. These copies should be stored in trustworthy ways—for example, the Internet Archive.

When Sakari Maraanen made a backup of the University of Idaho Gridded Surface Meteorological Data, he asked the custodians of that data to publish a digest, or ‘hash file’. One of them responded:

Sakari and others,

I have made the checksums for the UofI METDATA/gridMET files (1979-2015) as both md5sums and sha256sums.

You can find these hash files here:

https://www.northwestknowledge.net/metdata/data/hash.md5

https://www.northwestknowledge.net/metdata/data/hash.sha256

After you download the files, you can check the sums with:

md5sum -c hash.md5

sha256sum -c hash.sha256

Please let me know if something is not ideal and we’ll fix it!

Thanks for suggesting we do this!

Sakari replied:

Thank you so much! This means everything to public mirroring efforts. If you’d like to help further promoting this Best Practice, consider getting it recognized as a standard when you do online publishing of key public information.

1. Publishing those hashes is already a major improvement on its own.

2. Publishing them on a secure website offers people further guarantees that there has not been any man-in-the-middle.

3. Digitally signing the checksum files offers the best easily achievable guarantees of data integrity by the person(s) who sign the checksum files.

Please consider having these three steps included in your science organisation’s online publishing training and standard Best Practices.

Feel free to forward this message to whom it may concern. Feel free to rephrase as necessary.

As a separate item, public mirroring instructions for how to best download your data and/or public websites would further guarantee permanence of all your uniquely valuable science data and public contributions.

Right now we should get this message viral through the government funded science publishing people. Please approach the key people directly – avoiding the delay of using official channels. We need to have all the uniquely valuable public data mirrored before possible changes in funding.

Again, thank you for your quick response!

There are probably lots of things to be careful about. Here’s one. Maybe you can think of more, and ways to deal with them.

What if the data keeps changing with time? This is especially true of climate records, where new temperatures and so on are added to a database every day, or month, or year. Then I think we need to ‘time-stamp’ everything. The owners of the original database need to keep a list of digests, with the time each one was made. And when you make a copy, you need to record the time it was made.

This entry was posted on Friday, December 23rd, 2016 at 5:28 pm and is filed under climate, computer science, risks, strategies, the practice of science. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

10 Responses to Saving Climate Data (Part 3)

Julie Sylvia says:

23 December, 2016 at 6:10 pm

I am so grateful for your goodwill, hard work, and dedication. Thank you.

Reply
John Wehrle says:

23 December, 2016 at 7:18 pm

“What if the data keeps changing with time? This is especially true of climate records, where new temperatures and so on are added to a database every day, or month, or year. Then I think we need to ‘time-stamp’ everything. The owners of the original database need to keep a list of digests, with the time each one was made. And when you make a copy, you need to record the time it was made.”

That sounds like a job for a distributed version control system like Git.

Reply
Tobias Fritz says:

23 December, 2016 at 11:55 pm

I bet that you’ve already looked into trusted timestamping? You send your hash to a trusted authority, who then concatenates it with a timestamp, hashes the result again, and signs it. Then anyone who trusts the authority’s timekeeping can verify that you’ve had the data before the date of the timestamp.

Reply
- John Baez says:
  
  25 December, 2016 at 12:48 am
  
  I haven’t looked into this. I hope our experts do!
  
  Reply
hypergeometric says:

24 December, 2016 at 7:54 pm

I don’t understand the technology well, but there is the transactional safeguarding system called blockchaining which may deal with this kind of thing. I don’t really understand how blockchains deal with deletions and edits, but that’s one more thing on the huge pile of new things to learn.

Reply
hypergeometric says:

25 December, 2016 at 2:56 am

By the way, I don’t recall where I got this link from, perhaps from John, perhaps somewhere else. But it does a nice job of sketching the possibilities and why it’s wise to do what we are doing.

Reply
domenico says:

26 December, 2016 at 1:01 am

I think that could be more useful a single secure site – that distributes the climate data database – with files encrypted with pretty good privacy protocol, so the source is secure, and the source is verified.
So that each request could be made on an intermediate site, that make the request with the encrypted site using a site public key-private key registration, and the researcher don’t know nothing of the cryptographic protocol: the intermediate site exercises the deciphering phases on behalf of the client, so that the deciphering is concealed to the clients.
The hash table could fail (with a little probability), but the deciphering ensure that the file is not modified, and only the encrypted site could modify the original files.
There may be multiple copies of the secure database (for example with torrent), to ensure a quick download, but these could be multiple copies of a single secure database

Reply
- hypergeometric says:
  
  26 December, 2016 at 5:06 am
  
  YouTube coverage of the UToronto event.
  
  Reply
John Baez says:

27 December, 2016 at 5:14 pm

Sakari put up an article on G+ about this issue:

• Sakari Maaranen, Permanent digital publication, 26 December 2016.

Here’s a tiny taste:

This is a draft for comments. It’s requirement level description of long term archival quality electronic publishing.

The main problem with various persistent or permanent document identifiers, like PURLs, DOIs, LSIDs, or INFO URIs, is that they all assume being the one scheme that does it. They may certainly overlap, such that one is defined as a namespace under another, but none of them recognizes that various identification schemes come and go and that the same resource may be identified by many different names over time.

Permanent identification metadata

For permanent record, a digital work needs a separate set of metadata that is not a part of the work itself, but claims a reference to it. The reference is a set of one or more cryptographic hash values of the same work. These hash values may be added or dropped, as long as at least one currently relevant value remains at all times.

In addition to the set of hash values, the metadata may contain any number of any persistent (or transient) identifiers for the work. These may be authenticated or contested by any currently relevant means. The metadata definition must be independent of any particular identification scheme and support all of them, including any future schemes that have not been thought of yet.

Reply
gwpl says:

28 December, 2016 at 8:03 pm

One may want to verify in future if data were not changed – it’s domain of #DataIntegrity.

Historically it was done by timestamping, e.g. by official stamps on post.

Nowadays we have cryptographic timestamps, of two kinds:
* Based on PKI , assymetric cryptography
* Based.on blockchain.

TL;DR- Please timestamp! Making timestamps it’s not big hassle and later makes legally valid claims that data are as day are (e.g. blockchain method of GuardTime prints hash of all hashes in NewsPaper so one can be formally verified in libraries etc). The only moment of making timestamps is now: if we do later we can not proof it was not modified in mean time. After all, I am fan of block chain basedethod but if personally I timestamp with as many providers (including PKI ones) as handy, to be better covered.

lease allow future #DataIntegrity checks and do #DigitalTimestamp of those hashes! With #PKI & wothout (e.g. @GuardTime )

(As I wrote on
https://twitter.com/GWierzowiecki/status/814196190669602816 )

Reply

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it. Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

	John Baez on Agent-Based Models (Part …
	Grandpa D on Agent-Based Models (Part …
	Anton Sherwood on Well Temperaments (Part 3…
	John Baez on Well Temperaments (Part 3…
	Anton Sherwood on Well Temperaments (Part 3…
	John Baez on Hexagonal Tiling Honeycomb
	Victor V Albert on Hexagonal Tiling Honeycomb
	Matias Schrauf on Agent-Based Models (Part …
	John Tromp on The Busy Beaver Game
	Wolfgang on Protonium

Azimuth