Saving Climate Data (Part 2)

I want to get you involved in the Azimuth Environmental Data Backup Project, so click on that for more. But first some background.

A few days ago, many scientists, librarians, archivists, computer geeks and environmental activists started to make backups of US government environmental databases. We’re trying to beat the January 20th deadline just in case.

Backing up data is always a good thing, so there’s no point in arguing about politics or the likelihood that these backups are needed. The present situation is just a nice reason to hurry up and do some things we should have been doing anyway.

As of 2 days ago the story looked like this:

Saving climate data (Part 1), Azimuth, 13 December 2016.

A lot has happened since then, but much more needs to be done. Right now you can see a list of 90 databases to be backed up here:

Gov. Climate Datasets (Archive). (Click on the tiny word “Datasets” at the bottom of the page!)

Despite the word ‘climate’, the scope includes other environmental databases, and rightly so. Here is a list of databases that have been backed up:

The Climate Mirror Project.

By going here and clicking “Start Here to Help”:

Climate Mirror.

you can nominate a dataset for rescue, claim a dataset to rescue, let everyone know about a data rescue event, or help in some other way (which you must specify). There’s also other useful information on this page, which was set up by Nick Santos.

The overall effort is being organized by the Penn Program in the Environmental Humanities, or ‘PPEHLab’ for short, headed by Bethany Wiggin. If you want to know what’s going on, it helps to look at their blog:

DataRescuePenn.

However, the people organizing the project are currently overwhelmed with offers of help! People worldwide are proceeding to take action in a decentralzed way! So, everything is a bit chaotic, and nobody has an overall view of what’s going on.

I can’t overstate this: if you think that ‘they’ have a plan and ‘they’ know what’s going on, you’re wrong. ‘They’ is us. Act accordingly.

Here’s a list of news articles, a list of ‘data rescue events’ where people get together with lots of computers and do stuff, and a bit about archives and archivists.

News

Here are some things to read:

• Jason Koebler, Researchers are preparing for Trump to delete government science from the web, Vice, 13 December 2016.

• Brady Dennis, Scientists frantically copying U.S. climate data, fearing it might vanish under Trump, Washington Post, 13 December, 2016. (Also at the Chicago Tribune.)

• Eric Holthaus, Why I’m trying to preserve federal climate data before Trump takes office, Washington Post, 13 December 2016.

• Nicole Mortillaro, U of T heads ‘guerrilla archiving event’ to preserve climate data ahead of Trump presidency, CBC News, 14 December 2016.

• Audie Kornish and Eric Holthaus, Scientists race to preserve climate change data before Trump takes office, All Things Considered, National Public Radio, 14 December 2016.

Data rescue events

There’s one in Toronto:

Guerrilla archiving event, 10 am – 4 pm EST, Saturday 17 December 2016. Location: Bissell Building, 4th Floor, 140 St. George St. University of Toronto.

There will be one in Philadelphia:

DataRescuePenn Data Harvesting, Friday–Saturday 13–14 January 2017. Location: not determined yet, probably somewhere at the University of Pennsylvania, Philadelphia.

I hear there will also be events in New York City and Los Angeles, but I don’t know details. If you do, please tell me!

Archives and archivists

Today I helped catalyze a phone conversation between Bethany Wiggin, who heads the PPEHLab, and Nancy Beaumont, head of the Society of American Archivists. Digital archivists have a lot of expertise in saving information, so their skills are crucial here. Big wads of disorganized data are not very useful.

In this conversation I learned that some people are already in contact with the Internet Archive. This archive always tries to save US government websites and databases at the end of each presidential term. Their efforts are not limited to environmental data, and they save not only webpages but entire databases, e.g. data in ftp sites. You can nominate sites to be saved here:

• Internet Archive, End of Presidential Term Harvest 2016.

For more details read this:

• Internet Archive blog, Preserving U.S. Government Websites and Data as the Obama Term Ends, 15 December 2016.

19 Responses to Saving Climate Data (Part 2)

  1. This blog page is to help organize the Azimuth Environmental Data Backup Project, or Azimuth Backup Project for short. This is part of the larger but decentralized, frantic and somewhat disorganized project explained here:

    Saving Climate Data (Part 2), Azimuth, 15 December 2016.

    Here I’ll just say what we’re doing at Azimuth.

  2. Bruce Smith says:

    Just a general piece of advice: Backups are only useful if they are accessible themselves, and tested/verified. So a site that lists “what has already been backed up” might be useful for getting started, but once it’s “full”, what it needs to list for each thing that “has been backed up” is “how can a third party verify that it’s been backed up correctly, and get/use the data themselves”?

    For example, if we were talking about a single file (a gross oversimplification), what is its url, length, format, and a hash of its content? Then anyone can verify they can access the backup, by doing so. If there are two backups, anyone can compare them for equality.

    For a directory of many files, as long as some general info is given, anyone can verify accessibility of randomly chosen files within it, and compare the same file from two purported backups.

    For a “database” it’s more complicated, but therefore this verification is all the more needed. (A backup by an amateur is especially likely to be incomplete or incorrect without them realizing that.)

    Obviously the experts already know all this. I’m pointing it out just so you can easily notice whether a site that “lists which backups have been done” is missing this crucial info or not.

    • John Baez says:

      Digital archivists and others are getting onboard, so I hope the issues you raise will be dealt with correctly.

      By the way, the old list of ecological databases to be backed up is now topped with a huge message saying that less information is available there now. Now I see no information about whether each database has been backed up (and how many times, etc. etc.). I don’t know where that information can be found.

  3. Bruce Smith says:

    I want to get you involved in the Azimuth Environmental Data Backup Project, so click on that for more.

    That link is incorrect — it’s just a link to this post.

  4. Alex says:

    I’m surprised no one has just downloaded everything and posted it to a quick github repo…you know, just to make sure it’s safe and backed up.

    • John Baez says:

      You sound like the economist who, seeing a $20 bill lying in the gutter, didn’t pick it up—for it were real, someone would have already done so.

      What you’re surprised hasn’t happened already is exactly what’s happening now. Many people are, as we speak, “just” downloading everything and letting other know they’ve done it. But the word “just” is not quite the right word when you’re trying to back up a least 90 databases, ranging from hundreds of gigabytes to hundreds of terabytes in size.

      One big job is just finding all the databases. There is no perfect list of all US federal government environmental databases. The best list I know is the one that people are throwing together now (click the tiny word “Datasets” in blue on top).

  5. John Baez says:

    Important progress! Here is a partial list of climate mirrors, and databases that have been backed up:

    The Climate Mirror Project.

  6. Seems like a natural fit with true decentralized technologies as the IPFS and Blockchain technologies. These provide a framework for uncensorable, untamperable applications. I am a developer in this Field if you want to know more, AMA

    • This is a copy of my comment to the Saving Climate Data (Part 1) post, just in case someone is reading here and didn’t see it there.

      While the blockchain as a technology can be used to provide audit log for the data, I would still consider it as a technology, not the technology, until it has further developed and matured enough not to require the complete chain all the way back to the genesis block.

      Message digests? Absolutely.
      Digital signatures? Of course.
      With timestamps? Certainly.
      The blockchain? Why not, among others.

      A sufficient level of trust does not require a mathematical proof of authenticity all the way back to the inception of any particular system.

      Having reasonable guarantees distributed widely enough in a decentralized fashion is enough. Future derivatives of the blockchain technology and similar should rely on sufficiently many verifiable sources, instead of requiring an absolute chain to an absolute origin, and everyone keeping a perfect record of all history until current point in time.

      A mature system will allow for imperfect record and still work. Note that this can already be achieved to a reasonable level with message digests, digital signatures and timestamps alone, when widely deployed. Having a blockchain doesn’t hurt, but shouldn’t be considered a necessary part of the publication process. We can do 1-3 in any case, with and/or without 4.

  7. So in addition to the detailed listing at:

    https://goo.gl/O7uOjm

    there is:

    https://data.noaa.gov/dataset

    which looks like it overlaps the former to some degree, but also offers datasets which are not mentioned in the former. I did not find a corresponding FTP site, although there may be ones lower down in the hierarchy of datasets.

    There is some documentation at:

    http://www.nws.noaa.gov/tg/dataprod.php

    and

    https://www.ncdc.noaa.gov/isd/data-access

    and

    https://www.ngdc.noaa.gov/ftp.html

    including FTP access to some (part?) of it at

    ftp://ftp.ncdc.noaa.gov/pub/data/noaa

    Do we trust the original spreadsheet? Or try to go after these? The nice thing about FTP access is there is no question of needing to understand the Web site structure in order to replicate files. Data can just be downloaded.

    • John Baez says:

      Jan wrote:

      Do we trust the original spreadsheet? Or try to go after these?

      The original spreadsheet was randomly thrown together in haste by a bunch of people who aren’t much wiser than you and me. It’s probably missing a bunch of good sites, and it also includes some sites that probably shouldn’t be there, since they aren’t .gov. When Sakari asked the person running the University of Idaho Gridded Surface Meteorological Data site whether he could back that site up, that person replied:

      I am not as worried about my datasets as they are on servers hosted by US universities, rather than the federal gov’t.

      So, we should use our initiative in identifying and backing up .gov databases that contain important climate/ecological information and aren’t on the original spreadsheet!

      In short: go for it! It sounds like you found some good stuff. You can add it to the original spreadsheet but also just start downloading

      Also, please log the sites you’ve already backed up, so we can know what we’ve done! You can do it here:

      https://bitbucket.org/azimuth-backup/azimuth-inventory/wiki/Home

      Partial information is better than none.

  8. John Baez says:

    Here’s an article about the Toronto “guerilla archiving” event, written before it happened on Saturday the 17th of December:

    • Roy Hales, Protecting America’s environmental websites & data from Trump, The ECOreport, 17 December 2016.

    It describes a bunch of Toronto people and what they’re doing, and then a bit about the event:

    U of T’s Patrick Keilty explains how this week-ends guerilla archiving event will proceed:

    “First, we need to identify vulnerable programs and then seed their URLs to the webcrawler of the End of Term project, which will make copies of those webpages. Second, we are researching and evaluating the many data repositories that the EPA has online: some of this data we know will be backed up and protected by laws, some data will be archivable at the Internet Archive through their webcrawler, and yet other sources of data will need to identified as in need of saving at a library. Libraries, such as at the University of Pennsylvania, are arranging to become repositories of this kind of vulnerable data not easily preserved. We will be passing on what we build and research to our colleagues in other cities so that they can pick up where we have left off.”

    Somewhere we should be able to read about what actually happened!

  9. I have noted it elsewhere, but there’s a nice discussion of what a Trump administration could and could not do. Providing additional motivation and urgency to the Backup Project, they cite the case of a conservative Canadian PM who literally threw out many government reports from Environment Canada.

  10. More neat references, and things to which the effort could aspire.

    The oceanographers’ SeaView project, including an interface, and a poster from the recent AGU meeting in San Francisco
    The end of term Presidential Harvest project

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s