I want to get you involved in the Azimuth Environmental Data Backup Project, so click on that for more. But first some background.
A few days ago, many scientists, librarians, archivists, computer geeks and environmental activists started to make backups of US government environmental databases. We’re trying to beat the January 20th deadline just in case.
Backing up data is always a good thing, so there’s no point in arguing about politics or the likelihood that these backups are needed. The present situation is just a nice reason to hurry up and do some things we should have been doing anyway.
As of 2 days ago the story looked like this:
• Saving climate data (Part 1), Azimuth, 13 December 2016.
A lot has happened since then, but much more needs to be done. Right now you can see a list of 90 databases to be backed up here:
• Gov. Climate Datasets (Archive). (Click on the tiny word “Datasets” at the bottom of the page!)
Despite the word ‘climate’, the scope includes other environmental databases, and rightly so. Here is a list of databases that have been backed up:
By going here and clicking “Start Here to Help”:
you can nominate a dataset for rescue, claim a dataset to rescue, let everyone know about a data rescue event, or help in some other way (which you must specify). There’s also other useful information on this page, which was set up by Nick Santos.
The overall effort is being organized by the Penn Program in the Environmental Humanities, or ‘PPEHLab’ for short, headed by Bethany Wiggin. If you want to know what’s going on, it helps to look at their blog:
However, the people organizing the project are currently overwhelmed with offers of help! People worldwide are proceeding to take action in a decentralzed way! So, everything is a bit chaotic, and nobody has an overall view of what’s going on.
I can’t overstate this: if you think that ‘they’ have a plan and ‘they’ know what’s going on, you’re wrong. ‘They’ is us. Act accordingly.
Here’s a list of news articles, a list of ‘data rescue events’ where people get together with lots of computers and do stuff, and a bit about archives and archivists.
News
Here are some things to read:
• Jason Koebler, Researchers are preparing for Trump to delete government science from the web, Vice, 13 December 2016.
• Brady Dennis, Scientists frantically copying U.S. climate data, fearing it might vanish under Trump, Washington Post, 13 December, 2016. (Also at the Chicago Tribune.)
• Eric Holthaus, Why I’m trying to preserve federal climate data before Trump takes office, Washington Post, 13 December 2016.
• Nicole Mortillaro, U of T heads ‘guerrilla archiving event’ to preserve climate data ahead of Trump presidency, CBC News, 14 December 2016.
• Audie Kornish and Eric Holthaus, Scientists race to preserve climate change data before Trump takes office, All Things Considered, National Public Radio, 14 December 2016.
Data rescue events
There’s one in Toronto:
• Guerrilla archiving event, 10 am – 4 pm EST, Saturday 17 December 2016. Location: Bissell Building, 4th Floor, 140 St. George St. University of Toronto.
There will be one in Philadelphia:
• DataRescuePenn Data Harvesting, Friday–Saturday 13–14 January 2017. Location: not determined yet, probably somewhere at the University of Pennsylvania, Philadelphia.
I hear there will also be events in New York City and Los Angeles, but I don’t know details. If you do, please tell me!
Archives and archivists
Today I helped catalyze a phone conversation between Bethany Wiggin, who heads the PPEHLab, and Nancy Beaumont, head of the Society of American Archivists. Digital archivists have a lot of expertise in saving information, so their skills are crucial here. Big wads of disorganized data are not very useful.
In this conversation I learned that some people are already in contact with the Internet Archive. This archive always tries to save US government websites and databases at the end of each presidential term. Their efforts are not limited to environmental data, and they save not only webpages but entire databases, e.g. data in ftp sites. You can nominate sites to be saved here:
• Internet Archive, End of Presidential Term Harvest 2016.
For more details read this:
• Internet Archive blog, Preserving U.S. Government Websites and Data as the Obama Term Ends, 15 December 2016.
This blog page is to help organize the Azimuth Environmental Data Backup Project, or Azimuth Backup Project for short. This is part of the larger but decentralized, frantic and somewhat disorganized project explained here:
• Saving Climate Data (Part 2), Azimuth, 15 December 2016.
Here I’ll just say what we’re doing at Azimuth.
Just a general piece of advice: Backups are only useful if they are accessible themselves, and tested/verified. So a site that lists “what has already been backed up” might be useful for getting started, but once it’s “full”, what it needs to list for each thing that “has been backed up” is “how can a third party verify that it’s been backed up correctly, and get/use the data themselves”?
For example, if we were talking about a single file (a gross oversimplification), what is its url, length, format, and a hash of its content? Then anyone can verify they can access the backup, by doing so. If there are two backups, anyone can compare them for equality.
For a directory of many files, as long as some general info is given, anyone can verify accessibility of randomly chosen files within it, and compare the same file from two purported backups.
For a “database” it’s more complicated, but therefore this verification is all the more needed. (A backup by an amateur is especially likely to be incomplete or incorrect without them realizing that.)
Obviously the experts already know all this. I’m pointing it out just so you can easily notice whether a site that “lists which backups have been done” is missing this crucial info or not.
Digital archivists and others are getting onboard, so I hope the issues you raise will be dealt with correctly.
By the way, the old list of ecological databases to be backed up is now topped with a huge message saying that less information is available there now. Now I see no information about whether each database has been backed up (and how many times, etc. etc.). I don’t know where that information can be found.
That link is incorrect — it’s just a link to this post.
Fixed—thanks!
I’m surprised no one has just downloaded everything and posted it to a quick github repo…you know, just to make sure it’s safe and backed up.
You sound like the economist who, seeing a $20 bill lying in the gutter, didn’t pick it up—for it were real, someone would have already done so.
What you’re surprised hasn’t happened already is exactly what’s happening now. Many people are, as we speak, “just” downloading everything and letting other know they’ve done it. But the word “just” is not quite the right word when you’re trying to back up a least 90 databases, ranging from hundreds of gigabytes to hundreds of terabytes in size.
One big job is just finding all the databases. There is no perfect list of all US federal government environmental databases. The best list I know is the one that people are throwing together now (click the tiny word “Datasets” in blue on top).
Important progress! Here is a partial list of climate mirrors, and databases that have been backed up:
• The Climate Mirror Project.
Seems like a natural fit with true decentralized technologies as the IPFS and Blockchain technologies. These provide a framework for uncensorable, untamperable applications. I am a developer in this Field if you want to know more, AMA
This is a copy of my comment to the Saving Climate Data (Part 1) post, just in case someone is reading here and didn’t see it there.
While the blockchain as a technology can be used to provide audit log for the data, I would still consider it as a technology, not the technology, until it has further developed and matured enough not to require the complete chain all the way back to the genesis block.
Message digests? Absolutely.
Digital signatures? Of course.
With timestamps? Certainly.
The blockchain? Why not, among others.
A sufficient level of trust does not require a mathematical proof of authenticity all the way back to the inception of any particular system.
Having reasonable guarantees distributed widely enough in a decentralized fashion is enough. Future derivatives of the blockchain technology and similar should rely on sufficiently many verifiable sources, instead of requiring an absolute chain to an absolute origin, and everyone keeping a perfect record of all history until current point in time.
A mature system will allow for imperfect record and still work. Note that this can already be achieved to a reasonable level with message digests, digital signatures and timestamps alone, when widely deployed. Having a blockchain doesn’t hurt, but shouldn’t be considered a necessary part of the publication process. We can do 1-3 in any case, with and/or without 4.
So in addition to the detailed listing at:
https://goo.gl/O7uOjm
there is:
https://data.noaa.gov/dataset
which looks like it overlaps the former to some degree, but also offers datasets which are not mentioned in the former. I did not find a corresponding FTP site, although there may be ones lower down in the hierarchy of datasets.
There is some documentation at:
http://www.nws.noaa.gov/tg/dataprod.php
and
https://www.ncdc.noaa.gov/isd/data-access
and
https://www.ngdc.noaa.gov/ftp.html
including FTP access to some (part?) of it at
ftp://ftp.ncdc.noaa.gov/pub/data/noaa
Do we trust the original spreadsheet? Or try to go after these? The nice thing about FTP access is there is no question of needing to understand the Web site structure in order to replicate files. Data can just be downloaded.
Jan wrote:
The original spreadsheet was randomly thrown together in haste by a bunch of people who aren’t much wiser than you and me. It’s probably missing a bunch of good sites, and it also includes some sites that probably shouldn’t be there, since they aren’t .gov. When Sakari asked the person running the University of Idaho Gridded Surface Meteorological Data site whether he could back that site up, that person replied:
So, we should use our initiative in identifying and backing up .gov databases that contain important climate/ecological information and aren’t on the original spreadsheet!
In short: go for it! It sounds like you found some good stuff. You can add it to the original spreadsheet but also just start downloading
Also, please log the sites you’ve already backed up, so we can know what we’ve done! You can do it here:
https://bitbucket.org/azimuth-backup/azimuth-inventory/wiki/Home
Partial information is better than none.
All the datasets I am working on have been logged there, as well as two other, related blocking issues I am working. That’ll be all from me for a while, until I can clear out the backlog.
Sorry, I hadn’t figured out how to see the entries on azimuth-inventory-wiki. I’m still not sure I understand the best way to see a list of all databases that we have backed up. Right now the only thing I know is to go to
https://bitbucket.org/azimuth-backup/azimuth-inventory/issues
Is that the best way?
That’s where I’m reporting all my progress, John.
Here’s an article about the Toronto “guerilla archiving” event, written before it happened on Saturday the 17th of December:
• Roy Hales, Protecting America’s environmental websites & data from Trump, The ECOreport, 17 December 2016.
It describes a bunch of Toronto people and what they’re doing, and then a bit about the event:
Somewhere we should be able to read about what actually happened!
I have noted it elsewhere, but there’s a nice discussion of what a Trump administration could and could not do. Providing additional motivation and urgency to the Backup Project, they cite the case of a conservative Canadian PM who literally threw out many government reports from Environment Canada.
More neat references, and things to which the effort could aspire.
The oceanographers’ SeaView project, including an interface, and a poster from the recent AGU meeting in San Francisco
The end of term Presidential Harvest project
Pertinent: About HDF5.