This blog page is to help organize the Azimuth Environmental Data Backup Project, or Azimuth Backup Project for short. This is part of the larger but decentralized, frantic and somewhat disorganized project discussed elsewhere:
• Saving Climate Data (Part 2), Azimuth, 15 December 2016.
Here I’ll just say what we’re doing at Azimuth.
Jan Galkowski is a statistician and engineer at Akamai Technologies, a company in Cambridge Massachusetts whose content delivery network is one of the world’s largest distributed computing platforms, responsible for serving at least 15% of all web traffic. He has begun copying some of the publicly accessible US government climate databases. On 11 December he wrote:
John, so I have just started trying to mirror all of CDIAC [the Carbon Dioxide Information Analysis Center]. We’ll see. I’ll put it in a tarball, and then throw it up on Google. It should keep everything intact. Using WinHTTrack. I have coordinated with Eric Holthaus via Twitter, creating, per your suggestion, a new personal account which I am using exclusively to follow the principals.
Once CDIAC is done, and checked over, I’ll move on to other sites.
There are things beyond our control, such as paper records, or records which are online but are not within visibility of the public.
Oh, and I’ve formally requested time off from work for latter half of December so I can work this on vacation. (I have a number of other projects I want to work in parallel, anyway.)
By 14 December he was wanting some more storage space. He asked David Tanzer and me:
Do either of you have a large Google account, or the “unlimited storage” option at Amazon?
I’m using WebDrive, a commercial product. What I’m (now) doing is defining an FTP map at a .gov server, and then a map to my Amazon Cloud Drive. I’m using Windows 7, so these appear as standard drives (or mounts, in *nix terms). I navigate to an appropriate place on the Amazon Drive, and then I proceed to copy from .gov to Amazon.
There is no compression, and, in order to be sure I don’t abuse the .gov site, I’m deliberately passing this over a wireless network in my home, which limits the transfer rate. If necessary, and if the .gov site permits, I could hardwire the workstation to our FIOS router and get appreciably faster transfer. (I often do that for large work files.)
The nice thing is I get to work from home 3 days a week, so I can keep an eye on this. And I’m taking days off just to do this.
I’m thinking about how I might get a second workstation in the act.
The Web sites themselves I’m downloading, as mentioned, using HTTrack. I intended to tarball-up the site structure and then upload to Amazon. I’m still working on CDIAC at ORNL. For future sites, I’m going to try to get HTTrack to mirror directly to Amazon using one of the mounts.
I asked around for more storage space, and my request was kindly answer by Scott Maxwell. Scott lives in Pasadena California and he used to work for NASA: he even had a job driving a Mars rover! He is now a site reliability engineer at Google, and he works on Google Drive. Scott is setting up a 10-terabyte account on Google Drive, which Jan and others will be able to use.
Meanwhile, Jan noticed some interesting technical problems: for some reason WebDrive is barely using the capacity of his network connection, so things are moving much more slowly than they could in theory.
Most recently, Sakari Maaranen offered his assistance. Sakari is a systems architect at Ubisecure, a firm in Finland that specializes in identity management, advanced user authentication, authorization, single sign-on, and federation. He wrote:
I have several terabytes worth in Helsinki (can get more) and a gigabit connection. I registered my offer but they [the DataRefuge people] didn’t reply though. I’m glad if that means you have help already and don’t need a copy in Helsinki.
I replied saying that the absence of a reply probably means that they’re overwhelmed by offers of help and are struggling to figure out exactly what to do. Scott said:
Hey, Sakari! Thank you for the generous offer!
I’m setting these guys up with Google Drive storage, as at least a short-term solution.
IMHO, our first order of business is just to get a copy of the data into a location we control—one that can’t easily be taken away from us. That’s the rationale for Google Drive: it fits into Jan’s existing workflow, so it’s the lowest-friction path to getting a copy of the data that’s under our control.
How about if I propose this: we let Jan go ahead with the plan of backing up the data in Drive. Then I’ll look evaluate moving it from there to whatever other location we come up with. (Or copying instead of moving! More copies is better. :-) How does that sound to you?
I admit I haven’t gotten as far as thinking about Web serving at all—and it’s not my area of expertise anyway. Maybe you’d be kind enough to elaborate on your thoughts there.
Sakari responded with some information about servers. In late January, U. C. Riverside may help me with this—until then they are busy trying to get more storage space, for wholly unrelated reasons. But right now it seems the main job is to identify important data and get it under our control.
There are a lot of ways you could help.
• Computer skills. Personally I’m not much use with anything technical about computers, but the rest of the Azimuth Data Backup gang probably has technical questions that some of you out there could answer… so, I encourage discussion of those questions. (Clearly some discussions are best done privately, and at some point we may encounter unfriendly forces, but this is a good place for roaming experts to answer questions.)
• Security. Having a backup of climate data is not very useful if there are also fake databases floating around and you can’t prove yours is authentic. How can we create a kind of digital certificate that our database matches what was on a specific website at a specific time? We should do this if someone here has the expertise.
• Money. If we wind up wanting to set up a permanent database with a nice front end, accessible from the web, we may want money. We could do a Kickstarter campaign. People may be more interested in giving money now than later, unless the political situation immediately gets worse after January 20th.
• Strategy. We should talk a bit about what to do next, though too much talk tends to prevent action. Eventually, if all goes well, our homegrown effort will be overshadowed by others, at least in sheer quantity. About 3 hours ago Eric Holthaus tweeted “we just got a donation of several petabytes”. If it becomes clear that others are putting together huge, secure databases with nice front ends, we can either quit or—better—cooperate with them, and specialize on something we’re good at and enjoy.