Azimuth Backup Project (Part 1)

This blog page is to help organize the Azimuth Environmental Data Backup Project, or Azimuth Backup Project for short. This is part of the larger but decentralized, frantic and somewhat disorganized project discussed elsewhere:

Saving Climate Data (Part 2), Azimuth, 15 December 2016.

Here I’ll just say what we’re doing at Azimuth.

Jan Galkowski is a statistician and engineer at Akamai Technologies, a company in Cambridge Massachusetts whose content delivery network is one of the world’s largest distributed computing platforms, responsible for serving at least 15% of all web traffic. He has begun copying some of the publicly accessible US government climate databases. On 11 December he wrote:

John, so I have just started trying to mirror all of CDIAC [the Carbon Dioxide Information Analysis Center]. We’ll see. I’ll put it in a tarball, and then throw it up on Google. It should keep everything intact. Using WinHTTrack. I have coordinated with Eric Holthaus via Twitter, creating, per your suggestion, a new personal account which I am using exclusively to follow the principals.

Once CDIAC is done, and checked over, I’ll move on to other sites.

There are things beyond our control, such as paper records, or records which are online but are not within visibility of the public.

Oh, and I’ve formally requested time off from work for latter half of December so I can work this on vacation. (I have a number of other projects I want to work in parallel, anyway.)

By 14 December he was wanting some more storage space. He asked David Tanzer and me:

Do either of you have a large Google account, or the “unlimited storage” option at Amazon?

I’m using WebDrive, a commercial product. What I’m (now) doing is defining an FTP map at a .gov server, and then a map to my Amazon Cloud Drive. I’m using Windows 7, so these appear as standard drives (or mounts, in *nix terms). I navigate to an appropriate place on the Amazon Drive, and then I proceed to copy from .gov to Amazon.

There is no compression, and, in order to be sure I don’t abuse the .gov site, I’m deliberately passing this over a wireless network in my home, which limits the transfer rate. If necessary, and if the .gov site permits, I could hardwire the workstation to our FIOS router and get appreciably faster transfer. (I often do that for large work files.)

The nice thing is I get to work from home 3 days a week, so I can keep an eye on this. And I’m taking days off just to do this.

I’m thinking about how I might get a second workstation in the act.

The Web sites themselves I’m downloading, as mentioned, using HTTrack. I intended to tarball-up the site structure and then upload to Amazon. I’m still working on CDIAC at ORNL. For future sites, I’m going to try to get HTTrack to mirror directly to Amazon using one of the mounts.

I asked around for more storage space, and my request was kindly answer by Scott Maxwell. Scott lives in Pasadena California and he used to work for NASA: he even had a job driving a Mars rover! He is now a site reliability engineer at Google, and he works on Google Drive. Scott is setting up a 10-terabyte account on Google Drive, which Jan and others will be able to use.

Meanwhile, Jan noticed some interesting technical problems: for some reason WebDrive is barely using the capacity of his network connection, so things are moving much more slowly than they could in theory.

Most recently, Sakari Maaranen offered his assistance. Sakari is a systems architect at Ubisecure, a firm in Finland that specializes in identity management, advanced user authentication, authorization, single sign-on, and federation. He wrote:

I have several terabytes worth in Helsinki (can get more) and a gigabit connection. I registered my offer but they [the DataRefuge people] didn’t reply though. I’m glad if that means you have help already and don’t need a copy in Helsinki.

I replied saying that the absence of a reply probably means that they’re overwhelmed by offers of help and are struggling to figure out exactly what to do. Scott said:

Hey, Sakari! Thank you for the generous offer!

I’m setting these guys up with Google Drive storage, as at least a short-term solution.

IMHO, our first order of business is just to get a copy of the data into a location we control—one that can’t easily be taken away from us. That’s the rationale for Google Drive: it fits into Jan’s existing workflow, so it’s the lowest-friction path to getting a copy of the data that’s under our control.

How about if I propose this: we let Jan go ahead with the plan of backing up the data in Drive. Then I’ll look evaluate moving it from there to whatever other location we come up with. (Or copying instead of moving! More copies is better. :-) How does that sound to you?

I admit I haven’t gotten as far as thinking about Web serving at all—and it’s not my area of expertise anyway. Maybe you’d be kind enough to elaborate on your thoughts there.

Sakari responded with some information about servers. In late January, U. C. Riverside may help me with this—until then they are busy trying to get more storage space, for wholly unrelated reasons. But right now it seems the main job is to identify important data and get it under our control.

There are a lot of ways you could help.

Computer skills. Personally I’m not much use with anything technical about computers, but the rest of the Azimuth Data Backup gang probably has technical questions that some of you out there could answer… so, I encourage discussion of those questions. (Clearly some discussions are best done privately, and at some point we may encounter unfriendly forces, but this is a good place for roaming experts to answer questions.)

Security. Having a backup of climate data is not very useful if there are also fake databases floating around and you can’t prove yours is authentic. How can we create a kind of digital certificate that our database matches what was on a specific website at a specific time? We should do this if someone here has the expertise.

Money. If we wind up wanting to set up a permanent database with a nice front end, accessible from the web, we may want money. We could do a Kickstarter campaign. People may be more interested in giving money now than later, unless the political situation immediately gets worse after January 20th.

Strategy. We should talk a bit about what to do next, though too much talk tends to prevent action. Eventually, if all goes well, our homegrown effort will be overshadowed by others, at least in sheer quantity. About 3 hours ago Eric Holthaus tweeted “we just got a donation of several petabytes”. If it becomes clear that others are putting together huge, secure databases with nice front ends, we can either quit or—better—cooperate with them, and specialize on something we’re good at and enjoy.


azimuth_logo

75 Responses to Azimuth Backup Project (Part 1)

  1. John Baez says:

    Regarding security, Andrees Solo gave the following advice on G+:

    For a decentralised project, I would think that everybody who handles data acquisition should have a GPG (or equivalent) keypair and sign the data files, or at least lists of strong checksums (such as SHA-256 or up; MD5 is already broken, SHA-1 is probably breakable by NSA on a one-off basis right now and will be routinely breakable by moneyed public within few years) of the data files. Then, whatever happen to the files afterwards, integrity breaks can be easily detected. Other archivists won’t have to do anything but archive the checksum lists (if any), the public keys for signature checking, and the key signatures of people who trust that the acquisitors will do a good job so as to link into the Web of Trust, together with the main data itself.

    It may be useful to know that a BitTorrent file of a dataset contains cryptographic hashes (using SHA-1, so not good for long term) of the data, albeit in a somewhat idiosyncratic way. So, releasing torrents of everything may serve as a sort of a stop-gap measure, until a better process will be in place.

    • John Baez says:

      Scott Maxwell replied:

      I’m thinking along similar lines. SHA-256 hashes mirrored in independent locations will be one of my first orders of business after we get the data backed up.

      I’ll probably also look into storing a copy on Resilio, which I’ve coincidentally just started using myself.

    • I understand little of this or why it works. Am willing to learn, though. Can the hashes of Merkle Trees be signed?

  2. Peter Morgan says:

    Reading this BBC report, http://www.bbc.com/news/science-environment-38322594, I noted the response given by a skeptic named as Marcel Crok. You mention “fake” databases as an issue above, but Marcel Crok’s comments brought to mind that there is also a distinction to be drawn between raw data and model-specific analyses of that raw data. It may be impossible to reconstruct raw data to test new models if only model-specific analyses of the raw data are available. It’s usually the case that model-specific analyses are much smaller than raw datasets, and often that only model-specific analyses are published, so there may be a temptation not to save raw data. Scientists may also be reluctant to publicly archive the most raw of their raw data (where I have some experience with experiments that some find contentious, with Bell-EPR raw datasets, public access is now considerably more restricted –now always by request, AFAIK– than in early days).

    From a skeptic’s PoV, assuming we wish to or think it possible to convince skeptics, model-specific analysis is a significant part of the problem with climate scaremongering, so that model-specific data is as good as no data at all. A skeptic’s analysis of raw data, OTOH, what a scientist might call a “fake” database, can be regarded by those who want to derive different conclusions from the raw data as “adjusted under different model assumptions”.

    This is only to point out an aspect of your security bullet, not to advocate a skeptic PoV.

    • John Baez says:

      The more data we have, the better. Ideally we would like to save “raw data”, or something as close to “raw data” as possible, as well as all analyzed data and the software that did the analysis. Thus, we are trying to back up all such material that may only reside on US federal government databases. If you find something that should be saved, let us know.

      The Berkeley Earth project was an independent reanalysis of temperature data, undertaken to check previous analyses:

      The Berkeley Earth Surface Temperature Study has created a preliminary merged data set by combining 1.6 billion temperature reports from 16 preexisting data archives. Whenever possible, we have used raw data rather than previously homogenized or edited data. After eliminating duplicate records, the current archive contains over 39,000 unique stations. This is roughly five times the 7,280 stations found in the Global Historical Climatology Network Monthly data set (GHCN-M) that has served as the focus of many climate studies. The GHCN-M is limited by strict requirements for record length, completeness, and the need for nearly complete reference intervals used to define baselines. We have developed new algorithms that reduce the need to impose these requirements (see methodology), and as such we have intentionally created a more expansive data set.

      The Berkeley Earth data set is here. Some of this data originates from the US government, but the entire data set appears to be stored at U. C. Berkeley, and thus it can be considered ‘backed up’ for the purposes of this exercise.

      (I will assume the Berkeley Earth team is smart enough to have multiple backups of their dataset.)

  3. I have made a ten terabyte storage server available for the project in Germany. Jan has access.

    For data integrity, what Andres Soolo says about signed SHA-256 hashes above is correct. They would be generated automatically of course, if we decide to use that. Someone criticized that it would be difficult because of the number of files, but that doesn’t really matter. Automation doesn’t care how many files you have.

  4. wostraub says:

    I applaud the efforts of people dedicated to preserving climate data in what we now call (cliche coming) post-fact America. Sadly, it reminds me of the few German citizens who scurried around Nazi book bonfires trying to save what they could, the efforts of the same Nazis to destroy Einstein and “Jewish physics,” and the elderly book preservation group in the dystopian 1973 film Soylent Green.

    Equally sadly, President Trump could simply make all climate data illegal by fiat, meaning that all those multi-terabyte data preservers would be in violation of federal law.

    To paraphrase Beethoven: Has it come to this? It has come to this.

  5. Bruce Smith says:

    President Trump could simply make all climate data illegal by fiat …

    Fortunately he will only be the US president, not dictator, so I don’t think he (or any president) could legally do that, assuming the data is currently legal to copy and store.

  6. Really surprised that something as important as safeguarding, securing and organizing this type of data befalls on a spontaneous community of volunteers. A Kickstarter campaign is certainly worthwhile. It is very easy to donate some money. It is much harder to do actual work. Perhaps a new organization needs to be formally born out of this effort? So that in the future, there is an organization owning that problem?

    That said, is it worth it for me as an individual (small business) with a server 1TB of free space (which I could possibly extend) to make the effort to backup data. Yes, I run my own small business, but that’s hardly a reputable organization to be entrusted with data security. Helping out with hardware resources etc. only makes sense if you are a moderately large organization, no? Let me know if this assumption is wrong, and I’ll get to work! :)

  7. I want to get you involved in the Azimuth Environmental Data Backup Project, so click on that for more. But first some background.

  8. John Baez says:

    Jan Galkowski wrote:

    Well!

    That was a long, hard day. But I think there’s a process in hand. What I did:

    • Found out WebDrive, and Expandrive were too small-scaled to handle this. I paid some monies to find these out, which I’m not likely to get back. (Not much: $100 total.)
    • I have centralized on using WinSCP as the primary tool for copying
    • Ruled out using a Google Drive app and an Amazon Drive app on my workstation as a way of transferring files. It’s awkward, and, for remote Google Drives, would need to log in to each one each time.
    • Evaluated GNU WGet as a Web site scraper. It’s pretty good, meaning, it gets all the files accurately, but …
    • Found some .gov sites have certificates issues, at least from WGet’s perspective. I don’t yet know how to tell it to ignore the certificate discrepancy, even with use of the command line option that’s suppoed to do that.
    • HTTrack doesn’t have the certificate issues but it copies files incompletely. I suspect that’s because there’s no way I found to tell it to pause, like 3 seconds, in between copies. You can do that with WGet.
    • I discovered that my NAS has cloud syncing software so I can map a folder on it to each of any number of cloud sites, whether Google, or Amazon. So, I can map those folders as mounts from my Win7 workstation. And voila! If I WinSCP from a .gov site, I can dump it into those folders. And this means my workstation takes care of the .gov to folder transfer and the NAS does folder to cloud.
    • Copying several Web sites.
    • Copying two big FTP sites up to the cloud.
    • Was able to get to Sakari’s Germany site partly.

    I’ll give details, tomorrow I hope.

    Letting these run, and taking a break for tonight.

  9. domenico says:

    It could be useful some batch executables that contain all the wget command to get the complete web pages of the important databases, with the size of the download, for example

    nasa.com:

    800 GB size

    wget -r -np nasa_directory_1
    wget -r -np nasa_directory_2

    So that if a person want to download the complete list of the database, then it can simply execute the batch program in the computer; after it could be useful a torrent sharing of the files, using the computers like seeder.
    So that a standard download, and procotol, could be used; and all the sources could be compared: the torrent seeder could be used like an alternative distribution list.
    It could be used the experience of previous downloads to get a fast downloading procedure

  10. John Baez says:

    Jan Galkowski wrote:

    Good morning!

    I have completed the copying of the Web side of GISTEMP in final, and of all the data on CDIAC ORNL (ftp://ftp.ncdc.noaa.gov/pub/data/paleo/) and am now moving both to Sakari’s data refuge server. Unfortunately, WinSCP does not permit opening two remote locations in the two commander panes, so this needs to be a two-step process.

    The Web spidering of http://cdiac.ornl.gov is progressing, as is the Web spidering of http://www.ndbc.noaa.gov. These are being spidered using Wget from my home box, and then I’ll copy to Sakari’s data refuge server. I tried shelling into Sakari’s data refuge server but can’t, and I think Sakari explained that. If there were a Linux server we could use some place that was a proper destination for Azimuth data, I’d SSH there and do the Wgets directly, using Screen as Sakari said. But it’s okay.

    I have struggled to get data from ftp://eclipse.ncdc.noaa.gov/pub for days. Last night I left a large WinSCP trying to copy those, and it failed with a permissions problem some time in the night. The log records things like

    2016-12-16 23:32:23.297 Connecting to 205.167.25.154:61920 …
    . 2016-12-16 23:32:23.333 Transfer channel can’t be opened. Reason: An attempt was made to access a socket in a way forbidden by its access permissions.
    . 2016-12-16 23:32:23.333 Could not retrieve directory listing
    . 2016-12-16 23:32:23.333 Getting size of directory “pub/isccp/b1/.D2790P/images/1979/261”
    . 2016-12-16 23:32:23.333 Retrieving directory listing…
    > 2016-12-16 23:32:23.333 CWD /pub/isccp/b1/.D2790P/images/1979/261/
    . 2016-12-16 23:32:38.873 Timeout detected. (control connection)
    . 2016-12-16 23:32:38.873 Could not retrieve directory listing
    . 2016-12-16 23:32:38.873 Connection was lost, asking what to do.
    . 2016-12-16 23:32:38.873 Asking user:
    . 2016-12-16 23:32:38.873 Lost connection. (“Timeout detected. (control connection)”,”Could not retrieve directory listing”)
    . 2016-12-16 23:32:43.953 Connecting to eclipse.ncdc.noaa.gov …
    . 2016-12-16 23:32:43.977 Connected with eclipse.ncdc.noaa.gov. Waiting for welcome message…
    2016-12-16 23:32:44.128 USER anonymous
    2016-12-16 23:32:44.157 PASS ************************************
    < 2016-12-16 23:32:44.188 530 Sorry, the maximum number of clients (2) from your host are already connected.
    . 2016-12-16 23:32:44.188 Disconnected from server
    . 2016-12-16 23:32:44.188 Connection failed.

    and I have put the entire log available at:

    https://drive.google.com/file/d/0B3Nnyie7hrIucXpzY3lzVng0N2c/view?usp=sharing

    I feel bad about this, because that FTP site itself has over a TB of data.

    Meanwhile, I'm moving on. Doing (from my own cryptic notes):

    http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html

    ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis/

    wget –wait=5 -nv -4 –output-file=esrl-noaa-gridded-ncep-gov-errors.log -t 40 -nc -r –no-check-certificate -p http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html

    12/17/2016 12:28:30 PM

    http://www.esrl.noaa.gov/gmd/dv/ftpdata.html
    ftp://aftp.cmdl.noaa.gov/ (will be on importing migrate)

    http://www.fs.fed.us/nrs/atlas/

    wget –wait=5 -nv -4 –output-file=esrl-noaa-gmd-dv-gov-errors.log -t 40 -nc -r –no-check-certificate -p http://www.esrl.noaa.gov/gmd/dv/ftpdata.html

    wget –wait=5 -nv -4 –output-file=fs-fed-nrs-atlas-us-errors.log -t 40 -nc -r –no-check-certificate -p http://www.fs.fed.us/nrs/atlas/

    – Jan

  11. I am currently downloading a copy of the University of Idaho
    Gridded Surface Meteorological Data (UofI METDATA) to Germany. The source server seems to be limiting the transfer rate at 5.45 Mbps, so this process is going to take about three months, assuming the data set is as large as they said (approx. 5 TB) on the original DataRefuge worksheet, and that there are no irrecoverable interruptions.

    Below is the command that I am using. This is the first time I am using it. Let me know if you see anything wrong. Web crawling is not a part of my usual day job, so I have limited experience downloading entire sites.

    Note that if you plan on using a similar command, it is bypassing robots.txt and would be considered ill-behaved. My excuse is that we need backups of climate data and have very limited time available.

    wget –wait=12 –random-wait –prefer-family=IPv4 –verbose=on –tries=40 –timestamping=on –recursive –level=inf –no-remove-listing –output-file=climate.nkn.uidaho.edu_METDATA.log -e robots=off -Umozilla –no-check-certificate -H -Dclimate.nkn.uidaho.edu,northwestknowledge.net http://climate.nkn.uidaho.edu/METDATA/ https://www.northwestknowledge.net/metdata/data/ http://thredds.northwestknowledge.net:8080/thredds/reacch_climate_MET_catalog.html

    For explanation of the command line parameters, see:
    https://www.gnu.org/software/wget/manual/wget.html

    Do change the addresses, log file name, and host spanning limits as appropriate for your targeted site.

    • The transfer seems to hang from time to time (infrequently). Added the following command-line parameters to allow wget timeout and retry:

      –dns-timeout=10 –connect-timeout=20 –read-timeout=120

    • Do remove the parameters: -e robots=off -Umozilla

      It works most likely better without those, unless you have some very special reason. I shouldn’t have used them.

    • There has been an order of magnitude increase in the transfer speed. Excellent! I am getting a 67.5 Mbps sustained rates that means the download time could be down to a week or two instead of months.

      wget –dns-timeout=10 –connect-timeout=20 –read-timeout=120 \
      –wait=12 –random-wait –progress=dot:mega \
      –prefer-family=IPv4 –tries=40 –timestamping=on \
      –recursive –level=inf –no-remove-listing \
      –output-file=climate.nkn.uidaho.edu_METDATA.log \
      –no-check-certificate \
      -H -Dclimate.nkn.uidaho.edu,northwestknowledge.net \
      https://www.northwestknowledge.net/metdata/data/ \
      http://thredds.northwestknowledge.net:8080/thredds/reacch_climate_MET_catalog.html

    • If your target site uses FTP, remember to use the –follow-ftp option.

    • John Baez says:

      Sakari wrote:

      am currently downloading a copy of the University of Idaho
      Gridded Surface Meteorological Data (UofI METDATA) to Germany.

      If the state is stored at the University of Idaho, as the address http://climate.nkn.uidaho.edu/METDATA/ suggests, I don’t believe this is data that the new president could delete. Unless new laws are passed, the US federal government doesn’t have any direct control of the University of Idaho databases, or any other university databases. The data at most risk is all on .gov sites, e.g.

      http://cdiac.ornl.gov
      http://data.giss.nasa.gov/gistemp/
      https://pds.nasa.gov/
      https://tidesandcurrents.noaa.gov/sltrends.html

      and many others. The distinguishing feature is “.gov”. These .gov sites are under direct control of the US federal government.

      So, for your next choices I urge that you copy .gov sites.

      • Regarding “I don’t believe this is data that the new president could delete”, I believe “deletion” is just a codeword for an equivalence class of possible scenarios, all of which are bad. For instance, a government (President+Congress) could declare any data produced in part or whole using federal funds (even grants) to be copyrighted by the government and so requires written permission to share or otherwise circulate. This is completely contrary to present practice and tradition and makes no sense, but if the purpose is to inhibit and frustrate progress, that would do it. It is an extension of the declaration tried under the G W Bush administration upon permissions for scientists on the government payroll speaking to media, a situation which James Hansen defied and got into hot water about. The same was implemented, for a short time, by a government in Canada.

        In addition to stopping work on climate science, and effectively shutting down the practice of science on these subjects — and on many subjects — in part or in whole, the idea is to control what the public knows and thinks about attribution of weather and other events and their causes. By denying access to datasets compiled using federal funds via legal means, the perfectly sensible challenge to a specific claim about attribution, that it be reproducible, can succeed from silence.

        By effectively “publishing” these data by copy, when the data are, presently, available to the general public, I see the project as frustrating any attempt to do such a thing. We can argue, even if we don’t get every last file, that restricting access to such data is silly because it is not something which the putative copyright holder can enforce, because of circumstances. And if they cannot enforce it, they won’t enforce it, and as in other cases of copyright (Disney’s stuffed doll characters, for instance), without enforcement, the copyright is meaningless.

        • John Baez says:

          Jan wrote:

          For instance, a government (President+Congress) could declare any data produced in part or whole using federal funds (even grants) to be copyrighted by the government and so requires written permission to share or otherwise circulate.

          Right, but it’s hard to see how this would happen suddenly without warning unless even worse things had happened first. A .gov website run by the executive branch (e.g. NASA, NOAA, and most or all other .gov websites on this list) might be closed down quite quickly—and legally if there are no laws mandating its existence.

          Anyway, we could continue to argue about this, but I’ll moderate my position as follows: while backing up all climate data is good, I believe data at US federal government websites under the control of the executive branch should come first, right now.

          (Unless there’s something going on in Idaho that I don’t know about.)

        • Not arguing. Just airing my view FWIW. I don’t know what will happen.

        • Jon says:

          It would difficult for the gov’t to get the courts to go a long with enforcement of such a copyright, IMVHO. We are not quite in danger of emulating Columbia yet.

      • I can certainly start downloading those government sites, but how large are they? If the pds.nasa.gov is larger than 5 TB, it will fill up the current storage box and interrupt the already started download. It’s risky to begin downloading something you don’t know how large it is.

        The last link, https://tidesandcurrents.noaa.gov/sltrends.html, that you gave above gives error 404 – Page Not Found.

        The two first links seem to have been already downloaded by Jan, assuming his spider configuration has been complete.

  12. I am downloading ftp://hdsc.nws.noaa.gov/pub from

    http://hdsc.nws.noaa.gov/hdsc/pfds/

    There are some PDFs on the http server, and those download quickly, but the FTP server is the one containing a big data set that is taking longer. It’s available here:

    http://www.kobrix.com/hdsc.nws.noaa.gov

    I’m downloading from a server, not a home computer and last night I stupidly forgot to use ‘nohup’ so the load was interrupted when the ssh session closed. The command I’m using is:

    nohup wget -k -m -np -U Mozilla ftp://hdsc.nws.noaa.gov/pub

    though the -k and -np options probably don’t make sense for ftp, but that’s what I used to get the HTTP portion.

    I really wish there some more transparency and more open communication channels on what has been downloaded or not, where is help needed. I don’t mind going and buying a few external drives and getting more data on them.

    Another aspect of the long term story is organizing the data in a meaningful way, using relevant taxonomy/ontology (semantic web standards or not). This is something I know about and I can help a lot with as I’m sure a lot of people with software engineering background. I understand right now the priorities are different, but having an initial conversation about might not be such a major distraction.

  13. We should probably have our own spreadsheet type of log listings what dataset has been done, where it is, what verification has been done, including things like Sakari’s study of the Web sites before scraping, and comments.

    There’s another side to this … The FTP sites. Sometimes these go well, on their own, and sometimes, files which look like their neighbors won’t transfer, even on a manual retry. WinSCP is better than other software with these quirks. Presumably there are other software for ‘nixen which are comparable, like rsync.

    I’m intending to check out rclone, but my boxes are pretty busy.

    My LAN utilization is much better now: https://goo.gl/nHIHvl

    • John Baez says:

      Jan wrote:

      We should probably have our own spreadsheet type of log listings what dataset has been done, where it is, what verification has been done, including things like Sakari’s study of the Web sites before scraping, and comments.

      Yes! I’m not sure if it should privately writable, privately writable or privately writable, publicly readable. It would be nice to know that only “we” could write on it, and there might be advantages to knowing only we could read it. If we want something completely private, an ordinary spreadsheet together with a version control system like Subversion should do the job.

      • virtualadept says:

        The update submission process that neveragain.tech is using on Github might work – clone the repo, make the edits to the document, push the edits to your personal copy, then file a pull request from your repo to the core for inclusion. It’s a little more manual than a GDoc but it also adds a necessary gateway to prevent malicious edits from making it in due to the review process.

      • Let’s at least use Git, if we use version control.

    • I agree with the idea of another spreadsheet since it seems not so easy to coordinate with the organizer on the first Spreadsheet. It seems like there isn’t much transparency about what’s going on there and until that changes it might be worthwhile to have this parallel metadata as well.

    • John Baez says:

      Jan wrote:

      We should probably have our own spreadsheet type of log listings what dataset has been done, where it is, what verification has been done, including things like Sakari’s study of the Web sites before scraping, and comments.

      Sakari has now created on. Just to help people find it, it’s here:

      https://bitbucket.org/azimuth-backup/azimuth-inventory

      So far I’m finding this page the best one to look at:

      https://bitbucket.org/azimuth-backup/azimuth-inventory/issues

  14. virtualadept says:

    Currently backing up http://ftp.cdc.noaa.gov to a 24TB RAID using wget. I’d very much like to get my backup of that site (when it’s done) into the hands of people who have much better net.connections than I at present.

    • John Baez says:

      Thanks very much! Keep us posted, and when you feel sure you’ve got a full and accurate backup of that site please inform climatemirror.org (click on “Start Here to Help”). You can also tell them you want to deliver that data to them.

      In the short term, there will be quite a bit of confusion, so it’s good to have multiple redundant copies and assume the worst. Later thing will become more organized—if enough well-organized people join the effort.

  15. John Baez says:

    On 16 December 2016 some crucial information was removed from the Google doc of databases that need to be saved—namely, the information on which ones have already been downloaded.

    But there’s some good news! I contacted Nick Santos about this and he wrote:

    Hi John,

    Thanks for the feedback – catching up on emails now – I’ve passed along the thoughts to the people stewarding the spreadsheet, and they indicated they’ll add a status column into the spreadsheet. They removed the columns that were there for privacy reasons, but see good reason to have some sort of official status for each. Probably won’t happen today, but tomorrow or Monday.

  16. wires0 says:

    https://ipfs.io would be the right system to store this on, it is a content-addressed store using Merkle Trees.

    Linked (meta-) data can be used to describe and relate datasets (for instance https://github.com/ipld).

    If done properly CKAN http://ckan.org/ could be used as an interface on top of this, avoiding a lot of coding.

    One still needs to provide storage for the big datasets, but there is also Filecoin from the IPFS guys http://filecoin.io/ which could provide a decentralised alternative to this.

    I’d be happy to help implementing a system on top this, time permitting.

    Best,
    Jelle

    • John Baez says:

      Hi, Jelle! I’d love your help. I’ll send you an email putting you in touch with the rest of the gang. I think an ambitious but useful goal would be to develop a server that backs up some of the most important US federal government climate data with guarantees that the backup matches the database at the time it was made. I don’t quite understand how to do this if the database is later deleted, but people here have ideas, and I bet you have ideas too. If the method of doing this was easy after somebody explained it, we might help other people backing up these (and other) databases.

  17. Noah Friedman says:

    I’m not sure what specific skills you may need use of, but I have Unix/linux admin, networking admin and programming, rdbms programming & some admin, and a little crypto and signing experience. I don’t have any storage or hosting resources to offer, but I can contribute financially for someone else’s.

    • John Baez says:

      Thanks! In the short term I plan to help pay for things, so a bit of help from you would be great. In a while I’d like to set up a Kickstarter campaign. As far as programming goes, I’ll send an email to you and the gang asking what you might do.

      • For some reason, the blog didn’t show your reply when I wrote my last comment. Let’s wait for your Kickstarter campaign then, if you are now setting it up. No need to send donations directly to anyone’s PayPal account, if we have Kickstarter.

    • I have already set up a 10 TB storage box and can set up more for people who contribute the cost of 50 euro per month – to set up a proper server next. Once we have the server, we can add more storage per demand ( = per funding). My PayPal account is sam@iki.fi. I am paying for the current 10 TB storage box myself. Any donations will be used to add capacity and to download more data sets.

  18. Thank you for this project! I appreciate that it protects an important scientific resource, but also that it serves as a kind of public protest, highlighting that reality these precious resources are no longer safe.

    I wonder if this occasion warrants a more dramatic display of protest? I’m thinking explicitly about the role that social media has played in bringing us to this point. All the major tech stacks have shown an incredible amount of deference to Trump. Jeff Bezos, +Larry Page, Tim Cook, and Peter Theil (a board member of FB and a vocal Trump supporter) have all met publicly with Trump and legitimized his authority.

    http://www.huffingtonpost.com/entry/trump-meets-with-silicon-valley-tech-titans-at-his-manhattan-tower_us_5851c8e2e4b02edd4115c79b

    There is a deep tension in attempting to organize a political resistance using the very same tools that empower the opposition. On the other hand, organizing a mass exodus from mainstream social media might function as a sufficiently symbolic gesture to make clear the risks at stake.

    I’ve thought about leaving my mainstream social networks before, but I don’t want to abandon my community, and I don’t think my network would come with me if I left. I don’t have enough influence or charisma to lead an exodus myself.

    But maybe the Azimuth project does? Azimuth has credibility among the science community, and it represents the values of a broad range of concerned citizens. I wonder if there is interest in holding a National Deactivation Day, where everyone in the community agrees to shut down their FB and Google profiles and stage a mass digital walk out to another social networking service. If downloading 10TB of science data for offsite storage makes a political statement, maybe losing 10,000 active users sounds more like the roar of an angry crowd.

    10,000 concerned scientists leaving Facebook and Google+ in one fell stroke could easily be framed in the media as a “brain drain” event that makes vivid the threats motivating the Backup Project, and would underscore the way that social media has compromised our intellectual integrity.

    If there is interest in staging a social media walk out in conjunction with the Azimuth effort to offload data, I’d be happy to help organize, coordinate, and advocate for the effort, and to shut down my own streams on the day of exodus. Of course I’d need help, but this also gives everyone another way to contribute to the effort. Perhaps I can look into running a Diaspora pod to collect the digital refugees and seed alternative networks that sit off the major stacks.

    What does everyone think?

    • John Baez says:

      Unfortunately I seem to be up to my neck in helping out this data backup project! So, I can’t do too much to help this walkout you’re proposing, but I’ll announce it and join it if it seems to have a groundswell of support (whatever that means).

  19. domenico says:

    I am thinking that the problem is more great, in the future.
    There are many nations, that could have a problem with the politics to control the diffusion of the climate change data, and information (some problem was in Canada, Australia and America).
    So that a supranational organization should distribute, in the future, the current open source data, with a collaboration with the national agencies: it is only a open source distribution, a knowledge distribution, which it is important to preserve.
    I am thinking that the United Nation, with private funds (not to depend on national fund which could influence the choice: the cost of download and administration is very low for an agency), could distribute the data that could be certified data; but it is necessary some time, and we have an immediate problem with the results of this election, so this effort is right; but it is worthwhile to think about the future.

  20. John Baez says:

    Elsewhere, Scott Maxwell wrote:

    My backup of cdiac.ornl.gov has completed (~ 54GB downloaded). I’ll start uploading it to our shared space, then, since nobody seems to much care *what* we fetch as long as we fetch *something*, I’ll start downloading something else that’s going neglected. :-)

    I’m also generating SHA-256 checksums to be uploaded along with the data, and probably also kept in a separate repository (maybe in our Google Drive space).

    • John Baez says:

      I replied:
      Scott Walker wrote:

      My backup of cdiac.ornl.gov has completed (~ 54GB downloaded).

      Excellent!

      I’ll start uploading it to our shared space, then, since nobody seems to much care *what* we fetch as long as we fetch *something*, I’ll start downloading something else that’s going neglected. :-)

      I care. Please consult the new (partial) list of databases that have already been backed up, and avoid doing those:

      https://climate.daknob.net/

      Also, I suggest focusing on .gov sites at first.

      I’m also generating SHA-256 checksums to be uploaded along with the data, and probably also kept in a separate repository (maybe in our Google Drive space).

      Excellent!

  21. I have created us a Bitbucket.org public Git repository for tracking the backup effort:
    https://bitbucket.org/azimuth-backup/azimuth-inventory

    Please read Git tutorials and to set up your Git clients. Also create SSH PKI keys and Bitbucket.org accounts. Add your SSH keys to your Bitbucket.org account.

    You can also use the Bitbucket repository issue tracking and wiki even without Git client.

    Please create a wiki page in there to track which sources we are backing up.

    • The Azimuth inventory wiki home page is here:
      https://bitbucket.org/azimuth-backup/azimuth-inventory/wiki/Home

      Let me know your Bitbucket account names, so I can give you access permissions.

      The above repository is public and intended for tracking our shared backup effort.

      There will be separate repositories for private server configuration data later on.

    • I hope to have all of my part of the effort documented on the Bitbucket Wiki by close of day Tuesday, along hopefully with all keys and things set.

      What’s a PKI key and how is it used? I know people at work use then for locking up sections of networks handling tbings like credit cards, but I don’t work with those directly.

      • Please always refer to best existing tutorials on the web, instead of writing new ones. When you cannot find an existing tutorial — only then create new instructions. In your instructions, just give links to the already existing best documents.

        If you are on Windows, install the PuTTY package, for example using the “Windows MSI installer package for everything except PuTTYtel”, but use any approach you like.

        http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

        Generate keys with puttygen and load them to the PuTTY key agent, Pageant.

        You can also use WinSCP for file transfer. It is compatible with Pageant, so you can use your keys with both simply by loading them to Pageant.

        On Linux, read the OpenSSH tutorials. Generate keys with ssh-keygen and load them to your agent with ssh-add.

        Please search the web for more information. I am happy to assist more later on. Right now I need to go to work. See you later.

  22. Just to remind …. Grabbing copies of Web sites via Wget is just one part of the story. There. Are also a dozen or more FTP directories to clone, and some of my efforts have been on those.

  23. I have now added Bitbucket.org access to Borislav and Jelle.
    Currently there are two repositories in the Azimuth Backup Project on Bitbucket.org:

    private_cmdb – for sensitive data that must not be open (for example access controls) Please add your SSH keys to the authorized_keys file after you get Bitbucket.org access.

    azimuth-inventory – public information for coordinating the backup effort. You can use the Bitbucket.org Wiki for this repo directly from browser, without need to learn Git. (You can also use it with Git.)

    https://bitbucket.org/azimuth-backup/azimuth-inventory/wiki/Home

    Please register your download effort in the Wiki, to avoid duplicate effort, and to share how you do it.

    • Yeah, I’d like to add whatever I need to add to private_cmdb, but I am stuck at the Git part. I don’t know if I need to download Git separately (seems I do), and then add after cloning and then upload, or what. Naturally, it help tremendously if I had studied the related docs, but hasn’t happened yet, except for peeks.

      I’ll leave that until I can, and work towards documenting on the Wiki what I have done, as I can. I did not keep detailed records of commands, as desired now, of my earliest efforts.

  24. I really think we’d benefit from some live interaction, especially since “time is of the essence” here. It doesn’t look easy to create an open chat room using any of the big player chat platforms. For example, not everybody has or wants a Google account, though hangouts would be perfect for me.

    With some searching, I found a service that I haven’t used before and setup a room in case people are interested:

    http://www.chatzy.com/23286872238036

    The interface of this thing looks pretty old and clumsy, doesn’t look ideal, but it shows on top of Google searches.

    Alternatively, I can easily setup an XMPP server with a chat group and web page to go for the chat, or help embed it within the Azimuth website since there is client software that supports WordPress (see https://conversejs.org/). Again, the idea would be to avoid the hassle of registering and/or installing client software. People should be able to just pop in and out using their browser, IMO.

    If anyone has better ideas, please share them.

    Boris

    • I will soon add more documentation in the Bitbucket Wiki (linked above) and repositories.

      We already have Hip Chat configured at:

      https://azimuth-backup.hipchat.com/chat/room/3420336

      It receives notifications from the other Bitbucket resources so you can easily see what others are working on.

      You may need Bitbucket and Atlassian accounts. All free of course.

    • virtualadept says:

      Slack seems to be what all the cool kids are using these days.

      • I use Slack at work. Eh, it’s okay, but insecure, and I far prefer something integrated into BitBucket.org which is what we are using.

        • I couldn’t log to HipChat from the link above, it doesn’t even offer me to register. I’m sure I can figure it out, and I will if others settle on Hipchat, I’ll follow. But my point is it should be easy and accessible. Atlassian’s software is heavy and clumsy, improved a lot over the past year, but still pretty clumsy IMHO.

          I don’t think security is a concern at this point. We need it to be quick and easy, and open. In case we end up with trolls and/or impersonators, we could reconsider.

          I think Slack would be much better idea, it does require login, but at least the process is very smooth and quick. And it has a bunch of extra tools and integrations as well and it will be solid long term in case this communities outlives the immediate backing up task.

          Let me know if you guys are ok with Slack and I’ll set it up.

        • We are already using Hip Chat. It has been documented, and integrated with Bitbucket, so that you already see all project changes there.

          Atlassian has different kinds of tools. The ones we have chosen are very lightweight.

  25. Add new data sources according to these instructions:

    Climate data sources – How to add an entry

    https://bitbucket.org/azimuth-backup/azimuth-inventory/wiki/Climate%20data%20sources

    This process allows us coordinate the work of identifying and properly specifying new data sources, before starting the download.

  26. Sakari Maaranen,

    I doesn’t look like I have the option to reply to your last reply about chatting. Ok, HipChat it is. How can I login then? With the link you’ve provided, there is no option to register and my bitbucket login credentials don’t work.

    Best,
    Boris

    • John Baez says:

      When it seems you can’t reply to a comment, move up the comment tree, find the last comment you can comment on, and comment on that. Sorry.

      I have sent you information that should allow you to join our Hip Chat group.

      By the way, I’ll usually avoid live discussions because they eat up time and I’m pretty happy with communications as they are. You have not been cc’ed on most of our emails, but if we start doing that you may be satisfied with the information flow.

  27. Borislav Iordanov,

    I have clarified the instructions for joining Hip Chat at:

    https://bitbucket.org/azimuth-backup/azimuth-inventoryhttps://bitbucket.org/azimuth-backup/azimuth-inventory

    Several people have already joined the room.

  28. Mehmet B. says:

    I’ve emailed this to the Eric Holthaus personally (and he has acknowledged the issue has been forwarded to people who do security). There needs to be a movement from the horse’s mouth (scientists or the gov sites) to put the SHA signatures into the blockchain.

    This is the only way to prevent, verifiably and definitively any “fake” databases.

    I don’t know why this isn’t being discussed more.

    • Mehmet,

      I think there are two good reasons why this could be sticky.

      First, scientists in government looking forward towards a hostile administration might reasonably want to rock-the-boat as little as possible to maximize their job retention prospects. No one can properly fault them for having those concerns.

      Second, it is important, as I have written elsewhere, that this effort not be too well publicly coordinated, because the effort itself could be subjected to denial-of-service attack, were details known.

      Jan

    • While the blockchain as a technology can be used to provide audit log for the data, I would still consider it as a technology, not the technology, until it has further developed and matured enough not to require the complete chain all the way back to the genesis block.

      Message digests? Absolutely.
      Digital signatures? Of course.
      With timestamps? Certainly.
      The blockchain? Why not, among others.

      A sufficient level of trust does not require a mathematical proof of authenticity all the way back to the inception of any particular system.

      Having reasonable guaranteed distributed widely enough in a decentralized fashion is enough. Future derivatives of the blockchain technology and similar should rely on sufficiently many verifiable sources, instead of requiring an absolute chain to an absolute origin, and everyone keeping a perfect record of all history until current point in time.

      A mature system will allow for imperfect record and still work. Note that this can already be achieved to a reasonable level with message digests, digital signatures and timestamps alone, when widely deployed. Having a blockchain doesn’t hurt, but shouldn’t be considered a necessary part of the publication process. We can do 1-3 in any case, with and/or without 4.

  29. As indicated elsewhere, this is not the kind of project where doing a lot of version forks, to use a software development analogy, and then seeing what survives is the best approach. It is and needs to be inherently syncretistic. Accordingly, I think it best to work together with those already doing this kind of thing.

    • @whut says:

      I talked to the person at the EarthCube booth when I was at the AGU early last month. I told him that I didn’t see anything in place on the web site and he agreed that there was nothing apart from architectural plans on how to get started. In other words, they are still looking for input from participants in the consortium.

      One place to look for advice is the PO.DAAC site maintained by NASA JPL. What they do is store volumetric data of the ocean, which is really a huge amount of data. I have worked with the maintainer of that site on another project and when I talked to him several years ago, he said most of his effort goes into maintenance of the data and making sure that users can read it with the available software. Once the data is stored, that’s only the start of the effort.

      I also talked to other JPL scientists at the AGU and they are doing interesting data mining work in conjunction with this data. So my take is that we continue to support NASA and tell our congress to continue to fund them.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s