Along with the bad news there is some good news:
• Over 380 people have pledged over $14,000 to the Azimuth Backup Project on Kickstarter, greatly surpassing our conservative initial goal of $5,000.
• Given our budget, we currently aim at backing up 40 terabytes of data, and we are well on our way to this goal. You can see what we’ve done at Our Progress, and what we’re still doing at the Issue Tracker.
• I have gotten a commitment from Danna Gianforte, the head of Computing and Communications at U. C. Riverside, that eventually the university will maintain a copy of our data. (This commitment is based on my earlier estimate that we’d have 20 terabytes of data, so I need to see if 40 is okay.)
• I have gotten two offers from other people, saying they too can hold our data.
I’m hoping that the data at U. C. Riverside will be made publicly available through a server. The other offers may involve it being held ‘secretly’ until such time as it became needed; that has its own complementary advantages.
However, the interesting problem that confronts us now is: how to spend our money?
You can see how we’re currently spending it on our Budget and Spending page. Basically, we’re paying a firm called Hetzner for servers and storage boxes.
We could simply continue to do this until our money runs out. I hope that long before then, U. C. Riverside will have taken over some responsibilities. If so, there would be a long period where our money would largely pay for a redundant backup. Redundancy is good, but perhaps there is something better.
Two members of our team, Sakari Maaranen and Greg Kochanski, have thoughts on this matter which I’d like to share. Sakari posted his thoughts on Google+, while Greg posted his in an email which he’s letting me share here.
Please read these and offer us your thoughts! Maybe you can help us decide on the best strategy!
For the record, my views on our strategy of using the budget that the Azimuth Climate Data Backup Project now has.
People have contributed it to this effort specifically.
Some non-government entities have offered “free hosting”. Of course the project should take any and all free offers to host our data. Those would not be spending our budget however. And they are still paying for it, even if they offered it to us “for free”.
As far as it comes to spending, I think we should think in terms of 1) terabytemonths, and 2) sufficient redundancy, and do that as cost-efficiently as possible. We should not just dump the money to any takers, but think of the best bang for the buck. We owe that to the people who have contributed now.
For example, if we burn the cash quick to expensive storage, I would consider that a failure. Instead, we must plan for the best use of the budget towards our mission.
What we have promised to the people is that we back up and serve these data sets, by the money they have given to us. Let’s do exactly that.
We are currently serving the mission at approximately €0.006 per gigabytemonth at least for as long as we have volunteers to work for free. The cost could be slightly higher if we paid for professional maintenance, which should be a reasonable assumption if we plan for long term service. Volunteer work cannot be guaranteed forever, even if it works temporarily.
This is one view and the question is open to public discussion.
Some misc thoughts.
1) As I see it, we have made some promise of serving the data (“create a better interface for getting it”) which can be an expensive thing.
UI coding isn’t all that easy, and takes some time.
Beyond that, we’ve promised to back up the data, and once you say “backup”, you’ve also made an implicit promise to make the data available.
2) I agree that if we have a backup, it is a logical extension to take continuous backups, but I wouldn’t say it’s necessary.
Perhaps the way to think about it is to ask the question, “what do our donors likely want”?
3) Clearly they want to preserve the data, in case it disappears from the Federal sites. So, that’s job 1. And, if it does disappear, we need to make it available.
3a) Making it available will require some serving CPU, disk, and network. We may need to worry about DDOS attacks, thought perhaps we could get free coverage from Akamai or Google Project Shield.
Not all the data we’re collecting is in strictly servable form. Some of the databases, for example aren’t usefully servable in the form we collect, and we know some links will be broken because of missing pages, or because of wget’s design flaw.*
[* Wget stores http://a/b/c as a file, a/b/c, where a/b is a directory. Wget stores http://a/b as a file a/b, where a/b is a file.
Therefore, both cannot exist simultaneously on disk. If they do, wget drops one.]
Points 3 & 3a imply that we need to keep some money in the bank until either the websites are taken down, or we decide that the threat has abated. So, we need to figure out how much money to keep as a serving reserve. It doesn’t sound like UCR has committed to serve the data, though you could perhaps ask.
Beyond the serving reserve, I think we are free to do better backups (i.e. more than one data collection), and change detection.