Along with the bad news there is some good news:
• Over 380 people have pledged over $14,000 to the Azimuth Backup Project on Kickstarter, greatly surpassing our conservative initial goal of $5,000.
• Given our budget, we currently aim at backing up 40 terabytes of data, and we are well on our way to this goal. You can see what we’ve done at Our Progress, and what we’re still doing at the Issue Tracker.
• I have gotten a commitment from Danna Gianforte, the head of Computing and Communications at U. C. Riverside, that eventually the university will maintain a copy of our data. (This commitment is based on my earlier estimate that we’d have 20 terabytes of data, so I need to see if 40 is okay.)
• I have gotten two offers from other people, saying they too can hold our data.
I’m hoping that the data at U. C. Riverside will be made publicly available through a server. The other offers may involve it being held ‘secretly’ until such time as it became needed; that has its own complementary advantages.
However, the interesting problem that confronts us now is: how to spend our money?
You can see how we’re currently spending it on our Budget and Spending page. Basically, we’re paying a firm called Hetzner for servers and storage boxes.
We could simply continue to do this until our money runs out. I hope that long before then, U. C. Riverside will have taken over some responsibilities. If so, there would be a long period where our money would largely pay for a redundant backup. Redundancy is good, but perhaps there is something better.
Two members of our team, Sakari Maaranen and Greg Kochanski, have thoughts on this matter which I’d like to share. Sakari posted his thoughts on Google+, while Greg posted his in an email which he’s letting me share here.
Please read these and offer us your thoughts! Maybe you can help us decide on the best strategy!
Sakari Maaranen
For the record, my views on our strategy of using the budget that the Azimuth Climate Data Backup Project now has.
People have contributed it to this effort specifically.
Some non-government entities have offered “free hosting”. Of course the project should take any and all free offers to host our data. Those would not be spending our budget however. And they are still paying for it, even if they offered it to us “for free”.
As far as it comes to spending, I think we should think in terms of 1) terabytemonths, and 2) sufficient redundancy, and do that as cost-efficiently as possible. We should not just dump the money to any takers, but think of the best bang for the buck. We owe that to the people who have contributed now.
For example, if we burn the cash quick to expensive storage, I would consider that a failure. Instead, we must plan for the best use of the budget towards our mission.
What we have promised to the people is that we back up and serve these data sets, by the money they have given to us. Let’s do exactly that.
We are currently serving the mission at approximately €0.006 per gigabytemonth at least for as long as we have volunteers to work for free. The cost could be slightly higher if we paid for professional maintenance, which should be a reasonable assumption if we plan for long term service. Volunteer work cannot be guaranteed forever, even if it works temporarily.
This is one view and the question is open to public discussion.
Greg Kochanski
Some misc thoughts.
1) As I see it, we have made some promise of serving the data (“create a better interface for getting it”) which can be an expensive thing.
UI coding isn’t all that easy, and takes some time.
Beyond that, we’ve promised to back up the data, and once you say “backup”, you’ve also made an implicit promise to make the data available.
2) I agree that if we have a backup, it is a logical extension to take continuous backups, but I wouldn’t say it’s necessary.
Perhaps the way to think about it is to ask the question, “what do our donors likely want”?
3) Clearly they want to preserve the data, in case it disappears from the Federal sites. So, that’s job 1. And, if it does disappear, we need to make it available.
3a) Making it available will require some serving CPU, disk, and network. We may need to worry about DDOS attacks, thought perhaps we could get free coverage from Akamai or Google Project Shield.
3b) Making it available may imply paying some students to write Javascript and HTML to put up a front-end to allow people to access the data we are collecting.
Not all the data we’re collecting is in strictly servable form. Some of the databases, for example aren’t usefully servable in the form we collect, and we know some links will be broken because of missing pages, or because of wget’s design flaw.*
[* Wget stores http://a/b/c as a file, a/b/c, where a/b is a directory. Wget stores http://a/b as a file a/b, where a/b is a file.
Therefore, both cannot exist simultaneously on disk. If they do, wget drops one.]
Points 3 & 3a imply that we need to keep some money in the bank until either the websites are taken down, or we decide that the threat has abated. So, we need to figure out how much money to keep as a serving reserve. It doesn’t sound like UCR has committed to serve the data, though you could perhaps ask.
Beyond the serving reserve, I think we are free to do better backups (i.e. more than one data collection), and change detection.
Thank you for doing this! I suggest setting aside some portion of the funds for “publicity” in the future. Once data disappears from government websites, people will need to be able to find out where it is and what you’re archiving. It will take time to spread the word, and it will have to continue for years. Volunteers can help, but you may need to hire individuals or services.
Hi! Thanks, I’ll consider that. We seem to be moving toward a setup where backups of US government environmental data are listed here:
• https://climate.daknob.net/
If climate data is actually removed from government websites, this location will probably get free publicity in the news. When we make our backups publicly available, they will be listed here. So, we might not need extra publicity. We’ll see.
The more general point about saving some money for unknown future contingencies seems very robustly reasonable.
I asked my internet activist brother about this, and he had this to say:
An obvious place to put it would be to upload it to the Internet Archive (archive.org) as a series of items in a collection. Their mission is to save everything of value for future generations.
For information of archival value, they offer unlimited storage,
unlimited bandwidth, forever, for free. This data probably qualifies.
40 terabytes is a drop in the bucket for them. That’s 10 disk
drives (they use 8tb drives and they make two copies of each item).
The Archive manages and serves up dozens of petabytes of data currently. (A petabyte is 1000 terabytes.)
The Archive has also been doing a .gov-wide crawl of the entire
Federal web presence. They do this for every change of administration. This all goes into the Wayback Machine (https://web.archive.org).
I suggest that he talk with the Archive (info@archive.org or
jason@archive.org), offer to either upload the data to them, or to
mail them the data on disk drives, and offer to make a donation to the Archive to offset the long term cost of storing and making the data publicly downloadable.
Jason Scott is their free-range archivist, who sometimes handles
intake on data collections like this (he also manages a merry bunch of archivist volunteers who do things like download every website in
Geocities when Geocities is about to pull the plug on the whole server farm). Info@archive.org is their general intake address, where his message would get sent to the right people who’d handle it.
That may or may not address the question of creating a good interface, but it definitely covers the basic preservation aspect.
Jeff
I don’t know how archive.org is funded, but it seems that archiving institutions, like the NEH are also in peril. And even international institutions like the UN may face financial difficulties.
The internet archive is independent of government funding:
The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library. Its purposes include offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format.
I have another take on the issue. As long as we are setting up a large-scale effort to archive data, why not start fresh with a different philosophy on how to access the data?
I think in terms of what Tim Berners-Lee is advocating — essentially an open data initiative with attached semantics. It’s very easy to understand if you follow the example below for ★-rating of stored data:
http://5stardata.info
By simply backing up the data (stars ★,★★,★★★) , we aren’t adding much value apart from applying an insurance policy. But by wrapping semantic info around the data (★★★★, ★★★★★), we can provide extra context that can make it much more useful for discovery and reasoning applications. That’s the needed foundation for doing knowledge-based environmental context modeling via earth science data.
It’s pie-in-the-sky thinking as of right now, but there’s no doubt that we will eventually get to to a 5★-state. It takes a lot of discipline, which is why the jury is still out on whether machines will eventually do all this automatically and we just have to wait it out.
I keep repeating: the Azimuth Backup project is not a backup but an emergency measure.
A lot of the environmental data consists of continuing measurements. And this data needs to be tended, checked etc.
If you grab chunks of data from websites and store it somewhere then if you want to make again use of the contents this is an historical effort, you will have gaps, omittances and major losses of content, like it might be impossible to recalibrate, etc. You might be able too squeeze out something in the intermedium time range, but the longer the storage the more difficult it will be to recover anything.
There is another thing about semantic web. I think it is important to keep csv’s on the side. Semantic web content blows up data size and as we see already in this discussion this may become relevant.
Up to sofar the semantic web is also not very standardized and partially mutually rather incompatible. CSV’s might be large as well which makes the question of standardized compression and decompression and it’s availability an issue, but still.
Another thing is, who is supposed to provide the context?
I see already the deep learners in the front line, but I think there might be a fundamental -kind of overlooked- issue here, which needs to be discussed elsewhere.
I forgot to add that linked content is only as good as the corresponding links.
Establishing the veracity of the data is certainly important, and this project will have to deal with it too.
What happens when more “alternative facts” archives show up? Consider this one I found from last week: http://www.sealevel.info/
This looks pretty impressive in its scope, but then you find it is run by an AGW denialist. Just last week, he was able to advertise it on the biggest denialist blog WUWT by allowing him to write a top-level post describing how it works. The guy who runs it is apparently a North Carolina political activist trying to demonstrate that sea-level rise is not an issue for the barrier islands and other regions along the NC seacoast. The floodgates are now open to see lots of graphs from Trumpsters showing little sea-level rise.
Having some sort of meta-data or semantic information attached to the data will prevent this kind of manipulation. I would suspect that if the data does start disappearing from government servers, then that is just the tip=of-the-iceberg of the problems to come. Having backups of the data is a great initial move but the attacks will come from other fronts as well.
Being a part-time oil depletion analyst myself, we have long had a difficult time trying to estimate fossil fuel production and reserve data. So much potential manipulation in the numbers goes on the sites that we have always been prepared for that. That data always go though a corporate filter before getting public exposure and so, imo, this climate data issue looks like child’s play
I had a brief look at the sealevel.info site.
Not every critiziser of certain climate change assertions is automatically a Trumpster. Moreover on a first glimpse his computer renderings of the sealevels seem correct. He collected a lot of references, so he had really looked a lot into the issue. I haven’t found anything on ocean acidification though, but I might have overlooked that.
He seems to do all this programming and research next to his computer business. Amongst others the info about sea-levels seems to be probably also important for his local community in coastal North Carolina.
Well, on the WUWT blog the sea-level data manipulator dude is referring to everyone that believes in AGW as “climate alarmists”.
The fear with taking sea-level rise data at face-value — as this guy is doing — is that the rise has a huge inertia to it; and when it gets a head of steam, then it’s too late to do anything about it.
Dave Burton writes on his sealevel.info site:
and
Upon a short glimpse into the 2010 Report the possibility of accelerating sea-levels was taken into account, so that a global sealevel rise between about 7.7 inches for 2045 (hard to read off from Fig.2) for the linear projection (green curve) and about 10-11 inches for the accelerating level projection (red curve) was given.
The curves extend until 2100. As far as I know from other projection curves it is that usually the curves are drawn for such a long interval or the purpose of displaing the overall curves shape and not for the purpose of a real prognosis beyond 2045. Apparently in this report it seems the fact that the North Carolina plain is apparently rising was not taken into account. In the 2015 report the overall sealevels range between 7.7 [5.7 to 9.8] to 8.7 [6.5 to 11.0] (Table 8). So the lower boundary had been extended from 7.7 inches down to 5.7 inches. There are no curves until 2100, probably in order to avoid a misunderstanding in interpretation. Taking the possibile raise of North Carolina into account the projections for 2045 are however now (Table 10, 2015 report) between lowest level 3.2 and 10.6 instead of being between about 7.7 inches and about 10-11 inches in the 2010 report .
I don’t know what ruinous regulatory changes were planned or are still planned, but it is clear that a sealevel prognosis might influence future planning already without regulations, depending on how serious the prognosis is taken. And the local people in North Carolina have to deal with the corresponding rather immediate consequences, i.e. some people might stop investing into eventually to be easily flooded areas etc. so its clear that such a prognosis is scrutinized. The problem here is though that you need a rather high level of knowledge and often also computer soft- and hardware in order to be able to seriously question and scrutinize such a prognosis. So Dave Burton is so to say more or less forced to read all those climate science papers probably without being paid for this work and next to his computer business job. So more sitting and reading and delving through technical papers instead of enjoying the North Carolina Shore. Explains maybe a bit of his anger about “alarmists”.
As a matter of fact however seen from an overall viewpoint the whole assessment about sealevels is a free risk-analysis expensively paid for by scientists and “hobby”-scientists and their corresponding funding agencies (if there is funding).
The issue is that Dave Burton has created a climate data repository that is integrated into an anti-AGW propaganda web site. I took a look at the “Papers” category and it’s loaded with pseudo-scientific citations.
Perhaps I shouldn’t fear that Burton’s site will grow to any thing more than a freak-show curiosity. But we still need to rank sites according to value. That’s why Jim Stuttard’s link below to Google’s work is important.
Paul wrote:
I had a brief look at WUWT but I haven’t looked at the papers category, because I am not a climate scientist so I anyways wouldnt be able to judge whats respected in that area and whats not. I found however the current article on WUWT interesting. It looked to me scientifically OK and so I updated my webpage about ocean acidification.
” … because I am not a climate scientist”
Neither was Pierre Laplace, yet …
Nad: if I were you, I would avoid referring to the famous climate reality denier Anthony Watts on your blog. Even a broken clock is correct twice a day, so there may be interesting information in some article he wrote. But if you want climate scientists to to trust what you say, or even bother to read it, this is one of the worst possible sources! You can go straight to whatever peer-reviewed literature he cites, and talk about that instead.
• Tim Cook, Watts Up With That concludes Greenland is not melting without looking at any actual ice mass data, Skeptical Science, 13 July 2010.
• Stefan Rahmsdorf, The most popular deceptive climate graph, RealClimate, 8 December 2014.
John wrote:
I found the information about the foraminifera at his blog and I usually try to cite the blogs, where I found something. That doesn’t mean that I approve what is written on the respective blog.
If a climate scientist doesn’t trust what I write because I cite my sources then sorry -then I don’t need that repective scientists trust.
Apart from that -given your citations it seems Anthony Watt is quite respected, that is at least people discuss his findings and such. When I wrote to author Hans Joachim Schellnhuber about that “mathematical nuances” he didn’t bother to answer. As the head of the PIK institute he is probably drowning in managerial work so maybe I should have written to one of the younger authors, but from the authors he is the one in Potsdam which is just about one and a half hours by public transport from here.
Anyways – in one of the blog posts you cited Stefan Rahmsdorf -who is by the way also in PIK Potsdam- discusses a visualization and I don’t agree with many of his objections. There are many aspects one has to take into account with visualizations and usually it is good to have a couple different visualizations, because they usually prioritize different things, people read them in different ways etc.
So I think it is good that he shows also a visualization by reader Bernd Herd, who seems to have run a computer business called herdsoft. But then I asked myself who did pay for all those visualizations?
Anyways concerning the greenland ice and the temperatures: I have a problem with averaged temperatures. I had remarked and displayed in the forum how the average temperatures are e.g. sensitive to historical events, moreover averaging produces one big sauce, with not much to read of. In fact “global warming” is very different at different places, that is at some places surface temps are even in decline. I currently still though think that on average temps seem to be rising. And frankly having the methane and sun radiation discussion in mind I currently rather fear that things could be even worse. But this is sofar only a diffuse feeling and last but not least I look at all this because I want to get a bit more less worried.
Surface air temperatures are very connected to sea temperatures and so in particular in order to see this better I had plotted out various geographical temperatures from the CRUTEM 4 station data set. Amongst others I did this also to infer the speed of the AMOC which you can nicely see at how pronounced “bumps” travel in the tempset. Of course this could be visualized better, but I am not a programmer. Clearly visible is also how the stream travels only on one side of the greenland coast:
To me it looks that the surface temps in Greenland have been slightly rising since around 1997, but not much and in the 30s temperatures where also high. The fact that temps are not rising much and that temps were up in the 30s was pointed out by Steve Goddard, that looks right to me.
Unfortunately the CRUTEM4 file stopped in 2011.
Can such a small temperature rise lead to an accelerating ice loss as John Cook suggests?
Or can there be also other causes, in particular why is Greenland rising so fastly, like could there something be due to volcanic activity? Has there been a change in temperature measurement in Greenland, since the 30s like different instruments etc. ? I don’t know and John Cook doesn’t say much on this in the blog post you cited.
Nad wrote:
Anthony Watts is the hero of all climate change deniers, so huge numbers of such people read his blog. It may be the most read blog on the topic of global warming. As a result, climate scientists constantly feel the need to debunk his claims. However, he’s not on the Wikipedia List of scientists opposing the mainstream scientific assessment of global warming, presumably because they don’t consider him a scientist.
Anyway, do what you want, but it’s good to know a bit of the network of climate change deniers. Some of Watts’ work has been funded by the Heartland Institute, who are important in that network.
I sure hope somebody got this department of Inteior web page. It’s gone now. https://www.doi.gov/climate/
@Brookings Library,
Actually it’s not, or, at least, it has returned in some form.
Is this the same page you saw before?
I’m interested in this, because it is important to have tangible (provable!) evidence of any changes being made to government climate pages, and what the changes are.
Hi John – Thank you for starting this project. Although you’ve reached your KS goal, have you estimated how much funding would be ideal for your project? As you’ve noted, time seems to be of the essence.
Now that you’ve started archiving and have a better idea of the scope, are you looking at $20,000, $50,000 to do this or $??? to wrap up all the data and keep it safe? Thanks!
-john
PS, yep, I’m a JB too…
When it comes to the short term, what’s needed is not money so much as people able to create backups of the 67 databases that have not been backed up. This requires a certain level of skill. http sites are more varied in nature than ftp sites, and need to be studied before one can write a script to copy them. But the Azimuth Backup Project is not seeking to back up all of these databases; there are lots of other people out there doing this stuff, so we are currently aiming to grab 40 terabytes.
(Of course with larger amounts of money one might be able to hire people to download and store everything, but there’s still the organizational issue of finding such people and managing them. Right now we have a relatively small team of highly competent self-motivated people who are able to manage themselves and willing to work for free. I don’t think we have the organizational skills to hire and manage a bunch of less self-motivated people—not in time, anyway.)
I think we have enough money to store 40 terabytes for about 3 years. In the longer term, or even before then, I think we can find people willing to store the data for us for free. So, to me the interesting question is whether we should spend our money to keep holding on to the data ourselves for as long as possible, or spend it on something else that might help the world more.
Jeff mentioned the possibility of storing the data on portable HDs. Joining forces with the WayBack machine experts might be a win-win if they were paid. If 8TB drives are maybe ~$400 then 40TB would be ~$2,000 (maybe somebody knows the best deal?). Until the government servers have their content scrubbed and the possibility of, say, UCR and some other institutions offering free hosting is settled I don’t see the need to spend (more) money on server rent. When the downloads have finished might be the time to start something like an “Experiments in climate data access” project page to discuss detailed design options.
@whut google research are investigating metadata for public data sets. They mention Knowledge Graph as a possible tool.
https://research.googleblog.com/2017/01/facilitating-discovery-of-public.html
Jim, great find! yes that’s what I’m thinking about
“How can we efficiently use content descriptions that providers already describe in a declarative way using W3C standards for describing semantics of Web resources and linked data?”
For now, we have to have someone mechanically add this information, but if we can make this automatic yet rigorous in semantic intent then we can speed the process. Identifying declarative intent is the key, as that is the best way of inferring the semantic information.
With 30 minutes to go on the Kickstarter campaign, we have broken the $20,000 mark: four times our original goal. I thank everyone who contributed! Our work is just begun. Next Monday I’ll talk to people at Computing & Communications at U. C. Riverside, to initiate the process of copying our data to their storage space.
http://www.fel.duke.edu/~scafetta/pdf/opinion0308.pdf this might be the other view which i disagree with
So, I have another thought on how the Azimuth Backup Archive could be used. The U. C. Irvine servers host a pretty famous machine learning repository and archive. These are standard datasets for which the answers are known, used to test nascent ML algorithms and implementations.
What Azimuth Backup has can be usefully thought of as a large collection of documents. Like Amazon Cloud Drive or Google Drive (hi there @Scott Maxwell), a challenge, as I understand it, is to try to reduce storage use by detecting duplicate documents and replacing them with pointers. I have no proprietary insight into this from either company, but I am familiar with some of the algorithms typically used for this purpose, e.g., the famous minHash. Now minHash is not exactly suitable because it is approximate, but there are other algorithms which are, or which, at least, do not create false positives, and no one wants to replace a document with a link to another which is not really the same.
It is probable that in our copies of datasets we have duplicates. These may have been introduced operationally, not really knowing or having insight regarding the architecture of the many government Web sites, and with little possibility of cooperation with their operations people, but, more likely, it happens because these government sites are cross-linked in many ways, with NOAA citing things on NASA sites, NASA citing things on NOAA sites, and the Forest Service citing things on both, etc.
So, an interesting and useful additional possibility for our datasets would be to open them, on a read-only basis, to universities and courses who might want to use them as a publicly accessible real dataset for mining in this manner, for detecting duplicates, but possibly for other purposes. There would be no truth data but, then, as in statistics projects with read datasets in advanced classes, that makes the challenge a real example of doing work in this area.
We might “license” the data, as U.C. Irvine does for theirs, and as CAIDA does for their Internet data, requesting users and courses to deposit copies of papers, presentations, classwork, and publication at the site, so we develop a track record of results and how the data is being used.
Happy to discuss.
I’d been hoping we’d make the data available, on a read-only basis, to everyone—through U.C. Riverside, for example. So one question is whether you’re suggesting we should only make it to certain people.
No, no, no intent to limit access at all. I was just trying to sketch another use case other than the primary data the datasets represent.
Okay. Then the ‘licensing’ procedure would presumably be optional, I’m all for encouraging research groups to use the data in the manner you suggest. I don’t have much of a sense of how much interested there’d be in doing that.
There is a course at Stanford run by Professors Leskovec, Rajaraman, and Ullman which is the perfect candidate for these data. Of course, we are not ready to offer it for the present session, but they are “prospective customers.” Yes, the licensing would be optional, as it is at Irvine and CAIDA.
One project, called the End of Term (EOT) Presidential Harvest, was created in 2008 by the Library of Congress, the Internet Archive (also known as the Wayback Machine) and other archivists. At the end of every presidential term, the team of data specialists, professional librarians and volunteers searches and indexes all government information available, mainly at .gov or .mil websites. The 2008 and 2012 EOT efforts indexed and copied millions of webpages of government information.
The effort has heightened this year due to uncertainty about the Trump administration’s plans. Just days into the new administration, substantive changes have been made at the Department of Agriculture and the Environmental Protection Agency websites. And any potential disappearance of health data from the Centers for Disease Control and Prevention, the U.S. Census Bureau or the National Institutes of Health, for example, could affect researchers at the School of Medicine and elsewhere.
Nigam Shah, MBBS, PhD, an associate professor of medicine, who specializes in analyzing biomedical data, said he knows many Stanford researchers whose work depends on government health data, including the National InPatient Sample dataset and the National Health and Nutrition Examination Survey.
But downloading databases can be hit or miss, Jacobs said. “The internet is a messy place.” For example, the data in a database might be hidden behind search or GIS interfaces. And if those databases are taken offline, for all practical purposes, they cease to exist, Jacobs said.
For this year’s effort, 300 volunteers nationally (up from 30 in years past) have jumped in to help since mid-October, Jacobs said. The EOT team is planning to work through the end of March, so if you have a favorite database, you can still nominate it to be archived.
There are also other efforts to save government data — such as Climate Mirror, Data Refuge and the Azimuth Climate Data Backup Project.