Statebox: A Universal Language of Distributed Systems

22 January, 2018

guest post by Christian Williams

A short time ago, on the Croatian island of Zlarin, there gathered a band of bold individuals—rebels of academia and industry, whose everyday thoughts and actions challenge the separations of the modern world. They journeyed from all over to learn of the grand endeavor of another open mind, an expert functional programmer and creative hacktivist with significant mathematical knowledge: Jelle |yell-uh| Herold.

The Dutch computer scientist has devoted his life to helping our species and our planet: from consulting in business process optimization to winning a Greenpeace hackathon, from updating Netherlands telecommunications to creating a website to determine ways for individuals to help heal the earth, Jelle has gained a comprehensive perspective of the interconnected era. Through a diverse and innovative career, he has garnered crucial insights into software design and network computation—most profoundly, he has realized that it is imperative that these immense forces of global change develop thoughtful, comprehensive systematization.

Jelle understood that initiating such a grand ambition requires a massive amount of work, and the cooperation of many individuals, fluent in different fields of mathematics and computer science. Enter the Zlarin meeting: after a decade of consideration, Jelle has now brought together proponents of categories, open games, dependent types, Petri nets, string diagrams, and blockchains toward a singular end: a universal language of distributed systems—Statebox.

Statebox is a programming language formed and guided by fundamental concepts and principles of theoretical mathematics and computer science. The aim is to develop the canonical process language for distributed systems, and thereby elucidate the way these should actually be designed. The idea invokes the deep connections of these subjects in a novel and essential way, to make code simple, transparent, and concrete. Category theory is both the heart and pulse of this endeavor; more than a theory, it is a way of thinking universally. We hope the project helps to demonstrate the importance of this perspective, and encourages others to join.

The language is designed to be self-optimizing, open, adaptive, terminating, error-cognizant, composable, and most distinctively—visual. Petri nets are the natural representation of decentralized computation and concurrency. By utilizing them as program models, the entire language is diagrammatic, and this allows one to inspect the flow of the process executed by the program. While most languages only compile into illegible machine code, Statebox compiles directly into diagrams, so that the user immediately sees and understands the concrete realization of the abstract design. We believe that this immanent connection between the “geometric” and “algebraic” aspects of computation is of great importance.

Compositionality is a rightfully popular contemporary term, indicating the preservation of type under composition of systems or processes. This is essential to the universality of the type, and it is intrinsic to categories, which underpin the Petri net. A pertinent example is that composition allows for a form of abstraction in which programs do not require complete specification. This is parametricity: a program becomes executable when the functions are substituted with valid terms. Every term has a type, and one cannot connect pieces of code that have incompatible inputs and outputs—the compiler would simply produce an error. The intent is to preserve a simple mathematical structure that imposes as little as possible, and still ensure rationality of code. We can then more easily and reliably write tools providing automatic proofs of termination and type-correctness. Many more aspects will be explained as we go along, and in more detail in future posts.

Statebox is more than a specific implementation. It is an evolving aspiration, expressing an ideal, a source of inspiration, signifying a movement. We fully recognize that we are at the dawn of a new era, and do not assume that the current presentation is the best way to fulfill this ideal—but it is vital that this kind of endeavor gains the hearts and minds of these communities. By learning to develop and design by pure theory, we make a crucial step toward universal systems and knowledge. Formalisms are biased, fragile, transient—thought is eternal.

Thank you for reading, and thank you to John Baez—|bi-ez|, some there were not aware—for allowing me to write this post. Azimuth and its readers represent what scientific progress can and should be; it is an honor to speak to you. My name is Christian Williams, and I have just begun my doctoral studies with Dr. Baez. He received the invitation from Jelle and could not attend, and was generous enough to let me substitute. Disclaimer: I am just a young student with big dreams, with insufficient knowledge to do justice to this huge topic. If you can forgive some innocent confidence and enthusiasm, I would like to paint a big picture, to explain why this project is important. I hope to delve deeper into the subject in future posts, and in general to celebrate and encourage the cognitive revolution of Applied Category Theory. (Thank you also to Anton and Fabrizio for providing some of this writing when I was not well; I really appreciate it.)

Statebox Summit, Zlarin 2017, was awesome. Wish you could’ve been there. Just a short swim in the Adriatic from the old city of Šibenik |shib-enic|, there lies the small, green island of Zlarin |zlah-rin|, with just a few hundred kind inhabitants. Jelle’s friend, and part of the Statebox team, Anton Livaja and his family graciously allowed us to stay in their houses. Our headquarters was a hotel, one of the few places open in the fall. We set up in the back dining room for talks and work, and for food and sunlight we came to the patio and were brought platters of wonderful, wholesome Croatian dishes. As we burned the midnight oil, we enjoyed local beer, and already made history—the first Bitcoin transaction of the island, with a progressive bartender, Vinko.

Zlarin is a lovely place, but we haven’t gotten to the best part—the people. All who attended are brilliant, creative, and spirited. Everyone’s eyes had a unique spark to light. I don’t think I’ve ever met such a fascinating group in my life. The crew: Jelle, Anton, Emi Gheorghe, Fabrizio Genovese, Daniel van Dijk, Neil Ghani, Viktor Winschel, Philipp Zahn, Pawel Sobocinski, Jules Hedges, Andrew Polonsky, Robin Piedeleu, Alex Norta, Anthony di Franco, Florian Glatz, Fredrik Nordvall Forsberg. These innovators have provocative and complementary ideas in category theory, computer science, open game theory, functional programming, and the blockchain industry; and they came to share an important goal. These are people who work earnestly to better humanity, motivated by progress, not profit. Talking with them gave me hope, that there are enough intelligent, open-minded, and caring people to fix this mess of modern society. In our short time together, we connected—now, almost all continue to contribute and grow the endeavor.

Why is society a mess? The present human condition is absurd. We are in a cognitive renaissance, yet our world is in peril. We need to realize a deeper harmony of theory and practice—we need ideas that dare to dream big, that draw on the vast wealth of contemporary thought to guide and unite subjects in one mission. The way of the world is only a reflection of how we choose to think, and for more than a century we have delved endlessly into thought itself. If we truly learn from our thought, knowledge and application become imminently interrelated, not increasingly separate. It is imperative that we abandon preconception, pretense and prejudice, and ask with naive sincerity: “How should things be, really, and how can we make it happen?”

This pertains more generally to the irresponsibly ad hoc nature of society—we find ourselves entrenched in inadequate systems. Food, energy, medicine, finance, communications, media, governance, technology—our deepening dependence on centralization is our greatest vulnerability. Programming practice is the perfect example of the gradual failure of systems when their design is left to wander in abstraction. As business requirements evolved, technological solutions were created haphazardly, the priority being immediate return over comprehensive methodology, which resulted in ‘duct-taped’ systems, such as the Windows OS. Our entire world now depends on unsystematic software, giving rise to so much costly disorganization, miscommunication, and worse, bureaucracy. Statebox aims to close the gap between the misguided formalisms which came out of this type of degeneration, and design a language which corresponds naturally to essential mathematical concepts—to create systems which are rational, principled, universal. To explain why Statebox represents to us such an important ideal, we must first consider its closest relative, the elephant in the technological room: blockchain.

Often the best ideas are remarkably simple—in 2008, an unknown person under the alias Satoshi Nakamoto published the whitepaper Bitcoin: A Peer-to-Peer Electronic Cash System. In just a few pages, a protocol was proposed which underpins a new kind of computational network, called a blockchain, in which interactions are immediate, transparent, and permanent. This is a personal interpretation—the paper focuses on the application given in its title. In the original financial context, immediacy is one’s ability to directly transact with anyone, without intermediaries, such as banks; transparency is one’s right to complete knowledge of the economy in which one participates, meaning that each node owns a copy of the full history of the network; permanence is the irrevocability of one’s transactions. These core aspects are made possible by an elegant use of cryptography and game theory, which essentially removes the need for trusted third parties in the authorization, verification, and documentation of transactions. Per word, it’s almost peerless in modern influence; the short and sweet read is recommended.

The point of this simplistic explanation is that blockchain is about more than economics. The transaction could be any cooperation, the value could be any social good—when seen as a source of consensus, the blockchain protocol can be expanded to assimilate any data and code. After several years of competing cryptocurrencies, the importance of this deeper idea was gradually realized. There arose specialized tools to serve essential purposes in some broader system, and only recently have people dared to conceive of what this latter could be. In 2014, a wunderkind named Vitalik Buterin created Ethereum, a fully programmable blockchain. Solidity is a Turing-complete language of smart contracts, autonomous programs which enable interactions and enact rules on the network. With this framework, one can not only transact with others, but implement any kind of process; one can build currencies, websites, or organizations—decentralized applications, constructed with smart contracts, could be just about anything.

There is understandably great confidence and excitement for these ventures, and many are receiving massive public investment. Seriously, the numbers are staggering—but most of it is pure hype. There is talk of the first global computer, the internet of value, a single shared source of truth, and other speculative descriptions. But compared to the ambition, the actual theory is woefully underdeveloped. So far, implementations make almost no use of the powerful ideas of mathematics. There are still basic flaws in blockchain itself, the foundation of almost all decentralized technology. For example, the two viable candidates for transaction verification are called Proof of Work and Proof of Stake: the former requires unsustainable consumption of resources, namely hardware and electricity, and the latter is susceptible to centralization. Scalability is a major problem, thus also cost and speed of transactions. A major Ethereum dApp, Decentralized Autonomous Organization, was hacked.

These statements are absolutely not to disregard all of the great work of this community; it is primarily rhetoric to distinguish the high ideals of Statebox, and I lack the eloquence to make the point diplomatically, nor near the knowledge to give a real account of this huge endeavor. We now return to the rhetoric.

What seems to be lost in the commotion is the simple recognition that we do not yet really know what we should make, nor how to do so. The whole idea is simply too big—the space of possibility is almost completely unknown, because this innovation can open every aspect of society to reform. But as usual, people try to ignore their ignorance, imagining it will disappear, and millions clamor about things we do not yet understand. Most involved are seeing decentralization as an exciting business venture, rather than our best hope to change the way of this broken world; they want to cash in on another technological wave. Of the relatively few idealists, most still retain the assumptions and limitations of the blockchain.

For all this talk, there is little discussion of how to even work toward the ideal abstract design. Most mathematics associated to blockchain is statistical analysis of consensus, while we’re sitting on a mountain of powerful categorical knowledge of systems. At the summit, Prof. Neil Ghani said “it’s like we’re on the Moon, talking about going to Mars, while everyone back on Earth still doesn’t even have fire.” We have more than enough conceptual technology to begin developing an ideal and comprehensive system, if the right minds come together. Theory guides practice, practice motivates theory—the potential is immense.

Fortunately, there are those who have this big picture in mind. Long before the blockchain craze, Jelle saw the fundamental importance of both distributed systems and the need for academic-industrial symbiosis. In the mid-2000’s, he used Petri nets to create process tools for businesses. Employees could design and implement any kind of abstract workflow to more effectively communicate and produce. Jelle would provide consultation to optimize these processes, and integrate them into their existing infrastructure—as it executed, it would generate tasks, emails, forms and send them to designated individuals to be completed for the next iteration. Many institutions would have to shell out millions of dollars to IBM or Fujitsu for this kind of software, and his was more flexible and intuitive. This left a strong impression on Jelle, regarding the power of Petri nets and the impact of deliberate design.

Many experiences like this gradually instilled in Jelle a conviction to expand his knowledge and begin planning bold changes to the world of programming. He attended mathematics conferences, and would discuss with theorists from many relevant subjects. On the island, he told me that it was actually one of Baez’s talks about networks which finally inspired him to go for this huge idea. By sincerely and openly reaching out to the whole community, Jelle made many valuable connections. He invited these thinkers to share his vision—theorists from all over Europe, and some from overseas, gathered in Croatia to learn and begin to develop this project—and it was a great success.

By now you may be thinking, alright kid spill the beans already. Here they are, right into your brain—well, most will be in the next post, but we should at least have a quick overview of some of the main ideas not already discussed.

The notion of open system complements compositionality. The great difference between closure and openness, in society as well as theory, was a central theme in many of our conversations during the summit. Although we try to isolate and suspend life and cognition in abstraction, the real, concrete truth is what flows through these ethereal forms. Every system in Statebox is implicitly open, and this impels design to idealize the inner and outer connections of processes. Open systems are central to the Baez Network Theory research team. There are several ways to categorically formalize open systems; the best are still being developed, but the first main example can be found in The Algebra of Open and Interconnected Systems by Brendan Fong, an early member of the team.

Monoidal categories, as this blog knows well, represent systems with both series and parallel processes. One of the great challenge of this new era of interconnection is distributed computation—getting computers to work together as a supercomputer, and monoidal categories are essential to this. Here, objects are data types, and morphisms are computations, while composition is serial and tensor is parallel. As Dr. Baez has demonstrated with years of great original research, monoidal categories are essential to understanding the complexity of the world. If we can connect our knowledge of natural systems to social systems, we can learn to integrate valuable principles—a key example being complete resource cognizance.

Petri nets are presentations of free strict symmetric monoidal categories, and as such they are ideal models of “normal” computation, i.e. associative, unital, and commutative. Open Petri nets are the workhorses of Statebox. They are the morphisms of a category which is itself monoidal—and via openness it is even richer and more versatile. Most importantly it is compact closed, which introduces a simple but crucial duality into computation—input-output interchange—which is impossible in conventional cartesian closed computation, and actually brings the paradigm closer to quantum computation

Petri nets represent processes in an intuitive, consistent, and decentralized way. These will be multi-layered via the notion of operad and a resourceful use of Petri net tokens, representing the interacting levels of a system. Compositionality makes exploring their state space much easier: the state space of a big process can be constructed from those of smaller ones, a technique that more often than not avoids state space explosion, a long-standing problem in Petri net analysis. The correspondence between open Petri nets and a logical calculus, called place/transition calculus, allows the user to perform queries on the Petri net, and a revolutionary technique called information-gain computing greatly reduces response time.

Dependently typed functional programming is the exoskeleton of this beautiful beast; in particular, the underlying language is Idris. Dependent types arose out of both theoretical mathematics and computer science, and they are beginning to be recognized as very general, powerful, and natural in practice. Functional programming is a similarly pure and elegant paradigm for “open” computation. They are fascinating and inherently categorical, and deserve whole blog posts in the future.

Even economics has opened its mind to categories. Statebox is very fortunate to have several of these pioneers—open game theory is a categorical, compositional version of game theory, which allows the user to dynamically analyze and optimize code. Jules’ choice of the term “teleological category” is prescient; it is about more than just efficiency—it introduces the possibility of building principles into systems, by creating game-theoretical incentives which can guide people to cooperate for the greater good, and gradually lessen the influence of irrational, selfish priorities.

Categories are the language by which Petri nets, functional programming, and open games can communicate—and amazingly, all of these theories are unified in an elegant representation called string diagrams. These allow the user to forget the formalism, and reason purely in graphical terms. All the complex mathematics goes under the hood, and the user only needs to work with nodes and strings, which are guaranteed to be formally correct.

Category theory also models the data structures that are used by Statebox: Typedefs is a very lightweight—but also very expressive—data structure, that is at the very core of Statebox. It is based on initial F-algebras, and can be easily interpreted in a plethora of pre-existing solutions, enabling seamless integration with existing systems. One of the core features of Typedefs is that serialization is categorically internalized in the data structure, meaning that every operation involving types can receive a unique hash and be recorded on the blockchain public ledger. This is one of the many components that make Statebox fail-resistant: every process and event is accounted for on the public ledger, and the whole history of a process can be rolled back and analyzed thanks to the blockchain technology.

The Statebox team is currently working on a monograph that will neatly present how all the pertinent categorical theories work together in Statebox. This is a formidable task that will take months to complete, but will also be the cleanest way to understand how Statebox works, and which mathematical questions have still to be answered to obtain a working product. It will be a thorough document that also considers important aspects such as our guiding ethics.

The team members are devoted to creating something positive and different, explicitly and solely to better the world. The business paradigm is based on the principle that innovation should be open and collaborative, rather than competitive and exclusive. We want to share ideas and work with you. There are many blooming endeavors which share the ideals that have been described in this article, and we want them all to learn from each other and build off one another.

For example, Statebox contributor and visionary economist Viktor Winschel has a fantastic project called Oicos. The great proponent of applied category theory, David Spivak, has an exciting and impressive organization called Categorical Informatics. Mike Stay, a past student of Dr. Baez, has started a company called Pyrofex, which is developing categorical distributed computation. There are also somewhat related languages for blockchain, such as Simplicity, and innovative distributed systems such as Iota and RChain. Even Ethereum is beginning to utilize categories, with Casper. And of course there are research groups, such as Network Theory and Mathematically Structured Programming, as well as so many important papers, such as Algebraic Databases. This is just a slice of everything going on; as far as I know there is not yet a comprehensive account of all the great applied category theory and distributed innovations being developed. Inevitably these endeavors will follow the principle they share, and come together in a big way. Statebox is ready, willing, and able to help make this reality.

If you are interested in Statebox, you are welcomed with open arms. You can contact Jelle at, Fabrizio at, Emi at, Anton at; they can provide more information, connect you to the discussion, or anything else. There will be a second summit in 2018 in about six months, details to be determined. We hope to see you there. Future posts will keep you updated, and explain more of the theory and design of Statebox. Thank you very much for reading.

P.S. Found unexpected support in Šibenik! Great bar—once a reservoir.

Azimuth Backup Project (Part 3)

22 January, 2017


Along with the bad news there is some good news:

• Over 380 people have pledged over $14,000 to the Azimuth Backup Project on Kickstarter, greatly surpassing our conservative initial goal of $5,000.

• Given our budget, we currently aim at backing up 40 terabytes of data, and we are well on our way to this goal. You can see what we’ve done at Our Progress, and what we’re still doing at the Issue Tracker.

• I have gotten a commitment from Danna Gianforte, the head of Computing and Communications at U. C. Riverside, that eventually the university will maintain a copy of our data. (This commitment is based on my earlier estimate that we’d have 20 terabytes of data, so I need to see if 40 is okay.)

• I have gotten two offers from other people, saying they too can hold our data.

I’m hoping that the data at U. C. Riverside will be made publicly available through a server. The other offers may involve it being held ‘secretly’ until such time as it became needed; that has its own complementary advantages.

However, the interesting problem that confronts us now is: how to spend our money?

You can see how we’re currently spending it on our Budget and Spending page. Basically, we’re paying a firm called Hetzner for servers and storage boxes.

We could simply continue to do this until our money runs out. I hope that long before then, U. C. Riverside will have taken over some responsibilities. If so, there would be a long period where our money would largely pay for a redundant backup. Redundancy is good, but perhaps there is something better.

Two members of our team, Sakari Maaranen and Greg Kochanski, have thoughts on this matter which I’d like to share. Sakari posted his thoughts on Google+, while Greg posted his in an email which he’s letting me share here.

Please read these and offer us your thoughts! Maybe you can help us decide on the best strategy!

Sakari Maaranen

For the record, my views on our strategy of using the budget that the Azimuth Climate Data Backup Project now has.

People have contributed it to this effort specifically.

Some non-government entities have offered “free hosting”. Of course the project should take any and all free offers to host our data. Those would not be spending our budget however. And they are still paying for it, even if they offered it to us “for free”.

As far as it comes to spending, I think we should think in terms of 1) terabytemonths, and 2) sufficient redundancy, and do that as cost-efficiently as possible. We should not just dump the money to any takers, but think of the best bang for the buck. We owe that to the people who have contributed now.

For example, if we burn the cash quick to expensive storage, I would consider that a failure. Instead, we must plan for the best use of the budget towards our mission.

What we have promised to the people is that we back up and serve these data sets, by the money they have given to us. Let’s do exactly that.

We are currently serving the mission at approximately €0.006 per gigabytemonth at least for as long as we have volunteers to work for free. The cost could be slightly higher if we paid for professional maintenance, which should be a reasonable assumption if we plan for long term service. Volunteer work cannot be guaranteed forever, even if it works temporarily.

This is one view and the question is open to public discussion.

Greg Kochanski

Some misc thoughts.

1) As I see it, we have made some promise of serving the data (“create a better interface for getting it”) which can be an expensive thing.

UI coding isn’t all that easy, and takes some time.

Beyond that, we’ve promised to back up the data, and once you say “backup”, you’ve also made an implicit promise to make the data available.

2) I agree that if we have a backup, it is a logical extension to take continuous backups, but I wouldn’t say it’s necessary.

Perhaps the way to think about it is to ask the question, “what do our donors likely want”?

3) Clearly they want to preserve the data, in case it disappears from the Federal sites. So, that’s job 1. And, if it does disappear, we need to make it available.

3a) Making it available will require some serving CPU, disk, and network. We may need to worry about DDOS attacks, thought perhaps we could get free coverage from Akamai or Google Project Shield.

3b) Making it available may imply paying some students to write Javascript and HTML to put up a front-end to allow people to access the data we are collecting.

Not all the data we’re collecting is in strictly servable form. Some of the databases, for example aren’t usefully servable in the form we collect, and we know some links will be broken because of missing pages, or because of wget’s design flaw.*

[* Wget stores http://a/b/c as a file, a/b/c, where a/b is a directory. Wget stores http://a/b as a file a/b, where a/b is a file.

Therefore, both cannot exist simultaneously on disk. If they do, wget drops one.]

Points 3 & 3a imply that we need to keep some money in the bank until either the websites are taken down, or we decide that the threat has abated. So, we need to figure out how much money to keep as a serving reserve. It doesn’t sound like UCR has committed to serve the data, though you could perhaps ask.

Beyond the serving reserve, I think we are free to do better backups (i.e. more than one data collection), and change detection.

Saving Climate Data (Part 4)

21 January, 2017

At noon today in Washington DC, while Trump was being inaugurated, all mentions of “climate change” and “global warming” were eliminated from the White House website.

Well, not all. The word “climate” still shows up here:

President Trump is committed to eliminating harmful and unnecessary policies such as the Climate Action Plan….

There are also reports that all mentions of climate change will be scrubbed from the website of the Environmental Protection Agency, or EPA.

From Motherboard

Let me quote from this article:

• Jason Koebler, All references to climate change have been deleted from the White House website, Motherboard, 20 January 2017.

Scientists and professors around the country had been rushing to download and rehost as much government science as was possible before the transition, based on a fear that Trump’s administration would neglect or outright delete government information, databases, and web applications about science. Last week, the Radio Motherboard podcast recorded an episode about these efforts, which you can listen to below, or anywhere you listen to podcasts.

The Internet Archive, too, has been keeping a close watch on the White House website; President Obama’s climate change page had been archived every single day in January.

So far, nothing on the Environmental Protection Agency’s website has changed under Trump, but a report earlier this week from Inside EPA, a newsletter and website that reports on the agency, suggested that pages about climate are destined to be cut within the first few weeks of his presidency.

Scientists I’ve spoken to who are archiving websites say they expect scientific data on the NASA, NOAA, Department of Energy, and EPA websites to be neglected or deleted eventually. They say they don’t expect agency sites to be updated immediately, but expect it to play out over the course of months. This sort of low-key data destruction might not be the type of censorship people typically think about, but scientists are treating it as such.

From Technology Review

Greg Egan pointed out another good article, on MIT’s magazine:

• James Temple, Climate data preservation efforts mount as Trump takes office, Technology Review, 20 January 2010.

Quoting from that:

Dozens of computer science students at the University of California, Los Angeles, will mark Inauguration Day by downloading federal climate databases they fear could vanish under the Trump Administration.

Friday’s hackathon follows a series of grassroots data preservation efforts in recent weeks, amid increasing concerns the new administration is filling agencies with climate deniers likely eager to cut off access to scientific data that undermine their policy views. Those worries only grew earlier this week, when Inside EPA reported website that the Environmental Protection Agency transition team plans to scrub climate data from the agency’s website, citing a source familiar with the team.

Earlier federal data hackathons include the “Guerrilla Archiving” event at the University of Toronto last month, the Internet Archive’s Gov Data Hackathon in San Francisco at the beginning of January, and the DataRescue Philly event at the University of Pennsylvania last week.

Much of the collected data is being stored in the servers of the End of Term Web Archive, a collaborative effort to preserve government websites at the conclusion of presidential terms. The University of Pennsylvania’s Penn Program in Environmental Humanities launched the separate DataRefuge project, in part to back up environmental data sets that standard Web crawling tools can’t collect.

Many of the groups are working off a master list of crucial data sets from NASA, the National Oceanic and Atmospheric Administration, the U.S. Geological Survey, and other agencies. Meteorologist and climate journalist Eric Holthaus helped prompt the creation of that crowdsourced list with a tweet early last month.

Other key developments driving the archival initiatives included reports that the transition team had asked Energy Department officials for a list of staff who attended climate change meetings in recent years, and public statements from senior campaign policy advisors arguing that NASA should get out of the business of “politically correct environmental monitoring.”

“The transition team has given us no reason to believe that they will respect scientific data, particularly when it’s inconvenient,” says Gretchen Goldman, research director in the Center for Science and Democracy at the Union of Concerned Scientists. These historical databases are crucial to ongoing climate change research in the United States and abroad, she says.

To be clear, the Trump camp hasn’t publicly declared plans to erase or eliminate access to the databases. But there is certainly precedent for state and federal governments editing, removing, or downplaying scientific information that doesn’t conform to their political views.

Late last year, it emerged that text on Wisconsin’s Department of Natural Resources website was substantially rewritten to remove references to climate change. In addition, an extensive Congressional investigation concluded in a 2007 report that the Bush Administration “engaged in a systematic effort to manipulate climate change science and mislead policymakers and the public about the dangers of global warming.”

In fact these Bush Administration efforts were masterminded by Myron Ebell, who Trump chose to lead his EPA transition team!


In fact, there are wide-ranging changes to federal websites with every change in administration for a variety of reasons. The Internet Archive, which collaborated on the End of Term project in 2008 and 2012 as well, notes that more than 80 percent of PDFs on .gov sites disappeared during that four-year period.

The organization has seen a surge of interest in backing up sites and data this year across all government agencies, but particularly for climate information. In the end, they expect to collect well more than 100 terabytes of data, close to triple the amount in previous years, says Jefferson Bailey, director of Web archiving.

In fact the Azimuth Backup Project alone may gather about 40 terabytes!

From Inside EPA

And then there’s this view from inside the Environmental Protection Agency:

• Dawn Reeves, Trump transition preparing to scrub some climate data from EPA Website, Inside EPA, January 17, 2017

The incoming Trump administration’s EPA transition team intends to remove non-regulatory climate data from the agency’s website, including references to President Barack Obama’s June 2013 Climate Action Plan, the strategies for 2014 and 2015 to cut methane and other data, according to a source familiar with the transition team.

Additionally, Obama’s 2013 memo ordering EPA to establish its power sector carbon pollution standards “will not survive the first day,” the source says, a step that rule opponents say is integral to the incoming administration’s pledge to roll back the Clean Power Plan and new source power plant rules.

The Climate Action Plan has been the Obama administration’s government-wide blueprint for addressing climate change and includes information on cutting domestic greenhouse gas (GHG)emissions, including both regulatory and voluntary approaches; information on preparing for the impacts of climate change; and information on leading international efforts.

The removal of such information from EPA’s website — as well as likely removal of references to such programs that link to the White House and other agency websites — is being prepped now.

The transition team’s preparations fortify concerns from agency staff, environmentalists and many scientists that the Trump administration is going to destroy reams of EPA and other agencies’ climate data. Scientists have been preparing for this possibility for months, with many working to preserve key data on private websites.

Environmentalists are also stepping up their efforts to preserve the data. The Sierra Club Jan. 13 filed a Freedom of Information Act request seeking reams of climate-related data from EPA and the Department of Energy (DOE), including power plant GHG data. Even if the request is denied, the group said it should buy them some time.

“We’re interested in trying to download and preserve the information, but it’s going to take some time,” Andrea Issod, a senior attorney with the Sierra Club, told Bloomberg. “We hope our request will be a counterweight to the coming assault on this critical pollution and climate data.”

While Trump has pledged to take a host of steps to roll back Obama EPA climate and other high-profile actions actions on his first day in office, transition and other officials say the date may slip.

“In truth, it might not [happen] on the first day, it might be a week,” the source close to the transition says of the removal of climate information from EPA’s website. The source adds that in addition to EPA, the transition team is also looking at such information on the websites of DOE and the Interior Department.

Additionally, incoming Trump press secretary Sean Spicer told reporters Jan. 17 that not much may happen on Inauguration Day itself, but to expect major developments the following Monday, Jan. 23. “I think on [Jan. 23] you’re going to see a big flurry of activity” that is expected to include the disappearance of at least some EPA climate references.

Until Trump is inaugurated on Jan. 20, the transition team cannot tell agency staff what to do, and the source familiar with the transition team’s work is unaware of any communications requiring language removal or beta testing of websites happening now, though it appears that some of this work is occurring.

“We can only ask for information at this point until we are in charge. On [Jan. 20] at about 2 o’clock, then they can ask [staff] to” take actions, the source adds.

Scope & Breadth

The scope and breadth of the information to be removed is unclear. While it is likely to include executive actions on climate, it does not appear that the reams of climate science information, including models, tools and databases on the EPA Office of Research & Development’s (ORD) website will be impacted, at least not immediately.

ORD also has published climate, air and energy strategic research action plans, including one for 2016-2019 that includes research to assess impacts; prevent and reduce emissions; and prepare for and respond to changes in climate and air quality.

But other EPA information maintained on its websites including its climate change page and its “What is EPA doing about climate change” page that references the Climate Action Plan, the 2014 methane strategy and a 2015 oil and gas methane reduction strategy are expected targets.

Another possible target is new information EPA just compiled—and hosted a Jan. 17 webinar to discuss—on climate change impacts to vulnerable communities.

One former EPA official who has experience with transitions says it is unlikely that any top Obama EPA official is on board with this. “I would think they would be violently against this. . . I would think that the last thing [EPA Administrator] Gina McCarthy would want to do would to be complicit in Trump’s effort to purge the website” of climate-related work, and that if she knew she would “go ballistic.”

But the former official, the source close to the transition team and others note that EPA career staff is fearful and may be undertaking such prep work “as a defensive maneuver to avoid getting targeted,” the official says, adding that any directive would likely be coming from mid-level managers rather than political appointees or senior level officials.

But while the former official was surprised that such work might be happening now, the fact that it is only said to be targeting voluntary efforts “has a certain ring of truth to it. Someone who is knowledgeable would draw that distinction.”

Additionally, one science advocate says, “The people who are running the EPA transition have a long history of sowing misunderstanding about climate change and they tend to believe in a vast conspiracy in the scientific community to lie to the public. If they think the information is truly fraudulent, it would make sense they would try to scrub it. . . . But the role of the agency is to inform the public . . . [and not to satisfy] the musings of a band of conspiracy theorists.”

The source was referring to EPA transition team leader Myron Ebell, a long-time climate skeptic at the Competitive Enterprise Institute, along with David Schnare, another opponent of climate action, who is at the Energy & Environment Legal Institute.

And while “a new administration has the right to change information about policy, what they don’t have the right to do is change the scientific information about policies they wish to put forward and that includes removing resources on science that serve the public.”

The advocate adds that many state and local governments rely on EPA climate information.

EPA Concern

But there has been plenty of concern that such a move would take place, especially after transition team officials last month sought the names of DOE employees who worked on climate change, raising alarms and cries of a “political witch hunt” along with a Dec. 13 letter from Sen. Maria Cantwell (D-WA) that prompted the transition team to disavow the memo.

Since then, scientists have been scrambling to preserve government data.

On Jan. 10, High Country News reported that on a Saturday last month, 150 technology specialists, hackers, scholars and activists assembled in Toronto for the “Guerrilla Archiving Event: Saving Environmental Data from Trump” where the group combed the internet for key climate and environmental data from EPA’s website.

“A giant computer program would then copy the information onto an independent server, where it will remain publicly accessible—and safe from potential government interference.”

The organizer of the event, Henry Warwick, said, “Say Trump firewalls the EPA,” pulling reams of information from public access. “No one will have access to the data in these papers” unless the archiving took place.

Additionally, the Union of Concerned Scientists released a Jan. 17 report, “Preserving Scientific Integrity in Federal Policy Making,” urging the Trump administration to retain scientific integrity. It wrote in a related blog post, “So how will government science fare under Trump? Scientists are not just going to wait and see. More than 5,500 scientists have now signed onto a letter asking the president-elect to uphold scientific integrity in his administration. . . . We know what’s at stake. We’ve come too far with scientific integrity to see it unraveled by an anti-science president. It’s worth fighting for.”

The Irreversible Momentum of Clean Energy

17 January, 2017

The president of the US recently came out with an article in Science. It’s about climate change and clean energy:

• Barack Obama, The irreversible momentum of clean energy, Science, 13 January 2017.

Since it’s open-access, I’m going to take the liberty of quoting the whole thing, minus the references, which provide support for a lot of his facts and figures.

The irreversible momentum of clean energy

The release of carbon dioxide (CO2) and other greenhouse gases (GHGs) due to human activity is increasing global average surface air temperatures, disrupting weather patterns, and acidifying the ocean. Left unchecked, the continued growth of GHG emissions could cause global average temperatures to increase by another 4°C or more by 2100 and by 1.5 to 2 times as much in many midcontinent and far northern locations. Although our understanding of the impacts of climate change is increasingly and disturbingly clear, there is still debate about the proper course for U.S. policy — a debate that is very much on display during the current presidential transition. But putting near-term politics aside, the mounting economic and scientific evidence leave me confident that trends toward a clean-energy economy that have emerged during my presidency will continue and that the economic opportunity for our country to harness that trend will only grow. This Policy Forum will focus on the four reasons I believe the trend toward clean energy is irreversible.


The United States is showing that GHG mitigation need not conflict with economic growth. Rather, it can boost efficiency, productivity, and innovation. Since 2008, the United States has experienced the first sustained period of rapid GHG emissions reductions and simultaneous economic growth on record. Specifically, CO2 emissions from the energy sector fell by 9.5% from 2008 to 2015, while the economy grew by more than 10%. In this same period, the amount of energy consumed per dollar of real gross domestic product (GDP) fell by almost 11%, the amount of CO2 emitted per unit of energy consumed declined by 8%, and CO2 emitted per dollar of GDP declined by 18%.

The importance of this trend cannot be overstated. This “decoupling” of energy sector emissions and economic growth should put to rest the argument that combatting climate change requires accepting lower growth or a lower standard of living. In fact, although this decoupling is most pronounced in the United States, evidence that economies can grow while emissions do not is emerging around the world. The International Energy Agency’s (IEA’s) preliminary estimate of energy related CO2 emissions in 2015 reveals that emissions stayed flat compared with the year before, whereas the global economy grew. The IEA noted that “There have been only four periods in the past 40 years in which CO2 emission levels were flat or fell compared with the previous year, with three of those — the early 1980s, 1992, and 2009 — being associated with global economic weakness. By contrast, the recent halt in emissions growth comes in a period of economic growth.”

At the same time, evidence is mounting that any economic strategy that ignores carbon pollution will impose tremendous costs to the global economy and will result in fewer jobs and less economic growth over the long term. Estimates of the economic damages from warming of 4°C over preindustrial levels range from 1% to 5% of global GDP each year by 2100. One of the most frequently cited economic models pins the estimate of annual damages from warming of 4°C at ~4% of global GDP, which could lead to lost U.S. federal revenue of roughly $340 billion to $690 billion annually.

Moreover, these estimates do not include the possibility of GHG increases triggering catastrophic events, such as the accelerated shrinkage of the Greenland and Antarctic ice sheets, drastic changes in ocean currents, or sizable releases of GHGs from previously frozen soils and sediments that rapidly accelerate warming. In addition, these estimates factor in economic damages but do not address the critical question of whether the underlying rate of economic growth (rather than just the level of GDP) is affected by climate change, so these studies could substantially understate the potential damage of climate change on the global macroeconomy.

As a result, it is becoming increasingly clear that, regardless of the inherent uncertainties in predicting future climate and weather patterns, the investments needed to reduce emissions — and to increase resilience and preparedness for the changes in climate that can no longer be avoided — will be modest in comparison with the benefits from avoided climate-change damages. This means, in the coming years, states, localities, and businesses will need to continue making these critical investments, in addition to taking common-sense steps to disclose climate risk to taxpayers, homeowners, shareholders, and customers. Global insurance and reinsurance businesses are already taking such steps as their analytical models reveal growing climate risk.


Beyond the macroeconomic case, businesses are coming to the conclusion that reducing emissions is not just good for the environment — it can also boost bottom lines, cut costs for consumers, and deliver returns for shareholders.

Perhaps the most compelling example is energy efficiency. Government has played a role in encouraging this kind of investment and innovation. My Administration has put in place (i) fuel economy standards that are net beneficial and are projected to cut more than 8 billion tons of carbon pollution over the lifetime of new vehicles sold between 2012 and 2029 and (ii) 44 appliance standards and new building codes that are projected to cut 2.4 billion tons of carbon pollution and save $550 billion for consumers by 2030.

But ultimately, these investments are being made by firms that decide to cut their energy waste in order to save money and invest in other areas of their businesses. For example, Alcoa has set a goal of reducing its GHG intensity 30% by 2020 from its 2005 baseline, and General Motors is working to reduce its energy intensity from facilities by 20% from its 2011 baseline over the same timeframe. Investments like these are contributing to what we are seeing take place across the economy: Total energy consumption in 2015 was 2.5% lower than it was in 2008, whereas the economy was 10% larger.

This kind of corporate decision-making can save money, but it also has the potential to create jobs that pay well. A U.S. Department of Energy report released this week found that ~2.2 million Americans are currently employed in the design, installation, and manufacture of energy-efficiency products and services. This compares with the roughly 1.1 million Americans who are employed in the production of fossil fuels and their use for electric power generation. Policies that continue to encourage businesses to save money by cutting energy waste could pay a major employment dividend and are based on stronger economic logic than continuing the nearly $5 billion per year in federal fossil-fuel subsidies, a market distortion that should be corrected on its own or in the context of corporate tax reform.


The American electric-power sector — the largest source of GHG emissions in our economy — is being transformed, in large part, because of market dynamics. In 2008, natural gas made up ~21% of U.S. electricity generation. Today, it makes up ~33%, an increase due almost entirely to the shift from higher-emitting coal to lower-emitting natural gas, brought about primarily by the increased availability of low-cost gas due to new production techniques. Because the cost of new electricity generation using natural gas is projected to remain low relative to coal, it is unlikely that utilities will change course and choose to build coal-fired power plants, which would be more expensive than natural gas plants, regardless of any near-term changes in federal policy. Although methane emissions from natural gas production are a serious concern, firms have an economic incentive over the long term to put in place waste-reducing measures consistent with standards my Administration has put in place, and states will continue making important progress toward addressing this issue, irrespective of near-term federal policy.

Renewable electricity costs also fell dramatically between 2008 and 2015: the cost of electricity fell 41% for wind, 54% for rooftop solar photovoltaic (PV) installations, and 64% for utility-scale PV. According to Bloomberg New Energy Finance, 2015 was a record year for clean energy investment, with those energy sources attracting twice as much global capital as fossil fuels.

Public policy — ranging from Recovery Act investments to recent tax credit extensions — has played a crucial role, but technology advances and market forces will continue to drive renewable deployment. The levelized cost of electricity from new renewables like wind and solar in some parts of the United States is already lower than that for new coal generation, without counting subsidies for renewables.

That is why American businesses are making the move toward renewable energy sources. Google, for example, announced last month that, in 2017, it plans to power 100% of its operations using renewable energy — in large part through large-scale, long-term contracts to buy renewable energy directly. Walmart, the nation’s largest retailer, has set a goal of getting 100% of its energy from renewables in the coming years. And economy-wide, solar and wind firms now employ more than 360,000 Americans, compared with around 160,000 Americans who work in coal electric generation and support.

Beyond market forces, state-level policy will continue to drive clean-energy momentum. States representing 40% of the U.S. population are continuing to move ahead with clean-energy plans, and even outside of those states, clean energy is expanding. For example, wind power alone made up 12% of Texas’s electricity production in 2015 and, at certain points in 2015, that number was >40%, and wind provided 32% of Iowa’s total electricity generation in 2015, up from 8% in 2008 (a higher fraction than in any other state).


Outside the United States, countries and their businesses are moving forward, seeking to reap benefits for their countries by being at the front of the clean-energy race. This has not always been the case. A short time ago, many believed that only a small number of advanced economies should be responsible for reducing GHG emissions and contributing to the fight against climate change. But nations agreed in Paris that all countries should put forward increasingly ambitious climate policies and be subject to consistent transparency and accountability requirements. This was a fundamental shift in the diplomatic landscape, which has already yielded substantial dividends. The Paris Agreement entered into force in less than a year, and, at the follow-up meeting this fall in Marrakesh, countries agreed that, with more than 110 countries representing more than 75% of global emissions having already joined the Paris Agreement, climate action “momentum is irreversible”. Although substantive action over decades will be required to realize the vision of Paris, analysis of countries’ individual contributions suggests that meeting mediumterm respective targets and increasing their ambition in the years ahead — coupled with scaled-up investment in clean-energy technologies — could increase the international community’s probability of limiting warming to 2°C by as much as 50%.

Were the United States to step away from Paris, it would lose its seat at the table to hold other countries to their commitments, demand transparency, and encourage ambition. This does not mean the next Administration needs to follow identical domestic policies to my Administration’s. There are multiple paths and mechanisms by which this country can achieve — efficiently and economically — the targets we embraced in the Paris Agreement. The Paris Agreement itself is based on a nationally determined structure whereby each country sets and updates its own commitments. Regardless of U.S. domestic policies, it would undermine our economic interests to walk away from the opportunity to hold countries representing two-thirds of global emissions — including China, India, Mexico, European Union members, and others — accountable. This should not be a partisan issue. It is good business and good economics to lead a technological revolution and define market trends. And it is smart planning to set long term emission-reduction targets and give American companies, entrepreneurs, and investors certainty so they can invest and manufacture the emission-reducing technologies that we can use domestically and export to the rest of the world. That is why hundreds of major companies — including energy-related companies from ExxonMobil and Shell, to DuPont and Rio Tinto, to Berkshire Hathaway Energy, Calpine, and Pacific Gas and Electric Company — have supported the Paris process, and leading investors have committed $1 billion in patient, private capital to support clean-energy breakthroughs that could make even greater climate ambition possible.


We have long known, on the basis of a massive scientific record, that the urgency of acting to mitigate climate change is real and cannot be ignored. In recent years, we have also seen that the economic case for action — and against inaction — is just as clear, the business case for clean energy is growing, and the trend toward a cleaner power sector can be sustained regardless of near-term federal policies.

Despite the policy uncertainty that we face, I remain convinced that no country is better suited to confront the climate challenge and reap the economic benefits of a low-carbon future than the United States and that continued participation in the Paris process will yield great benefit for the American people, as well as the international community. Prudent U.S. policy over the next several decades would prioritize, among other actions, decarbonizing the U.S. energy system, storing carbon and reducing emissions within U.S. lands, and reducing non-CO2 emissions.

Of course, one of the great advantages of our system of government is that each president is able to chart his or her own policy course. And President-elect Donald Trump will have the opportunity to do so. The latest science and economics provide a helpful guide for what the future may bring, in many cases independent of near-term policy choices, when it comes to combatting climate change and transitioning to a clean energy economy.

Saving Climate Data (Part 3)

23 December, 2016

You can back up climate data, but how can anyone be sure your backups are accurate? Let’s suppose the databases you’ve backed up have been deleted, so that there’s no way to directly compare your backup with the original. And to make things really tough, let’s suppose that faked databases are being promoted as competitors with the real ones! What can you do?

One idea is ‘safety in numbers’. If a bunch of backups all match, and they were made independently, it’s less likely that they all suffer from the same errors.

Another is ‘safety in reputation’. If a bunch of backups of climate data are held by academic institutes of climate science, and another are held by climate change denying organizations (conveniently listed here), you probably know which one you trust more. (And this is true even if you’re a climate change denier, though your answer may be different than mine.)

But a third idea is to use a cryptographic hash function. In very simplified terms, this is a method of taking a database and computing a fairly short string from it, called a ‘digest’.


A good hash function makes it hard to change the database and get a new one with the same digest. So, if the person owning a database computes and publishes the digest, anyone can check that your backup is correct by computing its digest and comparing it to the original.

It’s not foolproof, but it works well enough to be helpful.

Of course, it only works if we have some trustworthy record of the original digest. But the digest is much smaller than the original database: for example, in the popular method called SHA-256, the digest is 256 bits long. So it’s much easier to make copies of the digest than to back up the original database. These copies should be stored in trustworthy ways—for example, the Internet Archive.

When Sakari Maraanen made a backup of the University of Idaho Gridded Surface Meteorological Data, he asked the custodians of that data to publish a digest, or ‘hash file’. One of them responded:

Sakari and others,

I have made the checksums for the UofI METDATA/gridMET files (1979-2015) as both md5sums and sha256sums.

You can find these hash files here:

After you download the files, you can check the sums with:

md5sum -c hash.md5

sha256sum -c hash.sha256

Please let me know if something is not ideal and we’ll fix it!

Thanks for suggesting we do this!

Sakari replied:

Thank you so much! This means everything to public mirroring efforts. If you’d like to help further promoting this Best Practice, consider getting it recognized as a standard when you do online publishing of key public information.

1. Publishing those hashes is already a major improvement on its own.

2. Publishing them on a secure website offers people further guarantees that there has not been any man-in-the-middle.

3. Digitally signing the checksum files offers the best easily achievable guarantees of data integrity by the person(s) who sign the checksum files.

Please consider having these three steps included in your science organisation’s online publishing training and standard Best Practices.

Feel free to forward this message to whom it may concern. Feel free to rephrase as necessary.

As a separate item, public mirroring instructions for how to best download your data and/or public websites would further guarantee permanence of all your uniquely valuable science data and public contributions.

Right now we should get this message viral through the government funded science publishing people. Please approach the key people directly – avoiding the delay of using official channels. We need to have all the uniquely valuable public data mirrored before possible changes in funding.

Again, thank you for your quick response!

There are probably lots of things to be careful about. Here’s one. Maybe you can think of more, and ways to deal with them.

What if the data keeps changing with time? This is especially true of climate records, where new temperatures and so on are added to a database every day, or month, or year. Then I think we need to ‘time-stamp’ everything. The owners of the original database need to keep a list of digests, with the time each one was made. And when you make a copy, you need to record the time it was made.

Azimuth Backup Project (Part 2)

20 December, 2016


I want to list some databases that are particularly worth backing up. But to do this, we need to know what’s already been backed up. That’s what this post is about.

Azimuth backups

Here is information as of now (21:45 GMT 20 December 2016). I won’t update this information. For up-to-date information see

Azimuth Backup Project: Issue Tracker.

For up-to-date information on the progress of each of individual databases listed below, click on my summary of what’s happening now.

Here are the databases that we’ve backed up:

• NASA GISTEMP website at by Jan and uploaded to Sakari’s datarefuge server.

• NOAA Carbon Dioxide Information Analysis Center (CDIAC) data at — downloaded by Jan and uploaded to Sakari’s datarefuge server.

• NOAA Carbon Tracker website at by Jan, uploaded to Sakari’s datarefuge server.

These are still in progress, but I think we have our hands on the data:

• NOAA Precipitation Frequency Data at and by Borislav, not yet uploaded to Sakari’s datarefuge server.

• NOAA Carbon Dioxide Information Analysis Center (CDIAC) website at http://cdiac.ornl.govdownloaded by Jan, uploaded to Sakari’s datarefuge server, but there’s evidence that the process was incomplete.

• NOAA website at https://www.ncdc.noaa.govdownloaded by Jan, who is now attempting to upload it to Sakari’s datarefuge server.

• NOAA National Centers for Environmental Information (NCEI) website at https://www.ncdc.noaa.govdownloaded by Jan, who is now attempting to upload it to Sakari’s datarefuge server, but there are problems.

• Ocean and Atmospheric Research data at — downloaded by Jan, now attempting to upload it to Sakari’s datarefuge server.

• NOAA NCEP/NCAR Reanalysis ftp site at — downloaded by Jan, now attempting to upload it to Sakari’s datarefuge server.

I think we’re getting these now, more or less:

• NOAA National Centers for Environmental Information (NCEI) ftp site at — in the process of being downloaded by Jan, “Very large. May be challenging to manage with my facilities”.

• NASA Planetary Data System (PDS) data at https://pds.nasa.govin the process of being downloaded by Sakari.

• NOAA tides and currents products website at, which includes the sea level trends data at is downloading this.

• NOAA National Centers for Environmental Information (NCEI) satellite datasets website at is downloading this.

• NASA JASON3 sea level data at is downloading this.

• U.S. Forest Service Climate Change Atlas website at is downloading this.

• NOAA Global Monitoring Division website at is downloading this.

• NOAA Global Monitoring Division ftp data at — Jan is downloading this.

• NOAA National Data Buoy Center website at is downloading this.

• NASA-ESDIS Oak Ridge National Laboratory Distributed Active Archive (DAAC) on Biogeochemical Dynamics at is downloading this.

• NASA-ESDIS Oak Ridge National Laboratory Distributed Active Archive (DAAC) on Biogeochemical Dynamics website at is downloading this.

Other backups

Other backups are listed at

The Climate Mirror Project,

This nicely provides the sizes of various backups, and other useful information. Some are ‘signed and verified’ with cryptographic keys, but I’m not sure exactly what that means, and the details matter.

About 90 databases are listed here, along with some size information and some information about whether people have already backed them up or are in process:

Gov. Climate Datasets (Archive). (Click on the tiny word “Datasets” at the bottom of the page!)


Azimuth Backup Project (Part 1)

16 December, 2016


This blog page is to help organize the Azimuth Environmental Data Backup Project, or Azimuth Backup Project for short. This is part of the larger but decentralized, frantic and somewhat disorganized project discussed elsewhere:

Saving Climate Data (Part 2), Azimuth, 15 December 2016.

Here I’ll just say what we’re doing at Azimuth.

Jan Galkowski is a statistician and engineer at Akamai Technologies, a company in Cambridge Massachusetts whose content delivery network is one of the world’s largest distributed computing platforms, responsible for serving at least 15% of all web traffic. He has begun copying some of the publicly accessible US government climate databases. On 11 December he wrote:

John, so I have just started trying to mirror all of CDIAC [the Carbon Dioxide Information Analysis Center]. We’ll see. I’ll put it in a tarball, and then throw it up on Google. It should keep everything intact. Using WinHTTrack. I have coordinated with Eric Holthaus via Twitter, creating, per your suggestion, a new personal account which I am using exclusively to follow the principals.

Once CDIAC is done, and checked over, I’ll move on to other sites.

There are things beyond our control, such as paper records, or records which are online but are not within visibility of the public.

Oh, and I’ve formally requested time off from work for latter half of December so I can work this on vacation. (I have a number of other projects I want to work in parallel, anyway.)

By 14 December he was wanting some more storage space. He asked David Tanzer and me:

Do either of you have a large Google account, or the “unlimited storage” option at Amazon?

I’m using WebDrive, a commercial product. What I’m (now) doing is defining an FTP map at a .gov server, and then a map to my Amazon Cloud Drive. I’m using Windows 7, so these appear as standard drives (or mounts, in *nix terms). I navigate to an appropriate place on the Amazon Drive, and then I proceed to copy from .gov to Amazon.

There is no compression, and, in order to be sure I don’t abuse the .gov site, I’m deliberately passing this over a wireless network in my home, which limits the transfer rate. If necessary, and if the .gov site permits, I could hardwire the workstation to our FIOS router and get appreciably faster transfer. (I often do that for large work files.)

The nice thing is I get to work from home 3 days a week, so I can keep an eye on this. And I’m taking days off just to do this.

I’m thinking about how I might get a second workstation in the act.

The Web sites themselves I’m downloading, as mentioned, using HTTrack. I intended to tarball-up the site structure and then upload to Amazon. I’m still working on CDIAC at ORNL. For future sites, I’m going to try to get HTTrack to mirror directly to Amazon using one of the mounts.

I asked around for more storage space, and my request was kindly answer by Scott Maxwell. Scott lives in Pasadena California and he used to work for NASA: he even had a job driving a Mars rover! He is now a site reliability engineer at Google, and he works on Google Drive. Scott is setting up a 10-terabyte account on Google Drive, which Jan and others will be able to use.

Meanwhile, Jan noticed some interesting technical problems: for some reason WebDrive is barely using the capacity of his network connection, so things are moving much more slowly than they could in theory.

Most recently, Sakari Maaranen offered his assistance. Sakari is a systems architect at Ubisecure, a firm in Finland that specializes in identity management, advanced user authentication, authorization, single sign-on, and federation. He wrote:

I have several terabytes worth in Helsinki (can get more) and a gigabit connection. I registered my offer but they [the DataRefuge people] didn’t reply though. I’m glad if that means you have help already and don’t need a copy in Helsinki.

I replied saying that the absence of a reply probably means that they’re overwhelmed by offers of help and are struggling to figure out exactly what to do. Scott said:

Hey, Sakari! Thank you for the generous offer!

I’m setting these guys up with Google Drive storage, as at least a short-term solution.

IMHO, our first order of business is just to get a copy of the data into a location we control—one that can’t easily be taken away from us. That’s the rationale for Google Drive: it fits into Jan’s existing workflow, so it’s the lowest-friction path to getting a copy of the data that’s under our control.

How about if I propose this: we let Jan go ahead with the plan of backing up the data in Drive. Then I’ll look evaluate moving it from there to whatever other location we come up with. (Or copying instead of moving! More copies is better. :-) How does that sound to you?

I admit I haven’t gotten as far as thinking about Web serving at all—and it’s not my area of expertise anyway. Maybe you’d be kind enough to elaborate on your thoughts there.

Sakari responded with some information about servers. In late January, U. C. Riverside may help me with this—until then they are busy trying to get more storage space, for wholly unrelated reasons. But right now it seems the main job is to identify important data and get it under our control.

There are a lot of ways you could help.

Computer skills. Personally I’m not much use with anything technical about computers, but the rest of the Azimuth Data Backup gang probably has technical questions that some of you out there could answer… so, I encourage discussion of those questions. (Clearly some discussions are best done privately, and at some point we may encounter unfriendly forces, but this is a good place for roaming experts to answer questions.)

Security. Having a backup of climate data is not very useful if there are also fake databases floating around and you can’t prove yours is authentic. How can we create a kind of digital certificate that our database matches what was on a specific website at a specific time? We should do this if someone here has the expertise.

Money. If we wind up wanting to set up a permanent database with a nice front end, accessible from the web, we may want money. We could do a Kickstarter campaign. People may be more interested in giving money now than later, unless the political situation immediately gets worse after January 20th.

Strategy. We should talk a bit about what to do next, though too much talk tends to prevent action. Eventually, if all goes well, our homegrown effort will be overshadowed by others, at least in sheer quantity. About 3 hours ago Eric Holthaus tweeted “we just got a donation of several petabytes”. If it becomes clear that others are putting together huge, secure databases with nice front ends, we can either quit or—better—cooperate with them, and specialize on something we’re good at and enjoy.