## Azimuth Backup Project (Part 5)

5 October, 2017

I haven’t spoken much about the Azimuth Climate Data Backup Project, but it’s going well, and I’ll be speaking about it soon, here:

International Open Access Week, Wednesday 25 October 2017, 9:30–11:00 a.m., University of California, Riverside, Orbach Science Library, Room 240.

“Open in Order to Save Data for Future Research” is the 2017 event theme.

Open Access Week is an opportunity for the academic and research community to learn about the potential benefits of sharing what they’ve learned with colleagues, and to help inspire wider participation in helping to make “open access” a new norm in scholarship, research and data planning and preservation.

The Open Access movement is made of up advocates (librarians, publishers, university repositories, etc.) who promote the free, immediate, and online publication of research.

The program will provide information on issues related to saving open data, including climate change and scientific data. The panelists also will describe open access projects in which they have participated to save climate data and to preserve end-of-term presidential data, information likely to be and utilized by the university community for research and scholarship.

The program includes:

• Brianna Marshall, Director of Research Services, UCR Library: Brianna welcomes guests and introduces panelists.

• John Baez, Professor of Mathematics, UCR: John will describe his activities to save US government climate data through his collaborative effort, the Azimuth Climate Data Backup Project. All of the saved data is now open access for everyone to utilize for research and scholarship.

• Perry Willett, Digital Preservation Projects Manager, California Digital Library: Perry will discuss the open data initiatives in which CDL participates, including the end-of-term presidential web archiving that is done in partnership with the Library of Congress, Internet Archive and University of North Texas.

• Kat Koziar, Data Librarian, UCR Library: Kat will give an overview of DASH, the UC system data repository, and provide suggestions for researchers interested in making their data open.

This will be the eighth International Open Access Week program hosted by the UCR Library.

The event is free and open to the public. Light refreshments will be served.

## Complex Adaptive System Design (Part 4)

22 August, 2017

Last time I introduced typed operads. A typed operad has a bunch of operations for putting together things of various types and getting new things of various types. This is a very general idea! But in the CASCADE project we’re interested in something more specific: networks. So we want operads whose operations are ways to put together networks and get new networks.

That’s what our team came up with: John Foley of Metron, my graduate students Blake Pollard and Joseph Moeller, and myself. We’re writing a couple of papers on this, and I’ll let you know when they’re ready. These blog articles are kind of sneak preview—and a gentle introduction, where you can ask questions.

For example: I’m talking a lot about networks. But what is a ‘network’, exactly?

There are many kinds. At the crudest level, we can model a network as a simple graph, which is something like this:

There are some restrictions on what counts as a simple graph. If the vertices are agents of some sort and the edges are communication channels, these restrictions imply:

• We allow at most one channel between any pair of agents, since there’s at most one edge between any two vertices of our graph.

• The channels do not have a favored direction, since there are no arrows on the edges of our graph.

• We don’t allow a channel from an agent to itself, since an edge can’t start and end at the same vertex.

For other purposes we may want to drop some or all of these restrictions. There is an appalling diversity of options! We might want to allow multiple channels between a pair of agents. For this we could use multigraphs. We might want to allow directed channels, where the sender and receiver have different capabilities: for example, signals may only be able to flow in one direction. For this we could use directed graphs. And so on.

We will also want to consider graphs with colored vertices, to specify different types of agents—or colored edges, to specify different types of channels. Even more complicated variants are likely to become important as we proceed.

To avoid sinking into a mire of special cases, we need the full power of modern mathematics. Instead of separately studying all these various kinds of networks, we need a unified notion that subsumes all of them.

To do this, the Metron team came up with something called a ‘network model’. There is a network model for simple graphs, a network model for multigraphs, a network model for directed graphs, a network model for directed graphs with 3 colors of vertex and 15 colors of edge, and more.

You should think of a network model as a kind of network. Not a specific network, just a kind of network.

Our team proved that for each network model $G$ there is an operad $O_G$ whose operations describe how to put together networks of that kind. We call such operads ‘network operads’.

I want to make all this precise, but today let me just show you one example. Let’s take $G$ to be the network model for simple graphs, and look at the network operad $O_G.$ I won’t tell you what kind of thing $G$ is yet! But I’ll tell you about the operad $O_G$.

Types. Remember from last time that an operad has a set of ‘types’. For $O_G$ this is the set of natural numbers, $\mathbb{N}.$ The reason is that a simple graph can have any number of vertices.

Operations. Remember that an operad has sets of ‘operations’. In our case we have a set of operations $O_G(t_1,\dots,t_n ; t)$ for each choice of $t_1,\dots,t_n, t \in \mathbb{N}.$

An operation $f \in O_G(t_1,\dots,t_n; t)$ is a way of taking a simple graph with $t_1$ vertices, a simple graph with $t_2$ vertices,… and so on, and sticking them together, perhaps adding new edges, to get a simple graph with

$t = t_1 + \cdots + t_n$

vertices.

Let me show you an operation

$f \in O_G(3,4,2;9)$

This will be a way of taking three simple graphs—one with 3 vertices, one with 4, and one with 2—and sticking them together, perhaps adding edges, to get one with 9 vertices.

Here’s what $f$ looks like:

It’s a simple graph with vertices numbered from 1 to 9, with the vertices in bunches: {1,2,3}, {4,5,6,7} and {8,9}. It could be any such graph. This one happens to have an edge from 3 to 6 and an edge from 1 to 2.

Here’s how we can actually use our operation. Say we have three simple graphs like this:

Then we can use our operation to stick them together and get this:

Notice that we added a new edge from 3 to 6, connecting two of our three simple graphs. We also added an edge from 1 to 2… but this had no effect, since there was already an edge there! The reason is that simple graphs have at most one edge between vertices.

But what if we didn’t already have an edge from 1 to 2? What if we applied our operation $f$ to the following simple graphs?

Well, now we’d get this:

This time adding the edge from 1 to 2 had an effect, since there wasn’t already an edge there!

In short, we can use this operad to stick together simple graphs, but also to add new edges within the simple graphs we’re sticking together!

When I’m telling you how we ‘actually use’ our operad to stick together graphs, I’m secretly describing an algebra of our operad. Remember, an operad describes ways of sticking together things together, but an ‘algebra’ of the operad gives a particular specification of these things and describes how we stick them together.

Our operad $O_G$ has lots of interesting algebras, but I’ve just shown you the simplest one. More precisely:

Things. Remember from last time that for each type, an algebra specifies a set of things of that type. In this example our types are natural numbers, and for each natural number $t \in \mathbb{N}$ I’m letting the set of things $A(t)$ consist of all simple graphs with vertices $\{1, \dots, t\}.$

Action. Remember that our operad $O_G$ should have an action on $A$, meaning a bunch of maps

$\alpha : O_G(t_1,...,t_n ; t) \times A(t_1) \times \cdots \times A(t_n) \to A(t)$

I just described how this works in some examples. Some rules should hold… and they do.

To make sure you understand, try these puzzles:

Puzzle 1. In the example I just explained, what is the set $O_G(t_1,\dots,t_n ; t)$ if $t \ne t_1 + \cdots + t_n?$

Puzzle 2. In this example, how many elements does $O_G(1,1;2)$ have?

Puzzle 3. In this example, how many elements does $O_G(1,2;3)$ have?

Puzzle 4. In this example, how many elements does $O_G(1,1,1;3)$ have?

Puzzle 5. In the particular algebra $A$ that I explained, how many elements does $A(3)$ have?

Next time I’ll describe some more interesting algebras of this operad $O_G.$ These let us describe networks of mobile agents with range-limited communication channels!

## Complex Adaptive System Design (Part 3)

17 August, 2017

It’s been a long time since I’ve blogged about the Complex Adaptive System Composition and Design Environment or CASCADE project run by John Paschkewitz. For a reminder, read these:

Complex adaptive system design (part 1), Azimuth, 2 October 2016.

Complex adaptive system design (part 2), Azimuth, 18 October 2016.

A lot has happened since then, and I want to explain it.

I’m working with Metron Scientific Solutions to develop new techniques for designing complex networks.

The particular problem we began cutting our teeth on is a search and rescue mission where a bunch of boats, planes and drones have to locate and save people who fall overboard during a boat race in the Caribbean Sea. Subsequently the Metron team expanded the scope to other search and rescue tasks. But the real goal is to develop very generally applicable new ideas on designing and ‘tasking’ networks of mobile agents—that is, designing these networks and telling the agents what to do.

We’re using the mathematics of ‘operads’, in part because Spivak’s work on operads has drawn a lot of attention and raised a lot of hopes:

An operad is a bunch of operations for sticking together smaller things to create bigger ones—I’ll explain this in detail later, but that’s the core idea. Spivak described some specific operads called ‘operads of wiring diagrams’ and illustrated some of their potential applications. But when we got going on our project, we wound up using a different class of operads, which I’ll call ‘network operads’.

Here’s our dream, which we’re still trying to make into a reality:

Network operads should make it easy to build a big network from smaller ones and have every agent know what to do. You should be able to ‘slap together’ a network, throwing in more agents and more links between them, and automatically have it do something reasonable. This should be more flexible than an approach where you need to know ahead of time exactly how many agents you have, and how they’re connected, before you can tell them what to do.

You don’t want a network to malfunction horribly because you forgot to hook it up correctly. You want to focus your attention on optimizing the network, not getting it to work at all. And you want everything to work so smoothly that it’s easy for the network to adapt to changing conditions.

To achieve this we’re using network operads, which are certain special ‘typed operads’. So before getting into the details of our approach, I should say a bit about typed operads. And I think that will be enough for today’s post: I don’t want to overwhelm you with too much information at once.

In general, a ‘typed operad’ describes ways of sticking together things of various types to get new things of various types. An ‘algebra’ of the operad gives a particular specification of these things and the results of sticking them together. For now I’ll skip the full definition of a typed operad and only highlight the most important features. A typed operad $O$ has:

• a set $T$ of types.

• collections of operations $O(t_1,...,t_n ; t)$ where $t_i, t \in T$. Here $t_1, \dots, t_n$ are the types of the inputs, while $t$ is the type of the output.

• ways to compose operations. Given an operation
$f \in O(t_1,\dots,t_n ;t)$ and $n$ operations

$g_1 \in O(t_{11},\dots,t_{1 k_1}; t_1),\dots, g_n \in O(t_{n1},\dots,t_{n k_n};t_n)$

we can compose them to get

$f \circ (g_1,\dots,g_n) \in O(t_{11}, \dots, t_{nk_n};t)$

These must obey some rules.

But if you haven’t seen operads before, you’re probably reeling in horror—so I need to rush in and save you by showing you the all-important pictures that help explain what’s going on!

First of all, you should visualize an operation $f \in O(t_1, \dots, t_n; t)$ as a little gizmo like this:

It has $n$ inputs at top and one output at bottom. Each input, and the output, has a ‘type’ taken from the set $T.$ So, for example, if you operation takes two real numbers, adds them and spits out the closest integer, both input types would be ‘real’, while the output type would be ‘integer’.

The main thing we do with operations is compose them. Given an an operation $f \in O(t_1,\dots,t_n ;t),$ we can compose it with $n$ operations

$g_1 \in O(t_{11},\dots,t_{1 k_1}; t_1), \quad \dots, \quad g_n \in O(t_{n1},\dots,t_{n k_n};t_n)$

by feeding their outputs into the inputs of $f,$ like this:

The result is an operation we call

$f \circ (g_1, \dots, g_n)$

Note that the input types of $f$ have to match the output types of the $g_i$ for this to work! This is the whole point of types: they forbid us from composing operations in ways that don’t make sense.

This avoids certain stupid mistakes. For example, you can take the square root of a positive number, but you may not want to take the square root of a negative number, and you definitely don’t want to take the square root of a hamburger. While you can land a plane on an airstrip, you probably don’t want to land a plane on a person.

The operations in an operad are quite abstract: they aren’t really operating on anything. To render them concrete, we need another idea: operads have ‘algebras’.

An algebra $A$ of the operad $O$ specifies a set of things of each type $t \in T$ such that the operations of $O$ act on these sets. A bit more precisely, an algebra consists of:

• for each type $t \in T,$ a set $A(t)$ of things of type $t$

• an action of $O$ on $A,$ that is, a collection of maps

$\alpha : O(t_1,...,t_n ; t) \times A(t_1) \times \cdots \times A(t_n) \to A(t)$

obeying some rules.

In other words, an algebra turns each operation $f \in O(t_1,...,t_n ; t)$ into a function that eats things of types $t_1, \dots, t_n$ and spits out a thing of type $t.$

When we get to designing systems with operads, the fact that the same operad can have many algebras will be useful. Our operad will have operations describing abstractly how to hook up networks to form larger networks. An algebra will give a specific implementation of these operations. We can use one algebra that’s fairly fine-grained and detailed about what the operations actually do, and another that’s less detailed. There will then be a map between from the first algebra to the second, called an ‘algebra homomorphism’, that forgets some fine-grained details.

There’s a lot more to say—all this is just the mathematical equivalent of clearing my throat before a speech—but I’ll stop here for now.

And as I do—since it also takes me time to stop talking—I should make it clear yet again that I haven’t even given the full definition of typed operads and their algebras! Besides the laws I didn’t write down, there’s other stuff I omitted. Most notably, there’s a way to permute the inputs of an operation in an operad, and operads have identity operations, one for each type.

To see the full definition of an ‘untyped’ operad, which is really an operad with just one type, go here:

• Wikipedia, Operad theory.

They just call it an ‘operad’. Note that they first explain ‘non-symmetric operads’, where you can’t permute the inputs of operations, and then explain operads, where you can.

If you’re mathematically sophisticated, you can easily guess the laws obeyed by a typed operad just by looking at this article and inserting the missing types. You can also see the laws written down in Spivak’s paper, but with some different terminology: he calls types ‘objects’, he calls operations ‘morphisms’, and he calls typed operads ‘symmetric colored operads’—or once he gets going, just ‘operads’.

You can also see the definition of a typed operad in Section 2.1 here:

• Donald Yau, Operads of wiring diagrams.

What I would call a typed operad with $S$ as its set of types, he calls an ‘$S$-colored operad’.

I guess it’s already evident, but I’ll warn you that the terminology in this subject varies quite a lot from author to author: for example, a certain community calls typed operads ‘symmetric multicategories’. This is annoying at first but once you understand the subject it’s as ignorable as the fact that mathematicians have many different accents. The main thing to remember is that operads come in four main flavors, since they can either be typed or untyped, and they can either let you permute inputs or not. I’ll always be working with typed operads where you can permute inputs.

Finally, I’ll say that while the definition of operad looks lengthy and cumbersome at first, it becomes lean and elegant if you use more category theory.

Next time I’ll give you an example of an operad: the simplest ‘network

## Saving Climate Data (Part 6)

23 February, 2017

Scott Pruitt, who filed legal challenges against Environmental Protection Agency rules fourteen times, working hand in hand with oil and gas companies, is now head of that agency. What does that mean about the safety of climate data on the EPA’s websites? Here is an inside report:

• Dawn Reeves, EPA preserves Obama-Era website but climate change data doubts remain, InsideEPA.com, 21 February 2017.

For those of us who are backing up climate data, the really important stuff is in red near the bottom.

The EPA has posted a link to an archived version of its website from Jan. 19, the day before President Donald Trump was inaugurated and the agency began removing climate change-related information from its official site, saying the move comes in response to concerns that it would permanently scrub such data.

However, the archived version notes that links to climate and other environmental databases will go to current versions of them—continuing the fears that the Trump EPA will remove or destroy crucial greenhouse gas and other data.

The archived version was put in place and linked to the main page in response to “numerous [Freedom of Information Act (FOIA)] requests regarding historic versions of the EPA website,” says an email to agency staff shared by the press office. “The Agency is making its best reasonable effort to 1) preserve agency records that are the subject of a request; 2) produce requested agency records in the format requested; and 3) post frequently requested agency records in electronic format for public inspection. To meet these goals, EPA has re-posted a snapshot of the EPA website as it existed on January 19, 2017.”

The email adds that the action is similar to the snapshot taken of the Obama White House website.

The archived version of EPA’s website includes a “more information” link that offers more explanation.

For example, it says the page is “not the current EPA website” and that the archive includes “static content, such as webpages and reports in Portable Document Format (PDF), as that content appeared on EPA’s website as of January 19, 2017.”

It cites technical limits for the database exclusions. “For example, many of the links contained on EPA’s website are to databases that are updated with the new information on a regular basis. These databases are not part of the static content that comprises the Web Snapshot.” Searches of the databases from the archive “will take you to the current version of the database,” the agency says.

“In addition, links may have been broken in the website as it appeared” on Jan. 19 and those will remain broken on the snapshot. Links that are no longer active will also appear as broken in the snapshot.

“Finally, certain extremely large collections of content… were not included in the Snapshot due to their size” such as AirNow images, radiation network graphs, historic air technology transfer network information, and EPA’s searchable news releases.”

#### ‘Smart’ Move

One source urging the preservation of the data says the snapshot appears to be a “smart” move on EPA’s behalf, given the FOIA requests it has received, and notes that even though other groups like NextGen Climate and scientists have been working to capture EPA’s online information, having it on EPA’s site makes it official.

But it could also be a signal that big changes are coming to the official Trump EPA site, and it is unclear how long the agency will maintain the archived version.

The source says while it is disappointing that the archive may signal the imminent removal of EPA’s climate site, “at least they are trying to accommodate public concerns” to preserve the information.

A second source adds that while it is good that EPA is seeking “to address the widespread concern” that the information will be removed by an administration that does not believe in human-caused climate change, “on the other hand, it doesn’t address the primary concern of the data. It is snapshots of the web text.” Also, information “not included,” such as climate databases, is what is difficult to capture by outside groups and is what really must be preserved.

“If they take [information] down” that groups have been trying to preserve, then the underlying concern about access to data remains. “Web crawlers and programs can do things that are easy,” such as taking snapshots of text, “but getting the data inside the database is much more challenging,” the source says.

The first source notes that EPA’s searchable databases, such as those maintained by its Clean Air Markets Division, are used by the public “all the time.”

The agency’s Office of General Counsel (OGC) Jan. 25 began a review of the implications of taking down the climate page—a planned wholesale removal that was temporarily suspended to allow for the OGC review.

But EPA did remove some specific climate information, including links to the Clean Power Plan and references to President Barack Obama’s Climate Action Plan. Inside EPA captured this screenshot of the “What EPA Is Doing” page regarding climate change. Those links are missing on the Trump EPA site. The archive includes the same version of the page as captured by our screenshot.

Inside EPA first reported the plans to take down the climate information on Jan. 17.

After the OGC investigation began, a source close to the Trump administration said Jan. 31 that climate “propaganda” would be taken down from the EPA site, but that the agency is not expected to remove databases on GHG emissions or climate science. “Eventually… the propaganda will get removed…. Most of what is there is not data. Most of what is there is interpretation.”

The Sierra Club and Environmental Defense Fund both filed FOIA requests asking the agency to preserve its climate data, while attorneys representing youth plaintiffs in a federal climate change lawsuit against the government have also asked the Department of Justice to ensure the data related to its claims is preserved.

The Azimuth Climate Data Backup Project and other groups are making copies of actual databases, not just the visible portions of websites.

## Azimuth Backup Project (Part 4)

18 February, 2017

The Azimuth Climate Data Backup Project is going well! Our Kickstarter campaign ended on January 31st and the money has recently reached us. Our original goal was $5000. We got$20,427 of donations, and after Kickstarter took its cut we received $18,590.96. Next time I’ll tell you what our project has actually been doing. This time I just want to give a huge “thank you!” to all 627 people who contributed money on Kickstarter! I sent out thank you notes to everyone, updating them on our progress and asking if they wanted their names listed. The blanks in the following list represent people who either didn’t reply, didn’t want their names listed, or backed out and decided not to give money. I’ll list people in chronological order: first contributors first. Only 12 people backed out; the vast majority of blanks on this list are people who haven’t replied to my email. I noticed some interesting but obvious patterns. For example, people who contributed later are less likely to have answered my email yet—I’ll update this list later. People who contributed more money were more likely to answer my email. The magnitude of contributions ranged from$2000 to $1. A few people offered to help in other ways. The response was international—this was really heartwarming! People from the US were more likely than others to ask not to be listed. But instead of continuing to list statistical patterns, let me just thank everyone who contributed. Daniel Estrada Ahmed Amer Saeed Masroor Jodi Kaplan John Wehrle Bob Calder Andrea Borgia L Gardner Uche Eke Keith Warner Dean Kalahan James Benson Dianne Hackborn Walter Hahn Thomas Savarino Noah Friedman Eric Willisson Jeffrey Gilmore John Bennett Glenn McDavid Brian Turner Peter Bagaric Martin Dahl Nielsen Broc Stenman Gabriel Scherer Roice Nelson Felipe Pait Kenneth Hertz Luis Bruno Andrew Lottmann Alex Morse Mads Bach Villadsen Noam Zeilberger Buffy Lyon Josh Wilcox Danny Borg Krishna Bhogaonker Harald Tveit Alvestrand Tarek A. Hijaz, MD Jouni Pohjola Chavdar Petkov Markus Jöbstl Bjørn Borud Sarah G William Straub Frank Harper Carsten Führmann Rick Angel Drew Armstrong Jesimpson Valeria de Paiva Ron Prater David Tanzer Rafael Laguna Miguel Esteves dos Santos Sophie Dennison-Gibby Randy Drexler Peter Haggstrom Jerzy Michał Pawlak Santini Basra Jenny Meyer John Iskra Bruce Jones Māris Ozols Everett Rubel Mike D Manik Uppal Todd Trimble Federer Fanatic Forrest Samuel, Harmos Consulting Annie Wynn Norman and Marcia Dresner Daniel Mattingly James W. Crosby Jennifer Booth Greg Randolph Dave and Karen Deeter Sarah Truebe Tieg Zaharia Jeffrey Salfen Birian Abelson Logan McDonald Brian Truebe Jon Leland Nicole Sarah Lim James Turnbull John Huerta Katie Mandel Bruce Bethany Summer Heather Tilert Anna C. Gladstone Naom Hart Aaron Riley Giampiero Campa Julie A. Sylvia Pace Willisson Bangskij Peter Herschberg Alaistair Farrugia Conor Hennessy Stephanie Mohr Torinthiel Lincoln Muri Anet Ferwerda Hanna Michelle Lee Guiney Ben Doherty Trace Hagemann Ryan Mannion Penni and Terry O'Hearn Brian Bassham Caitlin Murphy John Verran Susan Alexander Hawson Fabrizio Mafessoni Anita Phagan Nicolas Acuña Niklas Brunberg Adam Luptak V. Lazaro Zamora Branford Werner Niklas Starck Westerberg Luca Zenti and Marta Veneziano Ilja Preuß Christopher Flint George Read Courtney Leigh Katharina Spoerri Daniel Risse Hanna Charles-Etienne Jamme rhackman41 Jeff Leggett RKBookman Aaron Paul Mike Metzler Patrick Leiser Melinda Ryan Vaughn Kent Crispin Michael Teague Ben Fabian Bach Steven Canning Betsy McCall John Rees Mary Peters Shane Claridge Thomas Negovan Tom Grace Justin Jones Jason Mitchell Josh Weber Rebecca Lynne Hanginger Kirby Dawn Conniff Michael T. Astolfi Kristeva Erik Keith Uber Elaine Mazerolle Matthieu Walraet Linda Penfold Lujia Liu Keith Samar Tareem Henrik Almén Michael Deakin Rutger Ockhorst Erin Bassett James Crook Junior Eluhu Dan Laufer Carl Robert Solovay Silica Magazine Leonard Saers Alfredo Arroyo García Larry Yu John Behemonth Eric Humphrey Svein Halvor Halvorsen Karim Issa Øystein Risan Borgersen David Anderson Bell III Ole-Morten Duesend Adam North and Gabrielle Falquero Robert Biegler Qu Wenhao Steffen Dittmar Shanna Germain Adam Blinkinsop John WS Marvin (Dread Unicorn Games) Bill Carter Darth Chronis Lawrence Stewart Gareth Hodges Colin Backhurst Christopher Metzger Rachel Gumper Mariah Thompson Falk Alexander Glade Johnathan Salter Maggie Unkefer Shawna Maryanovich Wilhelm Fitzpatrick Dylan “ExoByte” Mayo Lynda Lee Scott Carpenter Charles D, Payet Vince Rostkowski Tim Brown Raven Daegmorgan Zak Brueckner Christian Page Adi Shavit Steven Greenberg Chuck Lunney Adriel Bustamente Natasha Anicich Bram De Bie Edward L Gray Detrick Robert Sarah Russell Sam Leavin Abilash Pulicken Isabel Olondriz James Pierce James Morrison April Daniels José Tremblay Champagne Chris Edmonds Hans & Maria Cummings Bart Gasiewiski Andy Chamard Andrew Jackson Christopher Wright Crystal Collins ichimonji10 Alan Stern Alison W Dag Henrik Bråtane Martin Nilsson William Schrade ## Give the Earth a Present: Help Us Save Climate Data 28 December, 2016 We’ve been busy backing up climate data before Trump becomes President. Now you can help too, with some money to pay for servers and storage space. Please give what you can at our Kickstarter campaign here: If we get$5000 by the end of January, we can save this data until we convince bigger organizations to take over. If we don’t get that much, we get nothing. That’s how Kickstarter works. Also, if you donate now, you won’t be billed until January 31st.

I will make public how we spend this money. And if we get more than \$5000, I’ll make sure it’s put to good use. There’s a lot of work we could do to make sure the data is authenticated, made easily accessible, and so on.

### The idea

The safety of US government climate data is at risk. Trump plans to have climate change deniers running every agency concerned with climate change. So, scientists are rushing to back up the many climate databases held by US government agencies before he takes office.

We hope he won’t be rash enough to delete these precious records. But: better safe than sorry!

The Azimuth Climate Data Backup Project is part of this effort. So far our volunteers have backed up nearly 1 terabyte of climate data from NASA and other agencies. We’ll do a lot more! We just need some funds to pay for storage space and a server until larger institutions take over this task.

### The team

Jan Galkowski is a statistician with a strong interest in climate science. He works at Akamai Technologies, a company responsible for serving at least 15% of all web traffic. He began downloading climate data on the 11th of December.

• Shortly thereafter John Baez, a mathematician and science blogger at U. C. Riverside, joined in to publicize the project. He’d already founded an organization called the Azimuth Project, which helps scientists and engineers cooperate on environmental issues.

• When Jan started running out of storage space, Scott Maxwell jumped in. He used to work for NASA—driving a Mars rover among other things—and now he works for Google. He set up a 10-terabyte account on Google Drive and started backing up data himself.

• A couple of days later Sakari Maaranen joined the team. He’s a systems architect at Ubisecure, a Finnish firm, with access to a high-bandwidth connection. He set up a server, he’s downloading lots of data, he showed us how to authenticate it with SHA-256 hashes, and he’s managing many other technical aspects of this project.

There are other people involved too. You can watch the nitty-gritty details of our progress here:

## Azimuth Backup Project (Part 1)

16 December, 2016

This blog page is to help organize the Azimuth Environmental Data Backup Project, or Azimuth Backup Project for short. This is part of the larger but decentralized, frantic and somewhat disorganized project discussed elsewhere:

Saving Climate Data (Part 2), Azimuth, 15 December 2016.

Here I’ll just say what we’re doing at Azimuth.

Jan Galkowski is a statistician and engineer at Akamai Technologies, a company in Cambridge Massachusetts whose content delivery network is one of the world’s largest distributed computing platforms, responsible for serving at least 15% of all web traffic. He has begun copying some of the publicly accessible US government climate databases. On 11 December he wrote:

John, so I have just started trying to mirror all of CDIAC [the Carbon Dioxide Information Analysis Center]. We’ll see. I’ll put it in a tarball, and then throw it up on Google. It should keep everything intact. Using WinHTTrack. I have coordinated with Eric Holthaus via Twitter, creating, per your suggestion, a new personal account which I am using exclusively to follow the principals.

Once CDIAC is done, and checked over, I’ll move on to other sites.

There are things beyond our control, such as paper records, or records which are online but are not within visibility of the public.

Oh, and I’ve formally requested time off from work for latter half of December so I can work this on vacation. (I have a number of other projects I want to work in parallel, anyway.)

By 14 December he was wanting some more storage space. He asked David Tanzer and me:

Do either of you have a large Google account, or the “unlimited storage” option at Amazon?

I’m using WebDrive, a commercial product. What I’m (now) doing is defining an FTP map at a .gov server, and then a map to my Amazon Cloud Drive. I’m using Windows 7, so these appear as standard drives (or mounts, in *nix terms). I navigate to an appropriate place on the Amazon Drive, and then I proceed to copy from .gov to Amazon.

There is no compression, and, in order to be sure I don’t abuse the .gov site, I’m deliberately passing this over a wireless network in my home, which limits the transfer rate. If necessary, and if the .gov site permits, I could hardwire the workstation to our FIOS router and get appreciably faster transfer. (I often do that for large work files.)

The nice thing is I get to work from home 3 days a week, so I can keep an eye on this. And I’m taking days off just to do this.

I’m thinking about how I might get a second workstation in the act.

The Web sites themselves I’m downloading, as mentioned, using HTTrack. I intended to tarball-up the site structure and then upload to Amazon. I’m still working on CDIAC at ORNL. For future sites, I’m going to try to get HTTrack to mirror directly to Amazon using one of the mounts.

I asked around for more storage space, and my request was kindly answer by Scott Maxwell. Scott lives in Pasadena California and he used to work for NASA: he even had a job driving a Mars rover! He is now a site reliability engineer at Google, and he works on Google Drive. Scott is setting up a 10-terabyte account on Google Drive, which Jan and others will be able to use.

Meanwhile, Jan noticed some interesting technical problems: for some reason WebDrive is barely using the capacity of his network connection, so things are moving much more slowly than they could in theory.

Most recently, Sakari Maaranen offered his assistance. Sakari is a systems architect at Ubisecure, a firm in Finland that specializes in identity management, advanced user authentication, authorization, single sign-on, and federation. He wrote:

I have several terabytes worth in Helsinki (can get more) and a gigabit connection. I registered my offer but they [the DataRefuge people] didn’t reply though. I’m glad if that means you have help already and don’t need a copy in Helsinki.﻿

I replied saying that the absence of a reply probably means that they’re overwhelmed by offers of help and are struggling to figure out exactly what to do. Scott said:

Hey, Sakari! Thank you for the generous offer!

I’m setting these guys up with Google Drive storage, as at least a short-term solution.

IMHO, our first order of business is just to get a copy of the data into a location we control—one that can’t easily be taken away from us. That’s the rationale for Google Drive: it fits into Jan’s existing workflow, so it’s the lowest-friction path to getting a copy of the data that’s under our control.

How about if I propose this: we let Jan go ahead with the plan of backing up the data in Drive. Then I’ll look evaluate moving it from there to whatever other location we come up with. (Or copying instead of moving! More copies is better. :-) How does that sound to you?

I admit I haven’t gotten as far as thinking about Web serving at all—and it’s not my area of expertise anyway. Maybe you’d be kind enough to elaborate on your thoughts there.

Sakari responded with some information about servers. In late January, U. C. Riverside may help me with this—until then they are busy trying to get more storage space, for wholly unrelated reasons. But right now it seems the main job is to identify important data and get it under our control.

There are a lot of ways you could help.

Computer skills. Personally I’m not much use with anything technical about computers, but the rest of the Azimuth Data Backup gang probably has technical questions that some of you out there could answer… so, I encourage discussion of those questions. (Clearly some discussions are best done privately, and at some point we may encounter unfriendly forces, but this is a good place for roaming experts to answer questions.)

Security. Having a backup of climate data is not very useful if there are also fake databases floating around and you can’t prove yours is authentic. How can we create a kind of digital certificate that our database matches what was on a specific website at a specific time? We should do this if someone here has the expertise.

Money. If we wind up wanting to set up a permanent database with a nice front end, accessible from the web, we may want money. We could do a Kickstarter campaign. People may be more interested in giving money now than later, unless the political situation immediately gets worse after January 20th.

Strategy. We should talk a bit about what to do next, though too much talk tends to prevent action. Eventually, if all goes well, our homegrown effort will be overshadowed by others, at least in sheer quantity. About 3 hours ago Eric Holthaus tweeted “we just got a donation of several petabytes”. If it becomes clear that others are putting together huge, secure databases with nice front ends, we can either quit or—better—cooperate with them, and specialize on something we’re good at and enjoy.