## Give the Earth a Present: Help Us Save Climate Data

28 December, 2016

We’ve been busy backing up climate data before Trump becomes President. Now you can help too, with some money to pay for servers and storage space. Please give what you can at our Kickstarter campaign here:

If we get $5000 by the end of January, we can save this data until we convince bigger organizations to take over. If we don’t get that much, we get nothing. That’s how Kickstarter works. Also, if you donate now, you won’t be billed until January 31st. So, please help! It’s urgent. I will make public how we spend this money. And if we get more than$5000, I’ll make sure it’s put to good use. There’s a lot of work we could do to make sure the data is authenticated, made easily accessible, and so on.

### The idea

The safety of US government climate data is at risk. Trump plans to have climate change deniers running every agency concerned with climate change. So, scientists are rushing to back up the many climate databases held by US government agencies before he takes office.

We hope he won’t be rash enough to delete these precious records. But: better safe than sorry!

The Azimuth Climate Data Backup Project is part of this effort. So far our volunteers have backed up nearly 1 terabyte of climate data from NASA and other agencies. We’ll do a lot more! We just need some funds to pay for storage space and a server until larger institutions take over this task.

### The team

Jan Galkowski is a statistician with a strong interest in climate science. He works at Akamai Technologies, a company responsible for serving at least 15% of all web traffic. He began downloading climate data on the 11th of December.

• Shortly thereafter John Baez, a mathematician and science blogger at U. C. Riverside, joined in to publicize the project. He’d already founded an organization called the Azimuth Project, which helps scientists and engineers cooperate on environmental issues.

• When Jan started running out of storage space, Scott Maxwell jumped in. He used to work for NASA—driving a Mars rover among other things—and now he works for Google. He set up a 10-terabyte account on Google Drive and started backing up data himself.

• A couple of days later Sakari Maaranen joined the team. He’s a systems architect at Ubisecure, a Finnish firm, with access to a high-bandwidth connection. He set up a server, he’s downloading lots of data, he showed us how to authenticate it with SHA-256 hashes, and he’s managing many other technical aspects of this project.

There are other people involved too. You can watch the nitty-gritty details of our progress here:

## Saving Climate Data (Part 3)

23 December, 2016

You can back up climate data, but how can anyone be sure your backups are accurate? Let’s suppose the databases you’ve backed up have been deleted, so that there’s no way to directly compare your backup with the original. And to make things really tough, let’s suppose that faked databases are being promoted as competitors with the real ones! What can you do?

One idea is ‘safety in numbers’. If a bunch of backups all match, and they were made independently, it’s less likely that they all suffer from the same errors.

Another is ‘safety in reputation’. If a bunch of backups of climate data are held by academic institutes of climate science, and another are held by climate change denying organizations (conveniently listed here), you probably know which one you trust more. (And this is true even if you’re a climate change denier, though your answer may be different than mine.)

But a third idea is to use a cryptographic hash function. In very simplified terms, this is a method of taking a database and computing a fairly short string from it, called a ‘digest’.

A good hash function makes it hard to change the database and get a new one with the same digest. So, if the person owning a database computes and publishes the digest, anyone can check that your backup is correct by computing its digest and comparing it to the original.

It’s not foolproof, but it works well enough to be helpful.

Of course, it only works if we have some trustworthy record of the original digest. But the digest is much smaller than the original database: for example, in the popular method called SHA-256, the digest is 256 bits long. So it’s much easier to make copies of the digest than to back up the original database. These copies should be stored in trustworthy ways—for example, the Internet Archive.

When Sakari Maraanen made a backup of the University of Idaho Gridded Surface Meteorological Data, he asked the custodians of that data to publish a digest, or ‘hash file’. One of them responded:

Sakari and others,

I have made the checksums for the UofI METDATA/gridMET files (1979-2015) as both md5sums and sha256sums.

You can find these hash files here:

https://www.northwestknowledge.net/metdata/data/hash.md5

https://www.northwestknowledge.net/metdata/data/hash.sha256

md5sum -c hash.md5

sha256sum -c hash.sha256

Please let me know if something is not ideal and we’ll fix it!

Thanks for suggesting we do this!

Sakari replied:

Thank you so much! This means everything to public mirroring efforts. If you’d like to help further promoting this Best Practice, consider getting it recognized as a standard when you do online publishing of key public information.

1. Publishing those hashes is already a major improvement on its own.

2. Publishing them on a secure website offers people further guarantees that there has not been any man-in-the-middle.

3. Digitally signing the checksum files offers the best easily achievable guarantees of data integrity by the person(s) who sign the checksum files.

Please consider having these three steps included in your science organisation’s online publishing training and standard Best Practices.

Feel free to forward this message to whom it may concern. Feel free to rephrase as necessary.

As a separate item, public mirroring instructions for how to best download your data and/or public websites would further guarantee permanence of all your uniquely valuable science data and public contributions.

Right now we should get this message viral through the government funded science publishing people. Please approach the key people directly – avoiding the delay of using official channels. We need to have all the uniquely valuable public data mirrored before possible changes in funding.

Again, thank you for your quick response!

There are probably lots of things to be careful about. Here’s one. Maybe you can think of more, and ways to deal with them.

What if the data keeps changing with time? This is especially true of climate records, where new temperatures and so on are added to a database every day, or month, or year. Then I think we need to ‘time-stamp’ everything. The owners of the original database need to keep a list of digests, with the time each one was made. And when you make a copy, you need to record the time it was made.

## Azimuth Backup Project (Part 1)

16 December, 2016

This blog page is to help organize the Azimuth Environmental Data Backup Project, or Azimuth Backup Project for short. This is part of the larger but decentralized, frantic and somewhat disorganized project discussed elsewhere:

Saving Climate Data (Part 2), Azimuth, 15 December 2016.

Here I’ll just say what we’re doing at Azimuth.

Jan Galkowski is a statistician and engineer at Akamai Technologies, a company in Cambridge Massachusetts whose content delivery network is one of the world’s largest distributed computing platforms, responsible for serving at least 15% of all web traffic. He has begun copying some of the publicly accessible US government climate databases. On 11 December he wrote:

John, so I have just started trying to mirror all of CDIAC [the Carbon Dioxide Information Analysis Center]. We’ll see. I’ll put it in a tarball, and then throw it up on Google. It should keep everything intact. Using WinHTTrack. I have coordinated with Eric Holthaus via Twitter, creating, per your suggestion, a new personal account which I am using exclusively to follow the principals.

Once CDIAC is done, and checked over, I’ll move on to other sites.

There are things beyond our control, such as paper records, or records which are online but are not within visibility of the public.

Oh, and I’ve formally requested time off from work for latter half of December so I can work this on vacation. (I have a number of other projects I want to work in parallel, anyway.)

By 14 December he was wanting some more storage space. He asked David Tanzer and me:

Do either of you have a large Google account, or the “unlimited storage” option at Amazon?

I’m using WebDrive, a commercial product. What I’m (now) doing is defining an FTP map at a .gov server, and then a map to my Amazon Cloud Drive. I’m using Windows 7, so these appear as standard drives (or mounts, in *nix terms). I navigate to an appropriate place on the Amazon Drive, and then I proceed to copy from .gov to Amazon.

There is no compression, and, in order to be sure I don’t abuse the .gov site, I’m deliberately passing this over a wireless network in my home, which limits the transfer rate. If necessary, and if the .gov site permits, I could hardwire the workstation to our FIOS router and get appreciably faster transfer. (I often do that for large work files.)

The nice thing is I get to work from home 3 days a week, so I can keep an eye on this. And I’m taking days off just to do this.

I’m thinking about how I might get a second workstation in the act.

The Web sites themselves I’m downloading, as mentioned, using HTTrack. I intended to tarball-up the site structure and then upload to Amazon. I’m still working on CDIAC at ORNL. For future sites, I’m going to try to get HTTrack to mirror directly to Amazon using one of the mounts.

I asked around for more storage space, and my request was kindly answer by Scott Maxwell. Scott lives in Pasadena California and he used to work for NASA: he even had a job driving a Mars rover! He is now a site reliability engineer at Google, and he works on Google Drive. Scott is setting up a 10-terabyte account on Google Drive, which Jan and others will be able to use.

Meanwhile, Jan noticed some interesting technical problems: for some reason WebDrive is barely using the capacity of his network connection, so things are moving much more slowly than they could in theory.

Most recently, Sakari Maaranen offered his assistance. Sakari is a systems architect at Ubisecure, a firm in Finland that specializes in identity management, advanced user authentication, authorization, single sign-on, and federation. He wrote:

I have several terabytes worth in Helsinki (can get more) and a gigabit connection. I registered my offer but they [the DataRefuge people] didn’t reply though. I’m glad if that means you have help already and don’t need a copy in Helsinki.﻿

I replied saying that the absence of a reply probably means that they’re overwhelmed by offers of help and are struggling to figure out exactly what to do. Scott said:

Hey, Sakari! Thank you for the generous offer!

I’m setting these guys up with Google Drive storage, as at least a short-term solution.

IMHO, our first order of business is just to get a copy of the data into a location we control—one that can’t easily be taken away from us. That’s the rationale for Google Drive: it fits into Jan’s existing workflow, so it’s the lowest-friction path to getting a copy of the data that’s under our control.

How about if I propose this: we let Jan go ahead with the plan of backing up the data in Drive. Then I’ll look evaluate moving it from there to whatever other location we come up with. (Or copying instead of moving! More copies is better. :-) How does that sound to you?

I admit I haven’t gotten as far as thinking about Web serving at all—and it’s not my area of expertise anyway. Maybe you’d be kind enough to elaborate on your thoughts there.

Sakari responded with some information about servers. In late January, U. C. Riverside may help me with this—until then they are busy trying to get more storage space, for wholly unrelated reasons. But right now it seems the main job is to identify important data and get it under our control.

There are a lot of ways you could help.

Computer skills. Personally I’m not much use with anything technical about computers, but the rest of the Azimuth Data Backup gang probably has technical questions that some of you out there could answer… so, I encourage discussion of those questions. (Clearly some discussions are best done privately, and at some point we may encounter unfriendly forces, but this is a good place for roaming experts to answer questions.)

Security. Having a backup of climate data is not very useful if there are also fake databases floating around and you can’t prove yours is authentic. How can we create a kind of digital certificate that our database matches what was on a specific website at a specific time? We should do this if someone here has the expertise.

Money. If we wind up wanting to set up a permanent database with a nice front end, accessible from the web, we may want money. We could do a Kickstarter campaign. People may be more interested in giving money now than later, unless the political situation immediately gets worse after January 20th.

Strategy. We should talk a bit about what to do next, though too much talk tends to prevent action. Eventually, if all goes well, our homegrown effort will be overshadowed by others, at least in sheer quantity. About 3 hours ago Eric Holthaus tweeted “we just got a donation of several petabytes”. If it becomes clear that others are putting together huge, secure databases with nice front ends, we can either quit or—better—cooperate with them, and specialize on something we’re good at and enjoy.

## Saving Climate Data (Part 2)

16 December, 2016

I want to get you involved in the Azimuth Environmental Data Backup Project, so click on that for more. But first some background.

A few days ago, many scientists, librarians, archivists, computer geeks and environmental activists started to make backups of US government environmental databases. We’re trying to beat the January 20th deadline just in case.

Backing up data is always a good thing, so there’s no point in arguing about politics or the likelihood that these backups are needed. The present situation is just a nice reason to hurry up and do some things we should have been doing anyway.

As of 2 days ago the story looked like this:

Saving climate data (Part 1), Azimuth, 13 December 2016.

A lot has happened since then, but much more needs to be done. Right now you can see a list of 90 databases to be backed up here:

Gov. Climate Datasets (Archive). (Click on the tiny word “Datasets” at the bottom of the page!)

Despite the word ‘climate’, the scope includes other environmental databases, and rightly so. Here is a list of databases that have been backed up:

By going here and clicking “Start Here to Help”:

you can nominate a dataset for rescue, claim a dataset to rescue, let everyone know about a data rescue event, or help in some other way (which you must specify). There’s also other useful information on this page, which was set up by Nick Santos.

The overall effort is being organized by the Penn Program in the Environmental Humanities, or ‘PPEHLab’ for short, headed by Bethany Wiggin. If you want to know what’s going on, it helps to look at their blog:

However, the people organizing the project are currently overwhelmed with offers of help! People worldwide are proceeding to take action in a decentralzed way! So, everything is a bit chaotic, and nobody has an overall view of what’s going on.

I can’t overstate this: if you think that ‘they’ have a plan and ‘they’ know what’s going on, you’re wrong. ‘They’ is us. Act accordingly.

Here’s a list of news articles, a list of ‘data rescue events’ where people get together with lots of computers and do stuff, and a bit about archives and archivists.

### News

Here are some things to read:

• Jason Koebler, Researchers are preparing for Trump to delete government science from the web, Vice, 13 December 2016.

• Brady Dennis, Scientists frantically copying U.S. climate data, fearing it might vanish under Trump, Washington Post, 13 December, 2016. (Also at the Chicago Tribune.)

• Eric Holthaus, Why I’m trying to preserve federal climate data before Trump takes office, Washington Post, 13 December 2016.

• Nicole Mortillaro, U of T heads ‘guerrilla archiving event’ to preserve climate data ahead of Trump presidency, CBC News, 14 December 2016.

• Audie Kornish and Eric Holthaus, Scientists race to preserve climate change data before Trump takes office, All Things Considered, National Public Radio, 14 December 2016.

### Data rescue events

There’s one in Toronto:

Guerrilla archiving event, 10 am – 4 pm EST, Saturday 17 December 2016. Location: Bissell Building, 4th Floor, 140 St. George St. University of Toronto.

There will be one in Philadelphia:

DataRescuePenn Data Harvesting, Friday–Saturday 13–14 January 2017. Location: not determined yet, probably somewhere at the University of Pennsylvania, Philadelphia.

I hear there will also be events in New York City and Los Angeles, but I don’t know details. If you do, please tell me!

### Archives and archivists

Today I helped catalyze a phone conversation between Bethany Wiggin, who heads the PPEHLab, and Nancy Beaumont, head of the Society of American Archivists. Digital archivists have a lot of expertise in saving information, so their skills are crucial here. Big wads of disorganized data are not very useful.

In this conversation I learned that some people are already in contact with the Internet Archive. This archive always tries to save US government websites and databases at the end of each presidential term. Their efforts are not limited to environmental data, and they save not only webpages but entire databases, e.g. data in ftp sites. You can nominate sites to be saved here:

• Internet Archive, End of Presidential Term Harvest 2016.

• Internet Archive blog, Preserving U.S. Government Websites and Data as the Obama Term Ends, 15 December 2016.

## Saving Climate Data (Part 1)

13 December, 2016

I try to stay out of politics on this website. This post is not mainly about politics. It’s a call to action. We’re trying to do something rather simple and clearly worthwhile. We’re trying to create backups of US government climate data.

The background is, of course, political. Many signs point to a dramatic change in US climate policy:

• Oliver Milman, Trump’s transition: sceptics guide every agency dealing with climate change, The Guardian, 12 December 2016.

So, scientists are now backing up large amounts of climate data, just in case the Trump administration tries to delete it after he takes office on January 20th:

• Brady Dennis, Scientists are frantically copying U.S. climate data, fearing it might vanish under Trump, Washington Post, 13 December 2016.

Of course saving the data publicly available on US government sites is not nearly as good as keeping climate programs fully funded! New data is coming in all the time from satellites and other sources. We need it—and we need the experts who understand it.

Also, it’s possible that the Trump administration won’t go so far as trying to delete big climate science databases. Still, I think it can’t be a bad thing to have backups. Or as my mother always said: better safe than sorry!

Quoting the Washington Post article:

Alarmed that decades of crucial climate measurements could vanish under a hostile Trump administration, scientists have begun a feverish attempt to copy reams of government data onto independent servers in hopes of safeguarding it from any political interference.

The efforts include a “guerrilla archiving” event in Toronto, where experts will copy irreplaceable public data, meetings at the University of Pennsylvania focused on how to download as much federal data as possible in the coming weeks, and a collaboration of scientists and database experts who are compiling an online site to harbor scientific information.

“Something that seemed a little paranoid to me before all of a sudden seems potentially realistic, or at least something you’d want to hedge against,” said Nick Santos, an environmental researcher at the University of California at Davis, who over the weekend began copying government climate data onto a nongovernment server, where it will remain available to the public. “Doing this can only be a good thing. Hopefully they leave everything in place. But if not, we’re planning for that.”

[…]

“What are the most important .gov climate assets?” Eric Holthaus, a meteorologist and self-proclaimed “climate hawk,” tweeted from his Arizona home Saturday evening. “Scientists: Do you have a US .gov climate database that you don’t want to see disappear?”

Within hours, responses flooded in from around the country. Scientists added links to dozens of government databases to a Google spreadsheet. Investors offered to help fund efforts to copy and safeguard key climate data. Lawyers offered pro bono legal help. Database experts offered to help organize mountains of data and to house it with free server space. In California, Santos began building an online repository to “make sure these data sets remain freely and broadly accessible.”

In Philadelphia, researchers at the University of Pennsylvania, along with members of groups such as Open Data Philly and the software company Azavea, have been meeting to figure out ways to harvest and store important data sets.

At the University of Toronto this weekend, researchers are holding what they call a “guerrilla archiving” event to catalogue key federal environmental data ahead of Trump’s inauguration. The event “is focused on preserving information and data from the Environmental Protection Agency, which has programs and data at high risk of being removed from online public access or even deleted,” the organizers said. “This includes climate change, water, air, toxics programs.”

The event is part of a broader effort to help San Francisco-based Internet Archive with its End of Term 2016 project, an effort by university, government and nonprofit officials to find and archive valuable pages on federal websites. The project has existed through several presidential transitions.

I hope that small “guerilla archiving” efforts will be dwarfed by more systematic work, because it’s crucial that databases be copied along with all relevant metadata—and some sort of cryptographic certificate of authenticity, if possible. However, getting lots of people involved is bound to be a good thing, politically speaking.

If you have good computer skills, good understanding of databases, or lots of storage space, please get involved. Efforts are being coordinated by Barbara Wiggin and others at the Data Refuge Project:

• PPEHLab (Penn Program in the Environmental Humanities), DataRefuge.

You can contact them at DataRefuge@ppehlab.org. Nick Santos is also involved, and if you want to get “more plugged into the project” you can contact him here. They are trying to build a climate database mirror website here:

At the help form on this website you can nominate a dataset for rescue, claim a dataset to rescue, let them know about a data rescue event, or help in some other way (which you must specify).

PPEHLab and Penn Libraries are organizing a data rescue event this Thursday:

• PPEHLab, DataRefuge meeting, 14 December 2016.

At the American Geophysical Union meeting in San Francisco, where more than 20,000 earth and climate scientists gather from around the world, there was a public demonstration today starting at 1:30 PST:

Rally to stand up for science, 13 December 2016.

And the “guerilla archiving” hackathon in Toronto is this Saturday—see below. If you know people with good computer skills in Toronto, get them to check it out!

### Guerrilla archiving in Toronto

Here are details on this:

Guerrilla Archiving Hackathon

Date: 10am-4pm, December 17, 2016

Location: Bissell Building, 4th Floor, 140 St. George St. University of Toronto

RSVP and up-to-date information: Guerilla archiving: saving environmental data from Trump.

Bring: laptops, power bars, and snacks. Coffee and pizza provided.

This event collaborates with the Internet Archive’s End of Term 2016 project, which seeks to archive the federal online pages and data that are in danger of disappearing during the Trump administration. Our event is focused on preserving information and data from the Environmental Protection Agency, which has programs and data at high risk of being removed from online public access or even deleted. This includes climate change, water, air, toxics programs. This project is urgent because the Trump transition team has identified the EPA and other environmental programs as priorities for the chopping block.

The Internet Archive is a San Francisco-based nonprofit digital library which aims at preserving and making universally accessible knowledge. Its End of Term web archive captures and saves U.S. Government websites that are at risk of changing or disappearing altogether during government transitions. The Internet Archive has asked volunteers to help select and organize information that will be preserved before the Trump transition.

End of Term web archive: http://eotarchive.cdlib.org/2016.html

New York Times article: “Harvesting Government History, One Web Page at a Time

Activities:

Identifying endangered programs and data

Seeding the End of Term webcrawler with priority URLs

Identifying and mapping the location of inaccessible environmental databases

Hacking scripts to make accessible to the webcrawler hard to reach databases.

Building a toolkit so that other groups can hold similar events

Skills needed: We need all kinds of people — and that means you!

People who can locate relevant webpages for the Internet Archive’s webcrawler

People who can identify data targeted for deletion by the Trump transition team and the organizations they work with

People with knowledge of government websites and information, including the EPA

People with library and archive skills

People who are good at navigating databases

People interested in mapping where inaccessible data is located at the EPA

Hackers to figure out how to extract data and URLs from databases (in a way that Internet Archive can use)

People with good organization and communication skills

People interested in creating a toolkit for reproducing similar events

Contacts: michelle.murphy@utoronto.ca, p.keilty@utoronto.ca

## Science, Models and Machine Learning

3 September, 2014

guest post by David Tweed

The members of the Azimuth Project have been working on both predicting and understanding the El Niño phenomenon, along with writing expository articles. So far we’ve mostly talked about the physics and data of the El Niño, along with looking at one method of actually trying to predict El Niño events. Since there’s going to more data exploration using methods more typical of machine learning, it’s a good time to briefly describe the mindset and highlight some of differences between different kinds of predictive models. Here we’ll concentrate on the concepts rather than the fine details and particular techniques.

We also stress there’s not a fundamental distinction between machine learning (ML) and statistical modelling and inference. There are certainly differences in culture, background and terminology, but in terms of the actual algorithms and mathematics used there’s a great commonality. Throughout the rest of the article we’ll talk about ‘machine learning models’, but could equally have used ‘statistical models’.

For our purposes here, a model is any object which provides a systematic procedure for taking some input data and producing a prediction of some output. There’s a spectrum of models, ranging from physically based models at one end to purely data driven models at the other. As a very simple example, suppose you commute by car from your place of work to your home and you want to leave work in order to arrive homeat 6:30 pm. You can tackle this by building a model which takes as input the day of the week and gives you back a time to leave.

• There’s the data driven approach, where you try various leaving times on various days and record whether or not you get home by 6:30 pm. You might find that the traffic is lighter on weekend days so you can leave at 6:10 pm while on weekdays you have to leave at 5:45 pm, except on Wednesdays when you have to leave at 5:30pm. Since you’ve just crunched on the data you have no idea why this works, but it’s a very reliable rule when you use it to predict when you need to leave.

• There’s the physical model approach, where you attempt to infer how many people are doing what on any given day and then figure out what that implies for the traffic levels and thus what time you need to leave. In this case you find out that there’s a mid-week sports game on Wednesday evenings which leads to even higher traffic. This not only predicts that you’ve got to leave at 5:30 pm on Wednesdays but also lets you understand why. (Of course this is just an illustrative example; in climate modelling a physical model would be based upon actual physical laws such as conservation of energy, conservation of momentum, Boyle’s law, etc.)

There are trade-offs between the two types of approach. Data driven modelling is a relatively simple process. In contrast, by proceeding from first principles you’ve got a more detailed framework which is equally predictive but at the cost of having to investigate a lot of complicated underlying effects. Physical models have one interesting advantage: nothing in the data driven model prevents it violating physical laws (e.g., not conserving energy, etc) whereas a physically based model obeys the physical laws by design. This is seldom a problem in practice, but worth keeping in mind.

The situation with data driven techniques is analogous to one
: there’s the big message about how “using a data driven technique can change your life for the better” while the voiceover gabbles out all sorts of small print. The remainder of this post will describe some of the basic principles in that small print.

### Preprocessing and feature extraction

There’s a popular misconception that machine learning works well when you simply collect some data and throw it into a machine learning algorithm. In practice that kind of approach often yields a model that is quite poor. Almost all successful machine learning applications are preceded by some form of data preprocessing. Sometimes this is simply rescaling so that different variables have similar magnitudes, are zero centred, etc.

However, there are often steps that are more involved. For example, many machine learning techniques have what are called ‘kernel variants’ which involve (in a way whose details don’t matter here) using a nonlinear mapping from the original data to a new space which is more amenable to the core algorithm. There are various kernels with the right mathematical properties, and frequently the choice of a good kernel is made either by experimentation or knowledge of the physical principles. Here’s an example (from Wikipedia’s entry on the support vector machine) of how a good choice of kernel can convert a not linearly separable dataset into a linearly separable one:

An extreme example of preprocessing is explicitly extracting features from the data. In ML jargon, a feature “boils down” some observed data into a value directly useful. For example, in the work by Ludescher et al that we’ve been looking at, they don’t feed all the daily time series values into their classifier but take the correlation between different points over a year as the basic features to consider. Since the individual days’ temperatures are incredibly noisy and there are so many of them, extracting features from them gives more useful input data. While these extraction functions could theoretically be learned by the ML algorithm, this is quite a complicated function to learn. By explicitly choosing to represent the data using this feature the amount the algorithm has to discover is reduced and hence the likelihood of it finding an excellent model is dramatically increased.

### Limited amounts of data for model development

Some of the problems that we describe below would vanish if we had unlimited amounts of data to use for model development. However, in real cases we often have a strictly limited amount of data we can obtain. Consequently we need methodologies to address the issues that arise when data is limited.

### Training sets and test sets

The most common way to work with collected data is to split it into a training set and a test set. The training set is used in the process of determining the best model parameters, while the test set—which is not used in any way in determining the those model parameters—is then used to see how effective the model is likely to be on new, unseen data. (The test and training sets need not be equally sized. There are some fitting techniques which need to further subdivide the training set, so that having more training than test data works out best.) This division of data acts to further reduce the effective amount of data used in determining the model parameters.

After we’ve made this split we have to be careful how much of the test data we scrutinise in any detail, since once it has been investigated it can’t meaningfully be used for testing again, although it can still be used for future training. (Examining the test data is often informally known as burning data.) That only applies to detailed inspection however; one common way to develop a model is to look at some training data and then train the model (also known as fitting the model) on that training data. It can then be evaluated on the test data to see how well it does. It’s also then okay to purely mechanically train the model on the test data and evaluate it on the training data to see how “stable” the performance is. (If you get dramatically different scores then your model is probably flaky!) However, once we start to look at precisely why the model failed on the test data—in order to change the form of the model—the test data has now become training data and can’t be used as test data for future variants of that model. (Remember, the real goal is to accurately predict the outputs for new, unseen inputs!)

### Random patterns in small sample sets

Suppose we’re modelling a system which has a true probability distribution $P$. We can’t directly observe this, but we have some samples $S$ obtained from observation of the system and hence come from $P$. Clearly there are problems if we generate this sample in a way that will bias the area of the distribution we sample from: it wouldn’t be a good idea to get training data featuring heights in the American population by only handing out surveys in the locker rooms of basketball facilities. But if we take care to avoid sampling bias as much as possible, then we can make various kinds of estimates of the distribution that we think $S$ comes from.

Let’s consider the estimate $P'$ implied for $S$ by some particular technique. It would be nice if $P = P'$, wouldn’t it? And indeed many good estimators have the property that as the size of $S$ tends to infinity $P'$ will tend to $P$. However, for finite sizes of $S$, and especially for small sizes, $P'$ may have some spurious detail that’s not present in $P$.

As a simple illustration of this, my computer has a pseudorandom number generator which generates essentially uniformly distributed random numbers between 0 and 32767. I just asked for 8 numbers and got

2928, 6552, 23979, 1672, 23440, 28451, 3937, 18910.

Note that we’ve got one subset of 4 values (2928, 6552, 1672, 3937) within the interval of length 5012 between 1540 and 6552 and another subset of 3 values (23440, 23979 and 28451) in the interval of length 5012 between 23440 and 28451. For this uniform distribution the expectation of the number of values falling within a given range of that size is about 1.2. Readers will be familiar with how the expectation of a random quantity for a small sample will have a large amount of variation around its value that only reduces as the sample size increases, so this isn’t a surprise. However, it does highlight that even completely unbiased sampling from the true distribution will typically give rise to extra ‘structure’ within the distribution implied by the samples.

For example, here are the results from one way of estimating the probability from the samples:

The green line is the true density while the red curve shows the probability density obtained from the samples, with clearly spurious extra structure.

### Generalization

Almost all modelling techniques, while not necessarily estimating an explicit probability distribution from the training samples, can be seen as building functions that are related to that probability distribution.

For example, a ‘thresholding classifier’ for dividing input into two output classes will place the threshold at the optimal point for the distribution implied by the samples. As a consequence, one important aim in building machine learning models is to estimate the features that are present in the true probability distribution while not learning such fine details that they are likely to be spurious features due to the small sample size. If you think about this, it’s a bit counter-intuitive: you deliberately don’t want to perfectly reflect every single pattern in the training data. Indeed, specialising a model too closely to the training data is given the name over-fitting.

This brings us to generalization. Strictly speaking generalization is the ability of a model to work well upon unseen instances of the problem (which may be difficult for a variety of reasons). In practice however one tries hard to get representative training data so that the main issue in generalization is in preventing overfitting, and the main way to do that is—as discussed above—to split the data into a set for training and a set only used for testing.

One factor that’s often related to generalization is regularization, which is the general term for adding constraints to the model to prevent it being too flexible. One particularly useful kind of regularization is sparsity. Sparsity refers to the degree to which a model has empty elements, typically represented as 0 coefficients. It’s often possible to incorporate a prior into the modelling procedure which will encourage the model to be sparse. (Recall that in Bayesian inference the prior represents our initial ideas of how likely various different parameter values are.) There are some cases where we have various detailed priors about sparsity for problem specific reasons. However the more common case is having a ‘general modelling’ belief, based upon experience in doing modelling, that sparser models have a better generalization performance.

As an example of using sparsity promoting priors, we can look at linear regression. For standard regression with $E$ examples of $y^{(i)}$ against $P$ dimensional vectors $x^{(i)}$ we’re considering the total error

$\min_{c_1,\dots, c_P} \frac{1}{E}\sum_{i=1}^E (y^{(i)} - \sum_{j=1}^P c_j x^{(i)}_j)^2$

while with the $l_1$ prior we’ve got

$\min_{c_1,\dots, c_P} \frac{1}{E} \sum_{i=1}^E (y^{(i)} - \sum_{j=1}^P c_j x^{(i)}_j)^2 + \lambda \sum_{j=1}^P |c_j|$

where $c_i$ are the coefficients to be fitted and $\lambda$ is the prior weight. We can see how the prior weight affects the sparsity of the $c_i$s:

On the $x$-axis is $\lambda$ while the $y$-axis is the coefficient value. Each line represents the value of one particular coefficient as $\lambda$ increases. You can see that for very small $\lambda$ – corresponding to a very weak prior – all the weights are non-zero, but as it increases – corresponding to the prior becoming stronger – more and more of them have a value of 0.

There are a couple of other reasons for wanting sparse models. The obvious one is speed of model evaluation, although this is much less significant with modern computing power. A less obvious reason is that one can often only effectively utilise a sparse model, e.g., if you’re attempting to see how the input factors should be physically modified in order to affect the real system in a particular way. In this case we might want a good sparse model rather than an excellent dense model.

### Utility functions and decision theory

While there are some situations where a model is sought purely to develop knowledge of the universe, in many cases we are interested in models in order to direct actions. For example, having forewarning of El Niño events would enable all sorts of mitigation actions. However, these actions are costly so they shouldn’t be undertaken when there isn’t an upcoming El Niño. When presented with an unseen input the model can either match the actual output (i.e., be right) or differ from the actual output (i.e., be wrong). While it’s impossible to know in advance if a single output will be right or wrong – if we could tell that we’d be better off using that in our model – from the training data it’s generally possible to estimate the fractions of predictions that will be right and will be wrong in a large number of uses. So we want to link these probabilities with the effects of actions taken in response to model predictions.

We can do this using a utility function and a loss
function
. The utility maps each possible output to a numerical value proportional to the benefit from taking actions when that output was correctly anticipated. The loss maps outputs to a number proportional to the loss from the actions when the output was incorrectly predicted by the model. (There is evidence that human beings often have inconsistent utility and loss functions, but that’s a story for another day…)

There are three common ways the utility and loss functions are used:

• Maximising the expected value of the utility (for the fraction where the prediction is correct) minus the
expected value of the loss (for the fraction where the prediction is incorrect).

• Minimising the expected loss while ensuring that the expected utility is at least some
value

• Maximising the expected utility while ensuring that the expected loss is at most some
value.

Once we’ve chosen which one we want, it’s often possible to actually tune the fitting of the model to optimize with respect to that criterion.

Of course sometimes when building a model we don’t know enough details of how it will be used to get accurate utility and loss functions (or indeed know how it will be used at all).

### Inferring a physical model from a machine learning model

It is certainly possible to take a predictive model obtained by machine learning and use it to figure out a physically based model; this is one way of performing data mining. However in practice there are a couple of reasons why it’s necessary to take some care when doing this:

• The variables in the training set may be related by some
non-observed latent variables which may be difficult to reconstruct without knowledge of the physical laws that are in play. (There are machine learning techniques which attempt to reconstruct unknown latent variables but this is a much more difficult problem than estimating known but unobserved latent variables.)

• Machine learning models have a maddening ability to find variables that are predictive due to the way the data was gathered. For example, in a vision system aimed at finding tanks all the images of tanks were taken during one day on a military base when there was accidentally a speck of grime on the camera lens, while all the images of things that weren’t tanks were taken on other days. A neural net cunningly learned that to decide if it was being shown a tank it should look for the shadow from the grime.

• It’s common to have some groups of very highly correlated input variables. In that case a model will generally learn a function which utilises an arbitrary linear combination of the correlated variables and an equally good model would result from using any other linear combination. (This is an example of the statistical problem of ‘identifiability’.) Certain sparsity encouraging priors have the useful property of encouraging the model to select only one representative from a group of correlated variables. However, even in that case it’s still important not to assign too much significance to the particular division of model parameters in groups of correlated variables.

• One can often come up with good machine learning models even when physically important variables haven’t been collected in the training data. A related issue is that if all the training data is collected from a particular subspace factors that aren’t important there won’t be found. For example, if in a collision system to be modelled all data is collected about low speeds the machine learning model won’t learn about relativistic effects that only have a big effect at a substantial fraction of the speed of light.

### Conclusions

All of the ideas discussed above are really just ways of making sure that work developing statistical/machine learning models for a real problem is producing meaningful results. As Bob Dylan (almost) sang, “to live outside the physical law, you must be honest; I know you always say that you agree”.

## Unreliable Biomedical Research

13 January, 2014

An American drug company, Amgen, that tried to replicate 53 landmark studies in cancer was able to reproduce the original results in only 6 cases—even though they worked with the original researchers!

That’s not all. Scientists at the pharmaceutical company Bayer were able to reproduce the published results in just a quarter of 67 studies!

How could things be so bad? The picture here shows two reasons:

If most interesting hypotheses are false, a lot of positive results will be ‘false positives’. Negative results may be more reliable. But few people publish negative results, so we miss out on those!

And then there’s wishful thinking, sloppiness and downright fraud. Read this Economist article for more on the problems—and how to fix them:

Trouble at the lab, Economist, 18 October 2013.

That’s where I got the picture above.