El Niño Project (Part 5)

 

And now for some comic relief.

Last time I explained how to download some weather data and start analyzing it, using programs written by Graham Jones. When you read that, did you think “Wow, that’s easy!” Or did you think “Huh? Run programs in R? How am I supposed to do that?”

If you’re in the latter group, you’re like me. But I managed to do it. And this is the tale of how. It’s a blow-by-blow account of my first steps, my blunders, my fears.

I hope that if you’re intimidated by programming, my tale will prove that you too can do this stuff… provided you have smart friends, or read this article.

More precisely, this article explains how to:

• download and run software that runs the programming language R;

• download temperature data from the National Center for Atmospheric Research;

• use R to create a file of temperature data for a given latitude/longitude rectangle for a given time interval.

I will not attempt to explain how to program in R.

If you want to copy what I’m doing, please remember that a few details depend on the operating system. Since I don’t care about operating systems, I use a Windows PC. If you use something better, some details will differ for you.

Also: at the end of this article there are some very basic programming puzzles.

A sad history

First, let me explain a bit about my relation to computers.

I first saw a computer at the Lawrence Hall of Science in Berkeley, back when I was visiting my uncle in the summer of 1978. It was really cool! They had some terminals where you could type programs in BASIC and run them.

I got especially excited when he gave me the book Computer Lib/Dream Machines by Ted Nelson. It espoused the visionary idea that people could write texts on computers all around the world—“hypertexts” where you could click on a link in one and hop to another!

I did more programming the next year in high school, sitting in a concrete block room with a teletype terminal that was connected to a mainframe somewhere far away. I stored my programs on paper tape. But my excitement gradually dwindled, because I was having more fun doing math and physics using just pencil and paper. My own brain was more easy to program than the machine. I did not start a computer company. I did not get rich. I learned quantum mechanics, and relativity, and Gödel’s theorem.

Later I did some programming in APL in college, and still later I did a bit in Mathematica in the early 1990s… but nothing much, and nothing sophisticated. Indeed, none of these languages would be the ones you’d choose to explore sophisticated ideas in computation!

I’ve just never been very interested… until now. I now want to do a lot of data analysis. It will be embarrassing to keep asking other people to do all of it for me. I need to learn how to do it myself.

Maybe you’d like to do this stuff too—or at least watch me make a fool of myself. So here’s my tale, from the start.

Downloading and running R

To use the programs written by Graham, I need to use R, a language currently popular among statisticians. It is not the language my programmer friends would want me to learn—they’d want me to use something like Python. But tough! I can learn that later.

To download R to my Windows PC, I cleverly type download R into Google, and go to the top website it recommends:

http://cran.r-project.org/bin/windows/base/

I click the big fat button on top saying

Download R 3.1.0 for Windows

and get asked to save a file R-3.1.0-win.exe. I save it in my Downloads folder; it takes a while to download since it was 57 megabytes. When I get it, I click on it and follow the easy default installation instructions. My Desktop window now has a little icon on it that says R.

Clicking this, I get an interface where I can type commands after a red

>

symbol. Following Graham’s advice, I start by trying

> 2^(1:8)

which generates a list of powers of 2 from 21 to 28, like this:

[1] 2 4 8 16 32 64 128 256

Then I try

> mean(2^(1:8))

which gives the arithmetic mean of this list. Somewhat more fun is

> plot(rnorm(20))

which plots a bunch of points, apparently 20 standard normal deviates.

When I hear “20 standard normal deviates” I think of the members of a typical math department… but no, those are deviants. Standard normal deviates are random numbers chosen from a Gaussian distribution of mean zero and variance 1.

Downloading climate data

To do something more interesting, I need to input data.

The papers by Ludescher et al use surface air temperatures in a certain patch of the Pacific, so I want to get ahold of those. They’re here:

NCEP/NCAR Reanalysis 1: Surface.

NCEP is the National Centers for Environmental Prediction, and NCAR is the National Center for Atmospheric Research. They have a bunch of files here containing worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid (that’s 144 × 73 grid points), from 1948 to 2010. And if you go here, the website will help you get data from within a chosen rectangle in a grid, for a chosen time interval.

These are NetCDF files. NetCDF stands for Network Common Data Form:

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

According to my student Blake Pollard:

… the method of downloading a bunch of raw data via ftp (file transfer protocol) is a great one to become familiar with. If you poke around on ftp://ftp.cdc.noaa.gov/Datasets or some other ftp servers maintained by government agencies you will find all the data you could ever want. Examples of things you can download for free: raw multispectral satellite images, processed data products, ‘re-analysis’ data (which is some way of combining analysis/simulation to assimilate data), sea surface temperature anomalies at resolutions much higher than 2.5 degrees (although you pay for that in file size). Also, believe it or not, people actually use NetCDF files quite widely, so once you know how to play around with those you’ll find the world quite literally at your fingertips!

I know about ftp: I’m so old that I know this was around before the web existed. Back then it meant “faster than ponies”. But I need to get R to accept data from these NetCDF files: that’s what scares me!

Graham said that R has a “package” called RNetCDF for using NetCDF files. So, I need to get ahold of this package, download some files in the NetCDF format, and somehow get R to eat those files with the help of this package.

At first I was utterly clueless! However, after a bit of messing around, I notice that right on top of the R interface there’s a menu item called Packages. I boldly click on this and choose Install Package(s).

I am rewarded with an enormous alphabetically ordered list of packages… obviously statisticians have lots of stuff they like to do over and over! I find RNetCDF, click on that and click something like “OK”.

I’m asked if I want to use a “personal library”. I click “no”, and get an error message. So I click “yes”. The computer barfs out some promising text:

utils:::menuInstallPkgs()
trying URL 'http://cran.stat.nus.edu.sg/bin/windows/contrib/3.1/RNetCDF_1.6.2-3.zip'
Content type 'application/zip' length 548584 bytes (535 Kb)
opened URL
downloaded 535 Kb

package ‘RNetCDF’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\JOHN\AppData\Local\Temp\Rtmp4qJ2h8\downloaded_packages

Success!

But now I need to figure out how to download a file and get R to eat it and digest it with the help of RNetCDF.

At this point my deus ex machina, Graham, descends from the clouds and says:

You can download the files from your browser. It is probably easiest to do that for starters. Put
ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/
into the browser, then right-click a file and Save link as…

This code will download a bunch of them:

for (year in 1950:1979) {
download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb")
}

It will put them into the “working directory”, probably C:\Users\JOHN\Documents. You can find the working directory using getwd(), and change it with setwd(). But you must use / not \ in the filepath.

Compared to UNIX, the Windows operating system has the peculiarity of using \ instead of / in path names, but R uses the UNIX conventions even on Windows.

So, after some mistakes, in the R interface I type

> setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino")

and then type

> getwd()

to see if I’ve succeeded. I’m rewarded with

[1] "C:/Users/JOHN/Documents/My Backups/azimuth/el nino"

Good!

Then, following Graham’s advice, I cut-and-paste this into the R interface:

for (year in 1950:1979) {
download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb")
}

It seems to be working! A little bar appears showing how each year’s data is getting downloaded. It chugs away, taking a couple minutes for each year’s worth of data.

Using R to process NetCDF files

Okay, now I’ve got all the worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid from 1950 to 1970.

The world is MINE!

But what do I do with it? Graham’s advice is again essential, along with a little R program, or script, that he wrote:

The R script netcdf-convertor.R from

https://github.com/azimuth-project/el-nino/tree/master/R

will eat the file, digest it, and spit it out again. It contains instructions.

I go to this URL, which is on GitHub, a popular free web-based service for software development. You can store programs here, edit them, and GitHub will help you keep track of the different versions. I know almost nothing about this stuff, but I’ve seen it before, so I’m not intimidated.

I click on the blue thing that says netcdf-convertor.R and see something that looks like the right script. Unfortunately I can’t see how to download it! I eventually see a button I’d overlooked, cryptically labelled “Raw”. I realize that since I don’t want a roasted or oven-broiled piece of software, I should click on this. I indeed succeed in downloading netcdf-convertor.R this way. Graham later says I could have done something better, but oh well. I’m just happy nothing has actually exploded yet.

Once I’ve downloaded this script, I open it using an text processor and look at it. At top are a bunch of comments written by Graham:


######################################################
######################################################

# You should be able to use this by editing this
# section only.

setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")

lat.range <- 13:14
lon.range <- 142:143

firstyear <- 1957
lastyear <- 1958

outputfilename <- paste0("Scotland-", firstyear, "-", lastyear, ".txt")

######################################################
######################################################

# Explanation

# 1. Use setwd() to set the working directory
# to the one containing the .nc files such as
# air.sig995.1951.nc.
# Example:
# setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")

# 2. Supply the latitude and longitude range. The
# NOAA data is every 2.5 degrees. The ranges are
# supplied as the number of steps of this size.
# For latitude, 1 means North Pole, 73 means South
# Pole. For longitude, 1 means 0 degrees East, 37
# is 90E, 73 is 180, 109 is 90W or 270E, 144 is
# 2.5W.

# These roughly cover Scotland.
# lat.range <- 13:14
# lon.range <- 142:143

# These are the area used by Ludescher et al,
# 2013. It is 27x69 points which are then
# subsampled to 9 by 23.
# lat.range <- 24:50
# lon.range <- 48:116

# 3. Supply the years
# firstyear <- 1950
# lastyear <- 1952

# 4. Supply the output name as a text string.
# paste0() concatenates strings which you may find
# handy:
# outputfilename <- paste0("Pacific-", firstyear, "-", lastyear, ".txt")

######################################################
######################################################

# Example of output

# S013E142 S013E143 S014E142 S014E143
# Y1950P001 281.60000272654 281.570002727211 281.60000272654 280.970002740622
# Y1950P002 280.740002745762 280.270002756268 281.070002738386 280.49000275135
# Y1950P003 280.100002760068 278.820002788678 281.120002737269 280.070002760738
# Y1950P004 281.070002738386 279.420002775267 281.620002726093 280.640002747998
# ...
# Y1950P193 285.450002640486 285.290002644062 285.720002634451 285.75000263378
# Y1950P194 285.570002637804 285.640002636239 286.070002626628 286.570002615452
# Y1950P195 285.92000262998 286.220002623275 286.200002623722 286.620002614334
# ...
# Y1950P364 276.100002849475 275.350002866238 276.37000284344 275.200002869591
# Y1950P365 276.990002829581 275.820002855733 276.020002851263 274.72000288032
# Y1951P001 278.220002802089 277.470002818853 276.700002836064 275.870002854615
# Y1951P002 277.750002812594 276.890002831817 276.650002837181 275.520002862439
# ...
# Y1952P365 280.35000275448 280.120002759621 280.370002754033 279.390002775937

# There is one row for each day, and 365 days in
# each year (leap days are omitted). In each row,
# you have temperatures in Kelvin for each grid
# point in a rectangle.

# S13E142 means 13 steps South from the North Pole
# and 142 steps East from Greenwich. The points
# are in reading order, starting at the top-left
# (Northmost, Westmost) and going along the top
# row first.

# Y1950P001 means year 1950, day 1. (P because
# longer periods might be used later.)

######################################################
######################################################

The instructions are admirably detailed concerning what I should do, but they don't say where the output will appear when I do it. This makes me nervous. I guess I should just try it. After all, the program is not called DestroyTheWorld.

Unfortunately, at this point a lot of things start acting weird.

It's too complicated and boring to explain in detail, but basically, I keep getting a file missing error message. I don't understand why this happens under some conditions and not others. I try lots of experiments.

Eventually I discover that one year of temperature data failed to download—the year 1949, right after the first year available! So, I'm getting the error message whenever I try to do anything involving that year of data.

To fix the problem, I simply download the 1949 data by hand from here:

ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/

(You can open ftp addresses in a web browser just like http addresses.) I put it in my working directory for R, and everything is fine again. Whew!

By the time things I get this file, I sort of know what to do—after all, I've spent about an hour trying lots of different things.

I decide to create a file listing temperatures near where I live in Riverside from 1948 to 1979. To do this, I open Graham's script netcdf-convertor.R in a word processor and change this section:

setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")
lat.range <- 13:14
lon.range <- 142:143
firstyear <- 1957
lastyear <- 1958
outputfilename <- paste0("Scotland-", firstyear, "-", lastyear, ".txt")

to this:

setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino")
lat.range <- 23:23
lon.range <- 98:98
firstyear <- 1948
lastyear <- 1979
outputfilename <- paste0("Riverside-", firstyear, "-", lastyear, ".txt")

Why? Well, I want it to put the file in my working directory. I want the years from 1948 to 1979. And I want temperature data from where I live!

Googling the info, I see Riverside, California is at 33.9481° N, 117.3961° W. 34° N is about 56 degrees south of the North Pole, which is 22 steps of size 2.5°. And because some idiot decided everyone should count starting at 1 instead of 0 even in contexts like this, the North Pole itself is step 1, not step 0… so Riverside is latitude step 23. That's why I write:

lat.range <- 23:23

Similarly, 117.5° W is 242.5° E, which is 97 steps of size 2.5°… which counts as step 98 according to this braindead system. That's why I write:

lon.range <- 98:98

Having done this, I save the file netcdf-convertor.R under another name, Riverside.R.

And then I do some stuff that it took some fiddling around to discover.

First, in my R interface I go to the menu item File, at far left, and click on Open script. It lets me browse around, so I go to my working directory for R and choose Riverside.R. A little window called R editor opens up in my R interface, containing this script.

I'm probably not doing this optimally, but I can now right-click on the R editor and see a menu with a choice called Select all. If I click this, everything in the window turns blue. Then I can right-click again and choose Run line or selection. And the script runs!

Voilà!

It huffs and puffs, and then stops. I peek in my working directory, and see that a file called

Riverside.1948-1979.txt

has been created. When I open it, it has lots of lines, starting with these:

S023E098
Y1948P001 279.95
Y1948P002 280.14
Y1948P003 282.27
Y1948P004 283.97
Y1948P005 284.27
Y1948P006 286.97

As Graham promised, each line has a year and day label, followed by a vector… which in my case is just a single number, since I only wanted the temperature in one location. I’m hoping this is the temperature near Riverside, in kelvin.

A small experiment

To see if this is working, I’d like to plot these temperatures and see if they make sense. Unfortunately I have no idea how to get R to take a file containing data of the sort I have and plot it! I need to learn how, but right now I’m exhausted, so I use another method to get the job done— a method that’s too suboptimal and embarrassing to describe here. (Hint: it involves the word “Excel”.)

I do a few things, but here’s the most interesting one—namely, not very interesting. I plot the temperatures for 1963:


I compare it to some publicly available data, not from Riverside, but from nearby Los Angeles:


As you can see, there was a cold day on January 13th, when the temperature dropped to 33°F. That seems to be visible on the graph I made, and looking at the data from which I made the graph, I see the temperature dropped to 251.4 kelvin on the 13th: that’s -7°F, very cold for here. It does get colder around Riverside than in Los Angeles in the winter, since it’s a desert, with temperatures not buffered by the ocean. So, this does seem compatible with the public records. That’s mildly reassuring.

But other features of the graph don’t match, and I’m not quite sure if they should or not. So, all this very tentative and unimpressive. However, I’ve managed to get over some of my worst fears, download some temperature data, and graph it! Now I need to learn how to use R to do statistics with this data, and graph it in a better way.

Puzzles

You can help me out by answering these puzzles. Later I might pose puzzles where you can help us write really interesting programs. But for now it’s just about learning R.

Puzzle 1. Given a text file with lots of lines of this form:

S023E098
Y1948P001 279.95
Y1948P002 280.14
Y1948P003 282.27
Y1948P004 283.97

write an R program that creates a huge vector, or list of numbers, like this:

279.95, 280.14, 282.27, 283.97, ...

Puzzle 2: Extend the above program so that it plots this list of numbers, or outputs it to a new file.

If you want to test your programs, here’s the actual file:

Riverside-1948-1979.txt

More puzzles

If those puzzles are too easy, here are two more. I gave these last time, but everyone was too wimpy to tackle them.

Puzzle 3. Modify the software so that it uses the same method to predict El Niños from 1980 to 2013. You’ll have to adjust two lines in netcdf-convertor-ludescher.R:

firstyear <- 1948
lastyear <- 1980

should become

firstyear <- 1980
lastyear <- 2013

or whatever range of years you want. You’ll also have to adjust names of years in ludescher-replication.R. Search the file for the string 19 and make the necessary changes. Ask me if you get stuck.

Puzzle 4. Right now we average the link strength over all pairs (i,j) where i is a node in the El Niño basin defined by Ludescher et al and j is a node outside this basin. The basin consists of the red dots here:


What happens if you change the definition of the El Niño basin? For example, can you drop those annoying two red dots that are south of the rest, without messing things up? Can you get better results if you change the shape of the basin?

To study these questions you need to rewrite ludescher-replication.R a bit. Here’s where Graham defines the El Niño basin:

ludescher.basin <- function() {
  lats <- c( 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6)
  lons <- c(11,12,13,14,15,16,17,18,19,20,21,22,16,22)
  stopifnot(length(lats) == length(lons))
  list(lats=lats,lons=lons)
}

These are lists of latitude and longitude coordinates: (5,11), (5,12), (5,13), etc. A coordinate like (5,11) means the little circle that’s 5 down and 11 across in the grid on the above map. So, that’s the leftmost point in Ludescher’s El Niño basin. By changing these lists, you can change the definition of the El Niño basin.

Next time I’ll discuss some criticisms of Ludescher et al’s paper, but later we will return to analyzing temperature data, looking for interesting patterns.

41 Responses to El Niño Project (Part 5)

  1. Walter Blackstock says:

    Not having Windows, I tried this on my ancient Thinkpad T42 running Xubuntu 12.04 LTS. The package installer would make it straightforward I thought! Well, not exactly, but after many mis-steps, I think I’m there and can run the example script.

    My approach will not be optimum, but here is a summary which may help others. It’s essentially taken from here:
    http://cran.r-project.org/bin/linux/ubuntu/README

    1. To ensure the latest version, add a local cran mirror to the software repository at /etc/apt/sources.list
    2. Add the key:
    sudo apt-key adv –keyserver keyserver.ubuntu.com –recv-keys E084DAB9
    3. Update:
    sudo apt-get update
    4. Install core packages:
    sudo apt-get install r-base
    sudo apt-get install r-base-dev

    From John’s description, the Windows version appears to have a GUI: for Ubuntu it’s the command line.

    4. sudo R gives the prompt > and the simple numerical tests work! Getting the package RNetCDF proved more difficult – no drop-down menu!

    > install.packages(‘RNetCDF’) failed. It needs to be downloaded and compiled but how? Some dependencies are missing:

    sudo apt-get libnetcdf-dev libudunits2-dev udunits-bin and try again: the missing package compiles! I did this one package at a time but they could be concatenated.

    see https://stackoverflow.com/questions/11319698/how-to-install-r-packages-rnetcdf-and-ncdf-on-ubuntu

    5. The rest follows closely the description given by John: > setwd and make the needed directory changes.
    The data download was without problem, but how to run the script from the R-prompt was not obvious to me.

    after some searching, I used
    > source (“path_to_test_file.R”) which generated a text output file that looks OK. Maybe .txt extension would also work?

    So far so good. I’m most grateful to Graham and John for making this available. Exciting!

    • John Baez says:

      Great to hear from you again, Walter! I’m glad you provided some help for all the Ubuntu users out there.

      For this not in the know, Walter is one of the original Azimuthers; I met him when I first came to Singapore in 2010.

    • Pedro J says:

      SAGE [http://www.sagemath.org/] is quite a nice piece of software for maths and includes R [see https://help.ubuntu.com/community/SAGE to install in ubuntu or just try it online] The gui is so simple as a notebook running on the browser. In fact, John Baez, even though you have said you are not interesting in other ways of doing things, Sage is really easy to use to play with advanced maths with not too much programming knowledge.

      • John Baez says:

        I’ve played a little bit with SAGE, but it has some big fans among the Azimuth crew. It might be good to use it on the cloud, especially if we want people to easily be able to run our programs without installing any software. I see now that R is also available at the same site.

  2. 1) More explanations of technical stuff should be written in this style. Clearly written, describing (some) mistakes you made, occasional humor. I particularly enjoyed the comment about math departments.

    2) Over the years I have been frustrated by downloading various computer languages and having a difficult time getting started with them. I am now pretty good at Mathematica but that was helped by a grant. My current frustration is the two languages that they use in Homotopy Type Theory (Agda and Coq), but I admit I have not yet gritted my teeth and sat down to thrash around enough to learn them.

    • John Baez says:

      I’m glad you liked this approach to explaining stuff. I imagine that if I get better at using computers I will lose the ability to describe the little things that experts find obvious but which trip up beginners. But I think everyone can do better by working a bit harder to remember and explain these little things.

  3. Darin Caggiano says:

    Python’s really nice for data munging, so I’m going to cheat a little bit and use it (Spyder specifically) instead of R for the first part of this.

    Here’s a time series data file of temperatures in Riverside.
    https://drive.google.com/file/d/0B40rcrJW2RqaSjZJSTJsd0xFVzg/edit?usp=sharing

    Here’s the python script to generate it (The Requests module is required for this script. I use pip for python modules, but there are other ways to install modules).
    https://drive.google.com/file/d/0B40rcrJW2RqacXNuUl9YMVdZYTQ/edit?usp=sharing

    Loading it as a data frame and plotting it is fairly straightforward in R.

    data <- read.csv(file="/pth/to/datafile.txt",header=FALSE)

    plot(data, type='l')

    Which returns a not very pretty time series graph of temperatures in Riverside.

    https://drive.google.com/file/d/0B40rcrJW2RqaZE9BWnhndG5OUEk/edit?usp=sharing

    I think we can construct an R script file in our python script and call it using subprocess to make this whole process a one-liner.

  4. xander says:

    What a fun way to learn R. ;)

    The installation for Mac is incredibly simple—binaries can be downloaded from http://cran.r-project.org/bin/macosx/ and installed with a double-click. The application bundle is installed in /Applications/ by default. The RNetCDF package can be installed and activated from the “Packages & Data” menu, at which point the rest of the tutorial above goes off without a hitch.

    As Walter notes,
    > source("path_to_file.R")
    can be used to run scripts.

    Having never used R before, I am getting used to the syntax and its notion of data structures. However, after banging away for a while, I think that both of the puzzles have pretty straight forward solutions:

    Puzzle 1: I think that something along the lines of
    > data <- read.table("Riverside-1948-1979.txt")
    gets the job done. This reads the file into a data frame, which seems to be something like a perl hash.

    Puzzle 2:
    plot(1:length(data[[1]]),data[[1]])
    Creates the plot, though it really isn’t much to look at. It may be slightly better as
    > plot(1:length(data[[1]]),data[[1]],type="l")
    The x-axis is days, the y-axis is temperature.

    • vzemlys says:

      I am curious, did the RNetCDF package work for you without problems? I had to install udutils library to get it working, otherwise it just threw error on load.

      For plotting see my version: https://github.com/mpiktas/el-nino/blob/master/R/puzzle1.R. Parsing dates is not that hard in R (given the library lubridate) and then x axis looks much nicer.

      • xander says:

        The RNetCDF library worked without modification, but I am running a fairly old and customized install of 10.6, so it is possible that the udunits library that you found missing had already been installed. It should also be noted the the R GUI for installing packages on my system has a check box labeled “Install Dependencies”—perhaps I solved the problem without even realizing it by checking that box?

    • John Baez says:

      Yes, RNetCDF worked for me just as I explained in my story. But if you read other comments in this thread, you’ll see Benjamin completely gave up on it and used something called ncdf instead! Other people had other experiences.

      Thanks for your solution! “Lubridate” seems to be a handy thing.

  5. It is very useful to be able to share joint work!

    One more tip … I always try to record the entirety of an R console session in a file having an “.Rhistory” suffix. That’s a slight abuse of the meaning of these, since in the R world these are supposed to consist of incoming commands only. I also am sure to include a date in the filename, and, if there are multiple working sessions in a day, suffixes as to the run version. I find these incredibly useful because they record what I did and how precisely I got a result.

    There’s a comparable facility for Python in IPython. Python has the advantage of bridging the statistical analyst’s world and that of serious software developers and production. Python, with Numpy and Scipy and variants, is also becoming a useful tool for scientific calculation. Ray Pierrehumbert offers Python scripts for doing problems in his Principles of Planetary Climate along with his textbook.

    But Python presently lacks the rich libraries and packages R has for data analysis and calculation. For a Bayesian, like myself, I find pykalman and emcee indispensible. But I still do most serious work in R. I often use Python to “predigest” and “precondition” data for use in R. But, then, again, I also do “final calculations” using the Gibbs sampling facility JAGS which has a nice R interface. Python also has an interface to JAGS but, so far, my attitude is, why bother? It’s going to be the same thing either way.

    Python also falls short in that there is no ability to “save a workspace” as you can in R which is something, I saw, John did not emphasize above.

    By the way, if you are interested in global temperatures rather than ENSO, I have the HadCRUT4 data available for R. Check out the blog post describing it.

    Professor Frank Harrell, Jr, of Vanderbilt University has worked a lot on three aspects of all this. First, he and colleagues have contributed absolutely first rate, well documented packages to CRAN. (Some of the others are quite wanting in comparison.) Second, he has championed Bayesian methods in medicine and biostatistics. (This has gotten some results.) Third, he has worked and argued for reproducible research, using R, Sweave, and LaTeX. I have learned an awful lot from him. I pay attention to him because, in terms of statistical structure, his problems are quite similar to the ones I have in my professional work. (That’s as opposed to work in climate and geophysical things.)

  6. Benjamin Antieau says:

    After a couple of hours, I was able to get (a modification of) the code provided by Graham Jones to work on OS X (specifically OS 10.9 Mavericks). The obstruction to it working out of the box is that despite a lot of work and googling, I could not successfully get RNetCDF to work at all. The problem seemed to be in one of the dependencies, udunits. For those interested, I followed the steps at the page to install netcdf and udunits, in the hopes that this would solve the problem. No joy.

    Let me describe how I did make it work using the R package ncdf in place of RNetCDF. Today was my first day working with either R or CDF, so I know little about R and nothing about the benefits of one cdf package over the other. (But, see the discussion at this site.)

    The first parts of the original post were the same: downloading R, downloading the data, and putting the script in the working directly.

    Once I decided to use ncdf all that was necessary was to modify Jones’s code slightly. I’ve put the new code on my homepage because it’s too late to figure out how to use github. Compared to the original script, I simply changed the library used, and I made some changes to the function make.Kvals.for.year. This needed to be updated to reflect the bindings in the new library.

    The result yields a file just like John’s.

    • John Baez says:

      Benjamin wrote:

      After a couple of hours, I was able to get (a modification of) the code provided by Graham Jones to work on OS X (specifically OS 10.9 Mavericks). The obstruction to it working out of the box is that despite a lot of work and googling, I could not successfully get RNetCDF to work at all.

      Ouch! Thanks for fighting through this. I’m glad it didn’t happen to me, since it might have been enough to crush me completely. For what it’s worth, I’m using a Toshiba laptop with Windows 7 Pro. I downloaded the RNetCDF package by working within the “RGui” interface that shows up when I click “R” on my icons, clicking on the menu item “Packages” and then “Install Packages”. Apparently all the dependencies were taken care of automatically and I was able to succeed while maintaining most of my ignorance.

      I’ll take the liberty of adding your variant to my GitHub El Niño repository, citing you.

      • Benjamin Antieau says:

        Thanks!

        I installed RNetCDF via the Install Packages feature of R, but it did not result in a useable library for some reason.

  7. John Baez says:

    Daniel Bowman sent an email saying:

    I don’t want to take the time to figure out how to post code on your blog, so I’ll just send along the R code with comments as an attachment. Hope it helps. Happy R’ing.

    Dan.

    P.S. If you find yourself doing much R programming, you’ll want to refer to this: http://www.burns-stat.com/documents/books/the-r-inferno/

    Here is his code:

    # If you like learning by experimenting, you probably want to get a quick reference card, e.g.,
    # http://cran.r-project.org/doc/contrib/Short-refcard.pdf
    #
    # Now, on to the puzzles.  First set your directory to where the file is:
    setwd("~/src/hack/R/climate") # set working directory 
    riverside<-read.table("Riverside-1948-1979.txt",header=TRUE)
    str(riverside)		# it's a date.frame; these are like tables in data bases
    head(riverside)		# first few observations
    summary(riverside)	# summary stats
    # Puzzle 1:
    bigLongVector<-riverside[,1] # pull out first column; think "project" in relational algebra
    # bigLongVector<-riverside[,"S023E098"]  # alternate
    # bigLongVector<-riverside$S023E098	 # yet another alternate
    str(bigLongVector)
    head(bigLongVector)
    summary(bigLongVector)
    # Puzzle 2:
    plot(bigLongVector,type="l")  					       # displayed on screen
    pdf(file="bigLongVectorPlot.pdf") 				       # prepares new graphic device; see also png() and jpeg()
    plot(bigLongVector,type="l",main="Riverside Daily Temps (1948-1979)")  # makes plot
    dev.off()							       # turns off graphic device; finishes plot
    #
    # But we can do more interesting things...First, let's get the date information out of the row labels:
    riverside$YMD<-strptime(row.names(riverside),"Y%YP%j")  # YMD format
    # Now, let's load the lubridate library (you may have to install it first) to make dealing
    # with dates easier.
    # install.packages("lubridate")
    library(lubridate)
    riverside$year<-year(riverside$YMD)
    riverside$month<-month(riverside$YMD)
    str(riverside)
    head(riverside)
    # Now, we can get some summary statistics by year and month
    riversideSumStats<-aggregate(S023E098~year+month,data=riverside,FUN="summary")
    str(riversideSumStats) # it's a data.frame
    head(riversideSumStats) # min, lower quartile, median, mean, upper quartile, and max
    str(riversideSumStats$S023E098) # it's a matrix...
    class(riversideSumStats$S023E098) # see?
    # Now, lets make some plots
    with(riversideSumStats[riversideSumStats$month==1,],matplot(year,S023E098,type="l",main="January in Riverside",ylab="Temp (K)"))
    with(riversideSumStats[riversideSumStats$month==7,],matplot(year,S023E098,type="l",main="July in Riverside",ylab="Temp (K)"))
    # Use help() to learn about commands and libraries:
    help(summary)
    help(lubridate)
    help(jpeg)
    help(matplot)
    # type a function name without arguments to see its code:
    matplot
    
    • John Baez says:

      Great, thanks! Many of these things work for me. They’re interesting and fun.

      Maybe someone who finds it not too bothersome could answer some of these questions:

      1. What’s going on in this line?

      riverside$YMD<-strptime(row.names(riverside),"Y%YP%j")
      

      I had created something called riverside with many lines of this form:

      Y1979P365   299.31
      

      When I did

      riverside$YMD<-strptime(row.names(riverside),"Y%YP%j")

      these lines changed form to lines like

      Y1979P365   299.31 1979-12-31
      

      So that was the effect. But I don’t understand the syntax riverside$YMD, which seems at first to be assigning values to a variable riverside$YMD. (It’s not; it’s changing the value of riverside.)

      2. When I get around to doing

      riversideSumStats<-aggregate(S023E098~year+month,data=riverside,FUN="summary")

      it doesn’t work. I get this error message:

      Error in eval(expr, envir, enclos) : object 'S023E098' not found

      Indeed, what is S023E098 supposed to be?

      • Dan says:

        Did you use header=TRUE in the read.table command? The first line of the file you posted is S023E098, which I’m guessing is the coordinates of the temperature readings. So, when you read the file into R with header=TRUE, that will become the column label of the temperature readings in the data.frame. By default, the first column in the file is assumed to be the row labels, which don’t get a name. That’s why I accessed them with the row.names command. The strptime command should have created a new column in the data.frame called YMD. Try head(riverside). You should see the column labels….

      • vzemlys says:

        The variable riverside is an data.frame object. Data.frame is simply a table with the named columns. The columns can be accessed via $ construct. Ie if the data.frame dt has a column named A, you can access it by writing dt$A. So the code

        riverside$YMD<-strptime(row.names(riverside),"Y%YP%j")

        tries to access the column YMD in the data.frame riverside and assign the value on the right hand side expression. Since there is no such column it simply creates it.

  8. James Curran says:

    Here are the answers to puzzles 1 and 2:

    data = readLines(“Riverside-1948-1979.txt”)[-1] ## -1 drops the first line

    ## split the data into two parts based on whitespace
    split.data = strsplit(data, “[[:space:]]+”)

    ## extract the numbers by `applying’ a function to every list element
    values = sapply(split.data, function(x)as.numeric(x[2]))

    plot(values, type = ‘l’)

    ## commented out incase you don’t want to save
    ## write.csv(values, ‘values.csv’)

  9. vzemlys says:

    To make the RNetCDF work in Mac OS X, simply download and install the udunits library. Open your terminal and do the following:

    1. wget ftp://ftp.unidata.ucar.edu/pub/udunits/udunits-2.1.24.tar.gz
    2. tar -xzvf udunits-2.1.24.tar.gz
    3. cd udunits-2.1.24
    4. ./configure
    5. make
    6. sudo make install

    This assumes that you have installed relevant Apple development files, .i.e. Xcode.

    After this RNetCDF works like a charm. Tested on OS X Mavericks and R 3.1.1.

  10. vzemlys says:

    You do not download all of the data. In the line for (year in 1950:1979) change 1950:1979 into 1948:2014. This will download all available data, which is necessary for predicting the El Nino for the years 1980 and beyond.

  11. vzemlys says:

    Ok, I did some code factoring to solve puzzle 3. I’ve put all my changes into a repository https://github.com/mpiktas/el-nino, which is the fork from https://github.com/johncarlosbaez/el-nino. The puzzle3.R contains all the relevant code. You change the years and the name of the output file at the beginning and source the file by source(“puzzle3.R”).

    What I did:

    1. Separate functions from the actual analysis
    2. Set all the variables in puzzle3.R and feed them to the analysis files.
    3. Adapt the functions and analysis files so that step 2 is possible.

    I have the question about the variable m. Now it is set to the value 200. This value is used in calculating standard deviations. From what I see in the code the window for calculation is the n+m observations. The value n is 365, so it makes 565 days. We start the calculations from the year number 2. This makes us lose some observations at the begining of the year 0. Is this OK? This might be a stupid question, because I’m just looking at the code and not seeing the logic behind it.

    I also added the file to download all the data files. The data files are downloaded in the directory data. All the intermediate files are put there also.

    To use the puzzle3.R do the following.

    1. Get all the code from github repository (there is a link with dowload zip)
    2. Start R and set working directory to R directory of the extracted code.
    3. Create data directory as a R subdirectory and put the data files there. Or uncomment the line source(“dowload_data.R”) in the file puzzle3.R
    4. Change the appropriate years and lattitude and longitude ranges
    in the file puzzle3.R
    5. Source the file puzzle3.R, by issuing command source(“puzzle3.R”) in R.
    6. Wait and check the results.
    7. Do not forget to comment out the data download line in puzzle3.R for further analyses.

    I’ve also added my own solution to puzzles 1 and 2 in the file puzzle1.R. Change the input file and explore the results.

    • Graham Jones says:

      “This makes us lose some observations at the begining of the year 0. Is this OK?”

      Yes. It just means the output starts in 1950, whereas it could go back a few months into 1949. Ludescher et al only produce results from 1950 onwards (judging by their graph).

  12. vzemlys says:

    Did more of the code refactoring. Now everything is a function and you can compare the results, i.e. you can run the replication and save it to the separate variable. I’ve added rudimentary comparison function. See the file puzzle4.R

    I’ve also added challenge2.R file with challenge2 from previous post. It is possible now to use a grid which is not coarse.

    I’ve looked into the code and unfortunately such type of code (i.e. looping through the 3d arrays) will not run fast in R. Either it needs to be vectorised or rewritten with Rcpp. The speed improvements could be significant. Unfortunately I am not that well versed in C++ and Rcpp, so at the moment I will not try to optimize anything.

  13. xander says:

    I may be betraying my complete lack of familiarity with R with this comment, but it appears to me that none of the scripts that have been written actually export the results so that they can be examined later. Am I missing something? And, if not, is the following snippet, added to the end of ludescher-replication.R the right way to export the results (as in, [1] is the code correct or is there a better way of handling it? and [2] am I saving the correct data?):

    write.table(data.frame(time=time.axis,S=S),file="/path/to/output/",quote=F)

    This seems to create a whitespace delineated table of indices, times (in decimal years—there seem to be functions in R that could pretty-ify that, but that is beyond the scope of the question I am asking at the moment), and link strengths.

    Again, apologies if this is a really dumb question. :\

    • John Baez says:

      I’m not very good at this stuff; I ran “ludescher-replication.R” and then ran a line like yours, but less sophisticated, to export the data.

      Yes, the vector “S” is the data you want to save. As explained in Part 4, S is a list of “average link strengths”, computed at 10-day intervals, starting from day 730 and going until day 12040, where day 1 is the first of January 1948.

      If you can export them along with the relevant year and day information, that would be better than what I did! Something like 1948D070 for the 70th day of 1948 would be fine, for example.

      • xander says:

        The package lubridate seems to get the job done. You’ll need to make sure that is installed (the process should be similar to installing the RNetCDF package). I’ve added the lines


        library(lubridate)

        ...

        time.date <- date_decimal(time.axis)
        time.out <- sprintf("Y%4dD%03d",year(time.date),day(time.date))
        outfile <- paste0("Pacific-",firstyear,"-",lastyear,"-AvgLinkStrength.txt")
        write.table(data.frame(time=time.out,S=S),file=outfile,quote=F)

        to my copy of the script. I am not familiar with GitHub, so I just put up a link to my Dropbox: https://www.dropbox.com/sh/y4t9g0xrvrrl8xc/AAAFf85GTAF8Nd5RerA_KA9Aa

        I’m currently running the program for the range 1980-2013, so the output should be ready in about 40 minutes.

        • xander says:

          Or maybe better, since it doesn’t use the lubridate library:

          time.out <- sprintf("Y%4dD%03d",floor(time.axis),floor(365*(time.axis-floor(time.axis))))
          outfile <- paste0("Pacific-",firstyear,"-",lastyear,"-AvgLinkStrength.txt")
          write.table(data.frame(time=time.out,S=S),file=outfile,quote=F)

        • vzemlys says:

          R recognises dates (when plotting for example), so it is handy to convert character to date. A side bonus is that you can format the date the way you want using the function format.Date, i.e. there is no need to extract year and day by yourself. Here is the example: format.Date(Sys.Date(),format=”%Y-%j”), will output the year and the the day of the year. To see all the possible options for output see ?strftime.

    • Graham Jones says:

      Actually, I’m not very good at this stuff either. If I can write out some data, make some sense of it in a text editor, and read it back in again without it having turned into something else, I’m happy!

    • vzemlys says:

      Yes this is the way to export tabular data in R. You can export to csv, which can be read in Excel immediately with the function write.csv (or write.csv for folks with coma as decimal point). By default R writes the row indices in the output, you can omit them by setting row.names=FALSE.

      • David Winsemius says:

        The tip about setting row.names=FALSE is a good one for those who are exporting to Excel. If the default value for that parameter is used the headers will be off by one column when the data comes into Excel. It’s easy enough to fix by just inserting a single cell in the “upper left corner” and picking shift right in the dialog box that will pop up.

        I’ve checked the “notify me of new posts via email”. I’m a pretty experienced R programmer and would like to support this effort. I’m not particularly experienced with the spatial statistics packages, though, so would be learning along side all of you. I’ll go back and try to catch up on what I have missed.

        @John; If you want any help on R plotting routines, I’m in the same timezone as you. I’m pretty good with base graphics and lattice, but not so good with ggplot2 (but those don’t look like ggplot graphs).

  14. domenico says:

    I am thinking that there is a problem with scientific data files because it is necessary the file and the program for the reading, but after many years it can be impossible to read a file without the right executable, and the right operating system.
    I am thinking that can be more simple an universal header (for example the part of the c program that can be used to read the strings, or the complete c source program) that can be used automatically by each operating system to convert the file in a text file (so that the executable and the data are concatenated): a double click on a file could compile a simple executable that convert to a text file, and it is not necessary an other language.
    I am thinking that the Network Common Data Form need a library, instead a source code could be more robust.

  15. In case you’re interested, here are the other parts of this series:

    El Niño project (part 1): basic introduction to El Niño and our project here.

    El Niño project (part 2): introduction to the physics of El Niño.

    El Niño project (part 3): summary of the work of Ludescher et al.

    El Niño project (part 4): how Graham Jones replicated the work by Ludescher et al, using software written in R.

    El Niño project (part 5): how to download R and use it to get files of climate data.

    El Niño project (part 6): Steve Wenner’s statistical analysis of the work of Ludescher et al.

    El Niño project (part 7): the definition of El Niño.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.