post by Nadja Kutz
This blog article is about the temperature data used in the reports of the Intergovernmental panel on Climate Change (IPCC). I present the results of an investigation into the completeness of global land surface temperature records. There are noticeable gaps in the data records, but I leave discussion about the implications of these gaps to the readers.
The data used in the newest IPCC report, namely the Fifth Assessment Report (AR5) is, as it seems, at the time of writing not yet available at the IPCC data distribution centre.
The temperature databases used for the previous report, AR4, are listed here on the website of the IPCC. These databases are:
• NCDC (probably as a guess using the data set GHCNM v3),
• GISTEMP, and
• the collection of Lugina et al.
The temperature collection CRUTEM3 was put together by the Climatic Research Unit (CRU) at the University of East Anglia. According to the CRU temperature page the CRUTEM3 data and in particular the CRUTEM3 land air temperature anomalies on a 5° × 5° grid-box basis has now been superseded by the so-called CRUTEM4 collection.
Since the CRUTEM collection appeared to be an important data source for the IPCC, I started by investigating the land air temperature data collection CRUTEM4. In what follows, only the availability of so-called land air temperature measurements will be investigated. (The collections often also contain sea surface temperature (SST) measurements.)
Usually only ‘temperature grid data’ or other averaged data is used for the climate assessments. Here ‘grid’ means that data is averaged over regions that cover the earth in a grid. However, the data is originally generated by temperature measuring stations around the world. So, I was interested in this original data and its quality. For the CRUTEM collection the latest station data is called the CRUTEM4 station data collection.
I downloaded the station’s data file, which is a simple text file, from the bottom of the CRUTEM4 station data page. I noticed on a first glance that there are big gaps in the file in some regions of the world. The file is huge, though: it contains monthly measurements starting in January 1701 ending in 2011 and there are altogether 4634 stations. Quickly finding a gap in such a huge file was a sufficiently disconcerting experience that persuaded my husband Tim Hoffmann to help me to investigate this station data in more accessible way, via a visualization.
The visualization takes a long time to load, and due to some unfortunate software configuration issues (not on our side) it sometimes doesn’t work at all. Please open it now in a separate tab while reading this article:
• Nadja Kutz and Tim Hoffman, Temperature data from stations around the globe, collected by CRUTEM 4.
For those who are too lazy to explore the data themselves, or in case the visualization is not working, here are some screenshots from the visualization which documents the missing data in the CRUTEM4 dataset.
The images should speak for themselves. However, an additional explanation is provided after the images. One should in particular mention that it looks as if the deterioration of the CRUTEM4 data set has been greater in the years 2000-2009 than in the years 1980-2000.
Now you could say: okay, we know that there are budget cuts in the UK, and so probably the University of East Anglia was subject to those, but what about all these other collections in the world? This will be addressed after the images.
These screenshots comprise various regions of the world for the month of January for the years 1980, 2000 and 2009. Each station is represented by a small rectangle around its coordinates. The color of a rectangle indicates the monthly temperature value for that station: blue is the coldest, red is the hotttest. Black rectangles are what CRU calls ‘missing data’, denoted with -999 in the file. I prefer instead to call it ‘invalid’ data, in order to distinguish it from the missing data due to stations that have been closed down. In the visualization, closed down stations are encoded by a transparent rectangle and their markers are also present.
We couldn’t find the reasons for this invalid data. At the end of the post John Baez has provided some more literature on this question. It is worth noting that satellites can replace surface measurements only to a certain degree, as was highlighted by Stefan Rahmstorf in a blog post on RealClimate:
the satellites cannot measure the near-surface temperatures but only those overhead at a certain altitude range in the troposphere. And secondly, there are a few question marks about the long-term stability of these measurements (temporal drift).
What about other collections?
Apart from the already mentioned collections, which were used in the IPCC’s AR4 report, there are actually some more institutional collections, and I also found some private weather collections. However among those private collections I haven’t found any collection that goes back in time as far as CRUTEM4. However, it could be that some of those private collections might be more complete in terms of actual data than the collections that reach further back in time.
After discussing our visualization on the Azimuth Forum it turned out that Nick Stokes, who runs the blog MOYHU in Australia, had the same idea as me—however, already in 2011. That is in this year he had visualized station data. For his visualization he used Google Earth. Moreover, for his visualization he used different temperature collections.
If you have Google Earth installed then you can see his visualizations here:
• Nick Stokes, click here.
The link is from the documentation page of Nick Stoke’s website.
What are the major collections?
As far as we can tell, the major global collections of temperature data that go back to the 18th or 19th or at least early 20th century seem to be following. First, there are the collections already mentioned, which are also used in the AR4 report:
• The CRUTEM collection from the University of East Anglia (UK).
• the GISTEMP collection from the Goddard Institute of Space Science (GISS) at NASA (US).
• the collection of Lugina et al, which is a cooperative project involving NCDC/NOAA (US) (see also below), the University of Maryland (US), St. Petersburg State University (Russia) and the State Hydrological Institute, St. Petersburg, (Russia).
• the GHCN collection from NOAA.
Then there are these:
• the Berkeley Earth collection, called BEST
• The GSOD (Global Summary Of the Day) and Global Historical Climatology Network (GHCN) collections. Both these are run by the National Climatic Data Center (NCDC) at National Oceanic and Atmospheric Administration (NOAA) (US). It is not clear to me to what extent these two databases overlap with those of Lugina et al, which were made in cooperation with NCDC/NOAA. It is also not clear to me whether the GHCN collection had been used for the AR4 report (it seems so). There is currently also a very partially working visualization of the GSOD data here. The sparse data in specific regions (see images above) is also apparent in this visualization.
• There is a comparatively new initiative called International Surface Temperatures Initiative (ISTI) which gathers collections in a databank and seeks to provide temperature data “from hourly to century timescales”. As written on their blog, this data seems not to be quality controlled:
The ISTI dataset is not quality controlled, so, after re-reading section 3.3 of Lawrimore et al 2011, I implemented an extremely simple quality control scheme, MADQC.
What did you visualize?
As far as I had understood in the visualization by Nick Stokes—which you just opened—the collection BEST (before 1850-2010), the collections GSOD (1921-2010) and GHCN v2 (before 1850-1990) from NOAA and CRUTEM3 (before 1850-2000) are represented.
CRUTEM3 is also visualized in another way Clive Best. In Clive Best’s visualization, it seems however that one has apart from the station name no further access to other data, like station temperatures, etc. Moreover, it is not possible to set a recent time range, which is important for checking how much the dataset changed in recent times.
Unfortunately this limited possibility to set a time range holds also true for two visualizations of Nick Stokes here and here. In his first visualization, which is more exhaustive than the second, the following datasets are shown: GHCNv3 and an adjusted version of it (GADJ), a prelimary dataset from ISTI, BEST and CRUTEM 4. So his first visualization seems quite exhaustive also with respect to newer data. Unfortunately, as mentioned, setting the time range didn’t work properly (at least when I tested it). The same holds for his second visualization of GHCN v3 data. So, I was only able to trace the deterioration of recent data manually (for example, by clicking on individual stations).
Tim and I visualized CRUTEM4, that is, the updated version of CRUTEM3.
What did you not visualize?
Newer datasets after 2011/2012, for example from the aforementioned ISTI or from the private collections, are not visualized in the two collections you just opened.
Moreover in the visualizations mentioned hre, there is no coverage of the GISS collection, which however now uses NOAA’S GHCN v3 collections. The historical data of GISS could, however, be different from the other collections. The visualizations may also not cover the Lugina et al. collection, which was mentioned above in the context of the IPCC report. Lugina et al. could however be similar to GSOD (and GHCN) due to cooperation. Moreover, GHCN v3 could be substantially more exhaustive than CRUTEM or GHCN v2 (as shown in Nick Stoves visualization). However here the last collection was—like CRUTEM4—released in the spring of 2011.
GCHN v3 is also represented in Nick Stokes’ visualizations (here and here). Upon manually investigating it, it didn’t seem to much crucial additional data not found in CRUTEM4. Since this manual exploration was not exhaustive, I may be wrong—but I don’t think so.
Hence, to our knowledge, in the two visualizations you just opened, quite a lot of the available data is visualized—and as it seems “almost all” (?) of the far-back-reaching original quality controlled global surface temperature data collections as of 2011 or 2012. If you know of other similar collections please let us know.
As mentioned above private collections and in particular the ISTI collection may contain much more data. At the point of writing we don’t know in how far those newer collections will be taken into account for the new IPCC reports and in particular for the AR5 report. Moreover it seems not so clear how quality control may be ensured for those newer collections.
In conclusion, the previous IPCC reports seem to have been informed by the collections described here. Thus the coverage problems you see here need to be taken into account in discussions about the scientific base of previous climate descriptions.
Hopefully the visualizations from Nick Stokes and from Tim and me are ready for exploration! You can start to explore them yourself, and in particular see that the ‘deterioration of data’ is—just as in our CRUTEM4 visualization—also visible in Nick’s collections.
Note: I would like to thank people at the Azimuth Forum for pointing out references, and in particular Nick Stokes and Nathan Urban.
The effects of missing data
supplement by John Baez
There have always been fewer temperature recording stations in Arctic regions than other regions. The following paper initiated a controversy over how this fact affects our picture of the Earth’s climate:
- Kevin Cowtan and Robert G. Way, Coverage bias in the HadCRUT4 temperature series and its impact on recent temperature trends, Quarterly Journal of the Royal Meteorological Society 140 (2014), 1935–1944.
Here is some discussion:
• Kevin Cowtan, Robert Way, and Dana Nuccitelli, Global warming since 1997 more than twice as fast as previously estimated, new study shows, Skeptical Science, 13 November 2013.
• Stefan Rahmstorf, Global warming since 1997 underestimated by half, RealClimate, 13 November 2013 in which it is highlighted that satellites can replace surface measurements only to a certain degree.
• Anthony Watts’ protest about Cowtan, Way and the Arctic, HotWhopper, 15 November 2013.
• Victor Venema, Temperature trend over last 15 years is twice as large as previously thought , Variable Variability, 13 November 2013.
However, these posts seem to say little about the increasing amount of ‘missing data’.
I have tried various other modes of visualisation. Here is a collection of WebGL active plots for the last 25 years (you can rotate, zoom etc, also display the stations). Here is a Google Maps version which lets you select stations by various criteria like years of operation. The pop-up information includes links to the GHCN station information. Here is a recent version which instead shows the effects of adjustment on trends at each station.
And here, in an HTML 5 style, is an active plot showing trends at individual stations over various periods. Again you can show the stations in question.
“However, these posts seem to say little about the increasing amount of ‘missing data’.”
There isn’t an increasing amount of missing data. There is increasing discipline about what is put into the collections. In the early years of both CRUTEM and GHCN, they were concerned to compile the vast amount of data that had fairly recently been digitised. GHCN was a big collaborative grant-funded project. But then it had to be maintained on an ongoing basis, and so they naturally asked – how much do we need? And they could drop a lot of stations and still have enough. I’ve written in detail about this process here.
Thanks for this post, Nadja! It’s especially useful, for us nonexperts, to have your list of various temperature databases, with links. I should copy that over to the Azimuth Wiki and keep improving it.
The question of why we see increasing numbers of black squares, and what they really mean—that’s something I hope we can resolve in the discussion here.
Is there any way to read papers, memos, etc. about this “increasing discipline”? The comments on your blog article include this:
• Thomas Peterson, Harald Daan and Philip Jones, Initial selection of a GCOS surface network, Bulletin of the American Meteorological Society, 78 (1997), 2145–2152.
which looks very interesting. I’ll have to read it. But I’m not sure this will help me understand the black squares appearing later than 1997. In your blog article you write:
Later you said “culling” wasn’t the right word. But what I’m really curious about is exactly what happened and why it happened… as explained by the people who did it. If this data is being used to do science, there should be papers about how the data was chosen.
Well, it’s not clear “who did it”. There were two separate projects – the historical phase, which ended, and the maintained phase. There was a gap; presumably it took time for someone in NOAA to get the organisation to commit to it.
GHCN is now basically just CLIMAT forms and USHCN. USHCN was a major collaborative exercise in planning; there is a NRC review here. CLIMAT forms require international cooperation, and I’m sure at one level there was a constraint on what some countries are prepared to do. WMO resolution 40 has the gory details.
To give an idea on “culling”, in 1990, Turkey lost about 200 stations. Now Turkey is not a very small country, but still. The original GHCN represented countries that had the best-kept records. Now they try to represent area.
ps I wouldn’t focus too much on the black squares. That comes from looking at snapshot grids. In 2009, for example, it looks as if most of Canada is missing. That was actually just because they were submitting CLIMAT forms with the requisite data, but omitting calculated norms which the protocol requires. That got fixed. Maybe the GISS graphic didn’t.
Nick thanks for the links. I think you may have a different emphasis in your visualizations. My focus was on the different time periods and the availability of data in those.
Well every reader may judge him/herself by looking at the visualizations. If you want my opinion: I think this is rather catastrophic. In particular I wouldn’t wonder if the “global warming hiatus” is connected to the gaps.
Upon only briefly looking into your visualization I see by randomly investigating the operation times of some canadian stations e.g. the year 1989 popping up a lot. That is those stations seemed to have been closed.
The Canadian stations weren’t closed. You’ll find their data on BEST or ISTI, or even GHCN Daily.
But GHCN is a compilation for recurrent up-to-date collection of global indices. That is, for calculating a numerical spatial integral. For that you need enough stations, suitably placed, to interpolate.
In the GHCN inventory, of 7280 stations, there are 1921 from the US. 847 from Canada. 254 from Turkey, and 57 from Brazil. There is no need to maintain 847 stations from Canada.
What I find frustrating about the approach here is that it is totally unquantitative. People calculate the uncertainty of those integrals. Famously, in the discussion of whether 2014 was the hottest years, some bloggers were shocked to discover that GISS and NOAA had error bars. HADCRUT gives the 95% range as +-0.1C, and GISS and NOAA give sd ranges of about half that. Those CIs are based mainly on coverage. That is, what is the total uncertainty associated with the interpolated values.
That depends on the number and distribution of stations. BEST claims to reduce the range to maybe half by having tens of thousands of stations. But BEST comes out months in arrears. It’s a calculated trade-off. You need to do that calculation.
A common starting point is Brohan 2006
I tried to check wether e.g. the station Thule Op station remarked as non-operating anymore in your visualization (termination 1981) can be found on GHCN daily. The readme.txt says:
I think I found “ghcnd-stations.txt.”
The “all” subdirectory in here is currently not accessible though.
And I didn’t want to download the tar files.
So sofar I just briefly looked into the inventory file for Thule , which has station ID GLW00017602 and for Thule Op. Unfortunately it seems this inventory file doesn’t contain opening and closing dates, but I interprete the dates as time periods for which the quantities TMIN, etc. where measured. The inventory doesn’t though seem to contain the daily temps periods. For Thule one has:
(FRGB = Base of frozen ground layer (cm), FRGT = Top of frozen ground layer (cm), WT are weather types as described in readme.txt)
For Thule the dates are so much in the past that it seems very much that this station is closed in GCHN daily just as it is in GHCNv3.
Similar to the Thule Op site which closed according to your visualization in 1981.
And as said I had manually checked your BEST visualization and it looked quite as if there was not much difference to the CRUTEM/GHCN data.
I don’t know much about the new ISTI data but they said themselves that quality control is an issue (see blog post).
Apart from that in my blog post I was looking at the data which was used for the last IPCC assessment report, since it is this data which entered the climate change recomendations and not BEST; ISTI or GCHN daily.
So even IF the data would have been available there (which I currently doubt very much) then it wouldn’t had made a difference in the evaluations, as it seems.
Well, Thule Op is in Greenland. And it seems that it did close in 1975, or that is when GHCN Daily data stops. There is a gadget here that will help with looking up GHCN Daily. On the first row of buttons above the frame is one for GHCN Daily. Click and you’ll see a table (may take some secs) with buttons saying USC, Australia,…World. Under World is a menu, with various countries, and down the bottom “Other”. Select Other, click World, and there is a long list of links, also with start-end dates. Search for THULE (Ctrl-F), there is THULE OP SITE. Click to bring up raw data.
John, by the way thanks again for the layout fixing. Funnily the images appear to look on your blog more foggy than if I look at them directly.
Yes sorry. I didn’t zoom into your visualization, because (just as for mine) zooming doesn’t work too smoothly and so I randomly chose one of the tags in your visualization which looked approximately like being located in northern canada…and oops ended up in Greenland.
Thule is a fictive/outdated name, but I should have gotten a clue from the naming of Thule airbase or the GLW abrreviation inthe stations ID.
But speaking of closed down Greenland stations, may be one should investigate the danish temperature station management as well.
So now I picked MACKAR INLET,NW which closed according to your GHCNv3 visualization in 1989 and indeed it seems TMAX seemed to have been at least measured until 1992:
So if there is no fault in the closing date then maybe there are still some stations living in GHCN daily, which are not in GHCNv3. Interesting, but not here since as said I looked at files which were used by the IPCC.
For MACKAR INLET, NW I clicked on Canada. It seems NW sounds as Northwestterritories, but by the location in your map it seems to be in NUNAVUT which seems not to exist neither as a region under your Canada pull down menu nor within all the other canadian pull down regions, (if I interpret your regions abbreviations: BC,NWT,Alb,Sas,Man,Ont,Marit correctly (where I couldnt identify Marit, is Marit some abrreviation for some maritime regions?) i.e. I couldn’t find MACKAR on that table. (by the way Yukon, New Foundland etc. seem also to be missing).
“NUNAVUT which seems not to exist neither as a region under your Canada pull down menu “,/i>
NWT is the old name for Nunavut, used by GHCN. Marit is maritime provinces. I’m just using (in order) the GHCN classifications indicated by the 3rd digit (001… etc) in the station code.
“I looked at files which were used by the IPCC”
What does “used by the IPCC” mean? I don’t think the IPCC is doing anything with these files.
Aha, OK. I don’t check this now but I find that a bit strange. But then where is MACKAR?
As written above on the IPCC website the for the 4th Assessment report used listed data sets are:
• NCDC (probably as a guess using the data set GHCNM v3),
• GISTEMP, and
• the collection of Lugina et al.
I wrote there “(probably as a guess using the data set GHCNM v3)”
because the document in link on the AR4 website says:
which I would interpret as that they are using the monthly data, which is in GHCNv3, which you visualized. I am though not fully sure. That is they link to a data file (which I can’t currently read), which seems to be based on the GHCNv3 stations data, given the information in the link and the document, but may be not. That is I haven’t found any clear assignment that this is so. May be this is written in this article which is cited there, to which I don’t have access.
I realized that some folks in the U.S. support the people who say that climate change is only a fantasy.
Now with the prolonged cold weather in some of the U.S. states (Maryland) they found the opportunity to attack environmentalists again by saying you see? we have snow, so where is global warming now?
The truth is that these kind of people are producing and sustaining the major polluting activities on this planet because they have great benefits from it and they don’t care about others who live here along with them.
In connection to this you might now want to look at this recent article:
Science publishes new NOAA analysis: Data show no recent slowdown in global warming