Decadal predictions are at the forefront of climate research whereby global climate models (GCMs) involving coupled atmospheric and oceanic components are initialised with the best available information about the current state of the oceans and subsequently used to simulate the next decade or so (Meehl et al., 2014). GCMs have more traditionally been used for climate change projections to provide a picture of a climate several decades into the future (Chang et al., 2012), and although they also can produce simulations for the near future, they typically do not start with initial conditions that match the current state of the ocean and atmosphere. Because climate projections usually do not involve initial conditions that provide the best snapshot of current conditions, a large fraction of the uncertainty in the Atlantic in the first few decades is likely due to internal variability, as suggested by uncertainty partitioning (Hawkins and Sutton, 2009). Recent results from model experiments indicate some decadal predictive skill in the North Atlantic (Doblas-Reyes et al., 2013; Kirtman et al., 2013; Garca-Serrano et al., 2015), western Pacific (Mochizuki et al., 2010; Chikamoto et al., 2015) and the Indian Ocean (Guemas et al., 2013), however, the skill is lower in other regions of the world’s oceans. One cause of long-term predictability may be a result of the slow advection of heat in the Atlantic Ocean (Sutton and Allen, 1997; Robson et al., 2014); however, there may potentially also be other slow physical processes that play a role in the North Atlantic, and it is this internal variability that decadal forecasts try to predict.

GCMs are designed to simulate large-scale features and have a minimum skillful scale that may be of the order of a few grid-boxes (Takayabu et al., 2016). Downscaling is required to represent smaller-scale features and involves a variety of methods that utilise dependencies between large-scale and small-scale processes and conditions. It includes both nested high-resolution limited area models (also referred to as ‘dynamical downscaling’) as well as statistical methods as in empirical-statistical downscaling (ESD) (Takayabu et al., 2016). ESD uses statistical methods such as regression on empirical data (historical observations) to calibrate a simple model that describes the dependency between the large-scale predictor and the small-scale predictand. It can be tailored specifically to particular needs, and so-called model output statistics (MOS) represents one type of ESD where the downscaling models are calibrated with a combination of GCM output and local observations (Wilks, 1995). The benefit with MOS and calibrating the model output against the observations is that it can correct for systematic biases, e.g. if the anomalies are too weak or strong or if the action is systematically displaced in space. Moreover, MOS is a subcategory of ESD, but where the model results are used in the calibration of the downscaling models rather than a third data set (e.g. reanalyses). Hence, the difference between MOS and ‘traditional’ ESD is that the former uses model output as predictors in the calibration of the downscaling models against historical observations, whereas the latter uses historical observations as predictors.


Current status

Most work on decadal forecasts has so far been based on direct model output and ensembles of GCM output from a single or multiple models. For instance, Meehl et al. (2014) provided a comprehensive review of decadal forecasting, but did not address potential benefits associated with ESD or MOS when applied to such forecasts. Likewise, traditional validation of seasonal-to-decadal forecasts tend to involve direct comparison with reanalyses or observations (Smith et al., 2013; Guemas et al., 2015) or estimation of reliability of the Brier score (Wilks, 1995), however, statistical post-processing, such as ESD or MOS, may potentially enhance the forecast skills as it potentially can by-pass systematic bias and deterioration from local small-scale noise. Some large-scale features may be more predictable than others and may enhance predictive skills if used as predictors in ESD. Furthermore, there is a minimum skillful scale connected to the design of climate models and their ability to represent real-world phenomena and processes. Hibino and Takayabu (2016) studied the effect of aggregating simulated daily precipitation over different temporal and spatial scales on the signal-to-noise ratio (SNR) for the detection of change, and reported increased SNR with longer scales. Improved ability for detecting signal with greater scales can be exploited for decadal forecasts, as it is expected that ESD and MOS emphasise the large-scales, which may be more predictable than small-scale variability, and hence may potentially enhance the prediction skills by filtering out small-scale noise.

There have been few scientific studies where downscaling has been applied to seasonal or decadal forecasts (Caron et al., 2014; Benestad and Mezghani, 2015), but the limited literature may be due to few available seasonal-to-decadal forecasts. It is possible to use ESD and MOS as a diagnostic tool to assess connections between large and small scales, and here ESD was used for checking the link between the scales in the historical data, whereas MOS was used to evaluate the skill of the decadal forecasts. MOS may offer a value-added contribution as model drift from initial state is a difficult obstacle when it comes to decadal forecasting (Meehl et al., 2014). Bias adjustment has often been used to produce skillful forecasts (Kharin et al., 2012), however, it is possible to by-pass a bias adjustment altogether through MOS. In addition to using MOS as an analytical tool for evaluating decadal forecasts, we also used MOS to provide an answer to the question of whether the use of downscaling improves the decadal forecast quality over the raw model output.

The objective of this study was to downscale the wet-day mean precipitation (μ) and the wet-day frequency (fw), as in Benestad and Mezghani (2015). Here, we assumed a wet day was when more than 1 mm/day precipitation was recorded in a rain gauge measurement. These quantities can be used in probabilistic forecasts since the probability of heavy precipitation may be approximated by an exponential distribution where Pr(X>x)=fwex/μ (Benestad, 2007; Benestad and Mezghani, 2015) (here x is the daily precipitation amount). Furthermore, the total precipitation amount is the product of the two and the total number of days (n): ptot=nfwμ. The greatest potential for downscaling such probabilities appears to be the connection between the mean sea-level pressure (SLP) and fw, as the connection between the amounts μ and the large-scale conditions may be more complicated (Benestad and Mezghani, 2015). The wet-day frequency responds more to shorter time scales (about 5 years) than the wet-day mean precipitation (about 15 years and longer). A comprehensive analysis of skill needs to account for different time scales and was in this case carried out for annual time scales, three-year aggregates, pentads (five-year means) and nine-year periods.



To test whether there is a link between large-scale surface air temperature (TAS) or mean SLP anomalies and local precipitation statistics, ESD was first applied with predictors from the NCEP/NCAR reanalysis 1 over the period 1950–2017 (Kalnay et al., 1996). The predictands (wet-day mean μ and frequency fw) were aggregated over the forecast interval as in Benestad and Mezghani (2015), but were based on the EOBS (version 19.0) gridded precipitation and temperature data for the period 1950–2017 (Linden and Mitchell, 2009). Here, the term ‘forecast interval’ refers to the number of years ahead that each forecast was extended from the initial lead time, e.g. for an interval of one year we used the annual mean temperature and for a three-year interval we used the three-year average. A more precise mathematical formulation of the forecasted quantity is x(t)¯, t[t1, t2] where t1 is the lead time (one year for all the cases presented here) and t2 is the last date of the forecasted quantity. The forecasts were aggregated to estimate a mean value for the interval in order to represent the associated time scale. Likewise, corresponding aggregated quantities were estimated from the EOBS data. The validation involved a leave-one-out (LOO) cross-validation (Wilks, 1995). Since past studies suggest that the circulation pattern seems to affect the frequency of rainy days, we used SLP as predictor for fw (Benestad and Mezghani, 2015; Benestad et al., 2016). For μ, we tried predictors that could capture the sources of moisture (evaporation), which included the TAS and the saturation water vapour pressure es (see the Appendix for more details).

The ensemble mean forecasts were based on the GFDL-CM2.1 model (Delworth et al., 2006) and assessed against the NCEP/NCAR reanalysis to indicate what skill could be expected from the MOS. The assessment of the decadal forecasts involved (i) calculating the correlation between the ensemble mean forecast and the reanalysis, and (ii) a diagnostic of the ensemble spread. For the former, the correlation was estimated over the years of forecasts for each matching pair of grid-boxes in the reanalysis and ensemble means forecast without accounting for the initial drift associated with the prescription of initial conditions. The correlation score should not be sensitive to differences in the mean as long as the drift was similar for the entire set of forecasts for one given forecast interval.

The ensemble spread, on the other hand was quantified through the application of common empirical orthogonal functions (EOFs) (Barnett, 1999; Benestad, 2001). We used common EOFs to quantify the ensemble spread because they provide an elegant and suitable method for comparing different multivariate datasets (Benestad et al., 2017). Common EOFs have not been used much previously within the climate research community for comparing multivariate datasets. Here, they were used to identify the spatial pattern that explains most of the variance within the ensemble and the time evolution associated with this for each of the ensemble members. Here, the term EOF refers to the spatial pattern whereas the principal component (PC) is the time series describing its evolution over time. Combined, the EOFs and PCs bear some similarity with the Fast Fourier Transform because it is possible to reconstruct the original data from the set of EOFs, singular values and PCs.

The common EOFs provided a basis for the MOS where the PCs associated with forecast ensemble were used as input (predictors) in a stepwise multiple regression analysis against the predictand representing local precipitation and temperature. The predictands were also PCs, but were associated with the EOFs of the gridded EOBS data. For the decadal forecasts, there were 10 different predictions (i.e. 10 ensemble members) for the same date/year, whereas for the observations, there was only one data point for each date. To make use of all the information embedded in the ensemble in the evaluation of the forecasts, the same set of observations had to be matched with each ensemble member. We also made use of a multiple regression analysis to assess how well the forecasts matched the observations. Each member of the forecast ensemble (predictors) was included in the regression analysis by creating a joint data matrix made up with concatenated forecasts sorted according to the date (see the Appendix). Each member was matched with corresponding observations by repeating the same observations (predictands) for each ensemble member. The motivation for using a regression with multiple ensemble members was that a large spread was expected to affect the calibration of the stepwise multiple regression used in the MOS in a similar fashion as a high noise fraction, and hence lower the expected variance captured by the regression model.

We expected that the decadal predictability resided in the oceans rather than over land, however, the variability in temperature was greater over land with more pronounced short-term variability and contained more noise. Hence, the experiments were repeated with a land-sea mask to leave out the contribution from land surfaces to assess the ensemble spread in the maritime temperatures.

The application of MOS was designed to emphasise the large scales and their effect on the local conditions, and one potential benefit in using MOS was to increase the SNR if the large scales were more skillfully predicted than the small scales. In this case, the decadal forecasts (TAS, SLP or es derived from TAS) were used as predictors in both the calibration process as well as for prediction, and the MOS was based on a stepwise multiple regression analysis, which could account for systematic biases in the forecast fields (e.g. different means/drift, variance or variations offset in space).

The question of how scales are related in time and space has not yet been settled, and a range of different forecast intervals was tested to cover phenomena that are potentially associated with various time scales. For instance, do changes need longer time if they involve larger areas and spatial scales? A related issue was the question of how robust were the forecast skill estimates, which was explored through a comparison of different time scales (prediction period over which the mean was estimated). The assessment therefore involved many trials with different intervals over which the forecasts were aggregated. The question concerning the dependency between temporal and spatial scales was explored in terms of a spectral analysis applied to regional averages from the reanalysis over a range of domains with different area. The smallest domain was taken to be grid box centred in the Atlantic (15° W/30° N) and the larger domains were taken to be centred around this grid box (each region i contains (2i+1)2 grid boxes where i[0, 1, 2,]). Time series were estimated for each of these by taking the area average, which were then used to estimate the power spectra for the different domain sizes.

In summary, the steps taken in this study were:

  • Compare variance in TAS and SLP associated with various spatial and temporal scales to assess the presence of large-scale slow processes that may be predictable.
  • Aggregate annual fw and μ for gridded precipitation and the annual mean for the gridded temperature (EOBS).
  • Apply ESD to reanalyses and gridded precipitation to assess dependencies to large-scale SLP, TAS or es.
  • Generate maps of correlation between reanalysis and the forecasts filtered through a limited number of EOFs to assess skill associated with spatial scales.
  • Generate common EOFs from data matrices that combine (reanalyses and) all members of the ensemble for analysis and use in MOS.
  • Apply MOS to the common EOFs to downscale the aggregated statistics from EOBS and assess predictive skill and account for biases/drift.

More information about the method is provided in the Appendix.




Scale dependency

The first part of the study involved an evaluation of how the variance varies with temporal and spatial scales. Results from the scale analysis would provide a context for the subsequent evaluation of the decadal forecasts and the MOS results. We used spectral analyses for comparing the temporal and spatial scales, and Fig. 1a shows results for the power spectra for the surface temperature applied to successively larger domains, and the results indicated pronounced power for a combination of long time scales averaged over a large area. The long-term variability over large regions was consistent with a general warming trend; however, the results also suggested that large-scale fluctuations were associated with ∼20 year time scales. For SLP, on the other hand (Fig. 1b), most of the spectral power was found on the spatial small scales. This reflects shifts in the mass of the air column, which is more pronounced over smaller than over larger volumes. These results provided some support for the initial assumption and one motivation for using MOS in conjunction with decadal forecasts: large-scale temperature anomalies tend to be associated with longer time scales (higher values in the upper right section). For SLP, on the other hand, the most pronounced variations tend to be associated with synoptic cyclones and smaller spatial scales (higher values in the lower section) (Neu, 2012).

Fig. 1.  

Heat maps showing power spectra along the x-axis for different spatial scales along the y-axis for TAS (a) and SLP (b). Yellow colour signifies high and red low power. The temporal scales are given as months and the area as number of grid boxes. The contours give a smoother representation of the colour scaling. Data: NCEP/NCAR reanalysis 1.


Link between large and small scale conditions

Establishing how spatial scales are connection to variability is a prerequisite for downscaling approaches such as MOS, which makes use of large-scale conditions (predictors) and how the local scales (predictands) depend on these. To make this assessment, we used ESD as a diagnostic tool, as the application of MOS to decadal forecasts was based on the assumption that the local precipitation statistics depends on the large-scale conditions. This assumption was first tested by applying conventional ESD to the NCEP/NCAR reanalysis 1 and the gridded precipitation parameters μ and fw from the EOBS data. Hence, the first step to assess the effect of applying MOS to decadal forecasts did not involve MOS but a downscaling exercise between the best representation of actual large-scale conditions from a reanalysis and aggregated historical rainfall statistics. The results suggested that there was indeed a connection between the large scales and local precipitation statistics, and Fig. 2 shows the results of a LOO cross-validation for the wet-day frequency. The figure only shows the results for the leading predictand EOF, however, we carried out a multiple regression for each of the four leading predictands PC respectively, using a set of seven predictor PCs as input. There was a particularly close association between the large-scale SLP and the wet-day frequency fw (LOO correlation of 0.81), but a similar cross-validation analysis suggested more modest skills for the wet-day mean μ when using the saturation vapour pressure es over the ocean as the predictor (LOO correlation of 0.44; Fig. 3). Here, only the results for the leading predictand EOF is shown as the higher order patterns were associated with lower proportion of the total covariance (see the Appendix). In addition to that, the cross-validation gave low skill scores for μ, the ESD-based reproduction of the leading EOF pattern for μ did not capture the observed trend associated with this pattern. The maritime TAS and the vapour saturation pressure es, which is a function of TAS, were both related to the sea surface temperature (SST), and the predictor pattern for es associated with higher μ over Europe exhibited stronger weights in the Mediterranean and the western part of the Atlantic. In summary, the results from both the spectral analysis of scales and the ESD on historical observations indicated that MOS potentially might be a feasible strategy for improving decadal forecasts.

Fig. 2.  

Results from downscaling wet-day frequency indicated a close association between large-scale annual mean SLP anomalies and variations in the leading EOF of the annual fw. The upper left panel shows the pattern of the leading EOF estimated for the annual fw, upper right shows the SLP anomalies which are associated with variations in the leading PC (combination of spatial weights from a stepwise multiple regression) derived from NCAR/NCEP reanalysis 1 SLP, lower left shows a scatter plot between the original PC and predictions in terms of LOO cross-validation, and the lower right shows the original time series (black) and the calibrated results (red).

Fig. 3.  

Results from downscaling wet-day mean precipitation indicated a weak link between large-scale annual mean es anomalies derived from NCAR/NCEP reanalysis 1 TAS and variations in the leading EOF of the annual μ. The upper left panel shows the pattern of the leading EOF estimated for the annual μ, upper right shows the es anomalies, which are associated with variations in the leading PC derived from NCAR/NCEP reanalysis 1 TAS, lower left shows a scatter plot between the original PC and predictions in terms of LOO cross-validation, and the lower right shows the original time series (black) and the calibrated results (red).


Forecast skill for large-scale anomalies


Ensemble mean and correlation with reanalysis

The application of MOS can only provide added-value if there is some skill in predicting the large-scale anomalies on decadal time scales. Hence, another assumption to test was the hypothesised improved skill of the decadal forecasts for large-scale TAS and SLP patterns compared to small-scale anomalies. In this case, we used correlations to quantify the predictive skill for the ensemble mean, and we filtered the spatial scales from the forecasts and the historical observations by estimating the EOFs of the fields and then only retaining a few of the leading modes to keep the main modes of variability. A correlation analysis indicated low (negative) skill between the ensemble mean TAS forecast and TAS from the NCEP/NCAR reanalysis off the east coast of the US east coast near Cape Hatteras where the Gulf Stream separates from the continent, but high values in the eastern and northern part of the ocean basin (Fig. 4). The skill of the decadal forecasts was assessed through maps showing correlations between forecasts and reanalysis data for the same locations through matching grid-boxes. Over the Atlantic region, the skill in predicting SST is generally associated with changes in radiative forcing over the tropical and subtropical regions and with internal variability (slow, dynamically driven changes in ocean circulation) over the sub-polar gyre region (Garca-Serrano et al., 2012, 2015; Doblas-Reyes et al., 2013; Caron et al., 2015).

Fig. 4.  

Correlation maps between three-year mean TAS from the NCEP/NCAR reanalysis 1 and corresponding quantities from the decadal ensemble mean forecast, both filtered through 3 EOFs. The results indicated that the three-year temperature in the west off the east coast of the USA has a low correlation with the ensemble mean of the forecast, but there were high correlations in the east Fig. 4a. Figure 4b, map of correlation differences between forecast filtered through the three leading EOFs and unfiltered forecasts to assess the degree predictability of associated with the leading EOF modes (large scales).

Part of the skill was associated with the long-term warming captured by the decadal forecasts based on the GFDL CM2.1 coupled ocean-atmosphere climate model. The question whether the skill was connected to the larger spatial scale anomalies was tested through an assessment where the forecasts were filtered through a set of leading EOFs: the first 20 EOFs were computed and then a subset of the leading modes were kept and used to reconstruct the forecasts. A subset of the three leading modes is presented in Fig. 4, but this assessment was also repeated with different subsets and with similar results (not shown). The leading EOF modes represented covariance patterns that usually involved larger spatial and temporal scales than the higher-order modes. A comparison between correlation scores for the filtered forecasts with the original forecasts against the NCAR/NCEP reanalysis suggests that there were regions with both higher and lower correlation scores (Fig. 4b). It was mainly in the central North Atlantic where small-scale variability seemed to reduce the correlation.


Ensemble members and common EOFs

We assessed the potential skill of the decadal forecasts in more detail by quantifying the spread of the different ensemble members in terms of the main modes of variability. A large spread would indicate less predictability than a smaller spread, and we used common EOFs to compare the ensemble members with the historical fields (see the Appendix). Hence, an additional assessment of skill involved a comparison between the PCs from the common EOFs, and the results indicated that there was a moderate spread among the ensemble members, suggesting that the forecasts were only moderately constrained by the known initial conditions and the boundary conditions (Fig. 5). In this analysis, we had combined temperature anomalies from the NCEP/NCAR reanalysis with corresponding temperature anomalies from the ten ensemble members from the decadal forecasts for corresponding times and spatial grid. The joint data matrix was then used to compute common EOFs in order to produce a set of spatial covariance patterns common for all and a set of PCs, which represented each of the data sets in the joint data matrix. Ten of the datasets were generated by the GCM while only one was from the reanalysis, and hence, the GCM results had a dominating influence on the leading spatial covariance pattern. All the curves showing the leading PC refer to the same spatial pattern (upper left panel), and the singular values suggested that this leading mode could account for almost 52% of the variance, which also indicated that the predicted spatial covariance structure was similar to that found in the reanalysis (see the Appendix for more on the common EOFs). The four leading EOFs could explain most of the total variance, suggesting that the model was able to produce anomalies with similar characteristics as those in the reanalysis data. The spatial structure associated with the common EOFs indicated that the strongest variability was in the Arctic near the edge of the sea-ice and that there was least action off the east coast of the USA, where the Gulf stream branches off the American continent and where the correlation between the ensemble mean forecast and the NCEP/NCAR reanalysis was the smallest (Fig. 4a). One explanation for the reduced skill in this region may be that the ocean model lacked sufficient spatial resolution to adequately represent the western boundary current, the separation of the Gulf Stream from the continent, and the ocean current meanders. It is also interesting to note that the decadal forecasts (red curves) predicted a similar long-term trend as the observations (black curve).

Fig. 5.  

Common EOFs estimated for the joint annual mean TAS NCEP/NCAR reanalysis and the forecasts. The upper left panel shows the common spatial anomaly structure common for the reanalysis and all ensemble members, the upper right shows the variance associated with each EOF (singular values) with cumulated variance (red), and the lower panel shows the PC where the black curve represents the reanalysis and the red curves indicate the ten ensemble members. The leading EOF accounted for more than 50% of the variance, and stood out as the most important mode. The PCs suggested that there was a weak-to-moderate scatter (with respect to the decadal variations) between the ensemble members, and hence some indication of predictive skill.

A similar assessment as for TAS was also carried out for the SLP results, and the results suggest that the atmospheric circulation patterns were less constrained by the ocean state compared to the TAS (Fig. 6a). Moreover, the spread between the ensemble members was more pronounced in the SLP than for TAS, which was evident from the common EOF results (Fig. 7). The leading EOF accounted for most of the variance both in SLP and TAS (Figs. 5 and 7), and for SLP, the spatial EOF pattern suggested that there was a linkage with the North Atlantic Oscillation (NAO). The NAO is well-known to be associated with pronounced variations in the winter-time precipitation and temperature over northern Europe, and may be influenced by a range of conditions including the stratosphere (Thompson et al., 2002; Cagnazzo and Manzini, 2009; Ripesi et al., 2012). On a longer timescale of nine years, the magnitude of the correlation was greater than for the three-year timescale (Fig. 6b), and one possible explanation may be that the reduced sample size of non-overlapping forecasts over the 1961–2011 interval results in more pronounced sampling fluctuations. Figure 6 suggests that there was no indication of improved skill with longer timescales for SLP. Having established that large scales were prominent for TAS but not for SLP, that there was a dependency between large and small scales in the historical observations, and that the decadal forecasts made with GFDL CM2.1 exhibited some skill for TAS but less so, for SLP, we now have a context for evaluating MOS applied to the decadal forecasts. We proceeded with applying MOS to the decadal forecasts.

Fig. 6.  

(a) Same as Fig. 4a, but showing correlation maps for NCAR/NCEP reanalysis 1 and forecasted three-year mean SLP. The correlations were substantially lower than for TAS, suggesting less skillful forecasts when it comes to atmospheric circulation patterns. (b) Correlation maps for observed and forecasted nine-year mean SLP.

Fig. 7.  

Common EOFs estimated for the joint annual SLP NCEP/NCAR reanalysis and the forecasts. The upper left panel shows the common spatial anomaly structure common for the reanalysis and all ensemble members, the upper right shows the variance associated with each EOF (singular values) with cumulated variance (red), and the lower panel shows the PC where the black curve represents the reanalysis and the red curves indicate the ten ensemble members. The leading EOF accounted for about 50% of the variance, and stood out as the most important mode. There was a substantial spread in way the PC was captured by the ensemble members.


Evaluation of MOS

The added value of applying MOS to gridded precipitation statistics was tested through LOO cross-validation applied to the decadal forecast for both wet-day mean and frequency, but none of these predictions were associated with much skill. The lack of predictive skill in precipitation over Europe was also noted by Guemas et al. (2015). A correlation analysis between the ensemble mean of the LOO results for the wet-day frequency fw and corresponding statistics from the gridded EOBS data returned scores scattered around zero (–0.02) for timescales of three years (Fig. 8a), but slightly higher for longer timescales (nine years: 0.14) (Fig. 8b). The same diagnostics for the wet-day mean μ showed similar mean correlation scores as for the wet-day frequency (Fig. 9). One possible reason for the poor prediction of the wet-day frequency may be that the forecasts of SLP were associated with pronounced ensemble spread, and a correlation analysis suggested little resemblance with the observed SLP field in terms of the NCEP/NCAR reanalysis 1 (Figs. 6 and 7).

Fig. 8.  

The correlation between (a) the three-year and (b) nine-year wet-day frequency estimated from the EOBS data and the mean MOS results taken over the entire ensemble. The results indicated low skill for the prediction of three-year fw.

Fig. 9.  

The correlation between the (a) three-year and (b) nine-year wet-day mean precipitation estimated from the EOBS data and the mean MOS results taken over the entire ensemble. The correlations were more heterogeneous for μ than for fw, with several regions that had higher values. However, there were also regions with weak scores.

In addition to using common EOFs to compare the raw decadal forecasts with the reanalysis data, they were also used to assess the MOS results against the observations, where common EOFs were calculated for the ensemble mean LOO results for μ and corresponding wet-day mean precipitation from the EOBS data (not shown). The PCs of the common EOFs suggested that both the decadal forecasts and the observations contained a long-term increasing trend, which can be associated with the past warming. For the wet-day frequency, the part of the leading PC representing the ensemble mean MOS results had significantly reduced variance compared to the observations, suggesting poor skills (R2; not shown).

The annual mean temperature over Europe is influenced by the large-scale temperature structure over the North Atlantic, as demonstrated with LOO cross-validation for a downscaling exercise using NCEP/NCAR reanalysis 1 TAS (both maritime and continental) as predictor (Fig. 10), and the leading EOF for the annual temperature pattern accounted for 66% of the total variance in the EOBS data over Europe. The combination of spatial weights from the multiple regression in the ESD cross-validation exercise reproduced a similar spatial pattern (upper left panel) as the leading EOF of annual mean temperature from EOBS (upper left), and a correlation of 0.97 in the cross-validation indicated a strong link between the two (lower left). The MOS was repeated for the gridded annual mean of the daily mean temperature from the EOBS set. Two experiments were carried out, one where the predictor was taken to be the marine TAS over the North Atlantic and where the temperature over land was masked, and one using both maritime and continental temperatures. The first experiment, which included maritime temperatures, was expected to yield better scores only if the temperatures over the ocean are more predictable, as the temperature over land often has higher variance and is more strongly affected by circulation patterns. Hence, a MOS that uses maritime temperatures as predictors may enhance the skill for decadal predictions. However, the evaluation of the MOS for annual mean TAS forecasts indicated that there was low skill for three-year time scales over northern and western Europe where the correlation skill scores were negative (Fig. 11a). For time scales of 3–5 years, there was an enhancement of the correlation in the southern part of Europe (not shown), and for time scales around 9 years and greater, the correlation was high for most of Europe with a mean correlation of 0.49 (Fig. 11b). These skill scores reflected the long-term trend, which was reproduced in the decadal forecasts, and ephemeral variations over northern Europe with time scale <5 years and low predictive skill suggested that the shorter time scales were strongly influenced by SLP and the NAO.

Fig. 10.  

Results from the ESD experiment indicated a strong link between large-scale and local annual mean TAS anomalies. The upper left panel shows the pattern of the leading EOF estimated for the annual TAS from EOBS, upper right shows the TAS anomalies from the NCEP/NCAR reanalysis 1 associated with variations in the leading PC, lower left shows a scatter plot between the original PC and predictions in terms of LOO cross-validation, and the lower right shows the original time series (black) and the calibrated results (red).

Fig. 11.  

The correlation between the (a) three-year and (b) nine-year mean TAS estimated from the EOBS data and the mean MOS results taken over the entire ensemble. The results indicated low skill for the prediction of three-year TAS. Only the maritime air temperatures from the decadal predictions were used to downscale the gridded temperature over Europe.

The analysis presented in Figs. 4–11 was repeated for forecasts for the annual, two-year means, four-year means, six-year means and so on up to nine-year aggregated value, and the skills varied gradually with time scale in accordance to the differences between the three/five/nine-year aggregates (not shown).



The skill assessment here was different to the common strategy for decadal forecasts, which often involves a comparison between the skill of the initialised forecasts with that of the non-initialised forecasts. In this case, the comparison was made between aggregated statistics from short and long forecast intervals and corresponding intervals of observations because we were concerned with a question concerning scales as well as the merit of initialisation of the model simulations.

One reason for low predictive skill on decadal time scales over Europe was that the temperature and precipitation statistics were closely linked to the atmospheric circulation regime, and the large-scale SLP was only loosely constrained in terms of ocean heat distribution. Furthermore, the SLP was associated with shorter time scales and had a particularly strong link to the wet-day frequency, but also to the temperature. Here, the temperature and precipitation statistics were aggregated for all seasons, however, it is possible that there may be seasonal differences (Benestad and Mezghani, 2015).

One may question if it makes sense to assess whether MOS can add any value when it is already known that the GCM cannot predict large-scale circulation patterns of satisfying skill. In this case, we tried to distill more information from the ensemble than the traditional ensemble mean and emphasise the large-scale co-variance to get a better test. Furthermore, MOS can improve skill if the model systematically displaces phenomena in terms of their location but still manages to predict their timing, i.e. the activity of low pressure systems along the storm tracks or temperature anomalies connected to ocean currents. These results bring up another question about how much skill should a GCM have before downscaling can add more value. Trial and error provide steps towards learning more about these issues, and both successes and failures are important to report as a neglect to publish negative outcomes results in a publication bias (Joober et al., 2012; Mlinari et al., 2017). The lesson from this study was the importance of the atmospheric circulation and its degree of predictability on decadal scales. The most prominent large-scale SLP anomaly over the North Atlantic resembles the NAO pattern (Fig. 7), which has been notoriously difficult to forecast. It may be influenced by the position of the polar jet (Baldwin et al., 2007), the stratosphere (Thompson et al., 2002; Cagnazzo and Manzini, 2009; Ripesi et al., 2012), and the sea-ice extent (Petoukhov and Semenov, 2010; Orsolini et al., 2011; Francis and Vavrus, 2012), in addition to the North Atlantic SST, and it co-varies with the North Atlantic storm tracks. The large-scale maritime TAS over the North Atlantic, on the other hand, is more strongly constrained by the ocean state and its effect on the European mainland is mediated through atmospheric advection.

The analysis of the decadal forecasts presented here cannot support the hypothesis that MOS adds value in terms of making use of large-scale anomalies associated with higher predictive skills than local small-scale variations, although this does not rule out the possibility that MOS may enhance such forecasts in general. The GCM, on which these results were based, may not necessarily be representative of the best decadal forecasts or decadal forecasting in general. It is possible that better models exist with higher spatial resolution and better ability to represent the oceanic state and the ocean currents, and their influence on the overlying atmosphere. The Achilles heel of decadal forecasts seems to be connected to the atmospheric circulation, which in this GCM appeared to be less predictable. The GFDL-CM2.1 model simulates a motion within the atmosphere that is only weakly influenced by the underlying boundary conditions, such as heat sources, and this weak constraint limits the skill of decadal predictions over the continental Europe. There is also a region off the eastern coast of North America with low predictability where the Gulf current separates from the continent.

The scales in time and space appear to be related, but differ for TAS and SLP. Large-scale temperature anomalies are associated with longer time scales such as long-term trends; however, there are also some indications of variations with 20–30 year time scales. For SLP, on the other hand, the aggregation over space does not translate into longer time scales, but reduced variance. The SLP provides a measure of the air mass residing above the sea, and an average over a large area reflects the total mass within a segment of the atmosphere. The total air mass is not expected to change much over time; however, on smaller spatial scales, the air column is affected by atmospheric waves or temperatures at different heights. There is also little indication that large scales in terms of leading EOFs are associated with higher skills than small-scale variability (higher-order EOFs), and there is also a weak link between the saturation water vapour pressure, which is a function of TAS, and the wet-day mean precipitation. Both the TAS and the wet-day mean precipitation exhibit a long-term trend associated with the general warming, and the higher skill scores for μ predictions on longer time scales may partly be due to a common trend.

It is important to keep in mind that while different scales reflect different physical processes, the statistical character of aggregated data is also affected by the choice of time intervals and spatial representation, which also have an effect on the sample size. The standard error of the mean is σ/n1 (Barde and Barde, 2012) and the diminishing error with sample size may be one reason why it may be easier to detect a signal in longer intervals and over larger regions (Hibino and Takayabu, 2016). Hence, for the case of decadal forecasting, an improved skill can be expected both from emphasising the slow processes as well as the higher accuracy in estimating the mean value.



Decadal forecasts derived from a GCM were subject to MOS-based ESD over the North Atlantic and European region, but the analysis of skill suggested low predictability on three-year timescales, mainly due to the fact that the atmospheric circulation was weakly constrained by the oceanic state. Precipitation and temperature over Europe were strongly influenced by such circulation patterns, and in particular variability resembling the NAO and corresponding large-scale mean SLP anomalies. There was also little indication that large-scale anomalies were more predictable than small-scale variability according to these forecasts, hence undermining the assumption for MOS enhancing the predictive skill on decadal time scales. Higher skill was found for forecasts with longer timescale of nine years; however, the higher scores may have been partly inflated by common long-term warming trends.