The performance of an ensemble forecast, as measured by scoring rules, depends on its number of members. Under the assumption of ensemble member exchangeability, ensemble-adjusted scores provide unbiased estimates of the ensemble-size effect. In this study, the concept of ensemble-adjusted scores is revisited and exploited in the general context of multi-model ensemble forecasting. In particular, an ensemble-size adjustment is proposed for the continuous ranked probability score in a multi-model ensemble setting. The method requires that the ensemble forecasts satisfy generalised multi-model exchangeability conditions. These conditions do not require the models themselves to be exchangeable. The adjusted scores are tested here on a dual-resolution ensemble, an ensemble which combines members drawn from the same numerical model but run at two different grid resolutions. It is shown that performance of different ensemble combinations can be robustly estimated based on a small subset of members from each model. At no additional cost, the ensemble-size effect is investigated not only considering the pooling of potential extra-members but also including the impact of optimal weighting strategies. With simple and efficient tools, the proposed methodology paves the way for predictive verification of multi-model ensemble forecasts; the derived statistics can provide guidance for the design of future operational ensemble configurations without having to run additional ensemble forecast experiments for all the potential configurations.

Ensemble systems provide a framework for probabilistic forecasting in numerical weather prediction. A collection of forecasts with the same target serves as a basis for the generation of probabilistic products. In this framework, it is well-established that the ensemble-size, that is the number of forecasts available at the product-generation stage, has an impact on the quality of the ensemble probabilistic products. This is for example the case when we consider a cumulative probability distribution function (CDF) generated from

More generally, ensemble-adjusted scores provide a means to estimate the ensemble-size effect on forecast performance assuming ensemble member exchangeability and stationarity of the error statistics. The concept of score adjustment allows one to derive an unbiased estimate of a score

The first objective of this paper is to revisit the concept of ensemble-adjusted scores and its applicability in the general context of multi-model ensembles. The multi-model ensemble approach refers here to the combination of forecasts from

The benefit of the multi-model ensemble approach and the rationale explaining its success were investigated successively in Hagedorn et al. (

The second objective of this paper is to propose a new approach for ensemble-weighting optimisation. We show that optimal weights can be derived directly from the kernel representation of the CRPS. As a result, ensemble-size effect and weighting strategy can be analysed simultaneously. This is illustrated here in the particular case of a two-ensemble combination. An exhaustive analysis of weighted and unweighted ensemble combinations is performed without the need to run large ensemble experiments or complex post-processing methods. This novel approach to forecast verification is coined

This paper is organised as follows: the concept of ensemble-adjusted scores in a multi-model setting is described and tested in

Consider forecasting a continuous outcome

Following Gneiting and Raftery (

Now consider that we are in a multi-model setting. The multi-model ensemble comprises _{i}_{ig}

We can form the EDF for each of the _{i}

If we choose

Similarly to the single ensemble case, we would like to measure the expected ensemble-size effect on forecast performance in a multi-model ensemble setting. Not only is exchangeability of the ensemble members from any one model required but also

When the generalised multi-model exchangeability conditions are satisfied, an unbiased estimator of the CRPS for a multi-model ensemble with

The adjusted CRPS in a multi-model setting is denoted

The concepts developed in Sections 2.1. (and later in Sections 3) are tested on a dual-resolution ensemble experiment. A dual-resolution ensemble is a particular case of a multi-model ensemble because the different contributing ensembles share the same underlying model. However, this specificity is neither required for the application of predictive verification as developed here, nor impacts the interpretation of the results. The choice of a multi-resolution ensemble to illustrate our method derives from a recent interest at ECMWF for this type of configuration but the method will also work with more traditional multi-model ensembles (as long as they satisfy the generalised multi-model exchangeability conditions).

In our test example, the forecast data-set comprises forecasts from the same numerical model (the ECMWF integrated forecasting system) but run at two different resolutions: TCo639 (∼18km) and TCo399 (∼29km). They are referred to as the higher resolution (Hres) and the lower resolution (Lres) members, respectively. A dual-resolution ensemble combines

In order to answer this question, several forecast combinations with the same computational cost are assessed and their performances compared. An ensemble with an operational-like setup, that is a (0,50) ensemble which comprises 50 Hres members only, is used as the reference ensemble forecast. Other ensemble combinations are compared with this reference. These baseline combinations correspond to the (40,40), (120,20), (160,10), and (200,0) ensembles. Results of this type of analysis are documented in Leutbecher and Ben Bouallègue (

Here, the question is whether the same results and conclusion can be obtained using ensemble-adjusted scores. The ensemble-adjusted approach potentially allows one to reduce drastically the cost of an ensemble experiment. In the case of a positive answer to the above question, score adjustment would provide the framework for the analysis of all potential ensemble combination performance at a lower experiment computational cost (see

Using Expression (6), performance analysis of multi-model combinations is based on verification statistics computed from small subsets of Hres and Lres members. Subsets of the type (2, 4), and (8) are tested where each forecast of the subset is selected in order to be exchangeable with the other members. Leutbecher (

Dual-resolution ensemble performance is assessed for several surface weather variables but only results for 2 m temperature forecasts are shown here. Results for 10 m wind speed and 24 h accumulated total precipitation were analysed as well but only briefly discussed here because they are qualitatively similar. The chosen verification period covers the boreal summer (JJA) 2016, and the forecasts are compared with SYNOP measurements distributed over the northern hemisphere.

Performance in terms of CRPS is computed for each baseline combination, (40,40), (120,20), (160,10), and (200,0), and compared with the performance of the reference forecast (0,50). The CRPS difference is normalised by the mean CRPS of the reference forecast over the verification period and is simply denoted

Now, the CRPS for each of the baseline/reference combinations is estimated based on a (8) subset of members. In other words, we compute

Gain in experiment computational time and accuracy of the score estimates based on adjusted scores for different sizes of ensemble subset.

Computational resources for experimental testing are generally limited. In order to decrease computational cost for numerical experimentation, one can think of reducing the length of the testing period. This alternative is also considered here. Scores computed from the actual size dual-resolution ensembles but over reduced sets of randomly selected verification days are compared with scores averaged over a full 92-day verification period (JJA 2016). Following the same procedure as for the ensemble-adjusted scores, correlations between normalised score differences and their estimates are computed for a range of verification window lengths. The results are reported in

Same as

Besides the accuracy of the ensemble-adjusted scores, the level of confidence associated with score differences is also important when verification results serve as a basis for decision-making regarding future ensemble configurations. In

Finally in this section, we would like to highlight the importance of the generalised multi-model exchangeability condition. This condition is required in order to have a valid unbiased score estimator as discussed in

Relative difference between CRPS and adjusted CRPS for a (40,40) ensemble as a function of the forecast lead time. The score adjustments are based on a (

In this section, the concept of ensemble-adjusted scores is exploited to efficiently assess the ensemble-size effect on multi-model ensemble performance. This is illustrated in the context of the design of a dual-resolution ensemble as described in

We first consider flexibility in terms of ensemble-size in the design of a multi-model ensemble. This is for instance the case in the dual-resolution ensemble example: A decision can be made about the number of Lres and Hres forecasts to be combined with computer power limited by current or foreseeable resources. Using ensemble-adjusted scores, forecast performance of any combination is estimated from a small set of Hres and Lres forecasts. These estimates of the scores for many different configurations are available virtually for free once the required verification statistics from one representative small ensemble have been computed. There is no additional cost in terms of numerical experimentation or in terms of repeated computations of verification statistics.

As suggested by results in

In

Parallel lines to the descending diagonal indicate results for combinations with equal computational costs. For example, results for combinations that require twice and half the current level of computer resources are indicated with dotted lines. Focussing on the current computational cost (dashed diagonal), one can consider running a (200,0) ensemble (top left corner), or a (0,50) ensemble (bottom right corner), or any intermediate combination. The ensemble performances of all these ensemble combinations with equal computational cost are reported on the right panel in

In

We consider now the case where the number of forecasts to be combined is fixed. The question is whether the ensemble performance can be improved by applying appropriate weighting to each combined model. We propose here an analytical expression of the optimal weights which is directly derived from the kernel representation of the CRPS. In contrast to post-processing methods generally in use, no numerical optimisation procedure is required in the proposed approach. Nevertheless, prior knowledge of forecast performance based on historical data is needed.

The weighting strategy proposed here is useful for showing the relative merits of the different model ensembles and its expected impact on multi-model ensemble performance. Potentially, further improvement could be achieved with considering additionally bias correction of the ensemble members. Such corrections are left for future studies.

In the following, we discuss the case where forecasts from two models are directly combined (

Considering the constraint

The optimal weighting depends directly on the number of members combined from model 1 and model 2, that is _{1} and _{2}, respectively. Optimal weights can also be estimated accounting for the ensemble-size effect. In order to derive optimal weights for a (_{1}, _{2}) ensemble based on a subset (_{1}, _{2}) of forecasts, one must solve

Using the kernel representation of the adjusted CRPS in a multi-model context, an analytical solution of

From

Exploiting these results, performance of dual-resolution ensembles can be now examined considering optimal weighting. For each (

Weight associated with the Hres model as a function of the number of Hres (Lres) members q (p) considering a fixed computational cost equivalent to running 50 Hres forecasts: weights when ensemble pooling is applied (grey line), optimal weights estimated from a (200,50) ensemble (black line), optimal weights estimated from a (

In

From the results presented on the right panel in

More generally, the multi-model ensemble performance under simple pooling and optimal weighting provide complementary information. For the design of an ensemble system, the assessments of raw and of potential performance after optimal weighting are both relevant figures. Both can be performed efficiently based on the ensemble-adjusted CRPS.

Ensemble-adjusted scores allow one to account for the ensemble-size effect on ensemble forecast performance. This paper revisits the ensemble-adjusted score concept in the context of multi-model ensemble forecasting. An unbiased estimator of the continuous ranked probability score as a function of the ensemble size is proposed and its robustness tested on dual-resolution ensemble forecasts. It is shown that adjusted scores _{i}

From a research testing perspective, the use of ensemble-size adjusted scores can represent a substantial saving in terms of the computational cost for the numerical experimentation. In our illustrative example, a decrease up to a factor 10 of the experiment cost (by running fewer members) does not considerably deteriorate the quality of the analysis: The unbiased score estimates are highly correlated with scores computed from the actual size ensemble. It is shown that this strategy is more efficient than a strategy consisting in drastically reducing the verification sample in terms of the number of forecast start dates. The latter can be detrimental to a robust assessment of ensemble performance.

Ensemble-adjusted scores find applications in the design of multi-model ensemble systems. This is also illustrated here with a dual-resolution ensemble where an optimal combination of higher- and lower-resolution forecasts is targeted. Not only simple pooling of forecasts but also optimal weighting of the contributing models can be investigated, accounting for the ensemble-size effect. Based on linear algebra, optimal weights are directly derived from the CRPS kernel representation. Applying optimal weighting strategies helps to demonstrate the potential performance of optimally combined ensemble forecasts. The derivation of optimal weights, in a non-iterative fashion, can be applied without restriction to any combination of ensemble members.

At low experiment computational cost and with limited verification effort, it is possible to draw a full picture of expected performance in terms of CRPS as a function of the number of members from each contributing model. The optimal ensemble configuration can be easily identified for a given computational cost with and without weighting members. This new type of ensemble skill investigations is coined predictive verification and aims to provide a framework for making informed decisions on future multi-model ensemble configurations.

The authors are grateful to Francisco J. Doblas-Reyes for triggering their attention on the main topic of this manuscript during the Annual Seminar held at ECMWF in September 2017.

No potential conflict of interest was reported by the authors.

In the following, the mathematical details that were omitted from the main text are provided. The required exchangeability conditions and the proof that the score estimator given these conditions is unbiased are provided first. Then, the general solution for the optimal weights is derived.

It is convenient to introduce a compact notation for the derivations that follow. The verification statistics that need to be aggregated in order to apply the score adjustment in the kernel representation of the CRPS are the mean absolute error _{i}_{ij}

The matrix

The CRPS kernel representation for the multi-model ensemble with weights _{i}_{ij}

Similarly, the adjusted score according to (_{i}_{ij}

Now, we focus on the conditions required to render the adjusted score given by Expressions (

We extend the notion of exchangeability to the multi-model setting as follows. For each of the

An ensemble composed of members from

Consider an ensemble generation method that can generate multi-model exchangeable ensembles of different sizes, say _{j}

For an ensemble that satisfies the generalised multi-model exchangeability conditions given in A2.1 and A2.2, the expected values of the distance between members and the distance between members and verification depend only on the model indices

These conditions are the key ingredient for the following proof.

With the conditions expressed in

This implies that the expected adjusted spread matrix satisfies

Therefore,

So

Now, we describe how to obtain the optimal weights

Optimal weights are sought subject to the constraint

This yields linear equations for the weights and the Lagrange multiplier

Solving (A18) for the weights and inserting in (A19) yields the Lagrange multiplier

Now, the optimum weights can be computed as

If we consider the combination of only two models (

From

Consider forecasting a binary outcome, _{i}_{i}

Suppose that we want to estimate

This result holds under the same generalised multi-model exchangeability conditions as the CRPS result (see