1.

## Introduction

Consider a probabilistic weather forecast F issued at a location with climatology G for that time of the year, both expressed in the form of a cumulative probability distribution function (cdf). The forecast F is the conditional distribution based on the information at hand when the forecast is issued, while G is the corresponding unconditional distribution function of the same random variable of interest. We set our focus on the point of intersection between forecast and climatology cdfs, F and G respectively. The projection of the forecast–climate intersection onto the probability level axis is called the crossing-point forecast. The corresponding crossing-point observation is the observed event expressed in terms of its climatological frequency rather than its absolute value. This transformation, the projection onto the probability level axis, is referred to as a projection in ‘probability space’.

The assessment of the crossing-point forecast requires the design of an error function. The score proposed in this article is not (strictly speaking) a new score, but is directly derived from the diagonal score, a scoring rule recently introduced by Ben Bouallègue et al. (2018). This manuscript sheds new light on the interpretation of this score and clarifies the link with a score routinely used in the meteorological community, namely the stable and equitable error in probability space (SEEPS; Rodwell et al., 2010). SEEPS serves as a headline score for assessing and communicating trends in precipitation forecast performance at the European Centre for Medium-Range Weather Forecasts (ECMWF).

The concept of ‘score in probability space’ was first developed in the context of deterministic forecast verification as an attempt to overcome the pitfalls of traditional scores that generally discourage forecast of extreme values, in particular for skewed-distributed variables such as precipitation (Potts et al., 1996; Ward and Folland, 1991). Rather than comparing forecast and observation in ‘measurement space’, the comparison takes place after projection in ‘probability space’. This manuscript intends to show how this concept can be formulated in a probabilistic forecasting context and applied to the verification of ensemble forecasts.

The manuscript is organised as follows: Section 2 presents the score definition, its properties and its relationship with other existing and better established scores. Section 3 presents applications in terms of crossing-point forecasts, an analysis of the score sensitivity in the context of ensemble forecast verification, as well as the concomitant challenges for the score computation. Following the Conclusion in Section 4, mathematical derivations and scoring algorithm are detailed in the Appendices.

2.

## The error function

2.1.

### Definition

Figure 1 sets the scene. We consider the situation where we have access to the following pieces of information: a climatology known to everyone (or unconditional cdf, denoted G), a probabilistic forecast issued by a forecaster (or conditional cdf, denoted F), and the outcome of a random process (or verification, denoted y).

Fig. 1.

The necessary ingredients: a climatology cumulative probability distribution function (grey), a probabilistic forecast (blue) and a verification (red). The projection of the circle and square onto the probability space, τf and τy, are called the forecast and verification crossing-points, respectively.

We consider the following single intersection (SI) condition: Given F and G two cumulative distribution functions, F and G satisfy the single intersection condition if there exists one and only one f such that:

((1))
$x\ge f⇒F\left(x\right)\ge G\left(x\right)$
and
((2))
$x

In this condition, the intersection point between the forecast F and the climatology G is denoted (f,τf) with ${\tau }_{f}:=G\left(f\right),$ while the intersection between verification y and climatology is denoted (y,τy) with ${\tau }_{y}:=G\left(y\right).$ We refer to τf and τy as crossing-point forecast and crossing-point verification, respectively. The crossing-point verification τy is the climate quantile level corresponding to the observation y, while the crossing-point forecast is introduced as a new type of probabilistic forecast. Illustrations of crossing-point forecasts and corresponding verification based on synthetic and real data are provided in Fig. 1 and Section 3, respectively.

In the SI condition, τf and τy are the unique projections of the forecast F and verification y, respectively, in probability space. In this context, the question that arises is how to define an error function that can be applied to a forecast τf and a verification τy in order to assess crossing-point forecasts appropriately.

Our proposition is the following. Given a forecast τ, an observation y and the distribution G, the scoring function

((3))
${S}_{G}\left(\tau ,y\right)=\left\{\begin{array}{ccc}G{\left(y\right)}^{2}-{\tau }^{2}& \text{if}& G\left(y\right)\ge \tau \\ {\left(1-G\left(y\right)\right)}^{2}-{\left(1-\tau \right)}^{2}& \text{if}& G\left(y\right)<\tau \end{array}$
is a consistent scoring function for the crossing-point functional ${T}_{G}:F\to {\tau }_{f}.$

For convenience, we note the corresponding score simply S with $S\left({\tau }_{f},{\tau }_{y}\right)$ the result of the comparison between a forecast τf and a verification τy following Eq. (3). By contrast, we also show why more simplistic error functions, such as for example a naive score defined as the squared difference ${\left({\tau }_{f}-{\tau }_{y}\right)}^{2},$ are not appropriate for the verification of probabilistic forecasts.

2.2.

### Illustrations

But first, let’s get familiar with the error function defined in Eq. (3). Figure 2 illustrates how $S\left({\tau }_{f},{\tau }_{y}\right)$ evolves as a function of the relationship between crossing-point forecast τf and crossing-point verification τy. In Fig. 2a, τy is fixed and τf varies in the interval [0,1]. This plot allows us to better visualize how a forecast is penalized given a verification. For example, when ${\tau }_{y}=0.5,$ the penalty of the score S increases rapidly with departures of τf from the verification (solid line). In other cases, when ${\tau }_{y}<0.5$ for example, we see how under-forecasting (${\tau }_{f}<{\tau }_{y}$) is less penalized than over-forecasting (${\tau }_{f}>{\tau }_{y}$) in this type of situation (dotted and dashed lines). By symmetry, the opposite is true when ${\tau }_{y}>0.5$ (not shown).

Fig. 2.

Getting familiar with the error function. (a) S as a function of τf for three different values of τy: 0.1 (dashed line), 0.3 (dotted line) and 0.5 (solid line). (b) S as a function of τy for three different values of τf: 0.1 (dashed line), 0.3 (dotted line) and 0.5 (solid line).

In Fig. 2b, we see how S varies as a function of τy in [0,1] for given τ. For the interpretation of this plot, we can recall that, by definition, the verification τy is uniformly distributed on [0,1]. One interesting aspect is that the integral of the error function S over all ${\tau }_{y}\in$ [0,1] is independent of the fixed forecast. When the same probabilistic forecast is issued every time (τf is a constant), the mean score S over all possible cases does not depend on the forecast. This score propriety is called equitability (Gandin and Murphy, 1992). So S is said to be equitable because the expected value of the score S is the same for all non-informative (constant) forecasts (see Appendix A for a formal illustration).

A second set of illustrations is provided in Fig. 3. This time, the aim is to illustrate the score sensitivity to predefined typical forecast discrepancies in a controlled environment. For this purpose, the simple toy-model proposed by Lerch et al. (2017) is used. Observations and forecasts are drawn from the same normal distribution:

((4))
$\mathcal{N}\left(\mu ,{\alpha }^{2}\right) \text{with} \mu \sim \mathcal{N}\left(0,1-{\alpha }^{2}\right), \alpha \in \left(0,1\right)$
where large (small) values of the parameter α indicate low (high) predictability. The climatology is the normal distribution $\mathcal{N}\left(0,1\right).$ For now, and without loss of generality, we set $\alpha =0.5.$ So far, the observation is statistically indistinguishable from a draw of the forecast distribution.

Fig. 3.

Comparing two scores in ‘probability space’ using a toy-model while varying the forecast bias (a) and the multiplicative spread factor σ (b). Normalised expected value of the score S (solid line) is compared with normalised expected value of a naive score in probability space computed as ${\left({\tau }_{\mathrm{f}}-{\tau }_{\mathrm{y}}\right)}^{2}$ (dotted line).

Two experiments are performed in order to make the probabilistic forecast deviate from perfect calibration. First, synthetic data sets are generated adding a bias b, varying in $\left[-0.5,3\right],$ to the mean forecast. Second, data are generated by controlling the level of forecast variance with a multiplicative factor σ. The spread multiplicative factor σ is varied in $\left[0.5,2\right].$ So, the forecast distribution follows a distribution $\mathcal{N}\left(b,{\sigma }^{2}\right),$ where perfect calibration corresponds to b = 0 and σ = 1 in each experiment. Normalised scores as a function of b and σ are plotted in Fig. 3a,b, respectively.

In Fig. 3, we compare the forecast performance, as b and σ vary, considering two scores: S and the naive score (squared difference) in probability space. For both scores, the minimum is reached when b = 0. A saturation for large biases is also visible with S reaching a plateau when b > 2 (the bias exceeds two times the climate standard deviation). However, when varying the forecast variance, S reaches its minimum when σ = 1, that is when the forecast is perfectly calibrated, while the naive score has its minimum for the lowest tested value of σ. So, on one side, this plot clearly illustrated that the naive score favours forecast that are not well-calibrated and so can be deemed as inappropriate for forecast comparison. On the other side, this plot also hints that S proceeds from a proper scoring rule. Further discussion on S properties follows.

2.3.

### Interpretation

The choice of the error function in Eq. (3) is not fortuitous but results from a derivation of the diagonal score, a proper score introduced by Ben Bouallègue et al. (2018). We show in Appendix B that the diagonal score expressed in terms of τf and τy is equivalent, up to a factor 2, to the score S when the crossing-point forecast exists and is unique. In the original study cited above, the diagonal score is interpreted as a score ‘tailored to vulnerable users, […] those more exposed to stress as the weather event severity increases’. More formally, it is defined as the integral of the diagonal elementary score over all quantile levels.

As a new result, the diagonal elementary score is established as a proper score for an interval probabilistic forecast (Mitchell and Ferro, 2017). An interval probabilistic forecast is for example ‘there is [10–20]% chance of rain tomorrow’. In Fig. 1, it is clear why this type of probabilistic forecasts is relevant here. For any event (defined as the variable of interest exceeding a threshold), a probability forecast is reduced to an interval probabilistic forecast. Focusing on the crossing-point, the only information retained from a probability forecast, say ${p}_{f}=1-F\left(t\right),$ where t is the given threshold defining an exceedance event, is whether this probability pf is greater or lower than the climatological probability of occurrence denoted p0 (with ${p}_{0}=1-G\left(t\right)$). So, for any binary event, we focus on a forecast which takes the form of a probability interval: [0,p0] or (p0,1]. In Appendix C, we show that a proper score for such an interval probabilistic forecast is the diagonal elementary score characterised by the error matrix in Table 1.

The error matrix in Table 1 indicates the asymmetrical penalties associated with the diagonal elementary score. This error matrix is key to understanding the relationship between the diagonal score and decision-making based on a standard cost/loss model (a detailed discussion on that point can be found in the work by Ben Bouallègue et al., 2018). This table also explicitly shows the relationship between S and SEEPS. Table 1 is equivalent, up to a constant factor ${p}_{0}\left(1-{p}_{0}\right),$ to Table X in the work by Rodwell et al. (2010) which shows the ‘two-category equitable error matrix for a score that SEEPS can be built from’. While SEEPS focuses on three categories (dry weather, light rain and heavy rain), the score S is built on a finer (possibly exhaustive) description of the climate distribution. Sensitivity of the score S to the climatology definition in terms of quantiles is discussed in Section 3.3.

3.

## Applications

3.1.

### Crossing-point forecasts

A crossing-point forecast is derived from the comparison of a conditional with an unconditional probability distribution. The two cdfs are summarised into a single number, one characteristic of a probabilistic forecast. This number provides information about the forecast level of risk with respect to the climatological level of risk. There is no focus on one particular event (as for example the risk of having temperature exceeding 30 °C) or on a particular climate quantile (as for example the 95% percentile), but rather a scanning of all possible events/quantile levels. In the SI condition, the crossing-point corresponds to the pivotal point where the probability forecast for an exceedance event switches from higher to lower than climatological frequency. So, in plain words, the crossing-point forecast is associated with the worst-case scenario which is more likely based on the information at hand than without, the largest threshold so that the corresponding exceedance event gets assigned an above-climatological probability based on the current probabilistic forecast. By convention, we express the crossing-point forecast in terms of a quantile level (a number between 0 and 1) but it could also be communicated in terms of a return-period (Prates and Buizza, 2011) or a quantile value (Hawkins and Kochar, 1991).

In Appendix D, we argue that the scoring function SG is consistent for the crossing-point forecast, and the diagonal score is the corresponding proper score. The idea of consistency between a score and a forecast directive (or functional) relates to the concept of elicitability (Gneiting, 2011). A statistical functional is called elicitable if there is a ‘scoring function or loss function such that the correct forecast of the functional is the unique minimizer of the expected score’ (Fissler et al., 2019). For example, the distributional mean is elicitable with the root mean squared error as a consistent loss function. These mathematical tools and related concepts help drawing robust conclusions from the comparison of competing forecasts in a probabilistic framework.

We illustrate the concept of crossing-point forecast (and its consistent assessment) with an example based on the operational ensemble prediction system (ENS) run at ECMWF. More specifically, we analysed a 2 m temperature forecast, in the form of model grid-box averages, considering instantaneous values at 12UTC. The spatial grid resolution of the ensemble forecast is approximatively 18 km, but the forecast is here interpolated on a 0.25° × 0.25° grid. The verification corresponds to the analysis on the validity date. The climatology is a model-climatology derived from reforecasts. Section 3.2 details how crossing-point forecasts are generated.

Figure 4 compares qualitatively (by visual inspection) and quantitatively (by computing S) the forecast and verification crossing-points of 2 m temperature valid on 1 June 2020. The focus is on the European-North African domain. In Fig. 4a, the western and northern parts of the domain appears at ‘higher risk than normal’ for large positive anomalies (crossing-point close or equal to 1), while Southern and Central Europe is dominated by a signal of ‘higher risk than normal’ for large cold anomalies (crossing-point close or equal to 0). By comparison with Fig. 4b, we appreciate the overall agreement in terms of spatial structures between crossing-point forecast and verification, but we also note that the verification map displays values more equally distributed over the interval [0,1] than the forecast map. Because S is a scoring function, it can be computed for each pair of forecast and verification (each model grid-point). As shown in Fig. 4c, large-scale poor forecast performance affects only North-Eastern Europe in this example.

Fig. 4.

(a) Two metre temperature crossing-point forecast derived from ENS at day 5, (b) crossing-point observation on 1 June 2020 and (c) corresponding score S.

3.2.

### Computation

When τf and τy are known, the computation of the score S is straightforward. However, this is generally not the case, and, to the best of the author’s knowledge, there is no closed-form for the computation of the intersection point between two cdfs in the case of well-defined distributions such as normal distributions. In this context, two different approaches can be followed: (1) a pragmatic approach to find the crossing-point forecast τf, or (2) the direct computation of the diagonal score in its original formulation. Both approaches are discussed below.

The first option is to be pragmatic and to estimate τf and compute $S\left({\tau }_{f},{\tau }_{y}\right)$ based on Eq. (3). Consider for an example an ensemble forecast with members ${e}_{1},\dots ,{e}_{M}$ and a climatology defined by unique quantiles ${q}_{1},\dots ,{q}_{{n}_{q}}$ at increasing levels ${\tau }_{1},\dots ,{\tau }_{{n}_{q}}\in \left(0,1\right).$ The forecast probability of exceedance ${p}_{i},i=1,\dots ,{n}_{q},$ is derived by counting the number of ensemble members exceeding the respective quantile qi, that is, ${p}_{i}=\frac{1}{M}{\sum }_{k=1}^{M}\mathbb{I}\left[{e}_{k}>{q}_{i}\right].$ Then the crossing-point forecast is found by comparing the exceedance probabilities pi to the climate probability levels $1-{\tau }_{i}.$ Let $j\in 1,\dots ,{n}_{q}+1$ be the smallest index i such that ${p}_{i}\le 1-{\tau }_{i},$ with $j={n}_{q}+1$ if no such index exists. The crossing-point forecast τf is set equal to $\frac{1}{2}\left({\tau }_{j}+{\tau }_{j-1}\right),$ where ${\tau }_{0}=0$ and ${\tau }_{{n}_{q}+1}=1$ are used at the boundaries.

This simple approach is followed to produce the illustrative example in Fig. 4. In our examples, M = 50 and τi takes value 1%, 2%,…, 98% and 99%. This approach can be refined by interpolating around the intersection point as illustrated in Fig. 5a. In addition, extrapolation could be performed using extreme value theory for a finer assessment of crossing-points close to the distribution tails. The application of this later step goes beyond the scope of this study.

Fig. 5.

Same as Figure 1 but based on a real data set: 2 m temperature ENS forecasts at day 5 valid at three different stations illustrating: (a) the single intersection condition, (b) a case of multiple (2) intersections and (c) a case with zero intersections. The climatology is site-specific, based on a 30-year observation records covering the period 1980–2009.

The second option consists in directly computing the diagonal score which does not require the estimation of τf. In order to facilitate the application of this approach for the verification of ensemble forecasts, an algorithm is provided in Appendix E. Besides the ensemble forecast and a verifying observation, the diagonal score computation requires as input a climatology defined by a set of quantiles. The score is computed as the mean diagonal elementary score over all unique climate quantile levels. Climatological distributions of weather variables such as precipitation are censored distributions and the first two loops of the algorithm are dedicated to identifying the set of unique quantile values within the variable bounds. In Section 3.3, we discuss the sensitivity of the score with respect to the ensemble size as well as the climatology definition.

The relationship between S and the diagonal score holds only in the SI condition. However, in practice, multiple intersection points can coexist for a single forecast–climatology pair as illustrated in Fig. 5b. How often this situation is encountered in real applications is examined with the help of two distinct datasets, one dealing with temperature at 2 m above the ground and one with daily precipitation. Over Europe, focusing on June 2018, pairs of ensemble forecast climate distributions are analysed at approximatively 1500 synoptical stations: for each pair, the number of intersection between the forecast and climate distributions is counted. The distribution of the number of intersections per pair is displayed in Fig. 6.

Fig. 6.

Distribution of intersection points between each pair of forecast and climate probability distributions. Results for (a) 2 m temperature, and (b) daily precipitation at day 2, 5, and 10, in July 2018, at station level over Europe.

The prevalence of the SI condition in both the temperature and precipitation datasets is illustrated in Fig. 6a,b, respectively. Multiple intersections represent 4% (5%), 11% (13%) and 22% (28%) of all cases at day 2, 5 and 10, respectively, for 2 m temperature (daily precipitation). Cases with zero intersections (0 category) are the limit cases of the SI condition: the crossing-point is defined and can in principle take value in $\left[0,{\tau }_{1}\right)$ or $\left({\tau }_{{n}_{q}},1\right].$ An example of a case with zero intersections is provided in Fig. 5c. Based on the results in Fig. 6, we infer that the SI condition is more often associated with forecasts at shorter lead time, that is forecast with a sharper probability distribution. At longer lead time, multiple intersections are more frequent. The number of cases with zero intersections is also rising with the forecast horizon, leading to a larger number of crossing-points taking value 0 or 1. For longer lead time, the forecast can become similar to climatology and a single crossing-point is difficult to identify in that case. In terms of score, S converges to the score value for non-informative forecasts, that is random or constant crossing-point forecasts (see the discussion on equitability in Section 2.2).

3.3.

### Sensitivity to ensemble size and climatology definition

We focus now on the case where the forecast F and the climatology G are empirical distribution functions. For example, F can be derived from an ensemble forecast and G based on a set of quantiles. Illustrations in Figs. 4 and 5 are based on an ensemble forecast with 50 members and a climatology defined by 99 quantile levels (1%, 2%,…,98%, 99%). We recall that the ensemble size is denoted M and the number of quantiles is denoted nq.

The score sensitivity to M and nq is analysed using both the score S and the diagonal score algorithm in Appendix E. Figure 7 shows the diagonal score as a function of the size of the ensemble forecast for three different climate representations defined by equidistant quantiles on [0,1] with intervals $\frac{1}{{n}_{q}},$nq taking value 5, 10 and 50. Results obtained with S are shown only for nq = 50 for the sake of the plot readability. More precisely, Fig. 7a,b compare results for 2 m temperature and daily precipitation ENS forecasts (European domain, day 5 in the lead time, Summer 2018), respectively. The respective scores when M = 50 are used as reference. As a consequence, all curves converge to 1 for M = 50.

Fig. 7.

Score sensitivity to ensemble size and climatology definition: score as a function of the number of ensemble members M (SM) relative to the score when M = 50 (S50) for different climate definitions. Results for scores computed with the diagonal score are shown in black, results based on Eq. (3) in grey, for 2 m temperature (a) and daily precipitation (b).

The ensemble size is a critical parameter in the design of an ensemble system (Leutbecher, 2019). Figure 7 shows the positive impact of increasing the ensemble size on the forecast performance. No qualitative differences appear between the results obtained with S and the ones obtained with the diagonal score. Comparing 2 m temperature and daily precipitation plots, the ensemble size has a smaller impact on the scores in the former case. The score converges also more rapidly with increasing M values for 2 m temperature forecasts. In addition, we note that more quantile levels in the climate definition (e.g. nq = 50 rather than nq = 5) allows a finer estimation of the ensemble size effect on the score. Figure 7 clearly illustrates that results might differ as a function of the score computation approach and setup, i.e. the number nq of climate quantile levels itself. Therefore, it is important to communicate this information along with the forecast performance results.

4.

## Conclusion

The crossing-point forecast is defined by the intersection point between a forecast cumulative distribution and the corresponding climatology. The crossing-point is a summary of a probabilistic forecast into a single number conveying information about the worst-case scenario which is more likely in the forecast than in the climatology. Is the predicted chance of suffering a loss, due to the occurrence of an exceedance event, higher than that event’s climatological frequency? The crossing-point forecast indicates the limit case for which the answer is positive. In weather forecasting, this type of information could be highly relevant for vulnerable users and more generally for users with interest for high-impact events. Further investigations on the crossing-point forecast interpretation and potential applications are encouraged.

A scoring function consistent for the crossing-point forecast exists. A simple error function that applies to the forecast and observed crossing-points is formulated. The resulting score is proper and equitable which makes its application appealing for the comparison of competing forecasts. The link with other scores and concept is also highlighted. The proposed score is equivalent to the diagonal score in the case of the single intersection condition (when a unique forecast crossing-point exists). Moreover, this work helps generalising the concept of ‘score in probability space’ to the context of ensemble forecast verification.

In practice, one can encounter situations where multiple crossing-points coexist in a single forecast or where forecast and climate distributions (partly) overlay. Such situations are common in ensemble weather forecasting. Suggestions on how to tackle such practical challenges are provided. In addition, the analysis of the score sensitivity to ensemble size and climatology definition (the number of used climate quantiles) indicates the clear benefits of a finer representation of both the ensemble and the climate distributions. Finally, an algorithm is provided in order to foster verification applications based on the diagonal score. A systematic comparison with other proper scoring rules for the evaluation and ranking of ensemble forecasts is also encouraged as future work.