## Introduction

Despite the sophistication of current operational numerical weather prediction systems, thunderstorms remain relatively unpredictable at fine scales beyond a few hours (Clark et al., **2009**). Most modern numerical atmospheric models can simulate key physical features of deep convective systems that produce thunderstorms, either by implicitly modelling subgrid convection in large scale models, or by explicitly resolving 3D convective clouds in non-hydrostatic models at kilometric resolutions. A discussion of the merits of both approaches is provided in Weisman et al. (**2008**). Beside the limited realism of numerical models, thunderstorm prediction is hampered by the usually poor predictability of deep convective clouds (Walser et al., **2004**): rapid error growth in the simulation of convective systems can lead to large uncertainties in their location, timing and intensity. Thunderstorm predictability has been shown to depend on the meteorological context, for instance ‘air mass’ (i.e. weakly forced) convection tends to be less predictable than synoptically forced events (Keil et al., **2014**). Sobash et al. (**2011**) explored storm risk ensemble predictions using spatial smoothing to account for location errors. Several authors have proposed lightning risk diagnostics for numerical model output, but the published results have so far been restricted to relatively large spatial and temporal scales due to the high forecast uncertainty (e.g. Casati and Wilson, **2007**; Schmeits et al., **2008**; Yair et al., **2010**; Collins and Tissot, **2015**; Gijben et al., **2017**; Simon et al., **2018**).

Ensemble prediction can help users interpret highly uncertain weather forecasts (Richardson, **2000**; Zhu et al., **2002**). Ensembles simulate the real-time propagation of uncertainties in the prediction process: pseudo-random error structures called perturbations are injected into the numerical weather prediction systems. Perturbations include some representation of uncertainties in the initial conditions (e.g. Descamps and Talagrand, **2007**) and in the model design (see review in Leutbecher et al., **2017**). Using a perturbation sample, a set of forecasts called ensemble members is computed in real time to simulate the probability distribution of uncertainties in the forecast products. The size of current operational ensembles (typically 10–50 members) is a compromise between model realism and statistical accuracy under the constraint of affordable computing cost. This size is arguably much smaller than the ensemble size needed to properly sample the space spanned by forecast errors (Leutbecher, **2018**). In some applications, the ensemble size can be increased by including older predictions into the product generation (Lu et al., **2007**; Osinski and Bouttier, **2018**).

In single-model ensembles, the implementation of perturbation schemes can be constrained by the architecture of the numerical models and data assimilations used. A possible workaround is to mix multiple physics packages, multiple models or multiple ensembles in the member generation (e.g. Ziehmann, **2000**; Clark et al., **2008**; Park et al., **2008**; Hagedorn et al., **2012**). These approaches have been shown to provide benefits, although they could possibly be superseded one day by single-model ensembles thanks to ongoing research to improve perturbation schemes.

Ensembles are limited by our ability to represent error sources in the initial conditions and model design, because the perturbation setup is always constrained by the architecture of the numerical models and data assimilations used. This issue can be somewhat alleviated by mixing multiple physics packages, multiple models or multiple ensembles in the member generation (e.g. Ziehmann, **2000**; Clark et al., **2008**; Park et al., **2008**; Hagedorn et al., **2012**).

An important application of ensemble prediction is point probabilistic forecasts, i.e. real time estimation of the likelihood of future meteorological events, at predefined locations in space and time. These forecasts can be verified a posteriori using a variety of statistical measures such as reliability and resolution (Jolliffe and Stephenson, **2011**). Ultimately, the usefulness of probabilistic forecasts depends on the user. Here, we will focus on binary forecasts of a particular meteorological event: thunderstorm occurrence. Its quality will be measured using the frequency of non-detections and false alarms, as summarized by the ROC diagram (relative operating characteristic, Mason and Graham, **1999**) averaged over many point forecasts. Other, more user-specific scores could be used, such as the potential economic value (Richardson, **2000**).

Various ensemble post-processing techniques have been proposed to improve probabilistic forecasts using historical data: dressing (Bröcker and Smith, **2008**), Bayesian model averaging (Raftery et al., **2005**; Berrocal et al., **2007**), ensemble model output statistics (EMOS, Gneiting et al., **2005**), random forest quantiles (Taillardat et al., **2016**), among others. Simon et al. (**2018**) presented a statistical thunderstorm forecasting technique based on a generalized additive model (GAM) algorithm. These techniques tend to require large homogeneous learning datasets, which may be difficult to obtain for rare events. In most weather centres, model upgrades occur frequently (at least annually), in which case the learning datasets have to be updated using potentially expensive reforecast techniques (Hamill et al., **2006**, **2008**). This can be problematic in a production setting that uses model output from several weather centres, each upgrading their own systems on independent schedules.

This paper presents an original technique for point probabilistic thunderstorm forecasts,. It deals with the above issues by combining multiple ensembles with a simple calibration technique. Our goal is to check if the end user value of ensembles can be improved by calibrating their output, using techniques that require little learning data. We will combine the following post-processing operators: each is relatively well known, but they have to our knowledge not yet been integrated as a single algorithm:

- a neighbourhood operator that allows for some spatial tolerance in the use of model-generated thunderstorms, following the ideas of Theis et al. (
**2005**), Berrocal et al. (**2007**) and Schwartz and Sobash (**2017**); - a kernel dressing in order to smooth the ensemble probability distribution function (PDF) in parameter space (Berrocal et al.,
**2007**; Bröcker and Smith,**2008**); - a calibration of the model diagnostic used to define the occurrence of thunderstorms: we use a much simplified version of existing calibration techniques (e.g. Gneiting et al.,
**2005**; Ben Bouallègue,**2013**; Simon et al.,**2018**), which can be understood as a bias correction of the modelled thunderstorm intensity; - an optimal weighting of the ensembles that are combined to produce the thunderstorm forecasts, which brings model diversity into the e190924thunprob-revised.pdfnd products (see e.g. Hagedorn et al.,
**2012**; Beck et al.,**2016**).

As explained below, these operators involve tuning parameters that will be optimized in terms of forecast error statistics (i.e. thunderstorm non-detection and false alarms rates), while requiring that forecast probabilities be reasonably well calibrated. We will demonstrate the performance of the results on several combinations of ensemble prediction systems that cover a wide range of forecast horizons, from a few hours to several days. The paper is organised as follows: the observations and ensemble forecasts are presented in Section 2. The parameter tuning procedure is explained in Section 3, and its variability is explored in Section 4. The performance of the optimized forecasts is studied in Section 5, before the discussion and concluding remarks in Section 6.

## Observation and forecast data

### Thunderstorm observations

In this paper, we regard thunderstorm occurrence as a binary field, without consideration of event intensity. Different users may be sensitive to different aspects of thunderstorm activity, such as heavy accumulation, hail, gusts, cloud-to-ground lightning strikes, etc. Thus, there are several possible ways of defining thunderstorm observations. In this work we use lightning sensors and radar reflectivities to define thunderstorms observations, because these measurements are readily available over our domain of interest, in Western Europe. The lightning data is provided by the Météorage company, based on LS700X Vaisala sensors, with some filtering to eliminate non-meteorological signals. After data processing, the reported detection rate in this area is of the order of 90% for cloud-to-ground strikes, and 50% for intracloud strikes (Pédeboy and Schulz, **2014**). The radar data is provided by the Météo-France PANTHERE network of ground-based polarimetric radars, which is designed to provide good coverage over mainland France and Corsica. Depending on their development stage, thunderstorm clouds can affect much larger zones than their electrically active areas; conversely, thunderstorms can have significant lightning activity but little precipitation. Some mostly produce intracloud flashes that are imperfectly detected by the lightning sensors. We combine lightning and radar data as explained below, in order to minimize the impact of these complexities on the verification. In regions without these observing systems, satellite based data could be used, such as cloud-top diagnostics (Karagiannidis et al., **2019**) or optical lightning sensors (Goodman et al., **2013**).

We are interested in predicting thunderstorm impacts at the hourly scale, with a resolution of a few kilometres: a thunderstorm will be deemed to occur if a lightning strike is observed within 10 km and 30 min of the observation, or the maximum radar reflectivity in this neighbourhood is greater than 35dBZ (this threshold is commonly used for radar thunderstorm detection; see Li et al. (**2012**) and references therein). This criterion is applied at all full hours and on each point of a regular latitude-longitude grid of 20 km mesh, in order to generate a set of pseudo-observations. Forecast verification will be performed on the domain represented in Fig. 1. There are 1748 pseudo-observations at each hour i.e. about 1.3 million data points per month. The period studied here is June to August 2018, during which thunderstorm activity was observed at 2% of the data points. An example of thunderstorm pseudo-observation coverage is presented in Fig. 1.

### Numerical model forecasts

We investigate the predictive value of four ensemble prediction systems, selected for a typical user that requires forecasts over Europe in the early morning (around 04UTC). Their timings are summarized in Fig. 2.

The lagged Arome ensemble (named AROLAG here) is a pseudo-ensemble built by taking the three most recent, deterministic Arome-France forecasts of the real-time operational Météo-France production. The Arome-France model is depicted in Seity et al. (**2011**), with an horizontal resolution of 1.3 km in 2018, over a slightly larger geographical domain than depicted in Fig. 1.

Each day, an AROLAG ensemble started at 00UTC on day D combines the Arome-France forecast started from the analyses at 00, 18 and 12UTC on D, D-1 and D-1 respectively. The forecast range of AROLAG is limited to 36 h by the oldest Arome-France run used.

- the Arome-France-EPS ensemble (named AromeEPS) is a 12-member ensemble based on perturbations of the Arome-France model at a resolution of 2.5 km in 2018. It is documented in Bouttier et al. (
**2012**), Raynaud and Bouttier (**2016**), and Bouttier et al. (**2016**). The Arome-EPS system is updated every six hours, the forecasts considered here are based on the D-1 analysis at 21UTC, with a maximum forecast range of 51 h (i.e. 48 h with respect to 00UTC). - the Arpège ensemble (named PEARP) is a 35-member ensemble based on perturbations of the global Arpège model. It is documented in Descamps et al. (
**2015**). Its resolution was 10 km - over Europe in 2018. The PEARP forecasts considered here are based on the D-1 analysis at 18UTC, with a maximum forecast range of 108 h i.e. 102 h with respect to 00UTC (the PEARP run based on 00UTC is too short to deliver forecasts beyond the range of the Arome systems).
- the ECMWF IFS ensemble (named IFSens) is a 51-member ensemble based on perturbations of the IFS model. A comprehensive documentation of the ECMWF models is maintained at www.ecmwf.int; the IFSens resolution was 18km in 2018. We only use the 50 IFSens perturbed members based on the 00UTC analysis, up to the 99 h range.

We consider three derived ensemble systems, called ensemble blends, each defined by the union of two of the above ensembles. Blend members are labelled as if they all started from 00UTC. Each blend is defined as follows:

- the ‘AromeEPS + AROLAG’ blend combines the 12 + 3 = 15 members of both systems. It can be used over ranges 3–36 h.
- the ‘AromeEPS + PEARP’ blend combines 12 + 35 = 47 members. It can be used over ranges 3–48 h.
- the ‘PEARP + IFSens’ blend combines 35 + 50 = 85 members. We will study it over ranges 9–93 h.

The timings of the three ensembles are summarized in Fig. 2 and Table 1; the tuning windows displayed in Fig. 2 will be explained in Section 3.

### Verification method

Probability scores rely on the comparison of forecasts and observations of a binary variable, the thunderstorm occurrence. Thunderstorm observations are defined using lightning and radar data as explained in Section 2.1. Thunderstorm forecasts are defined as the probability *p* that a scalar thunderstorm activity diagnostic, *x*, exceeds a predefined threshold *u*: at observation point *j*, the forecast probability is denoted *p*(*j*) = *P*(*x*(*j*) > *u*). Classically, in ensemble prediction, the forecast PDF *P* is defined by counting the number of members *i* that exceed the threshold at this point:

This is equivalent to defining the PDF as a sum of Dirac distributions located at the *n* ensemble member values (assumed equally likely):

In Section 3, we will use a more general definition of *P*, but the forecast probability will remain a function of the thunderstorm member fields predicted by the ensemble members *x _{i}*(

*j*).

Variable *x* is defined as an empirical diagnostic because current numerical models do not realistically simulate lightning activity or maximum reflectivities in thunderstorm cells. Instead, they represent the effects of deep convection in a more or less implicit way, depending on the resolution and physical parameterizations used in each model. Thunderstorm activity can be diagnosed using functions of the model variables. Studies on the realism of thunderstorms in the Arome and IFS models used here can be found in Brousseau et al. (**2016**) and Lopez (2016), respectively. PEARP and IFS lack the necessary horizontal resolution to realistically simulate deep convective cells: in these models, thunderstorms will be diagnosed using parameterizations of subgrid convection. We have chosen the following predictors of thunderstorm activity:

- in the Arome-based systems (AROLAG and AromeEPS),
*x*is the maximum simulated radar reflectivity in each column, which is an empirical function of the model hydrometeors. Reflectivity is expressed in mm/h. A study on the predictive value of Arome maximum reflectivity is provided in Osinski and Bouttier (**2018**). - in the PEARP system,
*x*is a CAPE (convective available potential energy) diagnostic, in Standard Units normalized by 100. This CAPE computation is tied to the Arpège model parametrisations of subgrid convection, which depend on the PEARP member as explained in Descamps et al. (**2015**). By design, large values of the PEARP CAPE diagnostic indicate that the model has indeed triggered deep precipitating convection. - in the IFS system,
*x*is the ‘instantaneous total lightning density’ diagnostic described in Lopez (**2016**), in Standard Units normalized by 100.

The precise normalizations used do not matter, because they will be modified by the *u* threshold re-tuning in the statistical procedure explained in Section 3. Their relative values matter because we will use a common re-tuning in each ensemble blend, so it is important that the above normalizations approximately lead to thunderstorm forecasts that cover similar geographical areas. Indeed, a superficial check has shown that the observed thunderstorm frequencies are similar to the forecast frequencies of reflectivity greater than 10 mm/h in Arome (approximately 35dBZ), CAPE greater than 1000SI in PEARP, and lightning density greater than 100SI in the IFS members. In other words, forecasting thunderstorms when *x >* 10 regardless of the numerical model used leads to approximately consistent forecast frequencies: *u =* 10 is our first guess for the threshold *u* used in Eq. (3). Forecast biases being model-dependent, it would seem better to tune a different *u* for each model. Throughout this paper, *u* is the same for the ensembles used in each blend, in order to limit the number of tunable parameters. The validity of this choice will be examined in Section 4 that looks at the *u* values that would be obtained if they were separately tuned for each ensemble.

The quality of each forecasting system will be assessed using scores of predicted thunderstorm probabilities at each observation point. Unless otherwise mentioned, the scores are averaged monthly over all points at three hourly frequency, using forecasts started at 00UTC on each day. The period considered here (June to August 2018) had significant thunderstorm activity over more than half of the days, in both observations and forecasts. The statistical procedure used in this study would be more difficult to apply over areas or seasons with weaker thunderstorm activity, because the scores work by counting thunderstorm prediction errors: a large enough number of meteorologically independent observed and forecast thunderstorms is needed in order to obtain robust estimates of thunderstorm detection and false alarm rates. Thunderstorm events involve meteorological structures (e.g. upper-air thalwegs) that typically extend over one day and the whole geographical domain considered here. Thus, the effective sample size used to assess each forecasting system is more or less the number of days with significant thunderstorm activity within the considered three-month period – about 50 in our study. In a less thunderstorm-prone season or area (e.g. winter in western Europe, or the dry season in subtropics), it may be necessary to gather much more than three months of historical data to obtain a similar sample size. If a small sample size is used, there is a risk that score averages are not representative of the actual quality of the forecasting system, because overfitting the sample data may prevent them from being relevant for other dates.

Bootstrap significance testing of daily score differences has been used to check the validity of the conclusions. We assume that the domain-averaged score on each day is an independent datum i.e. we neglect the serial correlation between scores computed on different dates.

## Parameter optimization method

As will be demonstrated in the next section, the forecasts defined by applying Eq. (1) at each point have little predictive value. We will improve them by applying five ensemble post-processing steps: a neighbourhood method, a probability dressing step, an ensemble weighting, a threshold adjustment, and a reliability calibration. For clarity, we start by mathematically defining each step separately, the complete post-processing will then be defined by their combination.

### Ensemble post-processing operators

The *neighbourhood* operator implements a tolerance on thunderstorm location. A member is assumed to predict thunderstorms at point *j* if it simulates a thunderstorm anywhere in a 2D neighbourhood of *j*. For instance, Osinski and Bouttier (**2018**) applied random shifts to the precipitation fields; Theis et al. (**2005**) considered the distribution of precipitation in a space-time neighbourhood. Schwartz and Sobash (**2017**) compared various neighbourhood methods and explained the differences between neighbourhood post-processing and neighbourhood verification. The goal here is to apply neighbourhood post-processing for the production of point forecasts; there will be no spatial tolerance in the score computation, because we are interested in the perception of forecast quality by non-expert users that only judge forecasts by what happens at their location (defined by a point in space and time, like our verifying observations). Mathematically, the neighbourhood post-processing works by replacing the forecast thunderstorm diagnostic of member *i* at point *j*, *x _{i}*(

*j*), by

*N*is the neighbourhood post-processing operator,

_{r}*D*is the horizontal distance on the sphere and

*r*is a tunable neighbourhood radius. In other words, field

*x*is replaced at each point by its maximum in a disk of radius

_{i}*r*. Each forecast system configuration uses a single radius at all locations and forecast ranges. The max function is used because it is computationally cheap and it has no tunable parameter; in a future study, it could be interesting to test more sophisticated neighbourhood operators, such as a spatial quantile, a time tolerance or a non-circular neighbourhood to account for geographical heterogeneities. In terms of the Schwartz and Sobash (

**2017**) terminology, our neighbourhood post-processing belongs to the class of 'neighbourhood ensemble probability’ (NEP) methods, with the key difference that we use a maximum operator instead of a spatial averaging: this choice will be justified below by its benefits on the scores, even though we will still interpret the post-processing output as point (i.e. not areal) probabilities.

The *dressing* operator is an empirical modification of the PDF at each forecast point: instead of considering a discrete set of ensemble members, we define the probabilities as a sum of rectangular function (named kernels) that encompass each member value. It is equivalent to smoothing the probabilities in parameter space: for instance, if a member predicts a value of *x* = 9.99, the probability that *x* exceeds 10 should intuitively be interpreted as non-zero. Kernel smoothing is used in statistics to build non-discrete probability functions that are more general than the parametric functions often used in e.g. EMOS ensemble calibration (e.g. Gneiting et al., **2005**, Scheuerer, **2014**). The kernel width drives the amount of smoothness and dispersion of the PDFs. Thunderstorm activity *x* is a positive quantity that is often equal to zero, so we define the kernel width as a multiplicative function of the ensemble value itself. Our dressing does not change the probability that *x* is zero. It is mainly used to extend the upper ‘tails’ of the probability functions beyond the maximum that is simulated by the raw ensemble. Dressing also has a smoothing effect on the PDFs. Mathematically, dressing works by replacing Eq. (2) with

*K*= 1 in the interval [1/(1 +

_{d}*d*), 1 +

*d*] and zero elsewhere. The tunable parameter is

*d*, which controls the kernel width.

*d*measures the relative position of the kernel edges with respect to

*x*(

_{i}*j*), so that e.g. a kernel with

*d*= 0.5 = 50% gives non-zero weight to values up to 50% larger than

*x*(

_{i}*j*).

The *ensemble weighting* operator is only used when combining two ensembles *a* and *b*: if their members are *x _{a}* and

*x*of respective sizes

_{b}*n*and

_{a}*n*, the PDF defined by Eq. (2) becomes

_{b}*w*.

The *reliability calibration* operator is a remapping of output probabilities inside the [0,1] interval, so that their reliability diagram is moved closer to the diagonal (Jolliffe and Stephenson, **2011**). Our approach can be regarded as a simplified version of the Flowerdew (**2014**) procedure, because we only calibrate for a single parameter threshold. The calibration operator works by replacing the output probabilities *p* with

*s*< 1 is a tuning parameter. The denominator ensures that

*p*^ is a smooth increasing function that remains in the interval [0, 1]. In the limit of small

*p*,

*C*is equivalent to the linear correction

_{s}*p^ = s p*. We shall see that forecast thunderstorm probabilities are nearly always small, so that in practice one can regard

*C*as a linear rescaling of

_{s}*p*to make it reliable,

*s*being the slope of the correction.

The basic impact of the neighbourhood, dressing and adjustment of threshold *u* are illustrated in Fig. 3, on the AromeEPS ensemble. Thunderstorm probability forecasts from the raw ensemble are very poor, and much improved by application of the neighbourhood tolerance with a conservative radius of 20 km. The forecasts can be further improved by adding dressing or by modifying threshold *u*. These operations mostly impact the upper part of the ROC curve, i.e. they change the quality of the lowest non-zero probabilities. Their effects can interact with each other, so that manually finding optimal values for the parameters set (*r*, *d*, *w*, *u*) is not trivial. In the following section we present a method to tune these parameters automatically.

### Tuning of the post-processing

The complete post-processing procedure is the sequence of the above operators in the following order: at each output point,

- the member values are defined using the neighbourhood operator
*N*on each ensemble member field_{r}*x*(Eq. (4)) - the PDF of
*x*is constructed from the member values using dressing kernel*K*(Eq. (5))_{d} - if two ensembles are being used, their PDFs are combined using weight
*w*(Eq. (6)) - the forecast thunderstorm probability is the integral of the resulting PDF below threshold
*u*(Eq. (3)) - the probabilities are calibrated by applying function
*C*(Eq. (7))._{s}

The operations are local to each post-processing point, except the neighbourhood operators. There are five tunable parameters: the radius *r,* dressing kernel width *d*, relative weight *w*, threshold *u*, and calibration slope *s*. Noting that *x* is positive, the complete post-processing equation can be written

The adjustable parameters will be tuned over some training periods, in order to minimize the forecast errors while preserving the reliability of the end product. Many metrics have been proposed to measure the performance of probabilistic forecasts of a binary variable (Jolliffe and Stephenson, **2011**); here, we choose to maximize ROCA, the area under the ROC curve. It is an increasing function of the POD (probability of detection) and a decreasing function of the FAR (false alarm rate); ROCA = 0.5 for a set of random forecasts (i.e. without any predictive value), and to 1 for a perfect forecasting system (with perfect detection and no false alarms). The area is computed numerically using the above definition of the forecast PDF at all verification points. In order to reduce the numerical costs, ROCA is only computed over a subset of forecast ranges, called the tuning window, as defined by Table 1.

The reliability of the thunderstorm probability forecasts is measured using the quadratic distance between the reliability curve (Jolliffe and Stephenson, **2011**) and the diagonal. Thus, the optimal calibration slope *s* can be estimated by fitting a linear regression to the reliability curve. ROCA is insensitive to *s* because the *C _{s}* operator is just a relabelling of the forecast probabilities. Thus, although changes to (

*r*,

*d*,

*w*,

*u*) affect the reliability, changing

*s*does not change the shape of the ROC curve, and we can decouple the ROCA optimization from the reliability calibration:

- first, the four parameters (
*r*,*d*,*w*,*u*) are tuned to maximize ROCA, - then,
*s*is tuned to optimize the reliability using a linear regression.

This organization of the computations can be applied to performance measures that are different from ROCA, provided they only depend on the ROC and FAR statistics. Chapter 3 of Jolliffe and Stephenson (**2011**) list various alternatives such as the Heidke or Peirce skill scores, the critical success index, etc. The potential economic value score (or PEV, Richardson, **2000**) can be used to optimize the forecast probabilities for users that have specific costs associated to non-detections and false alarms.

The optimization of objective function ROCA(*r*, *d*, *w*, *u*) is not trivial, because it does not always have a unique maximum; in our implementation it was not even continuous because of threshold processes in the numerical compression of forecast fields. Nevertheless we shall see that the problem is tractable in the sense that an acceptable optimization is achievable by smoothing out the smaller details of the objective function. First, we use the fact that the optimization domain is bounded by physical constraints:

- radius
*r*is positive, and less than a few hundred kilometres (otherwise the geographical structure of the thunderstorm forecasts would be blurred out); - dressing parameter
*d*and threshold*u*are positive, and constrained to be less than 100% for*d*and 30 for*u*, in order to prevent the numerical optimization from wandering too far away from the physical quantities predicted by the numerical models; - weight
*w*belongs to interval [0,1] by design.

An approximate optimization is then performed using a surrogate function approximation as described in the Appendix. The behaviour of the optimization is illustrated in Fig. 4, which is representative of the ensemble blends and training periods considered in this study. The figure shows that the optimum of the surrogate function belongs to the interior of the search domain, and that there is a clear optimum in terms of parameters *r*, *u*, and *d*. The unicity of the optimum *w* is less clear: near the ROCA optimum, there is little sensitivity to variations of *w*, and the surrogate function exhibits wiggles that are artefacts of the interpolating algorithm. It indicates that our procedure cannot numerically optimize the ROCA with a better precision than a few %, due to limitations of the optimization technique used. In the following section, we will measure the numerical uncertainty on each tuning parameter, using the interval over which the surrogate function does not decrease by less than 2%.

In this section, we have described the method for producing the thunderstorm forecasts, which involves an automatic parameter tuning step. In the next section, the behaviour of the tunings will be studied, before moving on to the performance of the thunderstorm forecasts themselves.

## Variability of the optimized parameters

The parameter tuning procedure has been applied to the three ensembles blends, independently over three calendar months: June, July and August 2018. In order to save computing time, the ROCA score has only been computed over a small set of forecast ranges, as indicated in Table 1. The resulting parameter values are shown in Fig. 5, with uncertainty bars. The parameters are rather stable, with all optimum values contained inside other month’s values, except for PEARP + IFSens in July. In all other cases, each individual parameter trained over one month can be applied to the other ones without degrading the ROCA by more than 2% (one could show that this remains true when applying a four-parameter set on a month that differs from the one over which it has been optimized, because the function ROCA(*r*, *d*, *w*, *u*) is rather flat around its optimum). It means that the tuning procedure can be applied in real time, provided at least one month of training data is used. Trials with shorter training periods (not shown) exhibited noisier results.

According to Fig. 5, the parameters are not very sensitive to the ensemble configuration used: the only exceptions are the dressing parameter for PEARP + IFSens in June and August, and the weight parameter for AromeEPS + PEARP. This lack of sensitivity is interesting given the differences between the Arome, PEARP and IFS systems. The optimal spatial tolerance radius *r* is of the order of 55 km, which can be interpreted as the approximate average resolution at which thunderstorms are predictable at the used ranges (although the exact predictable scale may vary as a function of time and space), since according to our metric, including finer scales in the post-processed product does not improve the average forecast score. Similarly, the 40% optimum for dressing parameter *d* suggests that it is the typical relative error in the prediction of thunderstorm intensity near the thunderstorm detection threshold.

According to the optimization of threshold parameter *u*, numerical model output is typically associated with electrical activity when its rain rate exceeds 6 mm/h, or when the PEARP CAPE or IFSens lightning diagnostic exceeds 600. These values are mostly relevant for weak thunderstorms, because the parameter optimization is performed over a large population of thunderstorm events, weak or not, and weak thunderstorms are much more frequent that heavy ones. The algorithm favours using weak predicted values of precipitation or CAPE, because the ensembles tend to under-predict thunderstorms (e.g. because of a too small ensemble size, or a lack of ensemble spread): we are dealing with relatively rare events, so it is ‘easier’ for the tuning to increase ROCA by increasing the POD than by reducing the FAR, which is already small before the optimization. This effect can be seen in Fig. 3: the ROC curves, except for the connections to the trivial (0, 0) and (1, 1) points, are compressed towards the left part of the diagram, because the FAR tends to be much smaller than the POD statistic (by definition, the FAR is normalized by the number of times the event was not observed, which is much larger than the number of times it was observed). In a nutshell, the choice of ROCA as a measure of performance implies that the focus of the optimization is on improving the forecast of the lowest probabilities, due to the rarity of the event.

Ensemble weights follow the convention that the second ensemble in each blend name has more relative weight if *w* is higher: when e.g. AromeEPS + AROLAG has an optimal weight of 66%, it means that the set of three AROLAG members has twice the weight of the 12 AromeEPS members. In this case each AROLAG member receives (60/3)/(40/12) = 6 times the weight of each AromeEPS member. Noting that the AROLAG model resolution is 1.3 km versus 2.5 km in AromeEPS, we conclude that the higher resolution members produce better forecasts, but they are not necessarily computationally cost-effective, since an AROLAG member costs over six times more than an AromeEPS member (this result should not be over-interpreted, though, because the error bars on the weights are quite large).

The interpretation of ensembles weights as measures of relative quality leads to the conclusion that (1) AROLAG is better than AromeEPS, (2) AromeEPS is better than PEARP and (3) IFSens is better than PEARP. The ensembles with the lower weights should not yet be regarded as useless, because in most configurations tested here, the combination of two ensembles performs significantly better than each of them, as shown by the fact that the optimal weights are always between 25 and 75%. This is consistent with previous studies on multi-ensembles such as Hagedorn et al. (**2012**): after calibration, the combination of multiple ensembles is usually better than single-ensemble systems. In our study, the ROCA value may not have a well-defined optimum, but it clearly drops for weights close to 0 or 100% (Fig. 4). As will be shown in the next section, the drop happens because the implied decrease in effective ensemble size hinders the ROC diagram from precisely sampling very low and very high probability events. One concludes that blending multiple ensembles improves the forecasts, but the weights used for the blending do not need to be precisely optimized. The performance of individual ensembles vs. ensemble blends will be further investigated in the next section.

As explained in section 3, the reliability calibration is performed after the optimization of (*r*, *d*, *w*, *u*) because it does not change the ROCA score. The effect of this calibration is shown in Fig. 6, using as an example the optimum (*r*, *d*, *w*, *u*) settings for one month. The raw thunderstorm forecasts have poor reliability, which can be mostly corrected using our simple calibration: the corrected reliability diagram is nearly on the diagonal, except for the highest probability events (which do not matter much in practice, because they are rare). Further reliability improvements could be achieved using better calibration methods, but they have not been pursued in this work because our focus here is on improving the ROC statistics.

Figure 6 shows the typical behaviour of the reliability calibration: the raw probabilities were overconfident (i.e. flatter than the diagonal), with a slope of *s* = 0.165. Applying function *C _{s}*(see Eq. (7)) reduces the forecast probabilities so that they nearly lie along the diagonal. An important consequence is a loss of sharpness, i.e. a narrowing of the forecast probabilities that are issued: after calibration, thunderstorm probabilities will rarely exceed 35%. Limitations of current numerical forecasting systems prevent them from making more confident forecasts. Higher forecast probabilities could probably be issued in more specific conditions e.g. at very short ranges using nowcasting tools, or in areas where thunderstorm events are particularly predictable. For instance, the Relàmpago del Catatumbo in Venezuela, or Hector the Convector in Australia are known to be very predictable in some seasons, because local weather and geographical features trigger quasi-periodic intense convection. The calibration coefficients for all blends and months considered in this work are shown in Table 2. From one month to the other, the calibration coefficient of each system changes by 5–20%, which is an indication of the calibration accuracy one can expect in a real-time production setting.

Figure 7 shows the impact of the post-processing on the same case as shown in Fig. 2: the raw Arome-EPS thunderstorm probabilities are computed by counting at each point the number of members that predicted thunderstorms. It leads to a very detailed probability map, with probabilities below 10% except in small areas next to the Bordeaux city (indicated on the maps), where they locally exceed 20%. Unfortunately, there was no thunderstorm there: storms occurred about 50 km to the SW and NE, where the predicted values are very small: in this region, a naive point forecast user would conclude that the prediction was mostly wrong. The three blends, on the other hand, rightly assigned probabilities greater than 10% over a wider area. The AromeEPS + AROLAG blend provides the most detailed map, with two zones of thunderstorm probabilities greater than 20%, and probabilities that rapidly drop to zero away from the thunderstorm-prone areas. The AromeEPS + PEARP and PEARP + IFSens blends are much smoother because there is less informative detail in the PEARP and IFSens ensembles. The AromeEPS + PEARP probabilities are everywhere lower than 20%, because although the raw PEARP ensemble predicts high probabilities over vast areas (not shown), they are much reduced by the calibration because they imply many false alarms. The PEARP + IFSens blend produces higher probabilities on a better defined region thanks to the IFSens system, at the expense of missing the northernmost part of the thunderstorms (using Fig. 2 as ground truth). In this event, the thunderstorms cells travelled in a SW flux; the maps suggest that the forecasts were better at predicting their trajectories than the timing of their motion, so the forecasts would probably have been improved if we had used a time-neighbourhood post-processing operator, to bring additional blurring in the SW-NE direction.

## Comparison between single- and multi-ensemble tunings

In this section we investigate two questions regarding multi-ensemble forecasts: should the parameter tunings (*r*, *d*, *u*) be model-specific? How are single- and multi-ensemble tunings related?

The first question can partly be addressed by checking if the tunings would be different in single-ensemble systems. The algorithm used is the same as for the blends, except that the Latin hypercube sampling is done in a 3D space, instead of 4D, since parameter *w* is only used for multi-ensembles. Figure 8 shows the optimal parameters for the AROLAG, AromeEPS, PEARP and IFSens systems over the month of June 2018. The AromeEPS and PEARP values shown have been optimized over ranges 12–30 h and 42–60 h, respectively, which are the ones used in the AromeEPS + AROLAG and PEARP + IFSens blends. These ranges are slightly inconsistent with the ones used for the AromeEPS + PEARP blend (24–42 h range), but the corresponding parameters are not displayed because they produce very similar tunings.

There are significant differences between the ensembles. The differences between AROLAG and AromeEPS are as large as with the other ensembles, although they use very similar forecast models. It shows that the neighbourhood radius, dressing and threshold do not only depend on physical properties of the thunderstorm diagnostic used; they are impacted by statistical properties of the ensembles such as spread and ensemble size. Parameters (*r*, *d*, *u*) can account for missing spread in the ensembles: *r* is a measure of spatial tolerance, so it can be expected to be smaller for ensembles (such as PEARP) that have larger spatial spread. Parameters *d* and *u* are measures of intensity tolerance, so they are related to intensity spread in the ensembles. They can also act as amplitude bias corrections on the ensemble output: the ROC area being sensitive to low forecast probabilities, an ensemble that under forecast thunderstorms (in the sense that its diagnostic *x* tends to have low values when thunderstorms are observed) can be improved, either by increasing *d* to widen the upper tail of the ensemble PDF, or by lowering threshold *u* to increase the frequency of thunderstorm predictions in the members.

A comparison between Figs. 5 and 8 shows that there is not a simple relationship between the single-ensemble and the blended ensemble parameter values. In particular, the blended ensemble values are not necessarily inside the interval of the contributing ensemble values. A possible explanation is that the tunings can be affected by the dispersion between ensembles, which is in general different from the dispersion of each ensemble. For instance, blending two under-dispersive ensembles may produce an over-dispersive blend if their members behave very differently from the other ensemble.

The parameter optimization used here optimizes the blends without taking into account the specific properties of the contributing ensembles. Better results could perhaps have been obtained by allowing more degrees of freedom in the optimization. For instance, the amplitude bias correction of field *x* uses a single parameter *u*: it would make sense to tune a different correction for each contributing ensemble, because the three diagnostics used to define *x* (Arome precipitation rate, PEARP CAPE and IFS lightning diagnostic) have different physical meanings. Thus, our algorithm should only be regarded as a baseline configuration that could be improved by increasing its complexity.

## Do ensemble blends outperform single ensembles?

We now investigate how ensemble blending improves over the use of single ensembles. It has been shown in Section 4 that the optimized value of weight *w* is strictly between 0 and 100%. By construction of the optimization algorithm, it means that a blended ensemble is always better than its contributing ensembles, in terms of the chosen performance metric. If a contributor was better than the blend that uses it, the optimization algorithms should have set *w* to 0% or 100%. Still, there may be reasons why a blend may not actually outperform its contributors in practice:

- the optimization might converge to an intermediate value of
*w*, even when 0% or 100% perform best, because of errors in the computation of the surrogate function, for instance if there are not enough sample points, or the interpolation algorithm has produced a surrogate with a maximum that is very different from the true ROC area maximum; - the parameters optimized at the specified forecast ranges used may not be optimal for other ranges;
- the parameters optimized for a given month may not be optimal for another month.

These issues are related to the overfitting problem in statistics (also called *variance* in the machine learning literature). In order to mitigate them, the following results will all be based on out-of-sample verification scores: whenever the optimized (*r*, *d*, *w*, *u*) parameters are used, we will use optimizations performed over a different month than the one over which the score is computed. Thus, the scores shown are representative of the performance than would have been obtained in a real time setting.

Figure 9 compares probabilistic scores of each ensemble blend with their respective contributing ensembles. Each has been post-processed and tuned independently. The ROC area and the PEV (potential economic value) diagram are shown over an interval of forecast ranges (much wider than the ranges used for the tuning). ROC and PEV emphasize different aspects of forecast error because we are dealing with relatively rare events: the ROC area is sensitive to the performance of the smallest non-zero forecast probabilities, whereas PEV graphically emphasizes the highest forecast probabilities, which show up as the ‘tail’ on the right of the PEV diagrams.

In Fig. 9, the AromeEPS + AROLAG scores (top row) indicate that there are only small differences between the blend and its post-processed contributors (AromeEPS and AROLAG). The blend is very close to the AROLAG pseudo-ensemble, with few statistically significant ROCA differences. AROLAG, a three-member poor man's ensemble, looks nearly as good (if not better, although the score differences have little statistical significance) as the numerically more expensive AromeEPS 12-member ensemble. The PEV curve reveals that the blending clearly outperforms AromeEPS for users with cost-loss ratio between 0.1 and 0.25. For higher cost-loss ratios, none of the systems has any forecast value.

The AromeEPS + PEARP blend (second row of Fig. 9) shows that PEARP degrades the forecast blend, since AromeEPS alone produces better ROCA and PEV scores. The differences are statistically significant. PEARP + IFSens significantly outperforms both PEARP and IFSens, except from a few forecast ranges. The improvement is clear for all cost-loss ratios. There is a (semi-)diurnal cycle in the ROC area scores, which suggests that the tuning of weight *w* might benefit from being optimized separately for each time of the day. Remembering that the ranges are counted from 00UTC, the ROCA curves suggest that thunderstorm forecast performance is minimal in the early morning (near ranges 30, 54 and 78 h), and relatively higher in the afternoon. This cycle may be due to variations in physical properties of the convection during the day, but it could also be a side effect of our optimizing a single set of parameters (*r*, *d*, *w*, *u*) for all ranges: during summer, thunderstorm activity has a peak in the afternoon, so it is possible that the parameter tuning is biased towards afternoon thunderstorms, and thus not optimal for the rest of the day.

The lower left panel of Fig. 9 (i.e. ROC area for the PEARP + IFSens blend) shows a decreasing trend of the score as a function of range. Using the commonly quoted value of ROCA = 0.6 as a limit below which a forecast is no longer regarded as usable in practice, a visual linear fit to the ROC area curves suggests that the average thunderstorm predictability horizon is about 5 to 6 days over Western Europe, an estimate that is consistent with the work of Simon et al. (**2018**).

We have shown that multi-ensemble blends usually outperform single ensembles, but not always. At specific ranges, and for some classes of users (e.g. with specific cost-loss ratios), single ensembles can be better. The situation can arise when mixing two ensembles with very different forecast performance (such as AromeEPS and PEARP): the worse one can degrade some aspects of the blend, despite the tuning of parameter *w* that is supposed to weight the ensembles according to their relative performance. In this case, using multi-ensembles cannot justified by our scores, but it may be desirable for other reasons, such as the resilience of the forecasting system against a missing contributing ensemble, or the forecast continuity across a wide set of ranges.

Figure 10 shows the ROC area scores of the three multi-ensemble blends, compared over all considered forecast ranges. It demonstrates the complementarity between the high-resolution Arome-based blends that provide the best forecasts at short ranges, and the global PEARP and IFSens systems that cover longer ranges. The Arome-EPS + PEARP blend fails to provide forecasts of intermediate quality, probably because the Arome and Arpège models used are too different to be blended using our simple technique: Fig. 8a showed that PEARP had the smallest optimal radius *r* of all systems. The spatial neighbourhood operator has been found to be the most important component of the ensemble post-processing, and it may not be possible to find a single radius that performs well for both AromeEPS and PEARP. To correct this issue, one could perhaps use a better PEARP thunderstorm diagnostic (e.g. along the lines of the Lopez (**2016**)), or directly blend AromeEPS with IFSens.

The reliability of the multi-ensemble blends has been checked as follows, after out-of-sample calibration: over the three-month period, the average fraction of points with observed thunderstorms was 2.2%, and the average forecast probability of thunderstorm occurrence was 3.1% in AromeEPS + AROLAG, 2.6% in AromeEPS + PEARP, and 2.9% PEARP + IFSens at forecast ranges between 9 and 30 hours. These numbers mean that the calibration works well on the optimized blends, because the output probabilities are quite reliable.

## Summary, discussion and conclusions

We have presented a new ensemble post-processing technique that combines different aspects of probabilistic forecasting: spatial tolerance using a neighbourhood operator, smoothing of the forecast density functions using a dressing operator, weighting between several ensembles, and adjustment of the threshold that diagnoses a binary event of interest (thunderstorm occurrence) from the NWP (numerical weather prediction) model output. These operators are controlled using only four tuning parameters that can be optimized on a rather short (about 1 month) training period, which makes the approach suitable for real-time application in operational meteorological institutes. The optimization is done by maximizing the ROC area, which is a measure of the end user value of probabilistic forecasts in terms of false alarm and detection rates. The optimization technique uses Latin hypercube sampling and a surrogate model algorithm, with a diagnostic of tolerance intervals around the estimated optimum parameter values. Output probabilities can be calibrated using an a posteriori rescaling, although more elaborate calibrations could be used.

The post-processing technique has been tested on thunderstorm forecasts. Only thunderstorm occurrence is predicted, not its intensity. The corresponding pseudo-observations have been generated from lightning and radar measurements. Three multi-ensemble systems, called ‘blends’, have been post-processed during 92 days of summer 2018, over mainland France. The dataset includes a wide variety of forecast ranges (from 9 to 93 h) and model resolutions (from 1.3 to 25 km horizontal mesh), using four operational ensembles: a poor man’s ensemble of lagged deterministic forecasts, a high-resolution limited area ensemble (AromeEPS), and two global ensembles (PEARP and IFSens).

The optimization of the post-processing parameters, as well as the calibration, appear to have enough statistical robustness for real-time operational applications. The robustness comes at the expense of neglecting variations between models, ensemble systems, and forecast times. Our diagnostics have shown that these variations may have significant implications. The optimized, blended superensembles have reasonable thunderstorm forecasting abilities, although they could probably be improved by including more tuning parameters to better account for the neglected parameter variability. These modifications to the post-processing system would require more computational resources and larger training datasets.

We recommend to further improve the proposed algorithm by making the (*u*, *w*, *d*) parameters dependent on the model type used (e.g. Arome, Arpège or IFS models), and by including some dependency with respect to diurnal time, which seems to be important. It would also be interesting for the tunings to depend on forecast range and on geographical location (preliminary testing has shown that our algorithm leads to different tunings if it is restricted to the Mediterranean area). The neighbourhood operator was limited to the space dimension in this study: it should be complemented by some time tolerance, in particular for models with an imperfect diurnal cycle of summer convection. Other model predictors of thunderstorm occurrence should be tested, in particular the PEARP CAPE diagnostic used here could be improved, because CAPE is only loosely related to the actual triggering of thunderstorms in numerical models.

Over regions and periods with less thunderstorm activity than in this paper, the automatic parameter tuning would be more difficult, because a minimum number of thunderstorm events is needed to achieve statistical stability: in an operational setting, the learning algorithm would need to be carefully warmed up at the beginning of each convective season, in particular if the NWP systems used have changed since the previous season. In production settings where reforecasts are not available, one could adjust the size of the learning dataset so that it always contains enough observed and forecast thunderstorm events to achieve statistical robustness. In some parts of the globe, the availability of ground-based lightning and radar data may be problematic, in which case thunderstorm products derived from satellite observations should be useful alternatives.

We have shown that the statistical post-processing has a large impact on the performance of the post-processed multi-ensembles. Probabilistic forecasts based on direct ensemble output (without any post-processing) usually benefit from higher model resolution and larger ensemble size, but as we have seen, this is not always true after post-processing: in some conditions, a poor man’s ensemble or a low-resolution global model can outperform more expensive NWP systems. The probabilistic forecast quality and the benefit of multi-ensembles also depends on the user cost-loss ratio – that is, on the relative cost that is attached to false alarms and to missed events. In a nutshell, multi-ensembles do not necessarily beat single-ensemble systems in all respects, but with a suitable post-processing they can be an attractive way of combining output from multiple systems, for specific user needs, and on a broad range of forecast horizons.

It would be useful to extend the approach used here to forecast violent thunderstorms. This would require a complexification of the observation definition (taking into account important meteorological variables such as gusts, hail and rain accumulation) and of the model diagnostics used. The rarity, and often low predictability, of these events will require larger training datasets, possibly including nowcasting products in the blending. This will be the topic of future work.