1. 

Introduction

Global Bayesian optimization (GBO) is an optimization method that is used widely in the tech-industry, e.g. for tuning parameters which define advertising schemes on web pages such as Yelp, or for optimizing server compiling flags at Facebook (Letham et al., 2019), or for tuning machine learning algorithms (Snoek et al., 2012). These engineering applications have in common that evaluations of the objective function (the function one needs to optimize) are ‘expensive’, either computationally, e.g. when tuning machine learning algorithms, or quite literally, e.g. when new advertisement arrangements are tested on a subset of website users.

The fact that GBO has been successful in such difficult problems suggests that it can find (near) optimal solutions after only a small number of evaluations of the objective function. The strategy with which GBO achieves a near optimal solution is constructing, refining and optimizing an emulator of the objective function (the emulator is typically a Gaussian process model). The power of GBO then stems from judiciously choosing where to evaluate the objective function to refine and improve the emulator. This is done by resolving a trade-off between exploring regions where the objective function is known to be large, and exploring unknown regions to encounter new, and possibly more relevant, extrema of the objective function. We note that the main computational cost of GBO arises from evaluations of the objective function (all other computations are manageable), and that no derivative or adjoints are required.

In short, GBO is an optimization method that can be used for the efficient numerical solution of optimization problems for which

  1. derivatives and adjoints are difficult or impossible to compute;
  2. evaluations of the objective function are expensive and a near optimal, or at least useful, solution can be found after only a few evaluations of the objective function.

Optimization problems with the above two characteristics occur frequently in the Earth sciences and it is natural to ask if and how one may be able to use GBO in Earth science problems. An example of the use of GBO in a geophysical problem is the recent work of Pirot et al. (2019), where GBO is used for underground contaminant source localization. A second example is the work of Abbas et al. (2016), which leverages GBO in the context of simultaneous state and parameter estimation problems.

In this paper, we continue this line of work and present numerical experiments in which we test the usefulness of GBO in three problems in ensemble data assimilation (DA). The goal in DA is to update forecasts of a numerical model by observations. A prominent example is (global) numerical weather prediction where model-based forecasts are updated in view of meteorological observations. The DA problem is usually formulated within a Bayesian framework, in which the forecast defines a prior, which is updated to a posterior distribution via a likelihood that takes the observations into account. Ensemble DA are a suite of numerical methods that use the Monte Carlo technique to perform this inference. Examples of ensemble DA methods include the ensemble Kalman filter (EnKF), see e.g. Evensen (2009a), and particle filters, see, e.g. Doucet et al. (2001) and Farchi and Bocquet (2018). Variational methods, which aim at finding the model state that maximizes the posterior distribution, are an alternative to ensemble DA (Talagrand and Courtier, 1987). Variational methods have been combined with the ensemble approach in hybrid-schemes (Lorenc, 2003; Bonavita et al., 2012).

The computational requirements of ensemble DA are connected to the required ensemble size: each ensemble member typically requires a simulation of the underlying computational model, which is expensive. For this reason, one must keep the ensemble size small. A small ensemble size, in turn, introduces sampling errors which can be reduced by ‘localization’ and ‘inflation’. The essential idea of localization is to reduce sampling error by deleting spurious long-range correlations (Hamill et al., 2001; Houtekamer and Mitchell, 2001). Inflation is another technique used to compensate for the tendency of ensemble DA schemes to underestimate actual errors, see, e.g. Kotsuki et al. (2017) and Anderson and Anderson (1999). The basic idea in inflation is to artificially increase the variances and covariances in the ensemble. The various parameters that define the localization and inflation are ‘tuned’, which often means that one performs tests with a limited number of combinations of localization and inflation parameters, and then uses the ones that gave the best results (see, e.g. Lorenc (2017) [Section 3]). We refer to this technique of optimizing localization and inflation parameters as a ‘grid-search’ (because one searches for a good solution on a grid of parameters). Alternatives to a grid-search, including data-driven, adaptive and optimal localization methods, are discussed in Section 3.2.

The rest of this paper is dedicated to investigating the use of GBO within ensemble DA by describing how to use GBO in three specific problems:

  1. tuning localization and inflation in ensemble DA;
  2. solving DA problems where some or all model parameters are uncertain or completely unknown;
  3. tuning localization and inflation, while simultaneously also estimating model parameters.

The first problem is motivated by the fact that tuning localization and inflation becomes increasingly difficult as new multi-scale localization schemes are put to use, see, e.g. Buehner (2012) and Buehner and Shlyaeva (2015). For multi scale localization methods, the number of parameters that need to be tuned may exceed what is feasible to do with simple grid-search techniques and GBO may represent a useful alternative. Data-driven or adaptive/optimal localization schemes have not yet been put forward for multi scale localization.

The second problem is motivated by the fact that, on short enough time intervals, linear or nearly linear ensemble DA methods (EnKF or hybrid methods) are successful because the state estimation is nearly linear on such short time scales, see e.g. Morzfeld and Hodyss (2019). Parameter estimation, however, is typically highly nonlinear, see, e.g. Vukicevic and Posselt (2008) and Posselt et al. (2012). Combining GBO with ensemble DA allows us to separate the simultaneous state/parameter estimation problem into its (approximately) linear and nonlinear components. Specifically, we use ensemble DA for the state estimation and, using GBO, create an additional outer loop to find optimal values for the model’s parameters. The third problem is motivated by the fact that the localization and inflation parameters depend on the model parameters. For example, a larger inflation can compensate for errors in the model parameters (because a large inflation essentially means to increase model errors compared to observation errors). On the other hand, if one finds appropriate model parameters, large forecast errors can occur if the localization/inflation parameters are inappropriate.

We present numerical experiments with the usual Lorenz test problems (Lorenz, 1963, 1996) and with an EnKF (Evensen, 2009a) serving as an example of an ensemble DA system. Our study is a proof-of-concept, but more work is needed, in particular with more realistic or real models, to better understand the benefits and limitations of the use of GBO within ensemble DA. It is clear, however, that the GBO framework is flexible and easy to use with any ensemble DA technique (variational or hybrid). The reason is GBO’s ‘offline’ structure, i.e. the fact that GBO operates in an outer loop in which one calls an ensemble DA for objective function evaluations. Moreover, the overall computational cost of a GBO technique is determined by the computational cost of evaluations of the objective function (using an ensemble DA system on a set of observations). All other computations required (updating/optimizing the emulator and computing where to evaluate the objective function) are negligible.

Before proceeding, we summarize some notation we use throughout this paper. Lowercase bold characters represent vectors, lowercase non-bold characters are scalars, and capital bold characters represent matrices. The dimensions are denoted by N with an index that references the vector, e.g. Na is the dimension of the vector a. We also use a condensed indexing notation where b1:n should be interpreted as the set {b1,,bn}. This indexing notation extends to scalar functions by interpreting f(b1:n) as the column vector [f(b1),,f(bn)]T.

2. 

Global Bayesian optimization

In this section, we review GBO from a mathematical perspective. As indicated in the introduction, GBO is a derivative-free optimization method, which means that no gradients or adjoints are required. The optimization relies solely on evaluations of the objective function. The optimization of the objective function is achieved by building and refining an emulator of the objective function. Where the function evaluations take place is decided by resolving a trade-off between refining the emulator in regions where the objective function is known to be large, and exploring unknown regions.

Throughout this paper, we denote the objective function by J(ϕ), where ϕ is typically vector valued, but the objective function itself is scalar. We assume that J(ϕ)0 for any ϕ and, thus, we seek to maximize the objective function. In all cases we consider, the objective functions is connected to a nonlinear least squares fit to observations (see Equation (25)). GBO computes the maximizer ϕ̂=arg max(J) by iterating two steps. Step one builds the emulator based on all available function evaluations, and step two proposes a new parameter ϕ*, which is determined by optimizing an ‘acquisition function’. Optimization of the acquisition function resolves the trade-off between refining the emulator in regions where J is known to be large, and exploring the parameter space more broadly to find new and higher maxima. During the iteration, one learns the structure of the objective function and can make informed guesses about where the function is large (ideally maximally large). We now describe each of the key GBO ingredients: The emulator, which is usually a Gaussian process (GP) model, its update via regression, and the acquisition function. For simplicity, we present the case of a scalar parameter ϕ, but the extension to vector-valued parameters is straightforward. We refer to Frazier and Wang (2016) for more details about GBO.

2.1. 

Gaussian processes and their updates by regression

A Gaussian process is a collection of random variables such that any finite subset has a multivariate Gaussian distribution. A simple example of a GP is a multivariate Gaussian random variable (this corresponds to a finite indexing set). Allowing the indexing set to be (uncountably) infinite, one can extend the notion of Gaussian distributions to random functions. In practice, one always uses finite dimensional approximations of functions by discretizing on a grid, but the theory is easier to grasp when considering the infinite dimensional limit, which is independent of the discretization one may chose.

A GP is defined by its mean function and a kernel function, that describes the covariances within the function. A kernel K is a positive definite function which defines the spatial correlation between any two points within the domain. There are many examples of suitable kernel functions, see, e.g. Williams and Rasmussen (2006), and one popular choice is a Gaussian kernel with added white noise

((1))
K(ϕ,ϕ)=γ2·exp(|ϕϕ|222)+κ2·δ(ϕ,ϕ)
where δ(·,·) is the Dirac delta function and where γ, κ and are positive parameters that define the kernel. We discuss how to find ‘hyper-parameters’ γ, κ and further below. To keep things simple, and since this study presents a proof-of-concept, we will use only the above kernel, but one can easily envision using other kernels, or even optimizing the kernel structure (see, e.g. Lunderman et al. (2018)).

One then emulates, or models, the objective function by a Gaussian process with the specified kernel:

((2))
Ĵ(ϕ)=GP(μ(ϕ),K(ϕ,ϕ)),
and updates the GP using evaluations of the objective function J and regression. Suppose we have evaluated the objective function at NG parameters ϕ1:NG, then regression gives
((3))
Ĵ(ϕ|ϕ1:NG)=GP(μ̂(ϕ),K̂(ϕ,ϕ)),
where
((4))
μ̂(ϕ)=μ(ϕ)+K(ϕ,ϕ1:NG)K(ϕ1:NG,ϕ1:NG)1[J(ϕ1:NG)μ(ϕ1:NG)],
((5))
K̂(ϕ,ϕ)=K(ϕ,ϕ)K(ϕ,ϕ1:NG)K(ϕ1:NG,ϕ1:NG)1K(ϕ1:NG,ϕ).

This (Bayesian) update can be implemented efficiently via linear algebra and since one can process each function evaluation sequentially, the linear algebra is relatively straightforward (see Rasmussen (2004) for more details).

The hyper-parameters (γ, κ and ) can be obtained by maximizing the negative log marginal likelihood defined as

((6))
L(γ,κ,)=log p(J(ϕ1:NG)| γ,κ,)=12[J(ϕ1:NG)]TK1[J(ϕ1:NG)]+12log|K|+NG2log2π,
where K=K(ϕ1:NG,ϕ1:NG;γ,κ,) is an NG×NG matrix that discretizes the kernel (for a given set of γ,κ,) and where |K| is its determinant. It is advisable to re-do this optimization and adjust the hyper-parameters of the kernel before each update, since the function evaluations and updates may provide additional information about what appropriate hyper-parameters ought to be.

2.2. 

Acquisition function

We have described above how to emulate the objective function by a GP and how to update this GP in view evaluations of the objective function. What remains to do is to find a way to suggest new parameters at which to evaluate the objective function to improve its GP emulator. GBO finds new parameters for function evaluations via optimization of an acquisition function. An acquisition function, A, has the form

((7))
A(ϕ)=A(Jmax,μ(ϕ),K(ϕ,ϕ)),
where Jmax=maxk{J(ϕk)}. The parameter ϕ*, where the objective function is evaluated, is selected by maximizing the acquisition function over ϕ:
((8))
ϕ*=arg maxϕA(Jmax,μ(ϕ),K(ϕ,ϕ)).

In this work, we use Expected Improvement (EI) as the acquisition function because it is one of the simplest choices and it turned out to be sufficient in the problems we tried, but more sophisticated techniques, e.g. using knowledge-gradients, may be used in the future. EI is defined as

((9))
EI(ϕ)=E[(J(ϕ)Jmax)+],
where (·)+=max{·,0} and E[ · ]=E[ · | ϕ1:NG,J(ϕ1:NG)] is the expectation conditioned on the known function realizations (see, e.g. Jones et al. (1998)). Under our assumptions, EI has the closed form representation
((10))
EI(ϕ)=(μ(ϕ)Jmax)Φ(μ(ϕ)Jmaxσ(ϕ))+σ(ϕ)φ(μ(ϕ)Jmaxσ(ϕ)),
where Φ(·) and φ(·) are the normal cumulative distribution function (cdf) and probability density function (pdf), and where σ2(ϕ)=K(ϕ,ϕ). Thus, maximizing EI is relatively straight forward.

Note that the maximizer of EI will have a large predicted value (relative to the known max Jmax) and a large amount of associated uncertainty (represented by σ(ϕ)). Thus, using optimization of EI to propose new parameters for function evaluations resolves the trade-off between exploring the space where the objective is known to be large with exploring unkown and uncertain regions that may contain higher maxima (see also Frazier and Wang (2016), Section 3.4.1).

2.3. 

Summary of the GBO algorithm

The GBO algorithm can be summarized as follows. First, pick a GP model by specifying a kernel and associated hyper-parameters. Second, initialize the algorithm by picking a (small) number of points and evaluating the objective function at these points. For example, one can use a quasi-random Sobol sequence to obtain a few ϕ. One then optimizes the hyper-parameters of the kernel and uses regression to update the mean function and covariance kernel in view of the new function evaluations. The algorithms proceeds by iterating the following steps

  1. Maximize expected improvement to obtain a ϕ* and evaluate the objective function at ϕ*.
  2. Optimize the marginal log-likelihood to obtain appropriate hyper-parameters for the GP model’s kernel in view of the new function evaluation.
  3. Use regression to update the GP’s mean function and covariance kernel model in view of all function evaluations done thus far.

This iteration proceeds until a stopping criterion is reached. In many engineering applications, the stopping criterion is simply ‘we run GBO until we run out of time’. In the numerical examples below, we use a similar strategy and simply perform a pre-specified number of GBO steps. This corresponds to a priori specifying a computational budget and then accepting the results one obtains as appropriate, which is realistic. Other stopping criteria include monitoring on how the maximum function value evolves over the iterations and stopping after reaching some threshold, or stopping when no improvement is made for a large number of iterations. Once the iteration is completed, the approximate maximizer of the objective function is

((11))
ϕ̂:=maxϕJ(ϕ)maxkJ(ϕk).

The GBO algorithm is illustrated by a flowchart in Figure 1.

Fig. 1

Flowchart of the GBO algorithm. The two panels on the right illustrate (i) how maximization of expected improvement results in additional function evaluations that explore the space effectively; and (ii) the effects of a typical update of the GP mean and covariance based on a single, additional function evaluation.

3. 

Review of ensemble data assimilation

The goal in data assimilation is to estimate the state of a numerical model at ‘time’ k, where k is an integer indexing a discrete time variable. The state estimate is based on previous observations, collected up to time k − 1, a numerical model which we write as M, and a new observation collected at time k. We assume that the model state and observation satisfy

((12))
yk=Hxk+ζk,
where ζkN(0,R) are independent identically distributed (iid) random variables and and R is the observation error covariance matrix; H is a matrix that maps the state to the observations, where the number of observations is usually much less than the number of state variables. We write the underlying numerical model as
((13))
xk=M(θ;xk1),
where θ are model parameters. In ‘pure’ state estimation problems, the model parameters are assumed to be known, but one can extend the framework to account for partially or entirely unknown model parameters (see below). Making use of the Bayesian framework, the relevant posterior distribution for state estimation is
((14))
p(xk|y1:k)p(yk|xk)p(xk|y1:k1).

It is now clear that data assimilation is a sequential process, in which state estimates are updated as observations become available, see, e.g. Carrassi et al. (2018) for more details and explanations.

3.1. 

Ensemble Kalman filter

The ensemble Kalman filter (EnKF) is an ensemble DA method that approximates the posterior distribution by using a Monte Carlo approximation within the Kalman filter formalism (Evensen (2009a)). Specifically, let {x1,x2,,xNe} be an ensemble of size Ne, approximately distributed according to the distribution p(xk1|y1:k1). The model can be applied to each ensemble member to generate a forecast ensemble

((15))
xjf=M(θ;xj),j=1,,Ne.

The covariance matrix computed from the forecast ensemble, Pf, is called the ‘forecast covariance’. The forecast covariance and the observation yk are used to generate an analysis ensemble

((16))
xja=xjf+K(y(Hxjf+ζj))j=1,,Ne,
where ζjN(0,R) are iid random variables, and where
((17))
K=PfHT(HPfHT+R)1,
is the Kalman gain. The analysis ensemble is approximately distributed according to the posterior distribution p(xk|y1:k). When a next observation becomes available, the entire process is repeated.

The EnKF described here is referred to as the ‘stochastic’ implementation of the EnKF, and other implementations are also heavily used and referred to as ensemble transform filters, see, e.g. Hunt et al. (2007). The GBO techniques we describe below, pertain to all implementations of the EnKF, but we consider only the stochastic implementation as an example, and to illustrate our ideas.

3.2. 

Localization and inflation

When applying the EnKF, one must keep computational requirements in mind, especially if the model is complex and computationally intensive to run. In this case, a large component of the computational cost of an EnKF is generating the forecast ensemble by applying the model to each ensemble member. To keep the computational requirements low, the ensemble size should be chosen as small as possible and is typically on the order of hundreds, often several orders of magnitude smaller than the dimension of the state or the observations. Localization and inflation are two techniques that enable using the EnKF with modest ensemble sizes, and it is widely accepted that localization and inflation are critical components of an EnKF, see, e.g. Hamill et al. (2009).

A small ensemble size causes errors in the sample-based estimates of the forecast covariance matrix, which often results in unrealistic long-range correlations. The basic idea of localization is to reduce spurious correlations, see, e.g. Hamill et al. (2001) and Houtekamer and Mitchell (2001). There are several localization techniques one can use, see, e.g. Shlyaeva et al. (2019), but here we focus on localization of the forecast covariance by a Schur-, or element-wise, product. One first determines a symmetric positive definite (SPD) localization matrix L with the property that correlations of any pair of state variables that are ‘far’ from each other are dampened. A simple example is the matrix L whose elements Li,j, are given by

((18))
Li,j=exp(12(zi,jL)2)
where zi,j is the distance between the ith and jth states and where L is a length scale that determines how quickly correlations decay. The localized forecast covariance is defined by
((19))
Plocf=L°Pf,
where the open circle, °, denotes an element-wise product of L and Pf. The localized forecast covariance Plocf then replaces the forecast covariance within the EnKF algorithm.

Inflation is another method that is used to make EnKF applicable and accurate by keeping the ensemble size small. More specifically, inflation is used to prevent underestimation of variances and covariances due to a small ensemble size, which may cause the EnKF to generate analysis ensembles whose covariance underestimates actual errors (Anderson and Anderson, 1999). In this work, we use multiplicative inflation as an example of one of many inflation techniques (see, e.g. Luo and Hoteit (2011) for an alternative). As above, we denote the forecast ensemble by x1f,,xNef and write the forecast mean as x¯f=(1/Ne)j=1Nexjf. The inflated ensemble, which has a larger associated covariance matrix than the ‘original’ ensemble if α>1, is given by

((20))
xj,inflf=x¯f+α(xjfx¯f).

The inflated ensemble xinflf is then updated via (16) within the EnKF algorithm, using the localized forecast covariance matrix when computing the Kalman gain.

We note that localization and inflation require that one specifies parameters that define the degree of localization (the length-scale L) and inflation (the inflation parameter α). Moreover, localization and inflation are usually intertwined: Fixing the inflation and optimizing the localization (or vice versa), does not yield results that are as good as a joint optimization over localization and inflation parameters. Perhaps the simplest technique to find appropriate localization/inflation parameters is to perform a ‘grid-search’, where one tries a number of pairs of localization/inflation parameters (often informed by the physics of the problem) and then uses the one that gives the best results, e.g. those that minimize EnKF forecast errors (see, e.g. Nerger (2015) and Lorenc (2017)). A grid-search requires that one performs a data assimilation experiment for each pair of localization/inflation parameters and, therefore, can accrue a large computational cost. In fact, the computational requirements of a grid-search may become excessive when multi-scale localization techniques, which require several parameters to define the localization, are used (Buehner, 2012; Buehner and Shlyaeva, 2015).

Besides a grid search, one can also use a data-driven approach to obtain a suitable localization (De La Chevrotière and Harlim, 2017). The basic idea is to first construct a model for the localization of the forecast covariance during an ‘offline training’ procedure, and to then deploy this localization in the ensemble DA. The training involves performing a DA for a specified number of training observations. A data-driven approach is thus similar in spirit to the grid-search because a set of observations is required and processed during training. Similarly, one can also use machine learning to find appropriate localizations Moosavi et al. (2019). Empirical localization functions (Lei et al., 2015), can also be use and are constructed, empirically, from a training set of obserations.

The computational cost of a grid-search or data-driven method can be high, because a large number of DA are performed during the grid-search/training phase. Alternatively, one can use adaptive localization schemes where the localization is adapted in view of the observations (and assumed errors). Examples of adaptive localization techniques include Anderson (2012), Zhen and Zhang (2014), Anderson (2016) and Popov and Sandu (2019). The computational cost of an adaptive localization is low and we emphasize that no training is required when using an adaptive strategy. Optimal localization methods are similar in spirit to adaptive techniques, in that these methods can be employed directly (without using any training or optimization). In an optimal localization scheme, the localization is chosen to optimize a specified set of criteria. The optimality is based on mathematical assumptions, e.g. about sampling- or model-, or observation errors, and the validity of these assumptions is difficult to ensure in realistic DA problems. Examples of optimal localization methods include Ménétrier et al. (2015), Flowerdew (2015) and Destouches et al. (2020). The tuning of localization ‘on-the-fly’ (no training required) using information from past observations is discussed in Flowerdew (2017). Finally we note that one can also adaptively chose the inflation, see, e.g. Li et al. (2009) and El Gharamti et al. (2019).

3.3. 

Augmented state EnKF for simultaneous state and parameter estimation

The EnKF framework can be extended to simultaneous state and parameter estimation, where one estimates the state of the model, x, and unknown model parameters θ, see e.g. Evensen (2009b), Bocquet and Sakov (2013) and Ruckstuhl and Janjić (2018). This technique, often called the augmented state EnKF method is as follows. One defines an ‘augmented state’

((21))
x˜=(xT,θT)T
which contains the state and parameters. One then also augments the model by simply writing
((22))
M˜(x˜)=(M(θ)θ).

Thus, the augmented model evolves the state x via the numerical model M as before, and the model parameters θ remain constant (θk=θk1). Since the relationship between parameters and observations is unknown, the observation matrix H˜, is augmented with a zero matrix:

((23))
H˜=(H   0).

The EnKF can now be used as described above and it generates estimates (in the form of ensembles) for the state x and the parameters θ. To localize an augmented state EnKF, one can, for example, augment the localization matrix by

((24))
L˜=[LBBTC]
where L is the same localization matrix as before and where B=1Nx×Nθ (an Nx×Nθ matrix whose elements are all equal to one) and C=INθ. More generally, one can consider localization matrices that dampen correlations between model states and model parameters in more sophisticated ways, but the tuning (or adapting, or optimally choosing) the augmented localization matrix can become cumbersome. Below, we will use the augmented state EnKF (with localization as above) as a benchmark method to test and verify the results we obtain via GBO applied to simultaneous state and parameter estimation problems.

Other numerical methods for simultaneous state and parameter estimation are often based on Monte Carlo (MC) sampling and particle filters. Particle filters are fully nonlinear/non-Gaussian state estimation algorithms. Examples of techniques that combine particle filters with Markov chain Monte Carlo (MCMC) for simultaneous state and parameter estimation include particle Markov chain Monte Carlo (PMCMC) (Andrieu et al., 2010) and SMC2 (Chopin et al., 2013). One can envision using these algorithms within a GBO framework, but we leave such ideas for future work.

4. 

The use of GBO in ensemble DA

We use GBO to estimate model parameters, or to tune localization and inflation, or to do both. The basic idea is to couple GBO to an EnKF (or another ensemble DA system) by defining the objective function by the forecast error of an EnKF, averaged over a pre-specified time-interval. More specifically, the objective function is defined as

((25))
J(ϕ)=1NTNBj=NBNT||yjHx¯jf(ϕ)||22,
where ||·||2 is the 2-norm, yj is the observation at time j, and x¯jf is the ensemble average of the forecast ensemble at time j (the forecast ensemble at time j is the analysis ensemble at time j − 1 propagated to time j via the model). We emphasize that the observations yj have already been collected and serve as a ‘training’ set of observations. Using the training observations yj, GBO finds appropriate parameters and once training is complete, these parameters are deployed in an ensemble DA system. Moreover, the total number of training observations is NT, but we disregard the first NB1 observations when defining the objective function to account for a possible spin-up period of the EnKF, during which errors may be large because of effects of the initial ensemble.

In the definition of the objective function in (25), we use ϕ to denote the parameters we are interested in, which may be the model parameters θ, the parameters that define the localization/inflation, or both (see below). Thus, the objective function we use within GBO is the same for all three tasks (parameter estimation, tuning of localization/inflation, or both), but the ‘input’ parameters, i.e. the variables we optimize over, vary for the three tasks. This is illustrated in Figure 2, where we show (simplified) flowcharts for using GBO for model parameter estimation and for tuning localization and inflation (see Figure 1 for a more detailed flowchart of the various steps in GBO). The third case, in which we estimate model parameters and localization and inflation, is self-explanatory upon inspection of these two flowcharts.

Fig. 2. 

Flowchart of the GBO used for parameter estimation (left) and tuning of localization/inflation (right). Note that the algorithms are essentially the same for both tasks, and differ only in the parameters being handed back and forth between the ensemble DA (EnKF) and the GBO iterations. We highlight this difference by using different colors for the parameter estimation (blue) and the tuning of localization/inflation (purple).

With the objective function defined in terms of the EnKF forecast errors, GBO operates as described above and finds a maximizer of the objective function by evaluating it a few times and by updating and revising the underlying emulator (the GP model of the objective function) in view of the function evaluations. Evaluation of the objective function requires that we apply the EnKF (with fixed ensemble size) to a set of NT observations, distributed on the time interval 1,,NT. Evaluating the objective function is, thus, computationally expensive. In fact all other computations are negligible in comparison to the evaluation of the cost function. This is exactly the scenario GBO is designed for – optimizing a cost function that is expensive (computationally) to evaluate.

One advantage of GBO, that helps with dealing with this computational challenge, is its flexibility. Recall that the GBO update (expected improvement maximization, Kernel optimization, and regression) can be done independently of the EnKF analysis and the update does not require any more runs with the model. Thus, one can use an existing DA framework to compute the various forecast errors and then hand these errors over to a GBO code to perform the GBO update, which in turn provides a new set of parameters to perform the DA with. One can in fact use different computers for the GBO update (which runs on a personal computer or workstation) and the DA (might require high performance computing resources). Moreover, by design, GBO should only run for a few function evaluations so that the required computations remain manageable.

4.1. 

Parameter estimation

When using GBO to estimate model parameters, one in fact performs a simultaneous state and parameter estimation for the training set of observations: The states are estimated by the EnKF, using the model parameters as determined by the GBO. Perhaps the main advantage of using GBO for the estimation of model parameters is its flexibility in terms of how the various computational tasks are performed. This flexibility allows us to effectively separate the linear/Gaussian characteristics of many state estimation problems, from the highly non-linear and non-Gaussian characteristics of parameter estimation problems, see, e.g. Vukicevic and Posselt (2008). As indicated in the introduction, on the typically short time intervals between successive observations, state estimation is a nearly linear problem, so that linear, or nearly linear, ensemble DA methods (such as the EnKF) can work well, see e.g. Morzfeld and Hodyss (2019). The highly nonlinear parameter estimation, however, is solved by an ‘outer loop’ of the GBO optimization, which does not rely on underlying assumptions of linearity or Gaussianity. One should carefully use the terminology here, because a Gaussian process model for the objective function, does not imply that this function is (nearly) Gaussian.

In applications with a large state dimension, the DA system should include a localization and inflation, even if the number of model parameters may be small. In this case, one can use adaptive or optimal techniques when implementing the DA system (since localization/inflation parameters are intertwined with model parameters, see also below). The flexibility of GBO is advantageous here because the objective function can be defined by forecast errors of an ensemble DA system that makes use of (adaptive) localization and inflation.

Finally we emphasize that while we only consider an EnKF for the state estimation in this work, it is straightforward to modify the framework we discuss to be used with other ensemble DA techniques, which include other versions of the EnKF, but also hybrid-variational/ensemble methods or even particle filters. Here, again, the flexibility of the GBO technique is important, because GBO is largely agnostic to how the state estimation is performed.

4.2. 

Tuning localization and inflation

If the model parameters are (assumed to be) known, one can use GBO to determine optimal, or at least appropriate, configurations of localization and inflation. The reason is that the localization and inflation has a large effect on the forecast errors in the EnKF and, for that reason, the GBO objective function is sensitive to the localization/inflation, so that GBO can be used to search for appropriate parameters that define the localization and inflation. Using GBO in this context is similar in spirit to a data-driven technique for localization, or finding empirical localization functions, since it requires repeatedly performing a DA on a training set of observations (each evaluation of the objective function in GBO requires a DA run). The computational cost of the training can be large, and is clearly larger than deploying adaptive or optimal localization strategyies which do not require any training. An advantage of using GBO could be that it couples the search for optimal localization and inflation parameters (the optimization is done jointly over both parameters), while emprical localization functions or data-driven techniques leave inflation aside.

We anticipate that GBO may become useful in the future, when localization techniques move to being multi scale or involve a large number of localization scales that may vary throughout the domain. The reason is that, in this case, many parameters are needed to specify the multi scale or spatially varying localization. Even for simple localization schemes with only a single parameter, one may find that GBO brings about significant computational savings over the a grid-search (which is still the standard in particular for finding inflation). The reason is that the GBO refines a model for the objective function as this function is evaluated. Thus, the search for optimal localization/inflation parameters is not ‘blind’, but takes the function being optimized into account directly (but perhaps less directly than gradient-based optimization methods). Further below, we illustrate computational savings of GBO-based tuning of localization and inflation over a grid-search approach in simple numerical examples.

4.3. 

Estimating parameters while tuning localization and inflation

If the model parameters are unknown, then the degree of localization/inflation needed depends on what model parameters are used. For example, if the model is poor (bad choice of model parameters), then the required inflation is typically large. On the other hand, even an appropriate choice of the model parameters may lead to large forecast errors if the localization/inflation are not appropriate. It is thus natural, and in fact advisable, to simultaneously tune localization and inflation, while searching for model parameters. The GBO framework naturally allows for this option by making the parameters one optimizes be the composition of the model parameters and of the parameters that define the localization and inflation.

4.4. 

Algorithmic details

Recall that the three main steps in GBO are:

  1. maximize expected improvement;
  2. optimize the kernel (hyper-parameters)
  3. perform a regression.

Step (iii) is straightforward linear algebra (see Equations (4) and (5)), and steps (i) and (ii) are intertwined by the choice of the Kernel function. In this paper, we focus on the family of Kernels defined in (1). In particular, the reason for adding ‘white noise’ (κ2·δ(ϕ,ϕ)) to the Gaussian Kernel is our implementation of the EnKF. Recall that we use a stochastic implementation of the EnKF, which means that the forecast errors depend (mildly) on the stochastic perturbations used within the EnKF. This in turn implies that the objective function (forecast errors of EnKF) is a random function and the addition of the white noise kernel helps with ameliorating the effects of the stochasticity of the objective function (see also Rasmussen (2004)).

One can, of course, consider more general families of Kernels (see, e.g. Rasmussen (2004)), or one can optimize over the structure and composition of the kernel (Lunderman et al., 2018), but we are content with using a single and relatively simple kernel because this choice seemed sufficient in the numerical examples we tried for this proof-of-concept study. More realistic or real applications, however, may benefit from considering a broader family of kernels that is more capable of emulating more complex functional dependencies.

With the kernel family fixed, one can derive an analytic expression of the expected improvement (see Equation (10)). Step (i), i.e. maximization of EI, is thus relatively straightforward and can be done, e.g. by using a derivative-free BFGS method (Byrd et al., 1995), (or one can compute gradients of EI and use another optimization method). To avoid being trapped in local extrema, we perform a whole suite of BFGS optimizations, each initialized by a different starting value, and then pick the parameter ϕ that leads to the highest local maximum. Again, this simple approach was sufficient in all examples we tried, but more sophisticated techniques may lead to better results.

Step (ii), i.e. optimizing the kernel, requires optimization of the negative log marginal likelihood, which also depends on the kernel family. With our choice of kernel family, one can derive the negative log marginal likelihood, L(), similar to the expression given in (6). Recall that we parametrize the kernel family by a length scale , and the parameters γ and κ. Throughout we fix κ=103, because only the ratio of γ and κ is important. We thus optimize the kernel over two parameters ( and γ) and we implement this optimization via a BFGS technique (using the same code as when optimizing EI).

Finally, we use a space-filling, quasi-random Sobol sequences to initialize the GBO (Joe and Kuo, 2008). This means that we specify a (small) number of parameters ϕ, evaluate the objective function at these points, and then build the initial GP model based on these function evaluations. We use Sobol sequences because these are a common choice, but other technique for finding an initial set of parameters can of course also be used.

In the numerical examples below, we stop the GBO iterations after a specified number of iterations. This means that we a priori decide on a ‘computational budget’ we wish to invest and accept the (approximate) solution after this number of iterations. It is also possible to stop the iteration using other criteria, e.g. when only little progress is made or when a desired error criterium is met. Such strategies may reveal the same, or a similar, solution to the one we obtain with a specified number of iterations, but after fewer iterations. We are, however, not so concerned about the required computations (because these computations occur during a training phase) and think of a pre-specified computational budget as being a realistic case in many applications.

5. 

Numerical examples

We present numerical examples to illustrate the use of GBO in ensemble DA. The examples are ‘synthetic’, in the sense that we use a numerical model to generate noisy observations, which are then used in GBO to estimate model parameters, tune localization and inflation, or both. We use the classical Lorenz models (Lorenz, 1963, 1996) with classical parameters (see below). We simulate each model using a 4th order Runge-Kutta method with time step Δt=0.01 for T = 400 time units. Observations are collected at regular time intervals ΔT=0.1:

((26))
yk=xk+ζk,ζkN(0,I), iid.

This corresponds to setting H=I and R=I in our earlier review of ensemble DA. The procedure results in 4000 observations that we use in our GBO algorithms. We stop the GBO after completing 100 iterations, and also report results after the first 10, 25, 50 and 75 iterations to track how the iterations improve the estimates. We then study the GBO results by computing the time-average of the forecast root mean square error (RMSE), where the forecast RMSE is defined by

((27))
RMSEj=1Ny||yjx¯jf||2.

Here x¯j is the EnKF forecast mean (see above) and Ny = Nx is the number of observations, which in all examples is equal to the number of state variables of the model.

The results one obtains, both in terms of the time-averaged forecast RMSE and the estimates of the parameters the GBO generates, depend to some extend on the noise that defines the observations (see Equation (26)). We address this issue by running each numerical experiment 50 times and using different random perturbations of the observations in each experiment. We then report the results of these experiments by showing ‘box-and-whisker’ plots of the time averaged forecast RMSE. Each plot thus shows the minimum, the maximum, the sample median, the first and third quartiles, and outliers.

5.1. 

Parameter estimation for Lorenz ’63

We consider the Lorenz ‘63 model (Lorenz, 1963),

((28))
dxdt = σ(yx),dydt = x(ρz)y,dzdt = xyβz,
where σ = 10, β = 28 and ρ=8/3, and demonstrate how to use GBO to estimate these model parameters from observations. Localization and inflation are not tuned by GBO. In fact, we do not use any localization in this example and use a multiplicative inflation of α=1.025, based on preliminary numerical tests.

In the parameter estimation, we assume the following bounds for the model parameters: σ[5,13],β[25,30] and ρ[1,5]. We initialize the GBO algorithm with NG = 4 parameter samples which we obtain a Sobol sequence over the hyper-cube, defined by the parameter bounds. We then perform 100 iterations of GBO. The EnKF is run with an ensemble size Ne = 25. Figure 3 summarizes our numerical results. Figure 3(a) shows how forecast RMSE decreases as we increase the number of iterations in the GBO. The reduction in forecast RMSE is reduced due to improved parameter estimates, which are shown in Figure 3(b)–(d). For reference, we also use an augmented state EnKF and an EnKF that uses a model with the ‘true’ model parameters. We note that the GBO quickly reduces forecast RMSE and finds appropriate parameter estimates which are near the true parameters. After about 50 iterations, the results do not change much, indicating that GBO has converged, and forecast errors are comparable to those of an EnKF that makes use of the ‘true’ model. After about 25 iterations, the GBO yields forecast errors that are comparable to those of an augmented state EnKF. In summary, we find that the GBO can reliably estimate the parameters of the Lorenz ’63 model given a set of observations. This is perhaps not so surprising, since this problem is relatively simple. Nonetheless, the positive results are encouraging and we think that a simple example helps with understanding the algorithm and how it functions in an ensemble DA and parameter estimation context.

Fig. 3. 

Summary of results for the parameter estimation of Lorenz ’63. (a) Box-and-whisker plot of time averaged forecast RMSE. Dashed blue: RMSE of the augmented state EnKF. Dashed green: RMSE of an EnKF using the ‘true’ model parameters. Statistical variation of RMSE for the EnKF or augmented state EnKF is too small to be visible on the scale of the plot. (b)–(d) Box-and-whisker plots of the model parameters σ, β and ρ. The limits of the y-axes in (b)–(d) are set to reflect the assumed bounds on the model parameters. In each box-and-whisker plot, the orange line is the median, the box spans the first and third quartiles, the whiskers are the minimum/maximum and dots represent outliers (if applicable).

5.2. 

Tuning inflation and localization in Lorenz ‘96

We consider the Lorenz ‘96 model (Lorenz, 1996)

((29))
dxkdt = xk1(xk2xk+1)xk+F,
where k=1,,40,F = 8 is a forcing term, and where we assume a periodic domain so that x1=x39,x0=x40 and x41=x1. We assume that the forcing term F = 8 is known and wish to tune the localization and inflation of an EnKF (ensemble size Ne = 40), applied to the Lorenz ’96 model. We constrain our search for localization and inflation using the bounds L(0,20] for the localization length-scale and α[0.9,2] for the multiplicative inflation. The GBO is initialized with NG = 2 parameter samples and, as before, we perform 100 iterations. The results are summarized in Figure 4. We note that, as before, GBO converges quickly and reduces forecast RMSE after only a few steps (see Figure 4(a)). The inflation parameter α is determined to be slightly larger than one, and the variation about the mean across the 50 experiments is small (see Figure 4(c)). The localization parameter (L), shows more variation over the 50 experiment, see Figure 4(b)), which is due to the fact that the cost function is relatively ‘flat’ (various localization parameters lead to similar forecast errors). This ‘flatness’ of the cost function may be problem dependent.

Fig. 4. 

Summary of results for tuning localization and inflation in Lorenz ’96. (a) Box-and-whisker plot of time averaged forecast RMSE. Dashed green: time-averaged forecast RMSE of an EnKF with localization/inflation tuned by a grid-search (statistical variation of RMSE is too small to be visible on the scale of the plot). (b) and (c) show box-and-whisker plots of localization (L) and inflation (α) parameters as a function of GBO iterations. The limits of the y-axes in (b) and (c) are set to reflect the assumed bounds on the localization/inflation parameters. In each box-and-whisker plot, the orange line is the median, the box spans the first and third quartiles, the whiskers are the minimum/maximum and dots represent outliers (if applicable).

We compare the results of GBO with an EnKF where we tune localization and inflation using the grid-search approach. Specifically, we use a uniform 23 × 20 grid over the square [0.9,2]×(0,20], defined by the bounds for the localization and inflation parameters. Thus, this grid-search requires 460 EnKF runs, but we obtain a similar forecast RMSE (see Figure 4(a)) after 25 GBO iterations, which requires a total of 27 EnKF runs (two for initialization and 25 iterations). Beyond 25 iterations, the localization and inflation tuned by GBO results in smaller forecast errors than what can be obtained by a grid-search. In summary, even for a ‘simple’ localization and inflation scheme with only two parameters, the GBO approach can lead to significant computational savings when compared to a grid-search technique.

Our study does not address possible computational savings compared to techniques when localization is tuned via a data-driven localization or empirical localization functions. It is clear, however, that the training phase of GBO (or other, similar techniques) is larger than deploying an adaptive/optimal localization or inflation strategy (which require no tuning). Our main goal is to show a proof-of-concept that GBO can be helpful when tuning localization and inflation. More specific ideas, such as choosing localization (inflation) adaptively while using GBO for the inflation (localization) is left for future studies in the context of specific scientific problems. We further expect that computational savings brought about by using GBO may become more important if the tuning of localization and inflation needs to be repeated, e.g. in view of new observations, in view of changes to the resolution of the model, or other significant modifications of the DA system. In fact, the use of GBO may enable a more frequent re-tuning of the localization and inflation, which is currently avoided (in operational centers) due to a large computational cost.

5.3. 

Parameter estimation while tuning localization and inflation in Lorenz ‘96

We again consider the Lorenz ’96 model, but now estimate the forcing term F, while simultaneously tuning the localization and inflation. Our results are summarized in Figure 5. We note that the GBO iterations reduce forecast RMSE (see Figure 5(a)) and that after about 50 iterations, GBO performs as well as an augmented state EnKF, for which we tuned the localization and inflation parameters via a grid-search. We note that the tuning of the augmented state EnKF relied on 460 runs with the augmented state EnKF (we used the same grid for the localization and inflation as in the previous example). The GBO thus leads to a significant reduction in the computational requirements when compared to the augmented state EnKF. We also compare the forecast RMSE of GBO to the forecast RMSE of a tuned EnKF which uses a model with the correct forcing term (F = 8), and note that the GBO leads to a forcing term and associated localization/inflation that make the RMSE of the ‘uncertain’ model comparable to that of the ‘true’ model. All EnKFs use an ensemble size Ne = 40. Figures 5(b)–(d) illustrate how the parameters (model parameter F and localization/inflation parameters L and α) vary with the number of GBO iterations. As before, we note a large variation of the localization parameter over the 50 experiments, which, as before, is due to the fact that the objective function is relatively flat (several localization parameters lead to similar forecast RMSE). We further note that the forcing parameter F is correctly and accurately identified after about 10–25 iterations, so that GBO spends most of the iterations beyond 25 to fine-tune the localization and inflation.

Fig. 5. 

Summary of results for tuning localization and inflation while estimating a model parameter in Lorenz ’96. (a) Box-and-whisker plot of time averaged forecast RMSE. Dashed blue: RMSE of the augmented state EnKF. Dashed green: RMSE of an EnKF using the ‘true’ model parameters. Statistical variation of RMSE for the EnKF or augmented state EnKF is too small to be visible on the scale of the plot. (b)–(d) show box-and-whisker plots of the forcing F, localization (L) and inflation (α) parameters as a function of GBO iterations. The limits of the y-axes in (b)–(d) are set to reflect the assumed bounds on the model and localization/inflation parameters. In each box-and-whisker plot, the orange line is the median, the box spans the first and third quartiles, the whiskers are the minimum/maximum and dots represent outliers (if applicable).

In summary, we could demonstrate on a simple example how GBO can be used for parameter estimation while simultaneously tuning the localization and inflation. The GBO approach leads to significant computational savings when compared to an augmented state EnKF, with its localization/inflation tuned via a grid-search. We note that our numerical experiments are not exhaustive: One can easily come up with schemes where one uses adaptive localization or inflation while estimating model parameters or various other combinations (e.g. using only adaptive localization, tuning inflation and model parameters, etc.). As indicated before, our numerical experiments should be viewed as a proof-of-concept, perhaps useful for assessing the usefulness of GBO in important tasks in ensemble DA.

6. 

Conclusions

Global Bayesian optimization (GBO) is a derivative-free optimization technique that is used to optimize objective functions that are expensive to evaluate (computationally or otherwise). The basic idea of GBO is to construct and revise an emulator of the objective function and to use this emulator for the function’s optimization. GBO is widely used in the tech-industry and in engineering, and, to a lesser extent, in the Earth sciences. The use of GBO for parameter and state estimation problems was pioneered by Abbas et al. (2016) and we continue this line of work by discussing the use of GBO in ensemble data assimilation (DA), where the goal is to estimate a model’s state based on noisy observations.

We specifically discussed three related problems in ensemble DA and how these can be tackled by GBO:

  1. parameter estimation problems in which some, or all, of the parameters that define the model used in the ensemble DA are uncertain or unknown;
  2. tuning of the ensemble DA (localization and inflation);
  3. estimation of model parameters while simultaneously tuning the ensemble DA.

We believe that problem (ii) is particularly relevant in atmospheric sciences, especially when multi-scale aspects of the ensemble DA are more directly accounted for. We used GBO in all three problems by defining the objective function in terms of forecast errors of the ensemble DA. In this way, we couple the GBO to an ensemble DA framework indirectly by the definition of the objective function. This formulation makes the computations flexible and one can couple an existing GBO code (which runs on a personal computer or workstation) to an existing ensemble DA code (which may require high-performance computing), which makes GBO easy to use in complex and realistic applications. Moreover, the GBO is agnostic to which ensemble DA method is used and one can easily couple GBO to ensemble Kalman filters (EnKF), or ensemble-variational methods.

The GBO has proved effective for all three tasks in simple numerical experiments with Lorenz models (L63 and L95). We anticipate that using GBO for the tuning of localization and inflation may become important when the community moves to multi-scale techniques (which are needed in view of an ever increasing resolution of the underlying numerical models). For parameter estimation, the flexibility of GBO is advantageous and essentially decouples the linear aspects of the problem (the state estimation) from its nonlinear and non-Gaussian aspects (the parameter estimation). We used simple numerical examples with the classical Lorenz models (Lorenz, 1963, 1996) to illustrate our ideas and to demonstrate the computational savings one can achieve by using GBO. We hope that this work sparks interest in the use of GBO in ensemble DA and that GBO-based methods will be applied to more realistic, or real, problems in the future.