^{a}

^{b}

^{*}

^{c}

^{d}

^{e}

^{f}

For modelling geophysical systems, large-scale processes are described through a set of coarse-grained dynamical equations while small-scale processes are represented via parameterizations. This work proposes a method for identifying the best possible stochastic parameterization from noisy data. State-of-the-art sequential estimation methods such as Kalman and particle filters do not achieve this goal successfully because both suffer from the collapse of the posterior distribution of the parameters. To overcome this intrinsic limitation, we propose two statistical learning methods. They are based on the combination of the ensemble Kalman filter (EnKF) with either the expectation–maximization (EM) or the Newton–Raphson (NR) used to maximize a likelihood associated to the parameters to be estimated. The EM and NR are applied primarily in the statistics and machine learning communities and are brought here in the context of data assimilation for the geosciences. The methods are derived using a Bayesian approach for a hidden Markov model and they are applied to infer deterministic and stochastic physical parameters from noisy observations in coarse-grained dynamical models. Numerical experiments are conducted using the Lorenz-96 dynamical system with one and two scales as a proof of concept. The imperfect coarse-grained model is modelled through a one-scale Lorenz-96 system in which a stochastic parameterization is incorporated to represent the small-scale dynamics. The algorithms are able to identify the optimal stochastic parameterization with good accuracy under moderate observational noise. The proposed EnKF-EM and EnKF-NR are promising efficient statistical learning methods for developing stochastic parameterizations in high-dimensional geophysical models.

The statistical combination of observations of a dynamical model with a priori information of physical laws allows the estimation of the full state of the model even when it is only partially observed. This is the main aim of data assimilation (Kalnay,

There are several multi-scale physical systems which are modelled through coarse-grained equations. The most paradigmatic cases being climate models (Stensrud,

One standard methodology to estimate physical model parameters from observations in data assimilation techniques, such as the traditional Kalman filter, is to augment the state space with the parameters (Jazwinski et al.,

The collapse of the parameter posterior distribution found in both ensemble Kalman filters (Delsole and Yang,

The expectation–maximization (EM) algorithm (Dempster et al.,

In this work, we combine the ensemble Kalman filter (Evensen,

This work is organized as follows. Section

A hidden Markov model is defined by a stochastic non-linear dynamical model

where

The observations at time

where

In the EM literature, the term ‘parameters’ is used for all the parameters of the likelihood function including the moments of the statistical distributions. Here, the parameters of the likelihood function are referred more specifically to as

The estimation method we derived is based on maximum likelihood estimation. Given a set of independent and identically distributed (iid) observations from a probability density function represented by

The estimation technique used in this work is a batch method: a set of observations taken along a time interval is used to estimate the model state trajectory that is closest to them, considering measurement and model errors with a least-square criterion to be established below. The simultaneous use of observations distributed in time is essential to capture the interplay of the several likelihood parameters included in the estimation problem. The required minimal length

The EM algorithm maximizes the log-likelihood of observations as a function of the likelihood parameters ^{1}

An analytical form for the log-likelihood function, (

Let us introduce in the integral (

We assume that the support of

If we choose

where

From (

where we use the notation

The joint distribution of a hidden Markov model using the definition of the conditional probability distribution reads

The model state probability density function can be expressed as a product of the transition density from

The observations are mutually independent and are conditioned on the current state (see (

Then, replacing (

If we now assume a Gaussian hidden Markov model, and that the covariances

The Markov hypothesis implies that model error is not correlated in time. Otherwise, we would have cross terms in the model error summation of (

We consider (

In this Gaussian state-space model, the maximum of the intermediate function in the EM algorithm, (

Note that

Differentiating (

In the case of the observation error covariance, the solution is:

Therefore, we can summarize the EM algorithm for a hidden Markov model as:

The basic steps of this EM algorithm are depicted in Fig. ^{2}

where

The EM algorithm applied to a linear Gaussian state-space model using the Kalman filter was first proposed by Shumway and Stoffer (

The EM algorithm is highly versatile and can be readily implemented. However, it requires the optimal value in the maximization step to be computed analytically which limits the range of its applications. If physical deterministic parameters of a non-linear model need to be estimated, an analytical expression for the optimal likelihood parameter values may not be available. Another approach to find an estimate of the likelihood parameters consists in maximizing an approximation of the likelihood function

Following Carrassi et al. (

with the convention

Replacing (

If we assume Gaussian distributions and linear dynamical and observational models, the integrand in (

and the prior forecast distribution,

where

The resulting approximation of the observation likelihood function which is obtained replacing (

where

The evaluation of the model evidence (

For all the cases in which we can find an analytical expression for the maximization step of the EM algorithm, we can also derive a gradient of the likelihood function (Cappé et al.,

(a) Flowchart of the EM algorithm (left panel). (b) NR flowchart (right panel). Each column of the matrix

A first set of numerical experiments consists of twin experiments in which we first generate a set of noisy observations using the model with known parameters. Then, the maximum likelihood estimators are computed using the same model with the synthetic observations. Since we know the true parameters, we can evaluate the error in the estimation and the performance of the proposed algorithms. A second set of experiments applies the method for model identification. The (imperfect) model represents the multi-scale system through a set of coarse-grained dynamical equations and an unknown stochastic physical parameterization. The model-identification experiments are imperfect model experiments in which we seek to determine the stochastic physical parameterization of the small-scale variables from observations. In particular, the ‘nature’ or true model is the two-scale Lorenz-96 model and it is used to generate the synthetic observations, while the imperfect model is the one-scale Lorenz-96 model forced by a physical parameterization which has to be identified. This parameterization should represent the effects of small-scale variables on the large-scale variables. In this way, the coarse-grained one-scale model with a physical parameterization with tunable deterministic and stochastic parameters is adjusted to account for the (noisy) observed data. We evaluate whether the EM algorithm and the NR method are able to determine the set of optimal parameters, assuming they exist.

The synthetic observations are taken from the known nature integration by, see (

with

In the twin experiments, we use the one-scale Lorenz-96 system and a physical parameterization that represents subgrid-scale effects. The nature integration is conducted with this model and a set of ‘true’ physical parameter values. These parameters characterize both deterministic and stochastic processes. By virtue of the perfect model assumption, the model used in the estimation experiments is exactly the same as the one used in the nature integration except that the physical parameter values are assumed to be unknown. Although for simplicity we call this ‘twin experiment’, this experiment could be thought as a model selection experiment with parametric model error in which we know the ‘perfect functional form of the dynamical equations’ but the model parameters are completely unknown and they need to be selected from noisy observations.

The equations of the one-scale Lorenz-96 model are:

where

We have included in the one-scale Lorenz-96 model a

where a noise term,

has been added to

In the model-identification experiments, the nature integration is conducted with the two-scale Lorenz-96 model (Lorenz,

with

where

The imperfect model used in the model-identification experiments is the one-scale Lorenz-96 model (

Log-likelihood function as a function of (a) model noise for three true observational noise values,

Convergence of the NR maximization as a function of the iteration of the outer loop (inner loops are composed of

Convergence of the EM algorithm as a function of the iteration for different observation time lengths (evidencing window). An experiment with

Estimated model noise as a function of the iteration in the EM algorithm. (a) Mean diagonal model noise (true value is 1.0). (b) Mean off-diagonal absolute model noise value (true value is 0.0).

As used in previous works (see e.g. Wilks

For the model-identification experiments, we aim to mimic the dynamics of the large-scale equations of the two-scale Lorenz-96 system with the one-scale Lorenz-96 system (

The EnKF implementation we use is the ensemble transform Kalman filter (Hunt et al.,

The optimization method used in the NR maximization is ‘newuoa’ (Powell,

(a) Estimated mean deterministic parameters,

(a) Estimated deterministic parameters as a function of the EM iterations for the model-identification experiment. Twenty experiments with random initial deterministic and stochastic parameters are shown. (b) Estimated stochastic parameters. (c) Log-likelihood function.

The nature integration is obtained from the one-scale Lorenz-96 model (

A first experiment examines the log-likelihood (

We conducted a second experiment using the same observations but the estimation of model noise covariance matrix is performed through the NR method. The control space is of 8

(a) Log-likelihood as a function of the

The convergence of the EM algorithm applied for the estimation of model noise covariance matrix only (8

(a) Scatterplot of the true small-scale effects in the two-scale Lorenz-96 model as a function of a large-scale variable (coloured dots) and scatterplot of the deterministic parameterization with optimal parameters (white dots). (b) Scatterplot from the stochastic paramerization with optimal parameters obtained with the EM algorithm and (c) with the NR method. (d) Scatterplot given by a constrained random walk with optimal EM parameters.

Comparing the standard

A second set of twin experiments evaluates the estimation of deterministic and stochastic parameters from a physical parameterization. The model used to generate the synthetic observations is (

Figure

A similar experiment was conducted with NR maximization for the same synthetic observations. A scaling of

As a proof-of-concept model-identification experiment, we now use synthetic observations with an additive observational noise of

NR maximization is applied to the same set of synthetic observations. The mean estimated deterministic and stochastic parameters are

Long integrations (

Two novel methods to estimate and charactize physical parametrizations in stochastic multi-scale dynamical systems have been introduced, the expectation–maximization algorithm (EnKF-EM) and Newton–Raphson likelihood maximization (EnKF-NR) combined with the ensemble Kalman filter. These new methods are suitable for the estimation of both stochastic and deterministic parameters, based on sparse and noisy observations. Both methods determine the maximum of the observation likelihood , also known as model evidence, given a set of spatio-temporally distributed observations, using the ensemble Kalman filter to combine observations with model predictions. The methods are first evaluated in a controlled model experiment in which the true parameters are known and then, in the two-scale Lorenz-96 dynamics which is represented with a stochastic coarse-grained model. The performance of the methods is excellent, even in the presence of moderate observational noise. The methods do not require neither inflation factors nor any other tunable parameters , because the methodology includes an additive model noise term or stochastic parameters, which compensate for the underestimation of the forecast error covariance. The level of model noise to be added is not arbitrarily chosen but the one that gives the maximal observation likelihood.

The estimation based on the expectation–maximization algorithm gives very promising results in these medium-sized experiments (

The computational cost of the algorithm is directly related to the number of iterations needed for convergence. Each iteration requires the application of an ensemble Kalman filter and a smoother (which needs an extra inversion through singular value decomposition). In the model-identification experiments, 50 EM iterations were chosen as a secure option, with a minimal iteration number of 20 for coarse convergence. In an operational high-dimensional data assimilation system, the application of 20–50 ensemble Kalman filters would be prohibitive. On the other hand, these experiments would be computationally feasible for model identification, during the model development phase, even for high-dimensional systems or for tuning the data assimilation scheme.

The estimation based on the NR method also presents good convergence for the twin experiment with an additive stochastic parameter. For the more realistic model-identification experiments, the model evidence presents some noise which may affect the convergence. The free-derivative optimization requires about 10 iterations of

The EM algorithm assumes a Gaussian additive model error term, which leads to an analytical expression for the maximization step. Besides, the derivation of the likelihood function in the NR method also assumes Gaussian additive model and observation errors. The methods could be extended for non-Gaussian statistics, in which case the maximization step in the EM algorithm can be conducted through an optimization iterative method. For cases with multimodal statistics, the application of a particle filter (vanLeeuwen,

Both estimation methods can be applied to a set of different dynamical models to address which one is more reliable given a set of noisy observations; the so called ‘model selection’ problem. A comparison of the likelihood from the different models with the optimal parameters gives a measure of the model fidelity to the observations. Majda and Gershgorin (

Hannart et al. (

The increase of data availability in many areas has fostered the number of applications of the ensemble Kalman filter. In particular, it has been used for influenza forecasting (Shaman et al.,

The authors wish to acknowledge the members of the DADA CNRS team for insightful discussions, in particular Alexis Hannart, Michael Ghil and Juan Ruiz.

No potential conflict of interest was reported by the authors.

We use the notation ‘; ’,

In principle what is required in (

The ensemble Kalman filter determines the probability density function of a dynamical model conditioned to a set of past observations, i.e.

where

To obtain the estimated hidden state, called

The optimal ensemble member weights

All the quantities in (

To determine each ensemble member of the analysis state, the ensemble transformed Kalman filter uses the square root of the analysis covariance matrix, thus it belongs to the so-called square-root filters,

where the perturbations of

The analysis state is evolved to the time of the next available observation

The smoother determines the probability density function of a dynamical model conditioned to a set of past and future observations, i.e.

where

In practice, the pseudo inversion of the forecast state perturbation matrix