A- A+
Alt. Display

# Stochastic Parameterization Using Compressed Sensing: Application to the Lorenz-96 Atmospheric Model

## Abstract

Growing set of optimization and regression techniques, based upon sparse representations of signals, to build models from data sets has received widespread attention recently with the advent of compressed sensing. This paper deals with the parameterization of the Lorenz-96 (Lorenz, 1995) model with two time-scales that mimics mid-latitude atmospheric dynamics with microscopic convective processes. Compressed sensing is used to build models (vector fields) to emulate the behavior of the fine-scale process, so that explicit simulations become an online benchmark for parameterization. We apply compressed sensing, where the sparse recovery is achieved by constructing a sensing/dictionary matrix from ergodic samples generated by the Lorenz-96 atmospheric model, to parameterize the unresolved variables in terms of resolved variables. Stochastic parameterization is achieved by auto-regressive modelling of noise. We utilize the ensemble Kalman filter for data assimilation, where observations (direct measurements) are assimilated in the low-dimensional stochastic parameterized model to provide predictions. Finally, we compare the predictions of compressed sensing and Wilks’ polynomial regression to demonstrate the potential effectiveness of the proposed methodology.

Keywords:
How to Cite: Mukherjee, A., Aydogdu, Y., Ravichandran, T. and Namachchivaya, N.S., 2022. Stochastic Parameterization Using Compressed Sensing: Application to the Lorenz-96 Atmospheric Model. Tellus A: Dynamic Meteorology and Oceanography, 74(2022), pp.300–317. DOI: http://doi.org/10.16993/tellusa.42
Published on 26 Apr 2022
Accepted on 21 Feb 2022            Submitted on 21 Feb 2022

## 1 Introduction

Global weather and climate models are the best available tools for projecting likely climate changes. However, current climate model projections are still considerably uncertain, due to uncertainties and errors in mathematical models, “curse of dimensionality”, model initialization, as well as inadequate observations and computational errors. Multi-layer models are adequate to capture the dynamics of large-scale atmospheric phenomena. Solving for small-scale processes with help of large eddy simulation (LES) models, on a high-resolution grid embedded (with grid spacing of a few tens of meters and can explicitly simulate the clouds and turbulence represented by parameterization schemes) within the low-resolution grid is very useful, but it’s applicability remains limited to its high computational cost. In this case, learning methods are used to develop parameterization schemes that utilize the explicitly simulated clouds and turbulence. In practice, low-resolution models resolve the large-scale processes while the effects of the small-scale processes are often parameterized in terms of the large-scale variables.

In 1995, Ed Lorenz came up with a simple model that captured the essence of this problem with two stacked layers, which is referred to as the Lorenz-96 (Lorenz, 1995) model. Even though this simple model is still a long way from the Primitive Equations that we need to eventually study, it is hoped that the parameterization schemes and filtering methods presented here can be ultimately adapted for the realistic models that need to be computationally solved for weather and climate predictions. We use the Lorenz-96 model to demonstrate the sparse optimization methods and regression techniques to build models from data. This model, as discussed in Section 2, has nonlinearities resembling the advective nonlinearities of fluid dynamics and a multiscale coupling of slow and fast variables similar to what is seen in the two-layer atmospheric models. Resolving all the relevant length and time scales in simulations of the turbulent atmospheric circulations remains out of reach due to computational constraints. Hence, the small-scale processes must be represented indirectly, through parameterization schemes which estimate their net impact on the resolved atmospheric state.

The objective of this paper is to develop a data-driven method to approximate the effects of small-scale processes with simplified functions called parameterizations which are characterized by compressed sensing (CS). Such parameterization schemes are inherently approximate, and the development of schemes which produce realistic simulations is a central challenge of model development. Most parameterization schemes depend critically on various parameters whose values cannot be determined a priori but must instead be estimated through data. Recent developments in optimization and sparse representations in high-dimensional spaces have provided powerful new techniques.

This paper deals with the parameterization of the Lorenz-96 model with two time-scale simplified ordinary differential equations describing advection, damping and forcing. We assume that the model that expresses the resolved atmospheric variables of the known physics or the resolved modes are well understood. Furthermore, Lorenz in introducing this multiscale model showed that the small-scale features of the atmosphere act like forcing. We present a CS and sparse recovery approach for constructing approximations to the Lorenz-96 model system that blends physics-informed candidate functions with functions that enforce sparse structure. The ultimate aim is to adapt the mathematical methods developed here for the realistic models that can be solved computationally to examine the turbulent atmospheric circulations. In the past few years, data-driven parameterization of the Lorenz-96 model using machine learning and deep learning techniques (Gagne et al., 2020) has shown promising results.

The aim of any parameterization is to develop enhanced models that capture the essence of dynamics and produce reliable predictions. Besides deterministic parameterization, stochastic parameterization was applied to the Lorenz-96 model with several kinds of noise and provided promising results (Arnold et al., 2013). In our problem, we extract the useful elements for parameterization by using CS that relies on the restricted isometry property (RIP) (Candès and Wakin, 2006). The RIP is satisfied due to the chaotic and ergodic nature of the Lorenz-96 model (numerically verified), showing that CS would be a great fit for stochastic parametrization. In this way, we are able to capture the key components that allow us to represent the unresolved variables in terms of resolved variables with only a few elements. After the parameterization of the deterministic part with compressed sensing, the stochastic part is achieved by auto-regressive modelling of noise. These parameterized models are then used for the predictions by using ensemble Kalman filter, where the observations are assimilated through the slow time scale.

The paper is organized as follows. In Section 2, we will introduce the 40-dimensional Lorenz-96 model, that was originally suggested by (Lorenz, 1995), and describe the characterization of complex dynamics including statistical properties, Lyapunov exponents, marginal probability density functions (PDFs) and auto-correlation functions (ACFs). Section 3 will present the details on data-driven reduction methods such as regression and compressed sensing with sparse optimization. In Section 4, stochastic parameterization using the compressed sensing approach will be illustrated on Lorenz-96 model. Section 5 will describe a general ensemble filtering algorithm, where the forecasting step is straightforward because the Markov kernel is known, while the update step is much more sophisticated. This nonlinear filtering method will be applied to parameterized models. Finally, conclusions and future work discussion will follow in Section 6. To ensure that our results are reproducible, we have provided our code in (Mukherjee et al., 2021).

## 2 The Lorenz-96 Atmospheric Model

Let us now turn to a simple atmospheric model, which nonetheless exhibits many of the difficulties arising in realistic models, to gain insight into predictability and data assimilation. The Lorenz-96 model was originally introduced in (Lorenz, 1995) to mimic multiscale mid-latitude atmospheric dynamics for an unspecified scalar meteorological quantity at K equidistant grid points along a latitude circle:

(1)
${\stackrel{˙}{X}}_{k}=-{X}_{k-1}\left({X}_{k-2}-{X}_{k+1}\right)-{X}_{k}+F-\frac{\mathrm{hc}}{b}\sum _{j=1}^{J}{Z}_{j,k}$
(2)
${\stackrel{˙}{Z}}_{j,k}=-cb{Z}_{j+1,k}\left({Z}_{j+2,k}-{Z}_{j-1,k}\right)-c{Z}_{j,k}+\frac{\mathrm{hc}}{b}{X}_{k}$

Equation (1) describes the dynamics of some atmospheric quantity X, and Xk can represent the value of this variable at time t in the kth sector defined over a latitude circle in the mid-latitude region. (Note that we use subscripts k and j to conform with the typical spatial indexing notation used for the Lorenz-96 model. In sections that follow, subscripts k and j will be used as discrete time indices, not to be confused with the spatial indices of the Lorenz model). The latitude circle is divided into K sectors. For convenience, the vector field associated with the slow components or the resolved variables is assumed a priori known with representable physics, which is denoted by

(3)
${\mathcal{G}}_{k}\left(X\right):=-{X}_{k-1}\left({X}_{k-2}-{X}_{k+1}\right)-{X}_{k}+F$

Figure 1 illustrates the behavior of a generic slow state Xk (shown in the inner ring with K = 8 for illustration), the fast states in the 1st sector, that is, Z1,1, …, ZJ,1 (shown in the outer ring with J = 7 for illustration), and the fast scale forcing that enters (1); for the Xk component the fast scale forcing is

(4)
${U}_{k}=-\frac{\mathrm{hc}}{b}\sum _{j=1}^{J}{Z}_{j,k}$
Figure 1

A recreation of a similar image by Wilks (2005), that illustrates the two timescale Lorenz-96 model.

Each Xk is coupled to its neighbors Xk+1, Xk–1, and Xk–2 by (1). Xk and Zj,k are periodic on the spatial domain, and (1) and (2) applies for all values of k and j by letting Xk+K = Xk, Zj,k+K = Zj,k, Zj+J,k = Zj,k+1 so, for example, in Figure 1 we have X9 = X1, Z1,9 = Z1,1, and Z8,1 = Z1,2. Even though this model is not a simplification of any atmospheric system, it is designed to satisfy three basic properties: contains a quadratic discrete advection-like term that conserves the total energy (i.e. it does not contribute to dE/dt) just like many atmospheric models; it has linear dissipation (the –Xk term) that decreases the energy defined as $E={\sum }_{k=1}^{K}{X}_{k}$, and an external forcing term F > 0 that can increase or decrease the total energy.

The original K-dimensional system was extended to study the influence of multiple spatio-temporal scales on the predictability of atmospheric flow by dividing each segment k into J subsectors (J = 10), and introducing a fast variable, Zj,k given by (2), associated with each subsector. Thus, each Xk represents a slowly-varying, large amplitude atmospheric quantity, with J fast-varying, low amplitude, similarly coupled quantities, Zj,k, associated with it. In the context of climate modelling, the slow component is also known as the resolved climate modes while the rapidly varying component is known as the unresolved non-climate modes. The coupling terms between neighbors model advection between sectors and subsectors, while the coupling between each sector and its subsectors model damping within the model. The model is subjected to linear external forcing, F, on the slow timescale.

We consider 40-dimensional slow variable components and 400-dimensional fast variable components. With K = 40, one can think of X as representing an unspecified scalar meteorological quantity equally spaced along a mid-latitude belt of roughly 30,000 km, thereby giving the model the correct synodic scale to mimic Rossby waves. Our main research interest in this paper is to develop a dimensional reduction and parameterization method based on compressed sensing (Candès and Wakin, 2006; Donoho, 2006). We compare this compressed sensing scheme based on sparse regression with Wilks’ polynomial regression (Wilks, 2005).

### 2.1 Characterization of complex dynamics

According to (Devaney, 1989), a chaotic dynamical system defined on a metric space X by a continuous map f is said to be chaotic on X if it exhibits the following three essential features; (a) f is transitive, (b) the periodic points of f are dense in X, and (c) f has sensitive dependence on initial conditions. These systems are completely deterministic in that if the current state of the system X0 is known, all future states at time t, denoted by Xt, t > 0, are known. However, even small numerical errors in the computation of X0 can lead to very large differences in the values of Xt, so while these systems are deterministic, they are not predictable. Hence, chaotic systems exhibit a large departure in the long-term evolution for very tiny changes in initial conditions, and hence to study the limiting behavior of these systems, one cannot study the global evolution “orbit by orbit”. Instead, one studies the statistical properties of these systems, as we describe in this section.

#### 2.1.1 Statistical properties of chaotic systems

Although individual orbits of chaotic dynamical systems are by definition unpredictable, the average behavior of typical trajectories can often be given a precise statistical description. Anosov (1969), Pesin (2004) and Sinai (1977) showed that a wide class of “chaotic” maps exhibit natural ergodic invariant measures, the maps being “chaotic” in the sense that the orbits of two points, which may be arbitrarily close to each other with respect to a metric, may evolve very differently over time (positive Lyapunov exponents as discussed below). Loosely speaking, ergodicity is the “law of large numbers” for dynamical systems. While the ergodic theorems prove convergence to the mean, natural questions arise about the deviation of these processes from the mean and about other statistical properties such as rates of mixing. For a given invariant measure, and a class of observables, the correlation functions tell whether (and how fast) the system “mixes”, i.e. “forgets” its initial conditions.

Many statistical properties, such as the Central Limit Theorem and the Law of Iterated Logarithm, displayed by independent identically distributed stochastic processes, have been shown to hold for stochastic processes generated by chaotic dynamical systems. One of the better known techniques for establishing such results was proved by Young (1998), where it was shown that for such (chaotic) dynamical systems with certain “towers” property (a powerful tool for coding the dynamics of a system in such a way so as to enable the computation of its statistical properties), behave as Markov chains on spaces with infinitely many states. Should such dynamical systems mix sufficiently quickly, the process satisfies the central limit theorem and the completely deterministic process behaves, at least in the first two moments, as an independent stochastic process. Furthermore, the degree of independence is characterized by the rate at which the process mixes.

#### 2.1.2 Lyapunov Exponents (LEs)

Among the features listed above, sensitivity to initial conditions is widely known to be associated with chaotic dynamical systems, and a quantitative measure of this sensitivity is given by the Lyapunov exponents (LEs) which are the most commonly used quantities to characterize chaotic phenomena. The LEs are asymptotic measures characterizing the average rate of divergence (or convergence) of small perturbations to the solutions of a dynamical system. The value of the maximum LE is an indicator of the chaotic or regular nature of orbits.

Considering the phase space, which is assumed to be ℝd or a Riemannian manifold, and denoting by M throughout, the Lebesgue or the Riemannian measure on M is denoted by m, and the dynamics are generated by iterating a self-map of M, written f: M ↺. Given a differentiable map f : M ↺, a point xM and a tangent vector v at x, we define

(5)

where $D{f}_{x}^{n}\left(v\right)$ is the derivative of the nth iteration of f, (i.e., f ◦ … ◦ f), with respect to variable x at the vector v. And if the limit does not exist, then the limit is replaced by inf or sup and we write $\underset{¯}{\lambda }\left(x,v\right)$ and $\overline{\lambda }\left(x,v\right)$, respectively. Thus, λ(x, v) > 0 implies the exponential growth of $|D{f}_{x}^{n}\left(v\right)|$ and can be interpreted to mean exponential divergence of nearby orbits. The computational procedure for Lyapunov exponents using the standard method can be found in (Wolf et al., 1985; Skokos, 2010).

When the chaotic system is dissipative, solutions will eventually reside on an attractor. The dimension of this attractor, i.e., the number of “active” degrees of freedom, is also known as the Kaplan-Yorke dimension DKY (Eckmann and Ruelle, 1985; Beeson and Namachchivaya, 2020) and is given as

(6)
${D}_{\mathrm{KY}}=r+\frac{1}{|{\stackrel{˜}{\lambda }}_{r+1}|}\sum _{i=1}^{r}{\stackrel{˜}{\lambda }}_{i}$

where ${\stackrel{˜}{\lambda }}_{1} ⩾ {\stackrel{˜}{\lambda }}_{2} ⩾ \cdots ⩾ {\stackrel{˜}{\lambda }}_{d}$ are the Lyapunov exponents counted with multiplicity (d = dim(M)) and r is the largest integer such that ${\sum }_{i=1}^{r}{\stackrel{˜}{\lambda }}_{i}>0$.

The Kolmogorov-Sinai entropy HKS is a measure of the diversity of the trajectories generated by the dynamical system. This entropy measure is determined through the Pesin formula, which provides an upper bound to HKS (Eckmann and Ruelle, 1985) as

(7)
${H}_{\mathrm{KS}}=\sum _{{\stackrel{˜}{\lambda }}_{i}>0}{\stackrel{˜}{\lambda }}_{i}.$

Complexity of the dynamics of the Lorenz-96 model varies considerably with different values of the constant forcing term F. It is shown in (Majda et al., 2005), the model is in weakly chaotic regimes for F = 5, 6, strongly chaotic regime for the values of F from 8 to 10, and turbulent regimes for F = 12, 16, 24. We will present our results for a specific value of forcing F = 10.

In our numerical simulations, we begin from random initial conditions and use a fourth-order Runge-Kutta time integration with a time step of 0.01. Typically, we integrate forward for 500 time units before starting the calculation of Lyapunov exponents to ensure that all transients have decayed. At this point, we begin integrating the tangent space equations using the Gram-Schmidt procedure update every τ = 0.2 time units. With K = 40, F = 10, h = 0 (just the slow variables), c = 10 and b = 10, the summary of Lyapunov Exponents obtained from the above simulation of the Lorenz-96 system is given in Table 1. The time evolution of the Lyapunov exponents for the Lorenz-96 model is shown in Figure 2. As noted before, the calculation of the Lyapunov exponents begins at 500 time units and Figure 2 depicts the convergence of the Lyapunov exponents around 1000 time units.

Table 1

Summary of Lyapunov Exponents.

 Largest Lyapunov exponent λ1 2.3098 Error doubling time 0.3 time units Number of strictly positive Lyapunov exponents 14 Number of neutral, λ ∈ [–1E –2, 1E –2], exponents 1 Number of strictly negative Lyapunov exponents 26 Kaplan-Yorke dimension 29.4694 Kolmogorov-entropy 14.8409
Figure 2

Time evolution of Lyapunov exponents for the Lorenz-96 system.

#### 2.1.3 Marginal Probability Density Functions (PDFs) and Auto-correlation Functions (ACFs)

Typically, the PDFs are computed using bin-counting also known as histogram methods. For estimating the PDF of a continuous variable x, standard histogram methods simply partition x into distinct bins of width Δi, and then the number of observations ni of x falling into bin i is converted into a normalized probability density for that bin.

More smoother estimates for PDFs can be obtained by using the Gaussian product kernel function which results in the estimation of the PDF known as the kernel density estimation (KDE) (Bishop, 2006):

(8)
${p}_{X}\left(x\right)=\frac{1}{N}\sum _{j=1}^{N}\frac{1}{{\left(2\pi {\lambda }^{2}\right)}^{1/2}}\text{exp}\left\{-\frac{‖{x}_{j}-x{‖}^{2}}{2{\lambda }^{2}}\right\}$

where λ represents the standard deviation of the Gaussian components and N is the total number of data points. Figure 3 depicts the marginal PDFs of some (randomly selected) slow variables X1, X8, X19 and X31 simulated as described above in Section 2. The convergence of the marginal PDFs shown in Figure 3 demonstrates the existing symmetry among the slow variables.

Figure 3

PDFs of the slow variables X1, X8, X19 and X31.

The strongly mixing character of the slow dynamics can be inferred from the correlation function, which means that the correlation function decays quickly to zero or near zero values. The discrete version of the auto-correlation functions (ACFs) for the slow variables is given by

(9)
${C}_{k,k}\left(m\Delta t\right)=\frac{1}{M-m}\sum _{h=1}^{M-m}\left({X}_{k}\left(h\Delta t\right)-\overline{X}\right)\left({X}_{k}\left(\left(h+m\right)\Delta t\right)-\overline{X}\right)$

where

(10)
$\overline{X}=\frac{1}{M}\sum _{m=1}^{M}{X}_{k}\left(m\Delta t\right)$

and these formulae can be given similarly for the fast variables. Figure 4 shows the ACF for the slow variable X1 which decays very quickly and then oscillates around zero illustrating the strongly mixing character of the dynamics.

Figure 4

ACF of the slow variable X1.

## 3 Data-driven Reduction Methods

The objective of this section is to compare model-based and data-driven tools for dimensional reduction and reducing uncertainties in prediction. It is impossible to effectively “learn” from high-dimensional data unless there is some kind of implicit or explicit low-dimensional structure – this can be made mathematically precise in different ways. The methods presented below provide the low-dimensional structure either because of the presence of multi-scales in the model or due to the measurement matrix chosen to satisfy with high probability the so called “Restricted Isometry Property (RIP)” condition introduced in compressed sensing (Candès and Wakin, 2006).

Compressed sensing theory addresses the problem of recovering high-dimensional but sparse vectors from a limited number of measurements – the number of observed measurements m is significantly smaller than the dimension n of the original vector. We present a compressed sensing approach for sparse regression problem and using the ideas from concentration of measure, extend it to a case where the samples are generated from a chaotic time series, which are sufficiently mixing, thus ergodic, as shown in subsection 2.1. Here, the sparse recovery is achieved by constructing a matrix Θ in section 3.2 from ergodic samples generated by the Lorenz-96, such that it is suitable for reconstructing sparse parameter vectors via l1-minimization.

### 3.1 Regression

This parameterization was proposed by Wilks (Wilks, 2005), and it uses a polynomial regression to parametrize the effects (4) of unresolved variables:

(11)
${U}_{k}\left(t\right)=P\left({X}_{k}\left(t\right)\right)+{e}_{k}\left(t\right)$

where P is a 4th degree polynomial in Xk and is given as:

(12)
$P\left({X}_{k}\right)={a}_{0}+{a}_{1}{X}_{k}+{a}_{2}{X}_{k}^{2}+{a}_{3}{X}_{k}^{3}+{a}_{4}{X}_{k}^{4}$

And ek represents noise, which can be written as a first-order autoregressive model:

(13)
${e}_{k}\left({t}_{i}\right)=\phi {e}_{k}\left({t}_{i-1}\right)+\sigma {z}_{k,i}$

where ${z}_{k,i}\stackrel{i.i.d}{~}N\left(0, 1\right)$ for all i, k. The method of calculating φ and σ is explained in Appendix A.

### 3.2 Compressed Sensing and Sparse Recovery

Over the past two decades, researchers have focused on sparsity as one type of low-dimensional structure. Given the recent advances in both compressed sensing (Candès and Wakin, 2006; Donoho, 2006) and sparse regression (Tibshirani, 1996) it has become computationally feasible to extract system dynamics from large, multimodal datasets. The term sparse in signal processing context refers to the case where signals (or any type of data, in general) have merely few non-zero components with respect to the total number of components. Sparsity plays a key role in optimization and data sciences. In the context of our work, these techniques rely heavily on the fact that many dynamical systems can be represented by governing equations that are sparse in the space of all possible functions of a given algebraic structure.

Compressed sensing (CS) is a technique for sampling and reconstructing signals as sparse signals, that is, the reconstructed signals have few non-zero components. More precisely, k-sparse signals are those that can be represented by k << n significant coefficients over an n-dimensional basis. The central goal of CS is to capture attributes of a signal using very few measurements, which distinguishes CS from other dimensionality reduction techniques – a methodology for the recovery of sparse vectors from a small number of linear measurements. Hence, this allows for polynomial-time reconstruction of the sparse signal (Donoho, 2006).

In (Donoho, 2006) and (Candès and Tao, 2006), the original signal X is projected onto a lower-dimensional subspace via a random projection scheme, called the sensing matrix. This is used to reconstruct a signal Y ∈ ℝm. More precisely, this broader objective is exemplified by the important special case in which one is interested in finding a parameterization vector S ∈ ℝn using the (noisy) observation or the measurement data

(14)
$Y=\Theta S+\eta$

where Θ (=: Θ(X)) ∈ ℝm×n with k < m < n is the dictionary or sensing matrix and η is the measurement noise. In CS, we estimate a good sensing matrix Θ and a vector S that reconstructs the signal Y from X.

In general, this problem cannot be solved uniquely. However, if S is k-sparse i.e., if it has up to k nonzero entries, the theory of CS shows that it is possible to reconstruct Y uniquely, from m linear measurements even when m << n, by exploiting the sparsity of S. This can be achieved by finding the sparsest signal consistent with the vector of measurements (Donoho, 2006), i.e.

(15)

where ‖S‖0 denotes the l0 norm for S (i.e., the number of non-zero entries of S), while ɛ denotes a parameter that depends on the level of measurement noise η. It can be shown that the l0 minimization method can exactly reconstruct the original signal in the absence of noise using a properly chosen sensing matrix Θ whenever m > 2k. However, l0 minimization problem (15) is non-convex and combinatorial, which is NP-hard.

Hence, instead of problem (15) we consider its l1 convex relaxation which may be stated as (Chen et al., 2003)

(16)

where ‖·‖1 represents the l1 norm, which is a convex function and problem (16) is a convex optimization problem which can accurately approximate the signal X (solution to problem (15)) in polynomial time if the sensing matrix Θ is chosen to satisfy the necessary RIP condition with high probability (Candès et al., 2005; Candès and Wakin, 2006). Loosely speaking, if Θ satisfies the RIP, then the measurement matrix approximately preserves the Euclidean length of every signal. Equivalently, all subsets of k columns taken from Θ are nearly orthogonal. One should note that the l1 minimization in (16) is closely related to the Lasso problem (Tibshirani, 1996).

(17)
$\mathrm{arg} \underset{S\in {ℝ}^{n}}{\mathrm{min}}\frac{1}{2}‖Y-\Theta S{‖}_{2}^{2}+ \lambda ‖S{‖}_{1}$

where $\lambda ⩾ 0$ is a regularization parameter. If ɛ and λ in (16) and (17) satisfy some special conditions, the two problems are equivalent; however, characterizing the relationships between ɛ and λ is difficult, except for the special case of orthogonal sensing matrices Θ.

Even though there are several objectives in compressed sensing, we are mainly interested in estimating a good sensing matrix and efficiently reconstructing the k-sparse vector S from the measurements. Hence, the RIP plays an important role in our calculations. Unfortunately, direct verification of RIP for a given matrix is not feasible. Nevertheless, it has been shown that there are some matrices, particularly random matrices with independent and identically distributed (i.i.d) entries, satisfy RIP with high probability.

We now enforce this sparse structure within learning. One of the related problems in the context of sparsity is “dictionary learning”. We refer to the columns of the m × n matrix Θ, as a dictionary, and they are assumed to be a set of vectors capable of providing a highly succinct representation for the signal. One has the freedom to determine the particular dictionary elements; however, the choice of monomial terms provides a model for the interactions between each of the components of the time-series and is used for model identification of dynamical systems in the general setting. In dictionary learning, which we shall use to determine Θ, the aim is to find a dictionary that can sparsely represent a given set of training data for further analysis –“sparsifying matrix” from a given set of training data. Regularized optimization with polynomial dictionaries is used to approximate the generating function of some unknown dynamic process. When the dictionary is large enough so that the “true” function is in the span of the candidate space, the solutions produced by sparse optimization are guaranteed to be accurate. This means that any of the training data should be presentable using linear combinations of few columns from the dictionary. In Lorenz-96, the sensing matrix Θ is formed as the collection of matrix of all monomials (stored column-wise) of resolved modes and due to the ergodic nature of the simulated time series the entries satisfy RIP with high probability.

In this paper, the effects of unresolved processes (2) on the resolved modes (1) in the Lorenz-96 model are denoted by ${u}_{k}=-\frac{\mathrm{hc}}{b}{\sum }_{j=1}^{J}{Z}_{j,k}$. The CS method takes snapshot data x(t) ∈ ℝn and attempts to discover the structure of a potentially highly nonlinear mapping of the form

(18)
$u\left(t\right)\approx f\left(x\left(t\right)\right)$

where the n-dimensional vector x(t) = [x1(t) x2(t) … xn(t)]T represents the state of the system at time t. The error e(t) = u(t) – f(x(t)) at time t will be used later for auto-regression. The basis of this construction lies on the fact that in the multi scale Lorenz-96 model, the unresolved dynamics depends on the value of the resolved variables x.

In particular, this method searches for a sparse approximation of the function mapping f(x) = [f1(x) f2(x) … fn(x)]T ∈ ℝn using a dictionary of basis functions yielding

(19)
${f}_{k}\left(x\right)\approx \sum _{j=1}^{p}{\theta }_{j}\left(x\right){s}_{\mathrm{jk}}=\Theta \left(x\right){s}_{k}$

where Θ(x) = [θ1(x), θ2(x), …, θp(x)] ∈ ℝ1×p form the dictionary of basis functions, and sk = [s1k s2k … spk]T ∈ ℝp is the vector of coefficients. Sparsity is expressed “not” in terms of an orthonormal basis but in terms of an over-complete dictionary. The majority of the coefficients sjk are zero while the remaining nonzero entries identify the active terms contributing to the sparse representation of f(x).

To learn the function mapping f from data, time-histories of vectors x(t) and u(t) are collected by performing simulations for the Lorenz-96 system. The input data matrix X ∈ ℝm×n is made of m snapshots of the state vector sampled at several times t1, t2, …, tm where subscripts 1,2, …, n indicates the elements of the state vector. The output data matrix U ∈ ℝm×n follows the same arrangement, and these two data matrices are given as follows:

(21)

#### 3.2.1 Dictionary of Basis Functions

First, we construct an augmented dictionary matrix Θ(X) ∈ ℝm×p where each column represents a basis function that can be a potential candidate for the terms in f to be discovered from data. p is the maximal number of n-multivariate monomials of degree at most d. As stated in (Brunton et al., 2016), there is huge flexibility in choosing the basis functions to populate the dictionary matrix Θ(X). For example, the dictionary matrix Θ(X) may consist of constant, polynomial and trigonometric terms as shown in (20).

(20)

The dictionary matrix Θ(X) is constructed by stacking together, column-wise, candidate nonlinear functions of X. Here, higher order polynomials are denoted as ${X}^{{P}_{d}}$ where d is the order of the polynomial considered. For example, element 1 is a column vector of ones, element X is as defined above, element ${X}^{{P}_{2}}$ is the matrix containing the set of all quadratic polynomial functions of the state vector x, and is constructed as shown in (21).

#### 3.2.2 Sparse Optimization

Sparse regression is performed to approximately solve

(22)
$U\approx \Theta \left(X\right)S$

to determine the vector coefficients S = [s1 s2sn] ∈ ℝp×n that determines the active terms in f. Since only a few of the candidate functions are expected to have an effect on the function mapping f, all coefficient vectors si are expected to be sparse. In this work, we use the Lasso algorithm (Tibshirani, 1996) for solving the above sparse regression problem.

The goal of the sparse optimization to solve the regression problem (22) can be defined using the Lasso form (17) for k = 1, …, n

(23)
${s}_{k}=\underset{{s}_{k}^{\prime }}{\mathrm{argmin}} ||{u}_{k}-\Theta \left(X\right){s}_{k}^{\prime }|{|}_{2}^{2}+\lambda ||{s}_{k}^{\prime }|{|}_{1}$

which uses l1 norm constraint on the coefficient vector sk. The term $\lambda ||{s}_{k}^{\prime }|{|}_{1}$ is the regularization term, which penalizes coefficients different from 0 in a linear fashion. It is the actual promoter of sparsity in the minimization problem.

In the Lasso method, given a collection of m time snapshot pairs ${\left(x\left({t}_{i}\right),u\left({t}_{i}\right)\right)}_{i=1}^{m}$, we estimate the coefficients in (22) by solving the following optimization problem for k = 1, …, n

(24)
$\underset{{s}_{k}}{\mathrm{minimize}} \sum _{i=1}^{m}{\left({u}_{k}\left({t}_{i}\right)-\sum _{j=1}^{p}{\theta }_{j}\left(x\left({t}_{i}\right)\right){s}_{\mathrm{jk}}\right)}^{2}+\lambda \sum _{j=1}^{p}|{s}_{\mathrm{jk}}|$

for some Lagrange multiplier $\lambda ⩾ 0$. In this paper, we use λ = 10–3.

## 4 Stochastic Parameterization: Application to Lorenz-96

Having introduced the data-driven methods for model reduction in the presence of time-scale separation, let us now turn to a simple atmospheric model, which nonetheless exhibits many of the difficulties arising in realistic models, to gain insight into predictability and data assimilation. The dynamics of the Lorenz-96 equation are shown in (1) and (2). Let Xk and Zj,k be the numerical solutions of the Lorenz-96 equations, generated using the 4th order Runge-Kutta method, for k = 1, …, K, j = 1, …, J. We define Uk as in Equation (4):

${U}_{k}=-\frac{\mathrm{hc}}{b}\sum _{j=1}^{J}{Z}_{j,k}$

The resolved modes in (1) is now modified as:

(25)
${\stackrel{˙}{X}}_{k}=-{X}_{k-1}\left({X}_{k-2}-{X}_{k+1}\right)-{X}_{k}+F+{U}_{k} ={\mathcal{G}}_{k}\left(X\right)+{U}_{k}$

Our goal is to use the methods discussed in Section 3 to form parameterizations, in other words, express Uk as a function of X1, …, XK and compare each of them. Based on the ergodic properties of the Lorenz-96 system shown in Section 2.1, the Lyapunov exponents stabilize at t = 500. The training set is set to ttrain = [500, 500 + h, 500 + 2h, …, 1000] where h = 0.01. The test set is set to ttest = [1000, 1000 + h, 1000 + 2h, …, 2000].

Using polynomial regression to solve stochastic parameterizations in the Lorenz-96 system was proposed by Wilks (2005). The goal of this method is to find a degree-4 polynomial P such that Uk can be approximated by P(Xk) for k = 1, …, K. This requires solving a least squares problem:

(26)
$\sum _{k=1}^{K}\sum _{i\in {t}_{\mathrm{train}}}\left({U}_{k}\left({t}_{i}\right)-P\left({X}_{k}\left({t}_{i}\right)\right){\right)}^{2}$

The polynomial obtained from this method is:

(27)
$\begin{array}{l}P\left({X}_{k}\right)=-0.0003142{X}_{k}^{4}+0.004882{X}_{k}^{3}\\ +0.005197{X}_{k}^{2}-0.47362{X}_{k}-0.03779\end{array}$

The goal of compressed sensing in this context is to find a sparse matrix S that solves for Y = ΘS. Let Y be the matrix with columns U1, …, UK, each divided by its L2 norm. Every column of Θ is a basis function of X1, …, Xk divided by its L2 norm. The goal of compressed sensing is to find unique equations for every Uk for k = 1, …, K. Since regression gave us a polynomial of degree 4, our goal is to express Uk as a function of X values of order 4. However, with K = 40, this means the Θ matrix will have 135750 columns, and thus, compressed sensing will be computationally expensive. Each of the parameterizations we get are shown in the “CompressedSensingEquations” folder in (Mukherjee et al., 2021).

A computationally less expensive approach will be to find a solution of order 2. The 40-dimensional CS model is used for comparison of distributions, but we present below the single equation with averaged coefficients that will be used later for filtering:

(28)
${f}_{k}\left(X\right)=-0.376{X}_{k}+0.0166{X}_{k}^{2}$

This parameterization shows that every fk is expressed as a function of Xk and ${X}_{k}^{2}$, thus showing that every basis function XiXj where ij is not relevant, and the Θ matrix can be reduced to a matrix with Xk and ${X}_{k}^{2}$ as the columns.

We then explore solutions of order 3. Given K = 40, ${X}^{{P}_{3}}$ has 11480 columns, thus finding a solution of order 3 is computationally expensive. However, creating multiple parameterizations, each with a subset of columns of ${X}^{{P}_{3}}$ is computationally less expensive. Thus, in this section, we create 100 parameterizations, each of which the Θ matrix contains ${X}^{{P}_{3}}$ and 100 random columns from ${X}_{k}^{3}$. If we average the coefficients of the model, we get:

(29)
${f}_{k}\left(X\right)=-0.0338{X}_{k}+0.00452{X}_{k}^{2}+0.00142{X}_{k}^{3}$

This shows that every fk is expressed as a function of Xk, ${X}_{k}^{2}$ and ${X}_{k}^{3}$, thus showing that all remaining columns in ${X}^{{P}_{3}}$ are not relevant. Based on the results of the previous models, we expect ${X}_{1}^{4}, \dots , {X}_{K}^{4}$ to be the only relevant columns from ${X}^{{P}_{4}}$. This information captures the underlying structure in the Lorenz-96 equation since, in equation (2), ${\stackrel{˙}{Z}}_{j,k}$ only depends on Xk.

Thus, our Θ matrix contains the basis functions $1, {X}_{k}, {X}_{k}^{2}, {X}_{k}^{3}, {X}_{k}^{4}$ for k = 1, …, K. We will call this parameterization the raw degree-4 compressed sensing model. Since this algorithm will be used for testing purposes, the compressed sensing algorithm also calculates biases for each fk. If we average the coefficients, we get:

(30)
$\begin{array}{l}{f}_{k}\left(X\right)=-0.416{X}_{k}+0.00277{X}_{k}^{2}\\ +0.00509{X}_{k}^{3}-0.000306{X}_{k}^{4}-0.0243\end{array}$

The biases are calculated by the compressed sensing algorithm as [2.173746, 1.393404, …, –2.973427] and their mean is –0.0243. The problem with this parameterization is that there is too much variation in the biases.

We investigate other ways to modify the biases. In the following model, the biases are strictly set to zero while keeping other coefficients constant. If we average the coefficients, we get:

(31)
$\begin{array}{l}{f}_{k}\left(X\right)=-0.474{X}_{k}+0.00277{X}_{k}^{2}\\ +0.00509{X}_{k}^{3}-0.000306{X}_{k}^{4}\end{array}$

In the next model, the biases are strictly set to the average of the biases while keeping other coefficients constant. If we average the coefficients, we get the same equation as (30). Both of these models perform equally well in predicting the trajectory of the system and significantly better than the raw 4-dimensional model. The model where biases are strictly set to the average of the biases performs better in the Kullback-Leibler divergence measure, but does not outperform linear regression.

A small variability was introduced in the biases with the goal of outperforming linear regression in terms of the Kullback-Leibler divergence measure. The biases were selected by sampling from N(μ, σ2), where μ is the average of the biases (μ = –0.0243). We used σ = 0.07 as it gave the lowest prediction error on the training set. This leads to a new model, which outperforms all other models. The model is shown in Appendix B. If we average the coefficients, we get:

(32)
$\begin{array}{l}{f}_{k}\left(X\right)=-0.474{X}_{k}+0.00277{X}_{k}^{2}\\ +0.00509{X}_{k}^{3}-0.000306{X}_{k}^{4}-0.0241\end{array}$

We will use (32) as our compressed sensing model in the later sections.

The equations of the resolved variables associated with the vector field denoted by 𝒢k(X) are equivariant with respect to a cyclic permutation of the variables. Therefore, the system with dimension K is completely determined by the equation for the kth variable. Even though the unresolved dynamics depend on the value of the resolved variables X, while computing the parameterization of Uk by CS, there were no restrictions placed on the form of the polynomial on the sparse regression. The consequences of Θ(X) satisfying RIP (hence, sparsity) are quite profound as shown in results provided in Appendix B. Compared to regression, compressed sensing also has the advantage that each unresolved variable Uk is given a unique equation that best represents its dynamics as a function of resolved variables.

### 4.1 Comparison of Trajectories

Section 2.1.2 shows that the largest Lyapynov exponent is 2.3. This shows that the error-doubling time is $\frac{\text{ln}\left(2\right)}{2*2.3}\approx 0.3$. To show the practicality of each method applied to Lorenz-96, we predict the trajectory of each Xk for a time interval of three times the error doubling time (0.9) after the end of the training set. The measure of inaccuracy used in this paper is the Mean Squared Prediction Error (MSPE) (Pindyck and Rubinfeld, 1991).

(33)
$\mathrm{MSPE}=\frac{1}{K}\sum _{k=1}^{K}\frac{1}{I}\sum _{i\in I}\left({\stackrel{^}{X}}_{k}\left({t}_{i}\right)-{X}_{k}\left({t}_{i}\right){\right)}^{2}$

where Xk is derived from (1) and (2) and ${\stackrel{^}{X}}_{k}$ is the approximation to Xk derived from a parametrized model (25) through the 4th order Runge-Kutta method, I is the time interval where we compare the trajectories. In this case, I = [1000, 1000 + h, 1000 + 2h, …, 1000 + 0.9] where h = 0.01. The initial conditions are set as: ${\stackrel{^}{X}}_{k}\left(1000\right)={X}_{k}\left(1000\right), k=1, \dots , K$ The MSPEs derived for each method are shown in Table 2.

Table 2

Evaluation of each method.

METHOD MSPE AVERAGE K-L DIVERGENCE

Regression 0.03606 0.06729

Compressed sensing (Raw) 5.9993 1.9891

Compressed sensing [1] (Setting biases to zero) 0.03398 0.10099

Compressed sensing [2] (Setting biases to average bias) 0.03440 0.07351

Compressed sensing [3] (Adding noise to average bias) 0.03387 0.05919

### 4.2 Comparison of Probability Density Functions

This section will compare the probability density functions (PDFs) of ${\stackrel{^}{X}}_{k}$ and Xk for k = 1, …, K during the time interval ttest. The PDFs are computed using the kernel density estimation, given in (8). PDFs are computed for every Xk and ${\stackrel{^}{X}}_{k}$ using their data in the time interval [500,2000].

The Kullback-Leibler divergence was introduced by Kullback and Leibler (1951) and measures the discrepancy between two PDFs. The Kullback-Leibler divergence is calculated as

(34)
${D}_{\mathrm{KL}}\left(P||Q\right)=\sum _{x\in X}P\left(x\right)\text{log}\left(\frac{P\left(x\right)}{Q\left(x\right)}\right)$

where P is the PDF calculated from the data of Xk and Q is the PDF calculated from ${\stackrel{^}{X}}_{k}$. We average the Kullback-Leibler divergence in the PDFs of ${\stackrel{^}{X}}_{k}$ and Xk for all k to compare each model. The average Kullback-Leibler Divergence derived for each method is shown in Table 2.

### 4.3 Auto-regressive Models in Residuals

So far, we have investigated the deterministic part in stochastic parameterization. The goal of this subsection is to investigate the stochastic part in stochastic parameterization to further improve the regression and compressed sensing models explored so far. This is done by obtaining φ, σ and σe from the first-order AR model in (13). The method of calculating these values is explained in Appendix A. Throughout this subsection, we will refer to the compressed sensing model with noise added to the average bias as the compressed sensing model.

The values for φ, σ and σe that we obtain are shown in Table 3. Given the values of φ and σ, we can run a numerical simulation of the reduced dynamical system with AR residuals. The equation is shown below:

(35)
${\stackrel{˙}{X}}_{k}=-{X}_{k-1}\left({X}_{k-2}-{X}_{k+1}\right)-{X}_{k}+F-{f}_{k}\left(X\left(t\right)\right)+{e}_{k}\left({t}_{i}\right)$

where fk(X(t)) represents either the Wilks’ parameterization (27) or the CS parameterization (32) and ek(ti) is generated using the first order AR model below:

(36)
$\begin{array}{l}{e}_{k}\left({t}_{i}\right)=\phi {e}_{k}\left({t}_{i-1}\right)+\sigma {z}_{k}\left({t}_{i}\right)\\ =\phi {e}_{k}\left({t}_{i-1}\right)+{\sigma }_{e}{\left(1-{\phi }^{2}\right)}^{1/2}{z}_{k}\left({t}_{i}\right)\end{array}$

zk(ti) is sampled randomly from N(0, 1).

Table 3

AR evaluation of residuals.

METHOD φ σ σe MSPE AVE. K-L DIVERGENCE

Regression 0.9453 0.2265 0.6945 0.02624 0.07692

Compressed sensing 0.9981 0.1710 2.775 0.02066 0.07142

Numerical simulations are done using the 4th order Runge-Kutta method. ek(t0) is set to 0 for all k. Similar to section 4.2, we evaluate our model by assessing its trajectory as well as its PDF. The results are summarized in Table 3.

A comparison of Table 3 with Table 2 shows that using an AR model in compressed sensing and regression has significantly improved the prediction of the trajectory of resolved variables as the MSPE is lower. Compressed sensing shows the lowest MSPE and Kullback-Leibler divergence.

The goal of this section was to create a deterministic and AR model to improve the prediction of resolved variables. Based on the results shown, compressed sensing improves the prediction significantly compared to regression. The coefficients of the deterministic model as well as the values of φ, σ and σe will be useful in the following section where filtering methods will be used to further improve the prediction of resolved variables.

## 5 Nonlinear Filtering Applied to Parameterized Models

The second component of this paper deals with nonlinear filtering or data assimilation (DA). The main focus of filtering is to combine computational models with sensor data (observations) to predict the dynamics of large-scale evolving systems. Filtering provides a recursive algorithm for estimating a signal or state of a random dynamical system based on noisy measurements. The signal that is represented by a Markov process cannot be accessed or observed directly and is to be “filtered” from the trajectory of the observation process which is statistically related to the signal. Suppose we make a forecast about the behavior at a future time of a complex system with some uncertainties (randomness) and there is near-continuous data available from remote sensing instrumentation networks. Then, as new information becomes available through observations, it is natural to ask how to best incorporate this new information into the prediction. Data assimilation can be simply defined as the combination of observations with the model output to produce an optimal estimate of a dynamical model.

The standard notation for the discrete DA problems from the time tk to tk+1 can be formulated as follows:

(37)
${x}^{f}\left({t}_{k+1}\right)=M\left[{x}^{f}\left({t}_{k}\right)\right]+{w}_{k}$

where x is the model state, M is the model dynamics and w is the model error. The operator M is represented with matrix in discrete case and differential operator in continuous case. In this section, for notational simplicity, x represents the model state X as described in previous sections. Observations at time tk can be defined by:

(38)
${y}_{k}=H\left[{x}^{t}\left({t}_{k}\right)\right]+{v}_{k}$

where y is the observation and H is the observation operator, which is identity 𝕀 in our problem. v is the white noise with associated covariance matrix describing the observation error.

Data assimilation methods consist of analysis (or correction/filtering) and forecast (or prediction) steps. At time tk, we have the output state of the background forecast ${x}_{k}^{f}$, and observation state yk. xt is the truth generated by equation 1 and 2. The (linear or nonlinear) analysis step is based on the outputs of ${x}_{k}^{f}$ and yk and produces analysis ${x}_{k}^{a}$. Then, applying the model dynamics on the analysis step generates the next forecast output as ${x}_{k+1}^{f}$.

### 5.1 Ensemble Kalman Filter

It is well documented that particle filters (Gordon et al., 1993) suffer from the well-known problem of sample degeneracy (Bengtsson et al., 2008; Snyder et al., 2008). Recently, (Yeong et al., 2020) presented reduced-order nonlinear filtering schemes based on a theoretical framework that combined stochastic parametrization based on homogenization and nonlinear filtering for the implementation of particle filters. In contrast, the ensemble Kalman filter (EnKF) (Evensen, 1994; Evensen, 2003) can handle some of these difficulties, where the dimensions of states and observations are large and the number of replicates is small. However, EnKF incorrectly treats the non-Gaussian features of the forecast distribution that arise in certain nonlinear systems. Nevertheless, we apply the ensemble Kalman filter for data assimilation, because they are easy to implement in large systems.

For the Kalman filter (Kalman, 1960), the modelling and observation errors are assumed to be independent with Gaussian/normal probability distributions:

(39)
${w}_{k}~N\left(0, {Q}_{k}\right) \text{and} {v}_{k}~N\left(0, {R}_{k}\right)$

where Q and R are the covariance matrices of the model error and observation error, respectively. The Kalman filter is of the form:

(40)
${x}_{k}^{a}={x}_{k}^{f}+{K}_{k}\left({y}_{k}-{H}_{k}{x}_{k}^{f}\right)$

where Kk is known as the Kalman gain and $\left({y}_{k}-{H}_{k}{x}_{k}^{f}\right)$ is called the innovation. After some manipulations, the Kalman gain can be obtained as:

(41)
${K}_{k}={P}_{k}^{f}{H}^{T}{\left(H{P}_{k}^{f}{H}^{T}+R\right)}^{-1}$

where ${P}_{k}^{f}$ is the forecast error covariance matrix.

The idea behind EnKF (Evensen, 2003) is the propagation and analysis of ensemble members instead of full-rank covariance matrices. Unlike Kalman filter, the EnKF is applicable for nonlinear dynamical models and computationally efficient for high-dimensional systems (Asch et al., 2016; Law et al., 2015). The forecast step of EnKF can be formulated as follows:

(42)
$\begin{array}{ll}{\stackrel{^}{x}}_{j+1}^{n}\hfill & =\Psi \left({x}_{j}^{n}\right)+{e}_{j}^{n}\hfill \\ {\stackrel{^}{m}}_{j+1}\hfill & =\frac{1}{N}\sum _{n=1}^{N}{\stackrel{^}{x}}_{j+1}^{n}\hfill \\ {\stackrel{^}{C}}_{j+1}\hfill & =\frac{1}{N-1}\sum _{n=1}^{N}\left({\stackrel{^}{x}}_{j+1}^{\left(n\right)}-{\stackrel{^}{m}}_{j+1}\right){\left({\stackrel{^}{x}}_{j+1}^{\left(n\right)}-{\stackrel{^}{m}}_{j+1}\right)}^{T}\hfill \end{array}$

where x is the model state that is propagated by Ψ and ej is the auto-regressive modeling of the noise as described before. m is the model state mean, C is the model covariance matrix and n = 1, …, N is the number of ensembles. Then, the analysis step can be performed as follows:

(43)
$\begin{array}{ll}{S}_{j+1}\hfill & =H{\stackrel{^}{C}}_{j+1}{H}^{T}+\Gamma \hfill \\ {K}_{j+1}\hfill & ={\stackrel{^}{C}}_{j+1}{H}^{T}{S}_{j+1}^{-1}\hfill \\ {x}_{j+1}^{n}\hfill & =\left(I-{K}_{j+1}H\right){\stackrel{^}{x}}_{j+1}^{\left(n\right)}+{K}_{j+1}{y}_{j+1}^{\left(n\right)}\hfill \\ {y}_{j+1}^{\left(n\right)}\hfill & ={y}_{j+1}+{\eta }_{j+1}^{n},\hfill \end{array}$

where Γ and η are i.i.d’s draws from normal distribution and yj’s are the observation states. In our problem, the truth is generated by the slow variables of non-parametric Lorenz-96 system, which is given by (1) and (2). Then, observations (direct measurements) are produced by adding small sensor noise to the slow variables X. And, model used for the propagation of ensemble Kalman filter is either given by compressed sensing or polynomial regression.

### 5.2 Application to Lorenz-96

In this section, we compare the filtering results of Wilks’ (polynomial) parameterization and compressed sensing parameterization. For each filtering implementation, 40 ensembles are used and observations are included in every 20 time steps. We observe that compressed sensing dynamics provides better prediction results compared to other parameterization methods. There are several software that can be used for filtering problems. Here, we benefit from the open source provided in (Pawar and San, 2020) regarding our parameterized Lorenz-96 models. Observations are generated by adding normal noise to the non-parameterized Lorenz-96 model (truth) and filtering trajectories are obtained by using the set of equations given in (42) and (43) for all cases.

#### 5.2.1 Autoregressive Wilks’ parameterization

Here, we analyze filtering results of Wilks’ parameterization with auto-regressive models. Instead of using random noise for the prediction, we use the auto-regressive modelling of noise as obtained in Section 4. While keeping the parametrized model same, we include modelled noise to the ensemble dynamics as the following:

(44)
$\begin{array}{l}{\stackrel{˙}{X}}_{k}={\mathcal{G}}_{k}\left(X\right)-0.03779-0.47362{X}_{k}+0.005197{X}_{k}^{2}\\ +0.004882{X}_{k}^{3}-0.0003142{X}_{k}^{4}+{e}_{k}\left(t\right)\end{array}$

where 𝒢k(X) is defined in (3) and

(45)
${e}_{k}\left({t}_{i}\right)=0.9453{e}_{k}\left({t}_{i-1}\right)+0.2265{z}_{k}\left({t}_{i}\right)$

zk(ti) is sampled randomly from N(0, 1).

Figure 5 shows the trajectories of truth and observations generated from the truth for the first 3 components of Lorenz-96 model. Figure 7 depicts the truth and prediction (filtering) results of 40 components and MSPE is 0.00940 which is slightly better than deterministic parameterization.

Figure 5

True trajectories, observations and EnKF with Wilks’ parametrized model.

#### 5.2.2 Autoregressive Compressed Sensing

In addition to compressed sensing method in Section 4, we consider auto-regressive modelling of noise. While keeping the parametrized model same, we include modelled noise to the ensemble dynamics as the following:

(46)
$\begin{array}{l}{\stackrel{˙}{X}}_{k}={\mathcal{G}}_{k}\left(X\right)-0.0241-0.474{X}_{k}+0.00277{X}_{k}^{2}\\ +0.00509{X}_{k}^{3}-0.000306{X}_{k}^{4}+{e}_{k}\left(t\right)\end{array}$

where 𝒢k(X) is defined in (3) and

(47)
${e}_{k}\left({t}_{i}\right)=0.9981{e}_{k}\left({t}_{i-1}\right)+0.1710{z}_{k}\left({t}_{i}\right)$

zk(ti) is sampled randomly from N(0,1). Figure 6 shows the trajectories of truth and observations generated from the truth for the first 3 components of Lorenz-96 model. Figure 8 depicts the truth and prediction (filtering) results of 40 components and MSPE is 0.00940 which is slightly better than deterministic parameterization.

Figure 6

True trajectories, observations and EnKF with compressed sensing model.

In Figures 7 to 8, we see the depictions of filtering results for 40 slow components of Lorenz96 model. The first plot of each figure represent the true trajectories of components where the observations are generated. The second plot of each figure show the filtering (prediction) results for the system. Finally, the third plot of each figure depict the L1 errors between the truth and prediction results. All the results about the figures shown in this section are discussed in the previous sections related to the individual parameterization methods.

Figure 7

Depiction of Lorenz-96 true and EnKF prediction trajectories for 40 components (Auto-regressive Wilks’ parameterization).

Figure 8

Depiction of Lorenz-96 true and EnKF prediction trajectories for 40 components (Auto-regressive compressed sensing).

Table 4 shows the parameterization methods with the MSPE errors. As a conclusion, stochastic parameterization methods provide slightly better prediction results compared to deterministic parameterizations.

Table 4

Parameterization methods and MSPE results including random noise and modelled noise with autoregression.

METHOD MSPE

Wilks’ parameterization 0.01006

Compressed sensing 0.01005

Autoregressive Wilks’ parameterization 0.00940

Autoregressive compressed sensing 0.00940

## 6 Conclusion

The sheer number of calculations required in directly solving large-scale global weather and climate models becomes computationally overwhelming. Hence, in this paper, using the multi-scale Lorenz-96 system as a test problem, we introduced a data-driven stochastic-parameterization framework in which the equations of small-scales are integrated and incorporated using compressed sensing to reduce the cost of data assimilation. We presented a data-driven, SPR technique, where we learned to describe the corresponding unknown or unresolved physics from the data due to the fact the governing equations of the unresolved physics are too computationally expensive to solve. Since the atmospheric model considered in this paper is inherently chaotic, it has several positive Lyapunov exponents and any small error in the initial conditions grow in finite time.

Random matrices with independent and identically distributed (i.i.d) entries, satisfy RIP with high probability, and the RIP plays an important role in our calculations. We provided a characterization of the dynamical system based on the Lyapunov exponents for the F = 10 case. In this case, the model is strongly chaotic and decorrelation times also suggested the model to be ergodic. Since the dictionary Θ was formed as the collection of matrices of all monomials (stored column-wise) of resolved modes and due to the ergodic nature of the simulated time series, the entries satisfied the RIP with high probability. The construction of the optimization problem and computational method were detailed in section 4.

For parameterization of Uk, while previous works mostly focused on suggesting pre-defined polynomial or other model form (or structure) and then estimating parameters, this work is the first attempt to use CS for automatically and simultaneously discovering polynomial or model form (or structure) and estimating parameters for Uk as illustrated by the results in section4 and Appendix B.

Specifically, a reduced-order ensemble Kalman filter (EnKF) was applied to the stochastically parametrized model. For the EnKF, particle locations were corrected based on their respective distances from the observation (innovation). Magnitude of the correction was proportional to the error covariance and inverse to the observation noise covariance (Kalman gain). In between observations, particles in the EnKF sample evolve according to the original signal dynamics, so particle trajectories are expected to deviate from the truth as time moves further away from the last observation correction.

Our results show that the compressed sensing model produces the best deterministic parameterization results. After incorporating auto-regression to model residuals, the stochastic parameterizations outperform deterministic parameterizations in terms of prediction accuracy. Compressed sensing with auto-regression produces the best stochastic parameterization results.

Solving for small-scale processes with help of LES models, on a high-resolution grid embedded (with grid spacing of a few tens of meters and can explicitly simulate the clouds and turbulence represented by parameterization schemes) within the low-resolution grid is very useful, but its applicability remains limited to its high computational cost. A long term goal is to create a data-driven scheme, in which LES models embedded in selected grid columns of a global model explicitly simulate subgrid-scale processes which are represented by parameterization schemes in the other columns. The result is a seamless algorithm which only uses a small scale grid on a few selected columns that simulates interactively within a running global simulation, and is far cheaper than eddy-resolving simulation throughout the global simulation.

Appendix

Appendix A and B. DOI: https://doi.org/10.16993/tellusa.42.s1

## Acknowledgements

The authors acknowledge partial support for this work from Natural Sciences and Engineering Research Council (NSERC) Discovery grant 50503-10802, TECSIS/Fields-CQAM Laboratory for Inference and Prediction and NSERC-CRD grant 543433-19. The authors would like to thank Professor Peter Baxendale, Department of Mathematics, University of Southern California, for stimulating discussions, contributions and suggestions on many topics of this paper.

## Competing Interests

The authors have no competing interests to declare.

## References

1. Anosov, DV. 1969. Geodesic flows on closed Riemann manifold with negative curvature. Volume 90 of Trundy ordena Lenina Mathematicheskogo Instituta imeni V.A. Steklova. American Mathematical Society.

2. Arnold, HM, Moroz, IM and Palmer, TN. 2013. Stochastic parametrizations and model uncertainty in the Lorenz ‘96 system Philosophic Transactions of the Royal Society A, 371: 1991. DOI: https://doi.org/10.1098/rsta.2011.0479

3. Asch, M, Bocquet, M and Nodet, M. 2016. Data assimilation: methods, algorithms, and applications. Society for Industrial and Applied Mathematics. DOI: https://doi.org/10.1137/1.9781611974546

4. Beeson, R and Sri Namachchivaya, N. 2020. Particle filtering for chaotic dynamical systems using future right-singular vectors. Nonlinear Dyn, 102: 679–696. DOI: https://doi.org/10.1007/s11071-020-05727-y

5. Bengtsson, T, Bickle, P and Li, B. 2008. Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. IMS Collections: Probability and Statistics Essays in Honor of David A. Freedman, 2: 316–334. DOI: https://doi.org/10.1214/193940307000000518

6. Bishop, CM. 2006. Pattern Recognition and Machine Learning. Springer.

7. Brunton, SL, Proctor, JL and Kutz, JN. 2016. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15): 3932–3937. DOI: https://doi.org/10.1073/pnas.1517384113

8. Candès, EJ, Romberg, JK and Tao, T. 2005. Decoding by linear programming. IEEE Transactions on information theory, 51(12): 4203–4215. DOI: https://doi.org/10.1109/TIT.2005.858979

9. Candès, EJ and Tao, T. 2006. Near-optimal signal recovery from random projections: Universal encoding strategies. IEEE Transactions on Information Theory, 52(12): 5406–5425. DOI: https://doi.org/10.1109/TIT.2006.885507

10. Candès, EJ and Wakin, MB. 2006. An introduction to compressed sampling. IEEE signal processing magazine, 25(2): 21–30. DOI: https://doi.org/10.1109/MSP.2007.914731

11. Chen, SS, Donoho, DL and Saunders, MA. 2003. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1): 33–61. DOI: https://doi.org/10.1137/S1064827596304010

12. Devaney, RL. 1989. An Introduction to Chaotic Dynamical Systems. Addison-Wesley.

13. Donoho, DL. 2006. Compressed sensing. IEEE Transactions on information theory, 52(4): 1289–1306. DOI: https://doi.org/10.1109/TIT.2006.871582

14. Eckmann, P and Ruelle, D. 1985. Ergodic theory of chaos and strange attractors. Rev. Mod. Phys., 57(3): 617–656. DOI: https://doi.org/10.1103/RevModPhys.57.617

15. Evensen, G. 1994. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. Journal of Geophysical Research, 99: 10143–10162. DOI: https://doi.org/10.1029/94JC00572

16. Evensen, G. 2003. The ensemble Kalman filter: Theoretical formulation and practical implementation. Ocean dynamics, 53(4): 343–367. DOI: https://doi.org/10.1007/s10236-003-0036-9

17. Gagne, DJ, Christensen, H, Subramanian, A and Monahan, AH. 2020. Machine learning for stochastic parameterization: Generative adversarial networks in the lorenz ‘96 model. Journal of Advances in Modeling Earth Systems, 12(3). DOI: https://doi.org/10.1029/2019MS001896

18. Gordon, NJ, Salmond, DJ and Smith, AFM. 1993. Novel approach to nonlinear/non-gaussian Bayesian state estimation. IEEE Proceedings, F(2): 107–113. DOI: https://doi.org/10.1049/ip-f-2.1993.0015

19. Kalman, RE. 1960. A new approach to linear filtering and prediction problems. DOI: https://doi.org/10.1115/1.3662552

20. Kullback, S and Leibler, RA. 1951. On Information and Sufficiency. The Annals of Mathematical Statistics, 22: 79–86. DOI: https://doi.org/10.1214/aoms/1177729694

21. Law, K, Stuart, A and Zygalakis, K. 2015. Data assimilation. 214. Cham, Switzerland: Springer. DOI: https://doi.org/10.1007/978-3-319-20325-6

22. Lorenz, EN. 1995. Predictability: A problem partly solved. ECMWF Seminar Proceedings on Predictability, ECMWF.

23. Majda, A, Abramov, R and Grote, M. 2005. Information Theory and Stochastic for Multiscale Nonlinear Systems (CRM Monograph Series), 25. New York: American Mathematical Society. DOI: https://doi.org/10.1090/crmm/025

24. Mukherjee, A, Aydogdu, Y and Ravichandran, T. 2021. Stochastic Parameterization using Compressed Sensing: Application to the Lorenz-96 Atmospheric Model. GitHub Repository. https://github.com/amartyamukherjee/StochasticParameterization-ApplicationToLorenz96.

25. Pawar, ASE and San, O. 2020. A Hands-On Introduction to Dynamical Data Assimilation with Python. Fluids, 5(4): 225. DOI: https://doi.org/10.3390/fluids5040225

26. Pesin, YB. 2004. Lectures on partial hyperbolicity and stable ergodicity. EuropeanMathematical Society. DOI: https://doi.org/10.4171/003

27. Pindyck, RS and Rubinfeld, DL. 1991. Forecasting with Time-Series Models. Econometric Models & Economic Forecasts 3rd ed., 516–535.

28. Sinai, YG. 1977. Introduction to ergodic theory. Volume 18 of Mathematical Notes. Princeton University Press.

29. Skokos, Ch. 2010. The Lyapunov Characteristic Exponents and Their Computation. Lect. Notes Phys., 790: 63–135. DOI: https://doi.org/10.1007/978-3-642-04458-8_2

30. Snyder, C, Bengtsson, T, Bickel, P and Anderson, J. 2008. Obstacles to high-dimensional particle filtering. Monthly Weather Review, 136(12): 4629–4640. DOI: https://doi.org/10.1175/2008MWR2529.1

31. Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1): 267–288. DOI: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

32. Wilks, DS. 2005. Effects of stochastic parameterizations in the Lorenz ‘96 system. Royal Meteorological Society, 131: 389–407. DOI: https://doi.org/10.1256/qj.04.03

33. Wolf, AS, Swinney, JB and Vastano, HL. 1985. Determining Lyapunov exponents from a time series. Physica, 16D: 285–317. DOI: https://doi.org/10.1016/0167-2789(85)90011-9

34. Yeong, HC, Beeson, R, Sri Namachchivaya, N and Perkowski, N. 2020. Particle Filters with Nudging in Multiscale Chaotic Systems: with Application to the Lorenz-96 Atmospheric Model. Journal of Nonlinear Science, 30: 1519–1552. DOI: https://doi.org/10.1007/s00332-020-09616-x

35. Young, LS. 1998. Statistical properties of dynamical systems with some hyperbolicity. The Annals of Mathematics, 147(3): 585–650. DOI: https://doi.org/10.2307/120960