In the traditional cycling of forecasts and data assimilation (DA) for numerical weather prediction, the DA step for the global model has occurred every 6 or 12 hours. This was appropriate for an era when data was concentrated at the main synoptic times, and the limited area models (LAMs) for which the global model provides boundary conditions were cycled every 6 hours. However, in recent years data has become dominated by sources which are essentially continuous in time, and centres such as the Met Office will soon cycle their highest resolution LAM every hour.
By increasing the frequency of global analyses (e.g. to every hour) global forecasts can be based on more recent data, which is not only desirable in itself but provides timely lateral boundary conditions (LBCs) for high resolution LAMs. In one study of the Met Office’s 1.5 km LAM covering the British Isles (Tang et al., 2013) it was found that replacing 3-hour and 6-hour old LBCs by 3-hour and fresh LBCs improved the UK index (a basket of scores measuring forecast skill) by 1.5% (Bruce Macpherson, pers comm). Furthermore, by having more frequent analyses the analysis increments will be smaller, which will improve the validity of the linear approximations in DA schemes. More frequent analyses may also improve the affordability of DA methods as the computational load is distributed more evenly in time.
We will see that, because of the delay in receiving some data, to ensure that all the data which are received are also assimilated, the assimilation windows will need to overlap. We obtain the optimal solution to this problem, which involves manipulating simultaneously all the states in the window and their joint errors. We explore the relation between the optimal solution and simplified methods which are closer to current methods for large-scale DA.
We show that the current use of largely climatological prior error covariances may pose a challenge for high frequency cycling, and discuss how this may be overcome.
An immediate issue is that observations are not received instantaneously. For example, by 09Z on 18 June 2015 the Met Office had received over 80 million observations valid between 09Z on 15 June and 03Z on 18 June 2015, including around 0.8 million surface, 2.1 million aircraft and sonde, 12.4 million satwind and 14 million ATOVS observations. The delay between validity time and receipt for these observation types is recorded in Fig. 1. We see that to receive 95% of aircraft and sonde, surface, ATOVS and satwind observation took respectively 0.6, 1.5, 3.4 and 4.1 hours.
This presents a quandary for traditional cycling which aims to produce an analysis every 6 (at some centres every 12) hours. For definiteness consider 4D-Var (e.g. Li and Navon, 2001) with a 6-hour window [T-3,T+3], which generates an analysis at T-3 (in this discussion the units are hours).
One could perform the analysis at T+3 using all observations available by T+3, which would minimise the time delay to produce the analysis, but observations received after T+3 would not be assimilated. Alternatively, one could perform the analysis at T+7, by which time (in view of Fig. 1) almost all the observations valid in the window have been received, but the analysis is onlyavailable 4 hours after the end of the window and 10 hours after its beginning. To generate an estimate of state at T+7 we could run a 10-hour forecast from the analysis, but compared with the estimate of state at T-3 this will be degraded by model error.
Centres such as the Met Office mitigate these issues by performing each analysis twice, a ‘late cut-off’ analysis currently at about $T+6$, and an ‘early cut-off’ analysis at about T+3. This is illustrated somewhat schematically in Fig. 2, which shows two adjacent non-overlapping windows [3Z,9Z) and [9Z,15Z) (where [${T}_{1}$,${T}_{2}$) denotes the time interval ${T}_{1}\le t<{T}_{2}$). For example, at 8Z we receive observations valid between 4Z and 8Z. Considering the window [3Z,9Z), for the ‘early cut-off’ run we perform the data assimilation at 9Z and use the observations in the dark blue region, and for the ‘late cut-off’ run perform it at about 12Z and use virtually all observations ever valid in the window (combined light and dark blue region).
In principle the late cut-off analysis could make use of the early cut-off one to reduce its work load, as is done in the ‘quasi-continuous’ approach of Järvinen et al. (1996) and Veerse and Thèpaut (1998), but at the Met Office the late cut-off analyses start again from scratch, making no use of the work done for the early cut-off analysis.
Having both early and late cut-off analyses goes some way to mitigating the shortcomings of 6-hourly cycling. However, the analyses are still 6 hours apart, which makes them insufficiently timely for some purposes, notably (considering the comments in Section 1) the LBCs for hourly LAM analyses; the analysis increment is much larger than would be the case with an hourly update, so nonlinearity can be a significant problem, especially for the linear model in 4D-Var; and the approach is inefficient insofar as the early cut-off analyses are not used as part of a cycle.
In Fig. 3 we illustrate how we would like to deal with the same case: each hour we assimilate all observations received in the last hour, e.g. at 12Z we assimilate the observations received between 11Z and 12Z (green region); these are valid between 7Z and 12Z. In principle we do not re-assimilate the observations valid between 7Z and 12Z received at earlier times (blue, red yellow and cyan regions in Fig. 3) as the information from these observations has been transferred to previous analyses and thereby to the background for this cycle.
In the context of global NWP, which provides among other outputs LBCs for hourly cycling of LAMs, an hourly update cycle is natural, but in principle the updates could be as short as one model time step. For the purposes of this paper we will refer to any cycling where observations are assimilated as soon as they are received, or as in Fig. 3 within some short time of receipt, as rapid update cycling (RUC).
We should note that the term ‘Rapid Update Cycle’ has been employed in the past to denote specific rapidly cycled NWP systems, for example by the National Centres for Environmental Prediction in the USA. In that case it referred to an operational regional forecast-analysis system over North America, where data was assimilated by 3D-Var (originally optimal interpolation) using non-overlapping windows of length one hour (Benjamin et al., 2004b; Benjamin et al., 2004a).
To examine the rapid update cycling problem further we will idealise it slightly by supposing that observations are valid at exact multiples of a time increment $\mathit{\delta}t$ (as opposed to continuously in time), and become available after delays of $0,\mathit{\delta}t,2\mathit{\delta}t,\dots ,N\mathit{\delta}t$. We will suppose that observations received at time $k\mathit{\delta}t$ are
Superscripts denote when the observations are received and subscripts their validity time, the longest delay being $N\mathit{\delta}t$.
The first problem is to develop an optimal method for assimilating observations as soon as they are available. At time $k\mathit{\delta}t$ we seek to estimate ${\mathbf{x}}_{k},\dots ,{\mathbf{x}}_{k-N}$, given observations (1) and our previous estimate of ${\mathbf{x}}_{k-1},\dots ,{\mathbf{x}}_{k-N-1}$.
We will suppose that for $i=k,k-1,\dots ,k-N$ we have observation operators ${\mathbf{h}}_{i}^{(k)}$ such that
and for each i a model ${\mathbf{f}}_{i}$
where the distributions of the errors ${\mathit{\nu}}_{i}^{(k)}$, ${\mathit{\omega}}_{i}$ are supposed known.
The optimal method is obtained by formulating the problem in such a way that standard estimation theory can be applied.
We will use the convention that the underlined vector ${\underline{\mathbf{x}}}_{k}$ denotes the concatenation of the $N+1$ vectors $\{{\mathbf{x}}_{i},k-N\le i\le k\}$:
and similarly the underlined matrix ${\underline{A}}_{k}$ is formed from matrices $\{{A}_{i,j},k-N\le i,j\le k\}$ of compatible size:
For example, if $\{{\mathbf{x}}_{i},k-N\le i\le k\}$ are vectors of length n and $\{{A}_{i,j},k-N\le i,j\le k\}$ are matrices of size $n\times n$, then ${\underline{\mathbf{x}}}_{k}$, ${\underline{A}}_{k}$ are of size respectively $n(N+1)\times 1$ and $n(N+1)\times n(N+1)$.
Define ${\underline{\mathbf{y}}}_{k}$ to be the observations received at time $k\mathit{\delta}t$, so
We seek the conditional expectations of ${\underline{\mathbf{x}}}_{k}$ and ${\underline{\mathbf{x}}}_{k+1}$ given observations received up to time $k\mathit{\delta}t$
Note that, given ${\underline{\mathbf{x}}}_{k}$, and assuming ${\mathit{\omega}}_{k}$ has zero mean, the best estimate of ${\underline{\mathbf{x}}}_{k+1}$ before observations received at $k\mathit{\delta}t$ are assimilated is
If we now define
Then we may write (2) and (3) as respectively
which are in the standard form for observation and signal map equations in estimation theory (e.g. Jazwinski, 1970).
It is illuminating to first work out the details in the simplest case, where in (2) and (3) the observation operators ${\mathbf{h}}_{i}^{(k)}$ and model ${\mathbf{f}}_{i}$ are linear and the errors ${\mathit{\nu}}_{i}^{(k)},{\mathit{\omega}}_{i}$ are zero-mean, Gaussian and uncorrelated.
In this case
where
for some matrices ${R}_{i}^{(k)},{Q}_{i}$ (where $\sim N(\mathit{\mu},\mathrm{\Sigma})$ denotes normally distributed with mean $\mathit{\mu}$ and variance $\mathrm{\Sigma}$), and setting
and
with
and the problem of finding the conditional expectations (6) of ${\underline{\mathbf{x}}}_{k}$ and ${\underline{\mathbf{x}}}_{k+1}$ given observations received up to time k, which may be denoted ${\underline{\widehat{\mathbf{x}}}}_{k|k}$, ${\underline{\widehat{\mathbf{x}}}}_{k+1|k}$, is solved by a standard Kalman Filter, as in Table 1.
The basic objects manipulated are whole windows of states and observations and the covariances of the errors in these objects. The symbols in Table 1 are whole-window analogues of their usual values, e.g. ${\underline{P}}_{k|k-1}$ and ${\underline{P}}_{k|k}$ are $n(N+1)\times n(N+1)$ prior and posterior error covariance matrices of the estimated ${\underline{\mathbf{x}}}_{k}$. In special circumstances simplification is possible. For example, if new observations only occur in the final time slot, i.e. if the only observations which become available at k are ${\mathbf{y}}_{k}^{(k)}$ and there are no observations ${\mathbf{y}}_{k-1}^{(k)},\dots ,{\mathbf{y}}_{k-N}^{(k)}$, it is straightforward to show that (15)–(19) simplifies to a conventional Kalman smoother.
For linear ${M}_{i},{H}_{i}^{(k)}$ the algorithm (15)–(19) finds $E[{\underline{\mathbf{x}}}_{k}|{\underline{\mathbf{y}}}_{0},{\underline{\mathbf{y}}}_{1},\dots ,{\underline{\mathbf{y}}}_{k}],\phantom{\rule{0.277778em}{0ex}}E[{\underline{\mathbf{x}}}_{k+1}|{\underline{\mathbf{y}}}_{0},{\underline{\mathbf{y}}}_{1},\dots ,{\underline{\mathbf{y}}}_{k}]$ if the errors ${\mathit{\nu}}_{i}^{(k)},{\mathit{\omega}}_{i}$ are zero-mean, Gaussian and uncorrelated. As noted in Anderson and Moore (1979), if we restrict attention to analysis-prediction equations of form (16) and (18) then we may drop the Gaussian assumption on ${\mathit{\nu}}_{i}^{(k)},{\mathit{\omega}}_{i}$, merely requiring them to be zero-mean and uncorrelated, and (15)–(19) still minimises the expected error variance.
For large-scale data assimilation variational methods are almost universally used, so it is of interest to cast the optimal RUC analysis step (16) in variational form. Define
where $\widehat{A}$ is the bottom right $Nn\times Nn$ submatrix of ${\underline{P}}_{k-1|k-1}$. Then the analysis step (16) is equivalent to
where $\underline{\mathit{\delta}}$ minimises
This is proved in Appendix 1. The ${J}_{b}$ term (20) constrains the N states in the intersection between the old and new windows by the inverse of the ‘background error covariance matrix’ $\widehat{A}$. This ‘big B’ of size $Nn\times Nn$ is formed by taking the analysis error covariance from the previous stage and shearing off the oldest row and column.
We saw in Section 3.1 how the problem of assimilating data immediately it becomes available can be cast into a standard signal model/observation model form (9) and (10), to which we can then apply well-established theory, e.g. Jazwinski (1970). In particular, given (9) and (10) we can compute (sequentially in k) the conditional pdfs $p({\underline{\mathbf{x}}}_{k}|{\underline{\mathbf{y}}}_{0},\dots ,{\underline{\mathbf{y}}}_{k})$ and $p({\underline{\mathbf{x}}}_{k+1}|{\underline{\mathbf{y}}}_{0},\dots ,{\underline{\mathbf{y}}}_{k})$. The novel feature for us is that ${\underline{\mathbf{f}}}_{k}$ and ${\underline{\mathbf{h}}}_{k}$ in (9) and (10) have a special structure which leads to significant simplifications, in particular enabling us to express these pdfs in terms of the original (as opposed to block) variables.
In general, given the prior pdf $p\left({\underline{\mathbf{x}}}_{k}|{\underline{\mathbf{y}}}_{0},\dots ,{\underline{\mathbf{y}}}_{k-1}\right)$ and the conditional pdf of the observations given the state $p\left({\underline{\mathbf{y}}}_{k}|{\underline{\mathbf{x}}}_{k}\right)$, Bayes’ theorem tells us that the posterior pdf $p\left({\underline{\mathbf{x}}}_{k}|{\underline{\mathbf{y}}}_{0},\dots ,{\underline{\mathbf{y}}}_{k}\right)$ is
where the normalisation $\mathcal{N}$ is
and the domain of integration $\mathcal{U}={R}^{(N+1)n}$. We will suppose the basic process satisfies the Markov property $p({\mathbf{x}}_{k}|{\mathbf{x}}_{k-1},{\mathbf{x}}_{k-2},\dots )=p\left({\mathbf{x}}_{k}|{\mathbf{x}}_{k-1}\right)$ which implies the same for underlined states $p\left({\underline{\mathbf{x}}}_{k}|{\underline{\mathbf{x}}}_{k-1},{\underline{\mathbf{x}}}_{k-2},\dots \right)=p\left({\underline{\mathbf{x}}}_{k}|{\underline{\mathbf{x}}}_{k-1}\right)$. Given $p({\underline{\mathbf{x}}}_{k-1}|{\underline{\mathbf{y}}}_{0},\dots ,{\underline{\mathbf{y}}}_{k-1})$ (the posterior pdf at $k-1$) and $p\left({\underline{\mathbf{x}}}_{k}|{\underline{\mathbf{x}}}_{k-1}\right)$, the Chapman-Kolmogorov equation then gives us for the prior pdf $p\left({\underline{\mathbf{x}}}_{k}|{\underline{\mathbf{y}}}_{0},\dots ,{\underline{\mathbf{y}}}_{k-1}\right)$
We could cycle (27) and (25) to obtain the posterior pdf for every k. However, as mentioned there are simplifications arising in this case.
By virtue of the fact that in (8) the ith sub-vector of ${\underline{\mathbf{h}}}_{k}$ depends only on ${\mathbf{x}}_{k-N+i-1}$, $p\left({\underline{\mathbf{y}}}_{k}|{\underline{\mathbf{x}}}_{k}\right)$, the conditional pdf of the observations given the state, factors into
Additionally, the transition pdf $p\left({\underline{\mathbf{x}}}_{k}|{\stackrel{~}{\underline{\mathbf{x}}}}_{k-1}\right)$ may be written
Combined with (27) this gives
We may cycle (30) and (25), (28) to obtain the posterior pdf for every k. For example, if the observation operators ${\mathbf{h}}_{i}^{(k)}$ and model ${\mathbf{f}}_{k}$ are linear, and the errors ${\mathit{\nu}}_{i}^{(k)}$, ${\mathit{\omega}}_{i}$ are Gaussian, then one may check that (30), (25) and (28) imply that the posterior pdf
We illustrate some of the foregoing with a small example, in which the model is the 40-dimensional chaotic model proposed by Lorenz (1996):
with $F=8$, $n=40$, where [i] denotes $i,mod\phantom{\rule{0.277778em}{0ex}}n$. This system is integrated using fourth order Runge–Kutta with a time step of 0.05/6 during which time errors grow at a rate corresponding to order one hour in an atmospheric system. We therefore refer to time step k as time k. The truth is obtained by integrating (31) and to each component of x adding Gaussian model error every time step with variance ${\mathit{\sigma}}_{q}^{2}$.
We suppose that at time k eight observations have just become available, at:
points ${x}_{[k+1]}$, ${x}_{[k+11]}$, ${x}_{[k+21]}$ and ${x}_{[k+31]}$, valid at time $k-1$
point ${x}_{[k+6]}$, valid at time $k-2$
point ${x}_{[k+16]}$, valid at time $k-3$
point ${x}_{[k+26]}$, valid at time $k-4$
point ${x}_{[k+36]}$, valid at time $k-5$
In this example we suppose there are no ‘instantaneously available’ observations (i.e. available at time k and also valid at k). Each observation has Gaussian error with variance ${\mathit{\sigma}}_{o}^{2}$ where ${\mathit{\sigma}}_{o}=0.546$. Every grid point is observed every 5 time steps and the observation network repeats itself exactly every 10 time steps. The system is well-observed and for the values of ${\mathit{\sigma}}_{o},{\mathit{\sigma}}_{q}$ used here the departure from linearity small enough that the foregoing linear theory can be well applied to the linearised model.
Consider first traditional non-overlapping window strategies. Suppose we have a 6-hour cycle with 6-hour windows. For the window $[k-6,k)$ we wish to assimilate observations valid in $k-6\le t<k$. For non-overlapping windows we use an optimal^{1} smoother, i.e. 4D-Var (Li and Navon, 2001) with model error correctly accounted for and correct cycling of background error covariances. For the window $[k-6,k)$ this produces analyses $\{{\mathbf{x}}_{k-6}^{a},\dots ,{\mathbf{x}}_{k-1}^{a}\}$.
As discussed in Section 2, the number of observations which are valid in the window and available for the analysis increases the longer the interval of time between the end of the window and when the analysis is performed, which we term the lag. For the present example the number of observations available at each time in the window if the analysis is performed at $k,k+2,k+4$ is shown in Table 2.
Taking for example the lag=2 case, at times k and $k+1$ the most recently available analyses are those in the window $[k-12,k-6)$ with last analysis at $k-7$, while at $k+2,\dots ,k+5$ the most recently available analysis is at $k-1$. Table 3 shows the validity time of the most recent analysis available at times $k,\dots ,k+5$ for lags of 0, 2 and 4 hours.
Fig. 4 shows the RMS forecast error (using ${\mathit{\sigma}}_{q}=0.182$ and averaged over 2000 cycles) in forecasts valid at times $t=k,k+1,\dots ,k+5$ taken from the latest available analysis using lag=0 (in black), lag=2 (in blue) and lag=4 (in green). Fig. 4 illustrates a point about non-overlapping windows made in Section 2 above, that we choose between a short lag between observations and when the analysis is performed, giving timely analyses but not using all the observations, or a longer lag using more observations, but which at any given time requires longer forecasts which will be more degraded by model error.
For the lag=2 and lag=4 cases we also show (dashed lines) the RMS forecast error for times between the end of the analysis window and the time the analysis is performed.
Since in our example the longest delay in receipt of observations is 5 h, for the optimal RUC method of Section 3 we have $N=5$. At time j this produces analyses $\{{\mathbf{x}}_{j-5}^{a},\dots ,{\mathbf{x}}_{j}^{a}\}$.
A comparison of the observation usage of non-overlapping windows and optimal RUC was illustrated in Figs. 2 and 3. In optimal RUC all observations are used, as soon as they are received.
The red curve in Fig. 4 shows the RMS error in the RUC analysis at ${\mathbf{x}}_{j}^{a}$ for $j=k,\dots ,k+5$. From the foregoing we know this will always be less than the RMS error at j using any available analysis using a non-overlapping window with any lag. Note however that this error can be greater than that from the lagged analyses run at a later time, eg, in this example for the lag-2 and lag-4 analyses at time k. This is because the lagged analyses are using observations not available at time k.
In Section 3 above we derived the optimal solution to the problem of assimilating data as soon as it becomes available, and saw that if the maximum delay is N and the state is described by n variables that this involved manipulating vectors of size $n(N+1)$ and their error covariances of size $n(N+1)\times n(N+1)$.
If observation and model errors are uncorrelated in time then optimal data assimilation methods for non-overlapping windows only involve vectors and matrices of size n and $n\times n$.
For large-scale systems manipulating vectors of size $n(N+1)$-pagination and more particularly matrices of size $n(N+1)\times n(N+1)$ may not be manageable. Furthermore, NWP centres already have methods implemented for non-overlapping windows (we will refer to these as ‘traditional methods’) and will naturally seek ways of adapting these to the RUC problem. Hence a topic of practical importance is the relationship between the optimal solution to RUC and traditional methods applied to RUC.
For simplicity we restrict attention to the case of linear forecast and observation operators whereas in Section 3.2 the errors are Gaussian and uncorrelated. We will designate the optimal solution for RUC in this case (i.e. Table 1) as Method 0. We will develop suboptimal methods for RUC based on traditional methods for non-overlapping windows, with Method 3 a ‘naive’ application of such a method to RUC, and Methods 2 and 1 adaptations of this which are progressively closer to the optimal solution. We will then examine the relation between the four methods.
Suppose that at time k we have prior estimates
and we have just received observations
A natural extension of the 4D-Var method (e.g. Li and Navon (2001)) as used for non-overlapping windows is to form analyses at $j=k-N,\dots ,k$
where
minimises
Method 3: The most similar to traditional 4D-Var, in which the only state saved from the above analyses is the first one ${\mathbf{x}}_{k-N}^{a}$ (i.e. at the beginning of the window). Denoting the model evolution from $k-N$ to j by ${M}_{k-N}^{j}={M}_{j-1}\dots {M}_{k-N+1}{M}_{k-N}$ this gives us priors
The number of prior states which are simply analysis states from the previous cycle is therefore 0, 1 and N for Methods 3, 2 and 1, respectively. The four RUC methods (with the optimal one of Section 3.2 labelled ‘Method 0’) are summarised in Table 4. The formation of backgrounds is shown in more detail for $N=2$ in Table 5.
To compare the various methods we will need the covariance of their errors. We may write (33) and (34) as
where
where we use the notation ${({\underline{U}}_{k})}_{\underline{i},\underline{j}}$ to denote the i, j$n\times n$ submatrix of ${\underline{U}}_{k}$. Denoting the truth by ${\underline{\mathbf{x}}}_{k}^{t}$ and
it follows that for all Methods 1–3 the analysis error covariance${\underline{A}}_{k}$ is, from (38)
We note this depends both on ${\underline{B}}_{k}$ and (via ${\underline{K}}_{k}$ and ${\underline{B}}_{k}^{v}$) the prescribed B.
The background error covariance${\underline{B}}_{k+1}$ depends on which method is used. For method $\ell $ we may write
where ${\underline{M}}_{k}^{(1)}={\underline{M}}_{k}^{(0)}={\underline{M}}_{k}$ as specified in (12) and ${\underline{M}}_{k}^{(2)},{\underline{M}}_{k}^{(3)}$ are illustrated for $N=2$ in Table 6. Since the error in ${\underline{\mathbf{x}}}_{k+1}^{b}$ using method $\ell $ is
where ${\underline{\mathit{\u03f5}}}_{k}^{q}={\underline{M}}_{k}^{(\ell )}{\underline{\mathbf{x}}}_{k}^{t}-{\underline{\mathbf{x}}}_{k+1}^{t}$ we can express ${\underline{B}}_{k+1}$ in terms of ${\underline{M}}_{k}^{(\ell )}$, ${\underline{A}}_{k}$, model error covariance and the cross-covariance of analysis error ${\underline{\mathit{\u03f5}}}_{k}^{a}$ and model error ${\underline{\mathit{\u03f5}}}_{k}^{q}$:
In order to cycle Methods 1–3 we need to specify B in (34). For the rest of this section, for the purposes of comparing the four methods, we will suppose that in Methods 1–3 we use $B={({\underline{B}}_{k})}_{\underline{1},\underline{1}}$, which can be obtained from (45) (as above, underlined subscripts refer to submatrices, so ${({\underline{B}}_{k})}_{\underline{1},\underline{1}}$ is the top left $n\times n$ submatrix of ${\underline{B}}_{k}$). It can be shown that for methods 1 and 2 that ${({\underline{B}}_{k})}_{\underline{1},\underline{1}}={({\underline{A}}_{k-1})}_{\underline{2},\underline{2}}$.
An important comparison is between optimal Method 0 and suboptimal Method 1. They share the same background step (43) with the same ${\underline{M}}_{k}$. In both cases the analysis step may be written in the form (38), though whereas in Method 0 the gain ${\underline{K}}_{k}$
uses true ${\underline{B}}_{k}$, i.e. the covariance of the error in ${\underline{\mathbf{x}}}_{k}^{b}$, for Method 1 the gain ${\underline{K}}_{k}$ is
where ${\underline{B}}_{k}^{v}$ is defined in (40). Because of the similarity in the structures of Methods 0 and 1 there is a simple and strong relation between their errors. If the sequence of background error covariances using Methods 0 and 1 are designated respectively ${\underline{B}}_{k}$ and ${\underline{\stackrel{~}{B}}}_{k}$, and we start from the same prior error covariance ${\underline{\stackrel{~}{B}}}_{N}={\underline{B}}_{N}$, then for all $k\ge N$ the difference ${\underline{\stackrel{~}{B}}}_{k}-{\underline{B}}_{k}$ is positive semi-definite, usually written
This is proved in Appendix 2.
We have four RUC methods, the optimal and three suboptimal ones. We can cycle each as described above, for the suboptimal methods using $B={({\underline{B}}_{k})}_{\underline{1},\underline{1}}$ (Section 5.2).
A limiting case which exhibits some of the differences between them, in particular how information is saved from previous cycles, is obtained by letting model error covariance ${Q}_{k}\to \infty $ for all k.
For simplicity suppose $\underline{H}=I$ and ${R}_{k}$ and ${Q}_{k}$ are independent of k. If $Q=\infty $ then after N cycles for method 0, one cycle for Methods 1 and 2, and immediately for Method 3, all knowledge of the initial background state ${\mathbf{x}}_{0}^{b}$ and its error covariance is lost. In Table 7 we show the analysed state ${\underline{\mathbf{x}}}_{j+N}^{a}$ produced by the four methods for any $j\ge N$ if model error is infinite, and the corresponding background error and analysis error covariances.
The optimal Method 0 retains all the observation information ever received; at time $j+N$ the estimate of state at any time between j and $j+N$ is simply the average of all the observations ever received valid at that time. At the other extreme, Method 3 ‘forgets’ all the observation information from previous cycles: at time $j+N$ the estimate of state at any time between j and $j+N$ is just the value of the observation valid at that time and received at time $j+N$. Methods 1 and 2 retain observation information from the previous cycle only at the initial time. These different behaviours are reflected in the analysis error covariances shown in Table 7.
Comparing Methods 1 and 2, while in the limit $Q\to \infty $ Method 1 analyses are no better than those of Method 2, we note in Table 7 that it has better backgrounds than Method 2.
Suppose that ${Q}_{k}=0$ for all k, and we are given an estimate ${\mathbf{x}}_{0}^{b}$ of ${\mathbf{x}}_{0}$ with error covariance ${B}_{0}$. Estimate the remaining states in the window $\{0,\dots ,N\}$ by
If ${Q}_{k}=0$ for all k then Methods 1–3 simplify to their ‘strong constraint’ forms, in which (34) simplifies to
Crucially, in the limit $Q\to 0$ Methods 0–3 coincide, so in particular Methods 1–3 are now optimal. This is proved in Appendix 3. In the absence of model error the ‘suboptimal’ methods all coincide with each other, and in fact are in this case optimal.
We may apply linearised versions of Methods 0–3 to our nonlinear chaotic example of Section 4. In an attempt to mitigate the effects of linearisation error one can formulate outer loop-style iterations for these strategies, which may be worth implementing in more non-linear systems (for the examples here they made negligible difference). Alternatively one could use the ‘best linear approximation’ (Payne, 2013).
In our example of Section 4, at time k observations have just become available which are valid at $k-1,\dots ,k-5$ so $N=5$. The optimal Method 0 and suboptimal Methods 1–3 all provide analyses ${\mathbf{x}}_{k}^{a},{\mathbf{x}}_{k-1}^{a},\dots ,{\mathbf{x}}_{k-5}^{a}$. (Since in our example no observations are instantaneously available, ${\mathbf{x}}_{k}^{a}$ is here a forecast from ${\mathbf{x}}_{k-1}^{a}$).
Each strategy is cycled 10,000 times and the first hundred cycles disregarded. Figure 5 shows the RMS error in the analyses at $k-5,\dots ,k$, for the optimal Method 0 (black), and suboptimal Methods 1 (blue), 2 (green) and 3 (red), for ${\mathit{\sigma}}_{q}=1.82$ and 0.455 (upper and lower set respectively).
As expected from the foregoing, the errors $\mathcal{E}$ are ordered
Furthermore, the analyses and therefore their errors converge as ${\mathit{\sigma}}_{q}\to 0$.
A significant difference between the methods of the preceding sections and those used for large scale systems is that in the latter the prior error covariance (usually denoted B) is not cycled, but is either constant or is a convex combination of a constant and an estimate of cycled B (see (47) below).
It is important to note that insofar as B is fixed it is advantageous to assimilate data as far as possible simultaneously in larger units rather than split it up into smaller volumes and assimilate it in smaller units. Intuitively, by assimilating many observations simultaneously the deficiencies of the fixed B are reduced.
Illustrating this point is complicated by the fact that increasing observation batch size tends to involve making other changes which themselves have an impact. If we compare cycling using non-overlapping windows with RUC then at no instant in time are the two methodologies assimilating the same observations (see Section 2). If we use 4D-Var to compare cycling with windows of length 1 (assimilating one observation every time step) with windows of length 2 (two observations every two time steps) then the latter has the advantage of covariances evolved through the window, which is a different point to the one being made.
If we compare assimilating two observations simultaneously every time step with assimilating one after the other, in the latter case we have to decide what B to use for the second observation. In Appendix 4 we show that if in a scalar system we assimilate two observations every cycle, and have a choice between assimilating them
We may contrast this result concerning the use of fixed background covariances with the fact that, if B is chosen optimally every cycle, and ${B}_{1}$ and ${B}_{2}$ are chosen optimally every cycle, the two strategies will produce identical (and optimal) results.
This means that if B is fixed then, in this respect, RUC is at a disadvantage compared with conventional cycling as now there are more cycles with fewer observations used every cycle. In practice this effect may be dwarfed by the advantages of RUC as discussed in Section 2. If not, the obvious remedy is to improve the cycling of the background error covariances.
As noted above, the effect is due to using a fixed B, and is removed if B is cycled properly. This is unattainable in current large-scale NWP, but centres such as the Met Office already employ a ‘hybrid’ B (Clayton et al., 2013)
where ${B}_{c}$ is fixed but ${B}_{e}$ is an estimate (from an ensemble) of the true prior error covariance, with ${\mathit{\beta}}_{c}^{2}+{\mathit{\beta}}_{e}^{2}=1$. For best performance ${\mathit{\beta}}_{e}\to 1$ as ensemble size increases, with the fixed part having no weight in the limit of an infinitely large ensemble.
There are other possible ways of introducing adaptivity into B, such as the ‘ensemble-variation integrated localised’ method (Auligné et al., 2016) and the so-called variational Kalman filters (Auvinen et al., 2010). In the latter the limited-memory quasi-Newton method is used to build a low-storage approximation to the Hessian of the analysis cost function, which therefore approximates the inverse of the analysis error covariance matrix, and can also be used to evolve the covariance forward to approximate B at the next analysis time.
Rapid update cycling (RUC) is the process by which we assimilate observations into a model as soon as they become available, or more practically, within some (short) time interval $\mathit{\delta}t$. We have seen that if the greatest delay in receiving observations is $N\mathit{\delta}t$ then the optimal solution to RUC at time k involves manipulating the $(N+1)n$ vectors $\underline{\mathbf{x}}$ formed from ${\mathbf{x}}_{k-N},\dots ,{\mathbf{x}}_{k}$ and the moments of the errors in $\underline{\mathbf{x}}$, such as their $(N+1)n\times (N+1)n$ error covariances.
Compared with ‘traditional’ cycling RUC makes more timely use of observations, which is particularly important for the provision of LBCs for LAMs. Another advantage of RUC is that the increments are smaller and hence linearisation error is reduced.
We have purposely concentrated on fundamental topics and avoided such practically important matters as efficiency and cost. The fact that for each analysis observation volumes and increments are smaller, and we always have a recent one available, suggest that it should be possible to reduce the cost per analysis.^{2} On contemporary HPCs where the increased power comes through higher numbers of processors rather than increased clock speeds this is more important than the total cost per day.^{3}
We may adapt ‘traditional’ methods designed for non-overlapping windows to RUC. These methods are suboptimal, but in all cases considered in this paper (Methods 1,2 and 3 in Section 4) they coincide with the optimal solution in the limit where model error vanishes.
Assimilating observations in smaller batches can be disadvantageous if climatological background error variances are used. This potentially poses a challenge for RUC, which could perhaps be met by improved cycling of error covariances.
^{{ label (or @symbol) needed for fn[@id='FN0004'] }}No potential conflict of interest was reported by the author.
2 Note also that there is scope for preconditioning using the work already done for recent analyses. This preconditioning could be based on Hessian eigenvectors (Fisher and Courtier, 1995), or on the vectors approximating the Hessian in the limited memory quasi-Newton method (Courtier et al., 1998).
3 We have also noted that the window length of RUC is determined by the longest delay in receiving observations, which in the operational example in Section 1 implies a window of 4 hours compared with the current 6 hours.
The author thanks Mike Cullen and Andrew Lorenc for useful discussions on this topic, and the referees appointed by Tellus for their comments.
Anderson, B. and Moore, J.1979. Optimal FilteringPrentice-Hall, Englewood Cliffs, NJ . 357 pp.
Auligné, T., Ménétrier, B., Lorenc, A. C. and Buehner, M.2016. Ensemble-variational integrated localized data assimilation. Mon. Weather Rev.144, 3677–3696. DOI:https://doi.org/10.1175/mwr-d-15-0252.1.
Auvinen, H., Bardsley, J. M., Haario, H. and Kauranne, T.2010. The Variational Kalman filter and an efficient implementation using limited memory BFGS. Int. J. Numer. Methods Fluids64, 314–335. DOI:https://doi.org/10.1002/fld.2153.
Benjamin, S. G., Dévényi, D., Weygandt, S. S., Brundage, K. J., Brown, J. M., and co-authors.2004a. An hourly assimilation-forecast cycle: the RUC. Mon. Weather Rev.132, 495–518. DOI:https://doi.org/10.1175/1520-0493.
Benjamin, S. G., Grell, G. A., Brown, J. M., Smirnova, T. G. and Bleck, R.2004b. Mesoscale weather prediction with the RUC hybrid isentropic terrain-following coordinate model. Mon. Weather Rev.132, 473–494. DOI:https://doi.org/10.1175/1520-0493.
Boyd, S. and Vandenberghe, L.2004. Convex OptimizationCambridge University Press, New York, NY . 716 pp.
Clayton, A. M., Lorenc, A. C. and Barker, D. M.2013. Operational implementation of a hybrid ensemble/4D-Var global data assimilation system at the Met Office. Q. J. R. Meteorol. Soc.139, 1445–1461. DOI:https://doi.org/10.1002/qj.2054.
Courtier, P., Andersson, E., Heckley, W., Vasiljevic, D., Hamrud, M., and co-authors.1998. The ECMWF implementation of three-dimensional variational assimilation (3D-Var). I: formulation. Q. J. R. Meteorol. Soc.124, 1783–1807. DOI:https://doi.org/10.1002/qj.49712455002.
Fisher, M. and Courtier, P.1995. Estimating the covariance matrices of analysis and forecast error in variational data assimilation. Tech. Memo.220, 28. ECMWF, Shinfield Park, Reading, UK.
Gallier, J.2010. The Schur complement and symmetric positive semidefinite (and definite) matrices. Penn Eng. Online at: www.cis.upenn.edu/~jean/schur-comp.pdf
Järvinen, H., Thèpaut, J.-N. and Courtier, P.1996. Quasi-continuous variational data assimilation. Q. J. R. Meteorol. Soc.122, 515–534. DOI:https://doi.org/10.1002/qj.49712253011.
Jazwinski, A. H.1970. Stochastic Processes and Filtering TheoryAcademic Press, New York . 376 pp.
Li, Z. and Navon, I. M.2001. Optimality of variational data assimilation and its relationship with the Kalman filter and smoother. Q. J. R. Meteorol. Soc.127, 661–683. DOI:https://doi.org/10.1002/qj.49712757220.
Lorenz, E.1996. Predictability -- a problem partly solved. Proceedings, Seminar on Predictability Vol. 1, ECMWF, Reading, UK, pp. 1–18.
Payne, T. J.2013. The linearisation of maps in data assimilation. Tellus A: Dyn. Meteorol. Oceanogr.65, DOI:https://doi.org/10.3402/tellusa.v65i0.18840.
Tang, Y., Lean, H. W. and Bornemann, J.2013. The benefits of the Met Office variable resolution NWP model for forecasting convection. Met. Apps20, 417–426. DOI:https://doi.org/10.1002/met.1300.
Veerse, F. and Thèpaut, J.-N.1998. Multiple-truncation incremental approach for four-dimensional variational data assimilation. Q. J. R. Meteorol. Soc.124, 1889–1908. DOI:https://doi.org/10.1002/qj.49712455006.
Continuing with the notation of Section 3.3 we readily obtain identities (A1)–(A3) below:
If the sequence of background error covariances using Methods 0 and 1 are designated, respectively,${\underline{B}}_{k}$ and ${\underline{\stackrel{~}{B}}}_{k}$, and we start from the same prior error covariance${\underline{\stackrel{~}{B}}}_{N}={\underline{B}}_{N}$, then for all$k\ge N$
and therefore
Now consider the matrix
Since $\underline{X}$ is the sum of two positive semi-definite matrices it is positive semi-definite. Note also that since $\underline{R}$ is positive definite that ${({\underline{B}}^{v})}^{-1}\underline{\stackrel{~}{B}}{({\underline{B}}^{v})}^{-1}+{\underline{R}}^{-1}$ is invertible, so its Moore–Penrose inverse is its usual matrix inverse.
The term in square brackets in (B2) is the Schur complement of ${({\underline{B}}^{v})}^{-1}\underline{\stackrel{~}{B}}{({\underline{B}}^{v})}^{-1}+{\underline{R}}^{-1}$ in $\underline{X}$ so by Theorem 4.3 of Gallier (2010) is positive semi-definite. Therefore if ${\underline{B}}_{k}^{-1}\u2ab0{\underline{\stackrel{~}{B}}}_{k}^{-1}$ then ${\underline{A}}_{k}^{-1}-{\underline{\stackrel{~}{A}}}_{k}^{-1}$ is the sum of two positive semi-definite matrices, so is positive semi-definite. We conclude that if ${\underline{B}}_{k}\u2aaf{\underline{\stackrel{~}{B}}}_{k}$ then ${\underline{A}}_{k}\u2aaf{\underline{\stackrel{~}{A}}}_{k}$. Since both methods satisfy (43) with the same ${\underline{M}}_{k}$, both satisfy the same relation
Therefore ${\underline{A}}_{k}\u2aaf{\underline{\stackrel{~}{A}}}_{k}$ implies ${\underline{B}}_{k+1}\u2aaf{\underline{\stackrel{~}{B}}}_{k+1}$ and the claim follows.$\square $
Set ${\underline{V}}_{k}$ to be the first n columns of the $n(N+1)\times n(N+1)$ matrix ${\underline{U}}_{k}$, i.e.
Because $Q=0$ it follows from (40) that
We suppose inductively that for some $k\ge N$ Method 0 and Methods 1–3 have the same ${\underline{\mathbf{x}}}_{k}^{b},{\underline{B}}_{k}$ and that
This is true for $k=N$ by construction. For all methods we therefore have
Using the ‘Kalman identity’
it follows from (17), (42) that for all methods
and therefore from (16), (38) that in all cases
Recall that the superscript $\ell $ in ${\underline{M}}^{(\ell )}$ in (43) denotes the method used, with ${\underline{M}}_{k}^{(1)}={\underline{M}}_{k}^{(0)}={\underline{M}}_{k}$ as defined for the optimal solution in (12). For all $\ell =0,1,2,3$-pagination
Let ${\stackrel{~}{\underline{\mathbf{x}}}}_{k}^{a}$ denote the vector in square brackets in (C5) and ${\stackrel{~}{\underline{A}}}_{k}$ the matrix in square brackets in (C3C4). It follows from (18), (19) and (43), (45) that both for Method 0 and for Methods $\ell =1,2,3$ we have
Suppose we have a linear system, observations in some time interval [0, T] and all errors are Gaussian. If we assimilate the observations using an optimal method, such as 4D-Var with correctly cycled prior and posterior error covariances, using m assimilation windows
then the estimate of state at time T is independent of m and of how we choose
However, if instead of properly cycling the error covariances the background error covariances are fixed, it is often advantageous to assimilate data in larger batches.
We illustrate this by considering a case where x is a scalar quantity, which evolves in time according to
for some constant $\mathit{\mu}$ (which is supposed known, so there is no model error) with $|\mathit{\mu}|>1$. At each time i we wish to assimilate an observation ${y}_{1}(i)$ with error variance ${r}_{i}$ and an observation ${y}_{2}(i)$ with error variance ${s}_{i}$. To make the problem analytically tractable we will suppose that $\{({r}_{i},{s}_{i}),i=1,\dots ,\infty \}$ are drawn (independently of i) from $\{({R}_{j},{S}_{j}),j=1\dots ,k\}$, where for any i$\{{r}_{i}={R}_{j},{s}_{i}={S}_{j}\}$ with probability ${p}_{j},j=1,\dots ,k$.
We compare two assimilation strategies using 4D-Var with a non-cycled background:
Simultaneous (batch size of 2): at each time i we assimilate ${y}_{1}(i)$ and ${y}_{2}(i)$ simultaneously using fixed background error covariance b, i.e. ${x}_{a}={x}_{b}+\mathit{\delta}$ where $\mathit{\delta}$ minimises
We will show the following: we can choose b so that, however ${b}_{1}$ and ${b}_{2}$ are chosen, the mean square error using the simultaneous method is lower than that using the sequential method (and strictly lower if $k\ge 2$ and ${R}_{1}\ne {R}_{2}$).
where the inequality is strict if $k\ge 2$ and ${R}_{1}\ne {R}_{2}$.
This Lemma proves the claim, and is itself proved in (3a)–(3d).
At any extremum of $\mathrm{\Gamma}$ Equation (D11) must hold, and in particular for every $i=1,\dots ,k$-pagination
The denominator is positive for all points in $\mathcal{D}$, therefore at any extremum in $\mathcal{D}$
Inserting this requirement with $j=1,2$ into the expressions for ${\mathit{\rho}}_{\mathit{seq}},{\mathit{\sigma}}_{\mathit{seq}}$ in (D5) it follows we would need
which for ${b}_{1}>0$ is only possible if ${R}_{1}={R}_{2}$.