In data assimilation, the aim is to combine observations with a priori information in a way which takes into account a statistical representation of their respective errors. In the Geosciences, the a priori information commonly comes from a sophisticated physical and dynamical model of the phenomena of interest, for example a numerical weather prediction.
Despite the non-linearity of these models, assimilation methods based on linearising the model and assuming Gaussian statistics have proved a powerful tool. Such methods include 4DVar, which is in operational use at many meteorological centres (Rabier, 2005). However, the assimilation is restricted to time and spatial scales where the non-linearity is small (Pires et al., 1996; Evensen, 1997).
At higher resolutions (e.g. convective scales), these geophysical models are highly non-linear, which potentially gives rise to significantly non-Gaussian a priori error distributions. This has led to an increasing interest in the methods of data assimilation, which do not rely on assumptions about the near linearity of the model and Gaussian error distributions.
Two reviews highlighting the recent developments in non-linear (non-Gaussian) data assimilation have been given by van Leeuwen (2009) and Bocquet et al. (2010). Many of the methods discussed in these papers are based on the direct application of Bayes’ theorem in which the posterior distribution, p(x∣y), (the probability distribution of the state given the observations), is derived from the multiplication and normalisation of the prior, p(x), with the likelihood, p(y∣x) (the probability distribution of the observations given the state):
The analysis is then often defined as the mode or mean of the posterior distribution.
The assimilation of observations is expected to give us a better understanding of the state; however, it is clear that some observations are more useful than others. For example, some observations may be more accurate, and others may provide information about a larger part of state space. It is therefore important to understand the impact that individual observations and subsets of observations have on the posterior distribution.
In Gaussian data assimilation (Gaussian prior and likelihood function), many measures of observation impact on the analysis are operationally used, for example:
The aim of this study is to look at the impact a non-Gaussian prior has on the observation impact as measured by three different measures: the sensitivity of the analysis to the observations, mutual information and relative entropy. It is assumed throughout that the observation error has a Gaussian distribution.
In the next section, these three measures will be derived for Gaussian data assimilation. In Section 3, a simple model of the non-Gaussian prior will be presented. The effect that this prior has on the sensitivity of the analysis to the observations will then be studied and compared with the impact on the mutual information and relative entropy. In Section 4.1, these differences will be illustrated with the Lorenz '63 model. Finally, in Section 5, a summary of the findings and conclusions will be presented.
If the prior and likelihood can be assumed to be Gaussian [given by N(x_{b}, B) and N(y, R), respectively] and the function mapping from state to observation space is linear (written as matrix H), then the posterior distribution is also Gaussian and can be characterised solely by its mean (the analysis), x_{a}, and its error covariance matrix, P_{a}:
Here, K is known as the Kalman gain matrix, which is a function of B, R and H:
The posterior (analysis error) covariance matrix is given by:
This is independent of the means of the prior and likelihood.
For a derivation of eqs. (2) and (4), refer to, for example, Kalnay (2003).
The linear relationship between the analysis and the observations, given by eq. (2), allows for a direct interpretation of the analysis sensitivity to the observations in terms of the Kalman gain matrix:
This is a p×p matrix, where p is the size of the observation space. This allows for the evaluation of the impact of individual observations on the analysis projected onto observation space. In this context, S was studied by Cardinali et al. (2004). From eqs. (5) and (3), it can be seen that accurate independent observations of features, which we have little prior knowledge of, have the greatest impact on the analysis because the Kalman gain will be large (Cardinali et al., 2004).
The trace of S gives the degrees of freedom for signal, d_{s}. This evaluates the expected fit of the analysis to x_{b} normalised by the error covariance matrix, B. That is d_{s}=E[(x_{a}−x_{b})^{T}B^{−1}(x_{a}−x_{b})]=trace(S) (Rodgers, 2000). The diagonal elements of the sensitivity matrix are bounded by 0 and 1 if R is diagonal, and therefore d_{s} lies between 0 and p. The closer d_{s} is to p the greater the observation impact.
Mutual information measures the reduction in entropy when an observation is made, that is the difference between entropy in the prior and the posterior. In information theory, entropy is a measure of the uncertainty associated with a random variable. For a probability distribution p(χ), entropy can be defined as . The entropy of a conditional probability distribution, p(χ|z), is defined as . This is the expected entropy of χ when conditioning with z. (Note that given these definitions, entropy is dependent on the choice of units for the variable χ.)
When p(χ) is a Gaussian, the entropy associated with χ depends only on its covariance matrix, C_{χ}. The entropy in this case is given by (1/2)ln [(2πe)^{n}|C_{x}|], where n is the size of the vector χ and ∣*∣ denotes the determinant (Rodgers, 2000). Mutual information for a Gaussian prior and posterior is therefore given by:
Mutual information for Gaussian data assimilation is therefore a measure of the difference in the determinant of the prior and posterior covariance matrices. Hence, it can be interpreted as the difference between a measure of the hypervolumes enclosed by iso-probability surfaces of the prior and posterior (Tarantola, 2005).
Mutual information can be rewritten in terms of the eigenvalues of the sensitivity matrix, S, presented in the last section. Firstly, note that using eq. (4) and the determinant of (I_{n}−HK) is equal to the determinant of (I_{p}−KH), where I_{n} and I_{p} are the identity matrix of dimension n×n and p×p, respectively. This leads to:
where λ_{i} is the ith eigenvalue of S (ordered in descending magnitude) and r≤min(n, p) is the rank of S. This links mutual information to the sensitivity of the analysis to the observations and hence to the degrees of freedom for signal [trace (S)]. It is a scalar interpretation of the observation impact, and therefore the impact of individual observations may not be easily quantified. However, mutual information can be shown to be additive with successive observations (see Appendix A.1).
Relative entropy measures the gain in information of the posterior relative to the prior.
Relative entropy can be thought of as a measure of the ‘distance’ between p(x∣y) and p(x). However, it is not a true distance because it is not symmetric (Cover and Thomas, 1991).
When both the prior and posterior are Gaussian, relative entropy is given by (Bishop 2006):
The first term is known as the signal term, which measures the change in the mean of the distribution. The rest is known as the dispersion term, which measures the change in the covariances. This can be shown to equal MI minus half the degrees of freedom for signal given by the trace of the sensitivity matrix. The dispersion term can therefore be written in terms of the eigenvalues of the sensitivity matrix whilst the signal term depends on the value of the observations and the prior mean.
The dependence of relative entropy on both the mean and variance of the posterior makes it an attractive measure, as it gives a more complete description of the observation impact. It can also be shown that relative entropy is invariant under a general non-linear change of variables (Kleeman, 2011). In Section 3.3, it is seen that mutual information may be written as a measure of the relative entropy of p(x, y) with respect to p(x)p(y). Writing mutual information in this way shows that if the same non-linear transformation is applied to x and y then mutual information is also left invariant.
A comparison of the degrees of freedom for signal, mutual information and relative entropy for Gaussian assimilation was performed by Xu et al. (2009). It was concluded that in application to the optimal radar scan configurement there was little difference in which measure was used. In the next section, we shall look at how a non-Gaussian prior affects these three measures.
From the previous section, it is seen that in Gaussian data assimilation the impact of the observations as measured by the three different measures is dependent on the ratio of the prescribed error variances of the prior and observations as well as their means for relative entropy. When the prior is non-Gaussian, the additional structure in the prior will be shown to be important for calculating the impact of the observations.
A study performed by Bocquet (2008) compared the information content of observations when the prior is assumed to be Gaussian and Bernoulli for the inverse modelling of a pollutant source. For this case study, a tracer gas was released from a point source over Northern France; observations of the gas were then made at locations across Europe. The measures of observation impact included the analysis error variance, the sensitivity matrix, mutual information and the degrees of freedom for signal. It was found that the more realistic non-Gaussian prior, which took into account the positivity of the released mass, allowed for observations far from the source to have a far greater impact on the analysis, giving a more accurate retrieval.
In this work, it is intended to use a much simpler idealised setup to understand the difference between the analysis sensitivity, mutual information and relative entropy when the prior is no longer Gaussian. An initially Gaussian prior may become non-Gaussian due to a non-linear forecast model. In particular, the model dynamics may lead to a skewed or multimodal distribution. The particle filter (PF) is an example of an assimilation scheme, which tries to represent the non-linear evolution of the prior (van Leeuwen, 2009). This will be used in Section 4.1 to illustrate the effect of the prior structure on the observation impact. Firstly, we shall look at the case when the prior can be modelled as a Gaussian mixture.
A Gaussian mixture allows for the representation of a wide range of non-Gaussian prior distributions. It is given by:
where (Bishop, 2006) and x_{bi} and B_{i} are the mean and covariance of the ith Gaussian component, respectively.
In this study, p(x) is simplified to a two-component Gaussian mixture (i.e. N=2) in one dimension (i.e. n=1). To reduce the number of parameters describing the prior further, the variances of the two component Gaussians are equal. This allows the prior to be described by four free parameters:
Although restrictive, a large range of non-Gaussian priors can be modelled by this mixture, see, for example, Fig. 5.
The likelihood function is then taken to be Gaussian with mean µ_{y}, interpreted as the measured value, and variance kσ^{2}, where k is a scalar. This is equivalent to assuming we have direct observations of x.
Using Bayes’ theorem, eq. (1), this implies that the posterior distribution is also a two-component Gaussian distribution with updated parameters given by , , and :
where :
for i=1, 2:
Here, we define the analysis as the mean of the posterior. In many cases. this will not correspond to the mode, as the posterior may be bi-modal [or multimodal in the case of the more general prior described by eq. (10)]. This makes the mode more difficult to uniquely define, and the mode will have infinite sensitivity to observations when the mode transfers from one peak to another.
When the prior is non-Gaussian, the simple linear relationship between the mean of the posterior and the observations [as seen in eq. (2)] breaks down. This can be seen in the following.
The mean of the posterior, the analysis, is given by:
Recall that and are a function of µ_{y} given by eqs. (11) and (12), respectively. The sensitivity of the analysis may be computed as:
With a little manipulation this can be written in terms of the parameters describing the prior:
From eq. (13) it is seen that S is a function of the observation value due to the appearance of the exponent a_{i}.
An illustration of S as a function of µ_{y} is given in Fig. 1 for k=2, σ^{2}=1, w=0.25, µ_{1}=−1.5, µ_{2}=1.5. On the left, the prior given by these parameters is plotted; it is both negatively skewed and bimodal. On the right, S is given.
S is seen to be a symmetric function about a maximum at µ_{y}=µ_{0} (solid grey vertical line). Away from µ_{y}=µ_{0} the sensitivity asymptotes to [1/(k+1)] [(1/3) in this case].
The value of µ_{y} for which the analysis shows the greatest sensitivity, µ_{0}, is given in terms of the parameters describing the prior and likelihood as:
This is found by solving for µ_{y}.
It can be shown that for this value of µ_{y} the Gaussian components in the posterior have equal weight, that is from eq. (11). Therefore, the posterior is symmetric and the analysis (given by the mean) is in the well between the two Gaussian components of the posterior. At this point, a small change in the observed value will dramatically change the shape of the posterior and the value of its mean. Note that this is also true for the posterior mode, which has infinite sensitivity in the region where observations give a symmetric posterior distribution.
An implication of this is that for a fixed prior and observation error variance, the observation that has maximum impact on the analysis value also gives the largest analysis error variance. In fact it can be proved:
for Gaussian mixture priors of any order and Gaussian likelihoods, see Appendix A.2. Note that S is unbounded as [(µ_{1}–µ_{2})^{2}/σ^{2}] increases. Therefore, may be much greater than when the prior describes two highly probable but distinct regimes.
Also plotted in Fig. 1 is the sensitivity of the analysis to the observations when the prior is approximated by a Gaussian distribution, S^{G}, (dashed line). This is not a function of the value of the observation as shown in Section 2.1. For this case:
where is the variance of the prior, . Substituting this into eq. (16) gives:
This is bounded by [1/(k+1)] and 1. Note that throughout this paper *^{G} refers to the value of * derived when approximating the prior as a Gaussian.
From Fig. 1 we see that when the full prior is used to assimilate the observation the analysis may be both more or less sensitive to the observation than when the prior is approximated by a Gaussian. The degree to which the sensitivity is affected depends on the value of the observation. When µ_{−}<µ_{y}<µ_{+} (marked on Fig. 1 by the dashed grey vertical lines) the Gaussian approximation results in an analysis, which is less sensitive to the observations than when the full prior is used and vice versa when the observation is outside of this region. The dependence of µ_{−} and µ_{+} on the parameters of the prior and likelihood and the magnitude of the disagreement between S and S^{G} will be discussed further in the next section.
Recall that mutual information is given by the prior entropy minus the conditional entropy:
Despite the dependence of the posterior error variance on the observations, the conditional entropy is independent of the value of the observations. Therefore, mutual information, unlike the sensitivity of the posterior mean to the observations, is independent of the value of the observations. This is shown in Fig. 2 where the measures are normalised by their Gaussian approximations. The analysis sensitivity is given in black and the mutual information is given in red. In this example, the Gaussian approximation to mutual information is about 102% of the true value, a very small error.
Also plotted in Fig. 2 is the relative entropy normalised by its Gaussian approximation, RE^{G}. As seen in Section 2.3, relative entropy combines the effect of the observations on both the position and shape of the posterior distribution. This explains the asymmetry in RE/RE^{G} as:
Using relative entropy to measure how the observation impact changes when the full prior is used may now lead to different conclusions than when the analysis sensitivity is used.
The large variation of these two measures as a function of the observation makes their interpretation more difficult for a single experiment. For some applications such as the design of new observation systems it may be more useful to look at the average impact.
The expected value of relative entropy can be shown to be equal to mutual information. This can be shown by writing mutual information in its equivalent form:
Here mutual information is interpreted as how ‘close’ two variables are to being independent, that is the error in approximating p(x, y) by p(x) p(y) (Cover and Thomas, 1991). In this form MI can be seen to be [see eq. (8)]. The marginal distribution, p(y), is given by:
where
In Fig. 2, we can compare mutual information (red) to the expected analysis sensitivity (black dashed), . It is seen that for this case mutual information is marginally closer to its Gaussian approximation than is. In both cases, the Gaussian approximation to the prior overestimates the average impact of the observations, as this assumes the prior has less information (less structure) than in fact it does.
The relationship between the three measures of observation impact shown in Fig. 2 extends to a wider range of prior distributions described by the simple two-component Gaussian mixture. In Fig. 3, the analysis sensitivity and relative entropy are plotted for a range of values of µ_{2}–µ_{1} when w=(1/2) (top row) and w when µ_{2}−µ_{1}=3 (bottom row).
For this range of prior distributions, the magnitude of the error in the Gaussian approximation to the relative entropy is smaller than the error in the analysis sensitivity. The tilt in the error fields seen for both measures as w is varied (bottom row) follows from the equation for µ_{0}, eq. (14).
It is seen in Fig. 3 that the magnitude of the error in the Gaussian approximation increases as the prior becomes more non-Gaussian, that is more bi-modal (µ_{2}−µ_{1} increases) and more skewed [|w−(1/2)| increases], as one would expect.
The range of values for which the sensitivity is underestimated (red contours), given by µ_{−}−µ_{+}, is only weakly a function of the non-Gaussianity of the prior. It is more greatly influenced by the variance of the likelihood. As the error variance of the observations increases the magnitude of S decreases (as would be expected, poor observations have a weaker impact). However, the range of values of µ_{y} for which S is greater than S^{G} also increases.
In Fig. 4, mutual information normalised by its Gaussian approximation (left) can be compared with the average analysis sensitivity normalised by its Gaussian approximation (right) for a range of values of µ_{2}−µ_{1} (y-axis) and w (x-axis).
As expected from Fig. 3, in which S/S_{G} was seen to be generally larger than RE/RE_{G}, the error in the Gaussian approximation of MI is less than the error in the Gaussian approximation of . In both measures, the Gaussian approximation always overestimates the observation impact but only by a marginal amount, increasing as the prior becomes more skewed and more bimodal. The peak in the skewness, calculated as , is given by the grey lines in Fig. 4.
The effect of the non-Gaussian prior on the measures of observation impact is now illustrated using the low-dimensional Lorenz 1963 model (Lorenz, 1963), given by:
Using the following parameters, σ=10, σ=28 and ρ=8/3, eq. (20) gives rise to a chaotic system with a strange attractor. The solution is seen to orbit around two equilibrium points giving two ‘regimes’.
We can represent a prior distribution of the state, x=(χ_{1}, χ_{2}, χ_{3})^{T}, by a large number, N_{p}, of weighted ‘particles’ (e.g. van Leeuwen, 2009):
where δ(*) is the Dirac delta function, i is the particle index and t is the observation time index. The number of particles, N_{p}, used in this work is 10 000 to avoid sampling issues.
At the initial time, all weights, , are equal to (1/N_{p}). In this work, the initial prior is taken to be Gaussian with covariance matrix given by:
and therefore the initial particles are drawn from N(x_{T},B), where x_{T} is the known truth at initial time.
This initial prior distribution is then evolved forward in time to the first observation by propagating each particle forward using a fourth-order Runge-Kutta discretisation of the equations given by eq. (20) with a time step of 0.01. Observations, y, are relatively sparse, made at every 50 time steps, and of only the χ_{1} variable. The observation error is assumed to be Gaussian with mean zero and a large error variance, . At the time of the first observation the non-linearity of the Lorenz '63 system gives rise to a new prior distribution of χ_{1}, which is no longer Gaussian.
At the time of the observation, the weights of each particle (which were initially all equal) are updated using:
The particles with updated weights now represent the posterior distribution, which is given by:
The particles are then propagated forward to the next observation, and the weights are again updated. It is desirable to have a large number of particles with non-negligible weight, so that the posterior distribution is accurately represented. A common problem with this standard PF for a limited sample size is filter divergence, when over time all the weight falls onto only a few particles. In this example, the large sample size ensures that the effective number of particles is greater than 25 up to the 10th observation time. Note that no resampling of the particles is used.
In Fig. 5, the priors are plotted as histograms for the first 10 observation times, and the observations are given by the blue stars.
The priors are far from Gaussian (black line). In particular at observation times 4 and 9 the prior appears to be bimodal. At each assimilation time, the observed state is represented by the particles. This is in agreement with the large effective number of particles (see definition above).
In Fig. 6a, the analysis (mean of the particles, ) of χ_{1} (red) is plotted alongside the true trajectory (grey) and the observations (black crosses). The analysis, given by the mean of the particles, gives a fairly good estimate of the truth until the 350th time step (7th observation) when there is a large uncertainty as to the sign of the χ_{1} seen in Fig. 5.
For these non-idealised prior distributions we can now calculate the impact of the observations using the three different measures and compare to the observation impact when the priors are approximated by a Gaussian. Firstly, the Gaussian approximations to the observation impacts are calculated using eqs. (5), (6) and (9) in Section 2. x_{b} and B are calculated directly from the particles at the observation time using the weights before they have been updated, for example:
x_{a} and p_{a} are then calculated using eqs. (2) and (4), where H=(1, 0, 0).
In this example, is constant and so for the Gaussian approximations to the observation impact only the spread in the prior distribution is important, given by the variance, this is plotted in Fig. 6b. In Fig. 6c, the Gaussian approximations to the observation impacts are plotted. As expected in all cases the observation impact is large when the prior variance is large (observation times 4, 7 and 9).
The strong dependence of each measure on the ratio of the observation error variance to the prior error variance means that each measure is seen to be in good agreement, as was also seen by Xu et al. (2009). Even relative entropy (blue) is dominated by the change in the prior variance rather than the value of the observation. This is not necessarily expected, as for Gaussian statistics relative entropy is a quadratic function of the observed value [see eq. (9)]. As such the consistency of relative entropy with the other measures may change dramatically depending on the underlying model used. The results given in Kleeman (2002) illustrate this for the application of quantifying predictability.
In order to calculate the sensitivity of the analysis to the observations in the full non-Gaussian case, small perturbations to the observations, δµ_{y}, are made at each observation time; the analysis sensitivity is then approximated by the change in the analysis of χ_{1}, δµ_{a}, that is:
However, using eq. (15) we may also approximate the sensitivity using the spread in the particles using the updated weights, and therefore:
where is the mean value of χ_{1} in the updated particles at observation time t, that is . The agreement between these two approximations to the sensitivity is seen to be good in Fig. 6d, comparing the black solid line (δµ_{a}/δµ_{y}) to the grey-dashed line .
The relative entropy may be calculated directly from the weights:
and is given by the blue line in Fig. 6d.
Mutual information is approximated using quadrature by:
where
δy was taken to be 1.
For the non-Gaussian representation of the prior the observation impacts are seen to roughly follow the pattern of higher observation impact when the prior variance is large (Fig. 6d). However, there are some notable differences between these and the Gaussian approximations to these measures:
It is clearer to see the differences between the Gaussian and PF approximations by plotting their ratio, as given in Fig. 7a.
As also seen in Section 3, the Gaussian approximation always overestimates the mutual information (red-dashed line) and the averaged sensitivity (black-dashed line). The error in these averaged measures roughly increases with time, as the Gaussian approximation to the prior becomes increasingly poor (see Fig. 7b). In particular, at observation time 9, S^{G} is approximately twice . The error in the Gaussian approximation to the sensitivity and relative entropy is seen to fluctuate. As already shown in Section 3, these errors are expected to depend strongly on the exact value of the observation.
In Fig. 8, the four measures are plotted as a function of the observation value and then normalised by their Gaussian approximation (similar to Fig. 2) for observation time 7 (when the error in relative entropy is largest) and time 9 (when the sensitivity is much smaller than expected given the increase in the prior variance at this time; see Fig. 6d).
At each of these times there is indeed a large range of values for the normalised S and RE as a function of observation value.
The error in the Gaussian approximation of S has the largest range at observation time 9 when the prior is clearly bimodal (see Fig. 5), ranging from approximately three times too large to about four times too small. For the realisation of the observation used (represented by the vertical black dash-dot line in Fig. 8a), the sensitivity is much smaller than would be expected.
At observation time 7, the root mean square error (RMSE) of the Gaussian fit is better (see Fig. 7), and the Gaussian approximation to S has a smaller error range. In this case, the Gaussian approximation to S ranges from being approximately six times too large to about 1.5 times too small. The observed value at this time (vertical black dash-dot line in Fig. 8b) leads to the sensitivity being underestimated.
For the normalised relative entropy, the range of values is more uniform across the observation times. Ranging from approximately 0.6 to 1.8 at observation time 9 and 0.6–1.7 at observation time 7. However, the realisation of the observations at each of the times results in a very different error in the Gaussian approximation.
Fig. 8 can be understood to some extent using the theory for a prior given by a Gaussian mixture developed in Section 3. The goodness of fit of the simplified Gaussian mixture to the priors for these cases is summarised in Table 1; also see red lines in Fig. 5.
At observation time 9 the simplified Gaussian mixture fit to the prior has the smallest RMSE. Therefore, as expected in Section 3, the sensitivity is symmetrical about a maximum; however, it does not tend to , which in this case is 0.375, as expected, continuing to decrease to a much smaller value. At the observation time 7, the two-component Gaussian mixture fit is poorer, and the normalised S is no longer symmetrical as a function of observation value.
The aim of this work has been to give a detailed study of the effect of a non-Gaussian prior on the impact of the observations on the analysis. For simplicity this has been restricted to one dimension.
A non-Gaussian prior was modelled as a two-component Gaussian mixture with equal variance, allowing for skewness and bi-modality. Describing the prior in this way allowed for the sensitivity of the mean of the posterior, our analysis, to the observation to be derived analytically.
The sensitivity of the analysis was shown to be a strong function of the value of the observations and equal to the posterior variance divided by the observation error variance. This result extends to the case of all smooth priors that can be described as a Gaussian mixture and Gaussian likelihoods. This means that an observation for which the analysis is very sensitive may not necessarily be a good observation in terms of minimising the posterior variance. The difference between relative entropy and its Gaussian approximation was also shown to be a strong function of the observation value. Mutual information, however, is independent of the value of observation used and was seen to be in good agreement with its Gaussian approximation.
Applying these measures of observation impact to the PF technique for solving the Lorenz '63 equations, it is seen that the non-Gaussian prior breaks down the agreement between these measures of observation impact because of their strong dependence on the value of the observation. Averaging these measures over observation space was shown to bring them closer to their Gaussian approximations. However, the Gaussian approximation was always seen to overestimate the averaged values. This is because the Gaussian approximation to the prior underestimates the information in the prior and therefore in an averaged sense overestimates the impact of the observations.
In conclusion when calculating the observation impact in a non-Gaussian assimilation system it is important to give careful consideration to what you wish to measure:
In practice, when the model is non-linear, the full prior and posterior PDFs are never known and can only be sampled by expensive techniques such as the PF. A limited ensemble size makes an accurate measurement of relative entropy difficult (Haven et al., 2005). In the work of Majda et al. (2002), lower bounds are given for relative entropy using a maximum entropy approximation to the PDFs using the sample moments. The aim of Haven et al. (2005) was to give a minimum relative entropy estimate with a statistical level of confidence implied by the sample. Similar difficulty is also faced when calculating mutual information when the observation space is large (Hoffmann et al., 2006).
All conclusions made are applicable for Gaussian observation errors. Non-Gaussian observation errors due to a non-linear map between observation and state space or non-Gaussian measurement errors are left for future work.
This work has been funded by the National Centre for Earth Observation part of the Natural Environment Research Council, project code H5043718. The authors would also like to thank the three anonymous reviewers.
Mutual information has the attractive quality that it is additive with successive observations. For example if at time t+1, a new set of observations, y^{new}, are made such that the total set of observations up to this time are given by the vector, y^{t+1}=(y^{t}, y^{new}). Then the mutual information given the total set of observations, MI^{t+1}, is equal to the sum of the mutual information given the previous set of observations, MI^{t}, and the mutual information given the new observations, MI^{new}:
Proof. From the definition of mutual information [eq. (17)] we have:
This may be expanded using Bayes’ theorem to give:
The log term may then be separated:
This can be simplified using the identity: :
This is equal to MI^{t}+MI^{new} using (Cover and Thomas, 1991):
Q.E.D.
It can be proved that when the likelihood is Gaussian and the prior is a Gaussian mixture that the sensitivity of the mean of the posterior to the observations is equal to the analysis error variance divided by the observation error variance. That is:
Proof. Let the prior, p(x), be given by a Gaussian mixture:
where . From Bayes’ theorem the posterior may be similarly expressed as:
The updated weights are given by:
where
The updated means are given by:
The updated variances are given by:
The mean of the posterior is given by:
Differentiating eq. (37) with respect to the observation value gives:
From eq. (36) can be seen to be , which is equal to .
The second term requires more manipulation. Let in eq. (35) be ŵ_{i} then:
and
where . We can rewrite eq. (39) using :
Substitute and eq. (40) into eq. (38):
The variance of the posterior is given by:
The second term can be rewritten as:
Therefore holds if:
Or equivalently:
This is true because for all i, j, since:
Q.E.D.
BishopC. Pattern Recognition and Machine Learning. Springer: New York, 2006
BocquetM. Inverse modelling of atmospheric tracers: non-Gaussian methods and second-order sensitivity analysis. Nonlin. Process. Geophys. 2008; 15: 127–143. https://doi.org/10.3402/tellusa.v64i0.17192.
BocquetM. PiresC. A. WuL. Beyond Gaussian statistical modeling in geophysical data assimilation. Mon. Weather Rev. 2010; 138: 2997–3023. https://doi.org/10.3402/tellusa.v64i0.17192.
KalnayE. Atmospheric modeling, data assimilation and predictability. Cambridge University Press: Cambridge, 2003
Cardinali, C, Pezzulli, S and Andersson, E. 2004. Influence-matrix diagnostics of a data assimilation system. Q. J. R. Met. Soc. 130, 2767–2786. https://doi.org/10.3402/tellusa.v64i0.17192.
CoverT. M. ThomasJ. A. Elements of Information Theory (Wiley series in Telecommunications). John Wiley and Sons: New York, 1991
EvensenG. Advanced data assimilation for strongly nonlinear dynamics. Mon. Weather Rev. 1997; 125: 1342–1354.
Eyre, J. E. 1990. The information content of data from satellite sounding systems: A simulation study. Q. J. R. Met. Soc. 116, 401–434. https://doi.org/10.3402/tellusa.v64i0.17192.
Fisher, M. 2003. Estimation of entropy reduction and degrees of freedom for signal for large variational analysis systems. Technical Report, ECMWF.
HavenK. MajdaA. AbramovR. Quantifying predictability through information theory: small sample estimation in a non-Gaussian framework. J. Comput. Phys. 2005; 206: 334–362. https://doi.org/10.3402/tellusa.v64i0.17192.
Hoffmann, G. M, Waslander, S. L and Tomlin, C. J. 2006. Mutual information methods with particle filters for mobile sensor network control. In: Proceedings of the 45th IEEE Conference on Decision & Control, conference held 13–15 Dec, 2006 in San Diego, CA. Publisher is Institute for Electrical and Electronic Engineers (IEEE), USA. pp. 1019–1024.
KleemanR. Measuring dynamical prediction utility using relative entropy. J. Atmos. Sci. 2002; 59: 2057–2072.
Kleeman, R. 2011. Information theory and dynamical system predictability. Entopy. 13, 612–649. https://doi.org/10.3402/tellusa.v64i0.17192.
LorenzE. N. Deterministic nonperiodic flow. J. Atmos. Sci. 1963; 20: 130–141.
MajdaA. KleemanR. CaiD. A mathematical framework for quantifying predictability through relative entropy. Methods Appl. Anal. 2002; 9: 425–444.
PiresC. A. VautardR. TalagrandO. On extending the limits of variational assimilation in nonlinear chaotic systems. Tellus. 1996; 48A: 96–121.
RabierF. Overview of global data assimilation developments in numerical weather-prediction centres. Q. J. R. Met. Soc. 2005; 131: 3215–3233. https://doi.org/10.3402/tellusa.v64i0.17192.
RodgersC. D. Inverse Methods for Atmospheric Sounding. World Scientific Publishing: Singapore, 2000
TarantolaA. Inverse Problem Theory. SIAM: Philadelphia, 2005
van LeeuwenP. J. Particle filtering in geophysical systems. Mon. Weather Rev. 2009; 137: 4089–4114. https://doi.org/10.3402/tellusa.v64i0.17192.
XuQ. Measuring information content from observations for data assimilation: relative entropy versus Shannon entropy difference. Tellus. 2007; 59A: 198–209.
XuQ. WeiL. HealyS. Measuring information content from observations for data assimilation: connection between different measures and application to radar scan design. Tellus. 2009; 61A: 144–153.