## Introduction

According to recent records of sea ice concentration and thickness, there has been a significant decline in Arctic ice extent, and thinning of the ice cover, over the past two decades (Meier et al., **2014**; Renner et al., **2014**). It also has been recently shown that there is a significant correlation between reductions in ice concentration in the seasonal ice zone and increases in shipping activity (Pizzolato et al., **2016**). However, the safety of navigation in ice covered regions depends in part on the availability of accurate estimates of the small scale details of ice concentration, ice thickness and ice pressure. Ships will tend to navigate through narrow openings in the ice cover, and if there are none present they will attempt to find a route where the ice appears to be thin and/or undeformed (Loptien and Axell, **2014**). Sudden changes in ice thickness (e.g. deformation ridges) and ice pressure play a significant role in vessel besetting events (Kubat et al., **2013**).

Ideally, sea ice forecasts would be made in a similar manner to weather forecasts, using data assimilation (Carrieres et al., **2017**). Data assimilation is a method for combining available observations with a background state from a numerical forecast model, to find the best estimate of the current state of the system, which is called the analysis (Lahoz et al., **2010**). The analysis is then used to initialise the forecast model to run a short-term forecast, the output of which is used as the background state for the next data assimilation cycle. Using a data assimilation system would allow physically-based forecasts to be made of features in the ice cover that cannot be directly observed, such as regions of high ice pressure.

Data assimilation schemes involve an interpolation of observational information to a model grid. To carry out this interpolation certain criteria should generally be satisfied, such as smoothness of the underlying fields (Wahba and Wendelberger, **1980**; Lorenc, **1986**). This poses a specific problem for sea ice because even though at large scales sea ice seems to be a nonrigid continuum, at small scales it includes important sharp features or discontinuities (Thomas et al., **2008**).

One way to move towards a better representation of these sharp features in the state estimate is to revisit the definition of the data assimilation problem. Data assimilation can be considered an inverse problem where the objective is to find the best state, $\mathbf{x}\in {\mathbb{R}}^{n}$, that satisfies the relationship $\mathbf{y}=H(\mathbf{x})+\epsilon $, where $\mathbf{y}\in {\mathbb{R}}^{m}$ represents the available observations, $H:{\mathbb{R}}^{n}\to {\mathbb{R}}^{m}$ is the observation operator, which maps the state vector to the observation space and can be a physical or mathematical model, and $\epsilon \in {\mathbb{R}}^{m}$ is the observation error. The standard objective function in an inverse problem is defined as the *l*_{2}-norm of the misfit between the observations and model state as $\left|\right|\mathbf{y}-H(\mathbf{x})|{|}_{2}^{2}$. Since, usually there are not enough observations available (i.e. the dimension of **y** is less than **x**) and/or the observations have errors, the data assimilation problem may be very ill-posed. This problem can be solved by adding regularisation terms to the standard objective function. For example, in variational data assimilation (VDA) an *l*_{2} norm is added to the objective function that is the difference between the optimal state and a background state, where the latter is typically from a forecast model.

The relationship between VDA and Tikhonov regularisation was discussed by Johnson by reformulating the objective function of VDA as a regularisation problem (Johnson et al., **2005**). In later studies, where sharp features in the analysis were desired, the *l*_{2}-norm term in the objective function was replaced with the *l*_{1}-norm (Freitag et al., **2010**; Budd et al., **2011**). This follows similar practice in image deblurring problems, where an original image is estimated from a given blurred image. The *l*_{1}-norm regularisation yields better performance than *l*_{2} norm regularisation when sharp edges need to be recovered (Hansen et al., **2006**). The reason for the edge preserving property of the *l*_{1}-norm is that edges in images lead to outliers in the regularisation term, which are penalised less in an *l*_{1}-norm than an *l*_{2}-norm. A similar *l*_{1} regularisation was taken in Foufoula-Georgiou et al. (**2014**) and Ebtehaj et al. (**2015**), where it was demonstrated that precipitation states and global fields of temperature, humidity and geopotential height admit a nearly sparse representation in a specific domain (derivative or wavelet domain). In the interest of retaining this sparsity in a data assimilation framework, the VDA problem was solved by applying *l*_{1}-norm regularisation on the analysis state derivative. In a similar spirit, Freitag et al. (**2013**) added a term to the standard *l*_{2} objective function constraining the *l*_{1}-norm of the gradient of the analysis vector to follow a Laplacian density for experiments involving sharp fronts. Ebtehaj et al. (**2014**) extended this study by proposing a generalised regularisation framework and applying this framework to an advection-diffusion equation enforcing sparsity in wavelet and spectral domains. They found this *l*_{1}-*l*_{2}-norm regularisation framework performed better than the standard *l*_{2}-norm regularisation method for their problem in idealised 4-D-Var experiments.

In this study, evidence of sparsity in the ice thickness derivative domain is demonstrated using both airborne survey data, and submarine sonar data. The impact of using a mixed variational regularisation framework, in which the *l*_{2} term is augmented with an *l*_{1} term in the derivative domain, to retain this sparsity in sea ice thickness analyses, is demonstrated using both data fusion and data assimilation experiments. For the data assimilation experiments a 1-D sea ice model is used (Stonebridge et al., **2018**), whereas the data fusion experiments do not use a prognostic model. Note similar studies were carried out in Ebtehaj and Foufoula-Georgiou (**2013**) and Freitag et al. (**2013**) where the data assimilation was carried out using a 1-D heat equation, and 1-D advection equation, respectively. In addition to using a more complicated physical model, our experiments examine not only the impact of background error correlations on the analyses, but also the impact of observation error correlations. The latter is motivated by both past (Daley, **1993**) and recent (Rainwater et al., **2015**) studies demonstrating the role of observation error correlations in retaining small scale information in the analyses.

The article is constructed as follows: In Section 2 the data is described and presence of sparsity in the derivative field of data is statistically illustrated. In Section 3 the assimilation framework and the formulation of the *l*_{1}-norm regularisation problem is presented. Following that, in Section 4 the experimental set-up is described, while in Section 5 the results of the experiments are shown and discussed. Finally in Section 6 concluding remarks are given.

## Sparsity of ice thickness observations

At the small scales, similar to observations of precipitation and other geophysical variables (Foufoula-Georgiou et al., **2014**; Ebtehaj et al., **2015**), sea ice no longer behaves as a continuum (Flato, **1998**), and exhibits sharp features or discontinuities. To visualise sharp features in the sea ice state, ice thickness measurements obtained using an airborne electromagnetic (AEM) sensor during an aircraft survey over the Beaufort Sea are examined. The AEM sensor measures the ice + snow thickness, which we will refer to as the ice thickness (Haas and Howell, **2015**), with a spatial distance of 4.5 m–6 m (ground distance) between measurement points. The data used here were acquired on 20 April 2015 in the Beaufort Sea. The measurement sample points indicating the flight path are shown in Fig. 1a. For more information about AEM ice thickness measurements see (Haas et al., **2009**).

In the Beaufort Sea, the ice cover is highly heterogeneous, and consists of a mixture of thin and thick ice, first-year and multi-year ice, with many ridges. The ridges are spatially localised regions with higher ice thickness than the surrounding ice. While the AEM measurement is known to underestimate the maximum thickness of pressure ridges, it has high accuracy over level ice ($\pm 0.1\hspace{0.17em}\mathrm{m}$) (Haas et al., **2009**, **2010**) and it can represent the variability in ice thickness at small spatial scales, as can be seen in Fig. 1b.

To demonstrate the sparsity in the derivative field of AEM ice thickness, various methods are used. In a sparse distribution, the distribution is more peaked at zero, indicating many points with value zero, with fatter tails, as compared to a Gaussian distribution. The histogram of the first order spatial derivative of the AEM ice thickness measurements (taking into account all data points in Fig. 1b is shown in Fig. 2a. Overlaid on this distribution are fitted Gaussian, Laplacian and generalised Gaussian distributions. The generalised Gaussian distribution is defined as

*p*and

*σ*are non-negative parameters describing the shape and width of the density, respectively. For special cases of

*p*= 1 and

*p*= 2, this probability distribution corresponds to Laplacian and Gaussian distribution, respectively. The number of bins in the histograms of Fig. 2 was set to the maximum of the Sturges and Freedman Diaconis estimators (Sturges,

**1926**; Freedman and Diaconis,

**1981**) and the AEM measurement points were resampled to have an equal spacing of 7 m as this simplifies application of the methods used.

It can be seen in Fig. 2a that the data exhibits a symmetric representation in the derivative space with a large mass around zero. The large number of zero coefficients in the derivative domain is caused by the uniformity of ice thickness over a large area, whereas the large tails are associated with the ridges and/or openings in the ice cover. Note the density of the histogram values around zero is greater than the fitted Gaussian distribution.

For comparison, histograms of the first order spatial derivatives of sea ice draft from submarine upward looking sonar (National Snow and Ice Data Center (NSIDC), 1998) and ice thickness retrieved from Cryosat data^{1} (Laxon et al., **2013**) are also shown in Fig. 2. Submarine data in Fig. 2b is from an archive of data acquired in Arctic ocean in July 2005 and is resampled at 10 m. It was expected that these data would be less sparse than the AEM data, since the underside of the ice is known to be irregular, containing fewer regions than the surface that can be considered level (Rothrock and Thorndike, **1980**). This is supported by the histogram in Fig. 2b, where it can be seen that while the fit is closest to the Laplacian distribution, derivative coefficients have lower concentration around zero than that of the AEM data showing that AEM data peak is sharper. As another example of ice thickness sparsity, Fig. 2c shows the histogram of the derivative field of the Cryosat sea ice thickness measurements at 5 km grid from March and April of 2015. The distribution of ice thickness derivatives is highly peaked around zero, showing a better fit to Laplacian and Generalised Gaussian distributions rather than a Gaussian distribution.

To quantitatively describe the goodness of fit of the represented distributions in Fig. 2 to the ice thickness derivative field distribution, Kullback-Leibler divergence (*D _{KL}*) (Kullback and Leibler,

**1951**) is used.

*D*is a measure of the distance between two distributions. For the discrete probability distributions P and Q it is defined as

_{KL}*i*is the size of the discrete probability distributions defined by the number of bins in the histogram. Table 1 represents the

*D*values for the three datasets in Fig. 2. For AEM data, the

_{KL}*D*values of two subsets of data, selected as smooth and deformed first year ice (FYI) corresponding to the black and red boxes in Fig. 1, respectively, are also included. For each distribution,

_{KL}*D*was calculated by substituting the histogram bin values for

_{KL}*P*and the fitted distribution values for each bin by

*Q*into (2). The

*D*values of the first column (AEM_all) reflect what can be seen in the histogram, which is that the distribution of the AEM ice thickness derivatives is statistically closer to a Laplacian or generalised Gaussian distribution (with

_{KL}*p*= 0.8 in (1)) than a Gaussian distribution. While the

*D*values for the Cryosat data are relatively large, the Laplacian distribution is a better fit in terms of

_{KL}*D*values to the AEM data and other datasets compared to Gaussian distribution.

_{KL}To further evaluate the fit of the proposed distributions to the AEM data, a Q-Q plot (Chambers, **1983**) is used. In a Q-Q plot the sample data points are plotted against the quantiles of an assumed theoretical distribution in such a manner that the points should form approximately a straight line. Departure from the straight line indicates departure from the specified distribution. Fig. 3 shows that Laplacian and generalised Gaussian distribution are better fits to the data in comparison with a Gaussian distribution.

## Data assimilation method

The assimilation method used here is a form of a VDA method, formulated as an inverse problem with regularisation, that will be discussed in detail in the following sections.

### Variational data assimilation

The VDA scheme can be defined as the following minimisation problem,

In (3), $\mathbf{x}\in {\mathbb{R}}^{n}$ is the current best approximation to the true state, ${\mathbf{x}}_{b}\in {\mathbb{R}}^{n}$ is the background state, $\mathbf{y}\in {\mathbb{R}}^{m}$ is the observation vector and $H\in {\mathbb{R}}^{m\times n}$ is the observation operator that maps the state vector to the observation space. The background and observation error covariance matrices are given by $\mathbf{B}\in {\mathbb{R}}^{n\times n}$ and $\mathbf{R}\in {\mathbb{R}}^{m\times m}$, respectively, and the analysis state, ${\mathbf{x}}_{a}\in {\mathbb{R}}^{n}$ is the state that corresponds to when the objective function, $J(\mathbf{x})$ reaches a minimum.

### Regularisation in variational data assimilation

In a study conducted by Johnson et al. (**2005**) the classic VDA problem was reformulated to the standard form of Tikhonov regularisation (Tikhonov et al., **1977**). Accordingly, by letting $\mathbf{B}={\sigma}_{b}^{2}{\mathbf{C}}_{\mathbf{B}}$ and $\mathbf{R}={\sigma}_{o}^{2}{\mathbf{C}}_{\mathbf{R}}$ where ${\mathbf{C}}_{\mathbf{B}}$ and ${\mathbf{C}}_{\mathbf{R}}$ are the error correlation matrices and ${\sigma}_{b}^{2}$ and ${\sigma}_{o}^{2}$ are the background and observation error variances, and using a change of variable $\mathbf{z}={\mathbf{C}}_{\mathbf{B}}{}^{-1/2}(\mathbf{x}-{\mathbf{x}}_{b})$, the objective function in (3) can be reformulated as (Freitag et al., **2010**)

**H**is the linearised observation operator. If we define

**f**and

**G**as $\mathbf{f}={\mathbf{C}}_{\mathbf{R}}{}^{-1/2}(\mathbf{y}-H{\mathbf{x}}_{b})$ and $\mathbf{G}={\mathbf{C}}_{\mathbf{R}}{}^{-1/2}\mathbf{H}{\mathbf{C}}_{\mathbf{B}}{}^{1/2}$, the objective function can be written as

Therefore, by solving

We note the objective function in Equations (4) and (5) uses the scalar parameter *μ*, which assumes spatially homogeneous error variances. An alternative formulation that does not require this assumption is given in Appendix A.

### Mixed l_{1}–l_{2}-norm regularisation

Freitag et al. (**2013**) proposed their mixed regularisation framework with an objective function as

*l*

_{1}-norm (Laplacian distribution), and $\delta >0$ is another regularisation parameter and has to be chosen empirically. As

*δ*increases, the gradient of the solution will be more sparse. Details about how to set this parameter are discussed in Section 4.3.

In (8) the first order derivative can be approximated by applying matrix $D\in {\mathbb{R}}^{(n-1)\times n}$ to the state vector, where **D** can be written as

This operator corresponds to a first-order differencing, with equal spacing between data points. Other approximations to the first order derivative operator were tested, but the results were not found to be sensitive to this choice. This representation in Equation (9) is chosen for simplicity.

As discussed in Section 2 the ice thickness observations exhibit sparsity in the derivative domain and have a closer fit to both the Laplacian and generalised Gaussian distribution as compared with the Gaussian distribution. Even though the generalised Gaussian distribution with *p* = 0.8 is the best fit for these observations, assuming this distribution for the regularisation results in a non-convex optimisation, which might not converge easily. Therefore, assuming the Laplacian distribution as the prior model for the gradient field is the approach taken in this study.

## Experimental method

To evaluate the advantage of using the mixed regularisation framework, two different regularised VDA systems are evaluated. The first system, *l*_{2} regularisation, minimises the objective function given by (5), while the second system, *l*_{1}–*l*_{2} regularisation, minimises the objective function described by (8). Note that the *l*_{1}–*l*_{2} regularisation method when *δ* = 0 is equivalent to the *l*_{2} method. The minimisation problems are reformulated to a quadratic programming using an approach introduced by Fu et al. (**2006**) and solved by Python CVXOPT package (Andersen et al., **2013**). These two systems were analysed by first carrying out a set of data fusion experiments, where the analysis is not cycled through a model. The objective of the first set of experiments is to look at the impact of the mixed regularisation on the analyses, without the complications introduced by model physics and resampling of the AEM data to the model grid. The data fusion experiments are followed by a set of data assimilation experiments where a toy (1-D) sea ice model is used (Appendix B).

For both the data fusion and data assimilation experiments the background error covariance matrix is defined as

*σ*, and ${\mathbf{C}}_{\mathbf{B}}$ is the background error correlation matrix. The error correlation function is calculated based on the compactly supported fifth-order piecewise polynomial correlation function proposed by Gaspari and Cohn (

_{b}**1999**). Note that in this formulation the error correlation function is an explicit function of the specified length scale and the spacing between points. The observation error covariance matrix,

**R**, is defined in the same way as the

**B**matrix using error standard deviation

*σ*and the same correlation function. Both the

_{o}*l*

_{1}and

*l*

_{1}–

*l*

_{2}systems require calculation of the square root of the background error correlation matrix. This is a computationally expensive operation, that scales as $O({n}^{3})$, where

*n*is the dimension of the problem. For this study,

*n*is small ($n\approx 400$) and the square root is computed using Cholesky decomposition, which takes about 5% of the total time of an analysis cycle. For a larger, operational sea-ice data assimilation system, a method to compute the background error correlation matrix square root using a diffusion equation is described in Caya et al. (

**2010**). A similar method could be used for the

*l*

_{1}–

*l*

_{2}method described here. An additional, potentially expensive, part of the algorithm is the inversion of the square root of the observation error correlation matrix. Large scale systems typically assume a diagonal observation error correlation matrix, which makes this operation trivial. Methods for large scale systems utilising non-diagonal observation error correlation matrices are still a topic of research in the operational data assimilation community.

The accuracy of the analyses is evaluated using six measures: (1) mean absolute error (MAE), (2) root mean squared error (RMSE), (3) kurtosis, $\text{KURT}[{\mathbf{x}}_{\mathbf{a}}]=\frac{E\left[{(\mathbf{x}-E[\mathbf{x}])}^{4}\right]}{{(E\left[{(\mathbf{x}-E[\mathbf{x}])}^{2}\right])}^{2}}$, where $E[\xb7]$ represents the expectation operator, (4) MAE of the derivative field (DIFF_MAE) (5) RMSE of the derivative field (DIFF_RMSE) and (6) kurtosis of the derivative field ($\text{KURT}[D{\mathbf{x}}_{\mathbf{a}}]$). Kurtosis of the state is used to show tail of the state’s probability distribution so that, an analysis with high kurtosis has retained more sharp features. Reported results are averaged over 40 simulations where the additive random noise for background and observation states are different for each simulation. Note that the background and observation error standard deviation ratio (*μ*) was held constant at *μ* = 1 for all experiments. Details specific to the data fusion and data assimilation experiments are now given.

### Data fusion experiments

The deformed FYI subset of the AEM data shown in Fig. 1 is selected as the true state (${\mathbf{x}}_{t}\in {\mathbb{R}}^{n}$ where *n* = 1800). This sample of the data, shown in Fig. 4, was found to include more sharp features as compared to the smooth FYI sample. The background (${\mathbf{x}}_{b}\in {\mathbb{R}}^{n}$) and observations ($\mathbf{y}\in {\mathbb{R}}^{n}$), consist of the true state with additive random noise sampled from a Gaussian distribution with mean zero and covariance matrix **B** and **R**, respectively. For both background and observation error covariance matrices, experiments were carried out over a range of correlation length scales, measured in meters, corresponding to $L=\{0,20,50,100,500\}$ and a range of *δ* values. It was found the values of the error measures (given below) increased with increasing *δ* for $\delta >1$. Hence the performance of the *l*_{1}–*l*_{2}-norm regularisation method is reported for a range of *δ* values between 0 and 1.

### Data assimilation experiments

For the data assimilation experiments the initial true state was based on the entire set of AEM ice thickness measurements acquired on 20 April, shown in Fig. 1b. The data points were averaged over segments of 1 km in length. This was done to obtain a state at a spatial resolution closer to that of interest in operational sea ice forecasting (Carrieres et al., **2017**; Batrak and Müller, **2018**; Mohammadi-Aragh et al., **2018**), in addition to one where the continuum assumption used in the ice model should be valid Flato (**1998**). When this true state was used to initialise the sea ice model it was found the ice did not move very significantly over the 72 h period. This was likely because the ice concentration was close to 100% over the entire domain and the model was run with thermodynamics off to simplify analysis of the results. This means the ice concentration can only change by openings in the ice cover, which may be difficult to obtain due to the high ice pressure when the ice concentration is close to 100% and the lack of solid boundaries in the domain. To create open water regions in the ice cover the thickness everywhere was reduced by 0.5 m and the points with negative thickness were set to zero. This state was used to initialise the ice model which was run forward in time to provide true states at *t* > 0, referred to in this paper as the true model run (Appendix B). While this state differs from the actual truth in part due to model error, such an approach should be sufficient for evaluating the difference between *l*_{2} and *l*_{1}–*l*_{2} regularisation methods. Each of the experiments (truth, *l*_{1}–*l*_{2} and *l*_{2}) was run with identical atmosphere and ocean states, which provide the forcing to the sea ice model.

The data assimilation experiments were initialised with a background state consisting of ${\mathbf{x}}_{b}={\mathbf{x}}_{t}+{\epsilon}_{b}$ where *ε _{b}* is a sample generated from the

**B**matrix described in the previous section. Every 6 h the model state was updated by assimilating a set of observations generated by perturbing the current state of the true model run at that time according to $\mathbf{y}={\mathbf{x}}_{t}+{\epsilon}_{o}$ where

*ε*is a sample generated from the

_{o}**R**matrix described in Section 4. Data assimilation experiments were carried out for background and observation correlation length scales corresponding to 0 and 10 km.

### Choice of regularisation parameter

Identifying an appropriate value of the regularisation parameter is an important task in a regularisation problem. The method used to choose the parameter is typically problem-dependent. There are several methods that find the value of the regularisation parameter assuming that the norm of the noise in observations is known (Golub and Von Matt, **1997**; Calvetti et al., **2000**). Conversely, there are approaches that do not need the observation noise to be known explicitly (Akaike, **1980**; Wahba, **1990**).

Here, we follow one of the latter approaches, known as the L-curve method. In this method it is assumed that the objective function can be written in the form $\left|\right|\mathbf{y}-\mathbf{H}\mathbf{x}|{|}_{2}^{2}+\lambda |\left|\mathbf{x}\right|{|}_{2}^{2}$ with regularisation parameter *λ*. The L-curve is the log-log plot of the norm of the regularised solution, $\left|\right|{\mathbf{x}}_{\mathbf{a}}|{|}_{2}^{2}$, versus the norm of the corresponding residual vector, $\left|\right|\mathbf{y}-\mathbf{H}{\mathbf{x}}_{\mathbf{a}}|{|}_{2}^{2}$, evaluated at the minimisation solution, ${\mathbf{x}}_{\mathbf{a}}$. In this method, the value corresponding to the point of maximum curvature in the obtained L-shape curve is selected as the optimal regularisation parameter (Hansen, **1992**; Hansen and O’Leary, **1993**; Lawson and Hanson, **1995**) because at this point there is a balance between the norm of the solution vector and the fit to the observations.

For the problem studied here, a different choice of axes is made for the L-curve due to the fact that the objective function has both *l*_{1} and *l*_{2} terms. Recalling the minimisation problem in (8), the corresponding L-curve is defined as the plot of $\frac{1}{2}\text{log}\hspace{0.17em}\left|\right|D{\mathbf{x}}_{0}|{|}_{1}$ versus $\frac{1}{2}\text{log}\hspace{0.17em}(||\mathbf{f}-\mathbf{G}\mathbf{z}|{|}_{2}^{2}+{\mu}^{2}\left|\right|\mathbf{z}|{|}_{2}^{2})$. In this case, the point of maximum curvature represents the balance between minimising the *l*_{1}-norm of the solution in derivative space, and the *l*_{2}-norm for the Tikkonov regularisation problem.

The L-curve is shown in Fig. 5a for ${L}_{b}=0\hspace{0.17em}\mathrm{m}$ and for two different observation error correlation length scales, ${L}_{o}=0\hspace{0.17em}\mathrm{m},{L}_{o}=20\hspace{0.17em}\mathrm{m}$ where the analysis used to evaluate the functions comprising the L-curve were obtained following the data fusion experimental set-up described in Section 4.2. The corresponding curvature plots are shown in Fig. 5b,c. Based these results, it was decided $\delta \approx 0.4$ must be chosen for the *l*_{1}–*l*_{2} regularisation method for the data fusion experiments, since this value corresponds to the maximum curvature in the L-curve.

Since in these experiments we have a known true state, the validity of the L-curve results could be verified by comparing the analysis states resulting from different *δ* values. Using the RMSE between the true state (taken as the data from the deformed FYI in Fig. 1) and the analyses as the error measure (see Section 5.1 for further details). Two sets of experiments were carried out. In the first set, the observation error covariance matrix was diagonal (${L}_{o}=0\hspace{0.17em}\mathrm{m}$) while *L _{b}* was varied over the range of 0 m–500 m, and in the second set, the background error covariance matrix was diagonal (${L}_{b}=0\hspace{0.17em}\mathrm{m}$) and

*L*was varied over the range 0 m–500 m. Note that the spacing of the analysis grid is 7 m, hence these length scales range from 0 to 70 times the analysis grid spacing. Results are represented in Fig. 6. In both cases the benefit of using the

_{o}*l*

_{1}–

*l*

_{2}approach as compared to

*l*

_{2}can be seen for the shorter length scales (i.e. when

*L*or

_{b}*L*is lower than 50 m) for

_{o}*δ*values in the range of $0.3<\delta <0.7$, consistent with the results from the L-curve.

For the data assimilation experiment, similar to the data fusion experiments, *δ* values of $\delta =0.01,0.05,0.08,0.1,0.2,0.4,0.5$ were tested. The optimal *δ* value was found to be 0.05, hence results using this value are shown here.

## Results

### Results from data fusion experiments

The analysis for the data fusion experiments corresponding to the case when both background and observation errors are uncorrelated is shown in Fig. 7a. It can be seen the *l*_{1}–*l*_{2} analysis is less noisy than that from *l*_{2} and the sharp features in the true state are captured more accurately. The analysis for the case when the background error is uncorrelated and ${L}_{o}=500m$ is represented in Fig. 7b. Again, in Fig. 7b, the analysis from the *l*_{1}–*l*_{2} method is less noisy than that from the *l*_{2} method and the sharp features are captured more accurately. However, a benefit is no longer visibly apparent when the background error correlation length scale is ${L}_{b}=500m$, as in Fig. 7c,d.

Table 2 represents the analysis errors measures described in Section 5 when $\delta =0.4$. Note the kurtosis of the true state, $\text{KURT}[{\mathbf{x}}_{t}]$, and its derivative, $\text{KURT}[D{\mathbf{x}}_{t}]$, were 3.12 and 10.58, respectively. Kurtosis is higher when the tails of the probability distribution are fatter which can be interpreted as sharp features in the state when kurtosis is computed on the derivative field of the state. If kurtosis of the analysis derivative is less than the true state derivative kurtosis, it shows that sharp features have been damped. Results presented in Table 2 confirms that when the background error is uncorrelated, the *l*_{1}–*l*_{2}-norm method outperforms the *l*_{2}-norm method in all terms of error measures. This is similar to what has been illustrated in Fig. 7. This improvement for the MAE and RMSE measures is about 25% while the improvement is more than 40% for the ice thickness derivative fields (DIFF_MAE and DIFF_RMSE). In addition, the kurtosis shows that when the background error is uncorrelated, the *l*_{1}–*l*_{2} method retains sharp features in the analysis better than *l*_{2} method. When the observation error length scale increases, the kurtosis of both methods approaches the true state however, the *l*_{1}–*l*_{2} method is always better. When ${L}_{b}=20\hspace{0.17em}\mathrm{m}$, the *l*_{1}–*l*_{2} method is slightly better than the *l*_{2} method regardless of the observation error correlation length scale. However, when the background error correlation length scale is ${L}_{b}=500\hspace{0.17em}\mathrm{m}$, the resulting analysis states are relatively smooth and DIFF_MAE and DIFF_RMSE are almost zero for all cases. Finally, it should be noted that while the kurtosis of the analysis states for both methods are similar, in all cases kurtosis of the analysis derivative is always better for the *l*_{1}–*l*_{2}-norm method, which shows the ability of this method in retaining the sharp features in ice thickness.

### Results from data assimilation experiments

The root-mean-squared (RMS) ice thickness errors from the data assimilation experiments are shown in Fig. 8 for various combinations of background and observation error correlation length scales. In all cases the errors are reduced when *l*_{1}–*l*_{2} is used as compared with *l*_{2}. Time traces of the data assimilation states (see Figs. 13 and 14) showed that this growth in errors, which is also seen in the ice concentration and ice velocity (Fig. 9) is related to the development of openings in the ice cover, which are better captured with *l*_{1}–*l*_{2} than *l*_{2}.

Histograms of the analysis states (ice thickness) and their derivative fields at each analysis time are presented in Figs. 10 and 11, respectively for the case where ${L}_{b}=0\hspace{0.17em}\mathrm{k}\mathrm{m}$ and ${L}_{o}=0\hspace{0.17em}\mathrm{k}\mathrm{m}$. In both figures, the first row illustrates the histogram of the true state. In addition, for better comparison, the histogram of the true state is overlayed on the histograms of the data assimilation method. In Fig. 10 (top row) it is difficult to see substantial differences between the ice thickness histograms, although the derivative fields in Fig. 11, show that for the *l*_{1}–*l*_{2} method, there is a higher level of sparsity on the derivative field of analysis as compared to the *l*_{2} method. This is consistent with the expected results, since the presence of the *l*_{1} term on the ice thickness derivative should result in more zeros in this field.

A representative result from the data assimilation experiments is shown in Fig. 12. This result is obtained at $t=64\mathrm{h}$ for *L _{b}* = 0 km and

*L*= 0 km. It can be seen in Fig. 12 that the reduction in ice concentration at spatial index 389 is captured more accurately for

_{o}*l*

_{1}–

*l*

_{2}as compared to

*l*

_{2}. The time trace of model states for this spatial index is shown in Fig. 13. Note the states are in agreement when the ice is thick and the concentration is high, but differences develop as the ice thins and the concentration drops. A similar result is shown in Fig. 14, where the reduction in ice concentration and thickness, is better captured with the

*l*

_{1}–

*l*

_{2}regularisation. Note also that both time traces show the ice velocity is also in better agreement with the true state for

*l*

_{1}–

*l*

_{2}.

Similar results were found for other background and observation error length scales and also for the case when the ice cover was consolidated (although these events were observed less frequently for the consolidated ice cover). In general, openings in the ice cover and related changes in the ice velocity are better captured with *l*_{1}–*l*_{2} than *l*_{2}. This may be due to the relationship between ice thickness, ice strength and bulk viscosity, and the role this plays in the momentum equation (see Appendix B). The exact details of this interaction are complicated and will depend on the chosen parameters and numerical methods used. This will be investigated in a future study.

## Concluding remarks

This study has demonstrated for the first time that sea ice thickness exhibits a sparse representation in the derivative domain. This has been demonstrated using (1) sea ice thickness measurements from an AEM sensor over the Beaufort Sea; (2) submarine upward looking sonar data; and (3) sea ice thickness from Cryosat. To retain this feature when using sea ice thickness data in a data fusion or data assimilation scheme, the use of an additional term in the objective function to constrain sparsity on the derivative of the ice thickness state, is evaluated. This *l*_{1}–*l*_{2} formulation is compared with the standard *l*_{2} regularisation first using data fusion, and then by carrying out data assimilation experiments using a toy sea-ice model.

For data fusion a clear benefit to the *l*_{1}–*l*_{2} formulation is seen when the background correlation error length scale is small (on the order of twice the analysis grid spacing). Based on previous studies (Caya et al., **2010**), it can be expected that in the vicinity of a sharp feature (e.g. ice edge) the background error correlation length scale may be in this range. This data fusion result could be relevant for the generation of merged sea ice products, where sharp features are desired (see for example the ice concentration used at the lower boundary in (Batrak and Müller, **2018**)).

For data assimilation a clear benefit is also seen to the *l*_{1}–*l*_{2} regularisation, although the impact of the error correlation length scales on the difference between the *l*_{1}–*l*_{2} method and the *l*_{2} method is less clear. This may be due to the spatial averaging of ice thickness that was required to increase the scale of the data, or it could be due to the model dynamics. For example, changes in the ice thickness will impact the bulk viscosity in the momentum equation, which would interact with the numerical methods used in the model. This will be investigated in a later study. However, based on the preliminary results shown here, it can be noted that the *l*_{1}–*l*_{2} method is superior with regards to capturing openings in the ice cover than the conventional *l*_{2} method. This was observed for a variety of error correlation length scales, values of the regularisation parameter (*μ*), and model initial conditions.