## Introduction

The aim of this paper is to simulate realistic oceanographic scenarios using a modern finite-volume scheme on GPUs. Modern operational ocean models, such as the Regional Ocean Modeling System (ROMS) (Shchepetkin and McWilliams, **2005**) and the Nucleus for European Modelling of the Ocean (NEMO) (the NEMO team, **2008**), are based on solving the primitive equations that describe conservation of mass, momentum, temperature, and salinity. Even though these ocean models are heavily optimised, their complex physical and mathematical properties make them very computationally demanding. The NorKyst-800 model system (Albretsen et al., **2011**), for example, is based on the ocean model ROMS (using the Rutgers-ROMS 3.7 dynamic kernel) and covers the Norwegian coast with 800 m horizontal resolution. It is run operationally by the Norwegian Meteorological Institute on a daily basis. The complete model consists of 42 vertical layers with 2600 × 900 grid cells each, and a 24 hour forecast takes 45 minutes when using 512 CPU cores (Christensen, **2020**). It is unfeasible to use such models for flexible on-demand simulations during search-and-rescue operations, oil spills, heavy lifting at sea, storm surges, etc. In these cases, fast on-demand simulations based on the shallow-water equations can have a significant value for decision makers. Furthermore, such fast simulations enables running ensembles with thousands of members required for fully non-linear data assimilation (Holm et al., **2020**).

We simulate the shallow-water equations in a rotating frame of reference, targeting efficient simulation of ocean currents based on the operational NorKyst-800 data. Our model equations are essentially vertical integrations of the primitive equations and assume constant density and a vertically integrated momentum (see more detailed derivations in, e.g. Røed (**2019**)). Our work starts with the recent high-resolution finite-volume scheme by Chertock et al. (**2018**), and extends this scheme to support bottom shear stress, wind forcing, projection-free spatial variations in the north vector, spatially varying Coriolis parameter, land mask, and a moving wet-dry boundary. We present how to reconstruct a piecewise bilinear bathymetry from the piecewise constant bathymetry in the NorKyst-800 data, nesting of the model into the NorKyst-800 data using linearly interpolated boundary conditions, and generation of both higher and lower spatial resolution of the initial conditions.

We simulate increasingly challenging scenarios along the Norwegian coast, and compare with reference results from the operational model. As the shallow-water equations are particularly well suited for tidal and storm-surge predictions, we also compare our tidal-wave predictions at several locations with the reference dataset and observations. All datasets and source code used in this work are publicly available under open licenses (Kluyver et al., **2016**).^{1}

The rest of the paper is organised as follows: Section 2 gives a high-level description of the shallow-water equations and how they can be solved numerically. In Section 3, we present all the extensions that are required to tackle relevant oceanographic problems. The performance and numerical accuracy of our scheme are assessed in Section 4, where we show perfect weak scaling of the GPU implementation and run a grid convergence test with the new terms introduced in the scheme. We present operational-grade simulations in Section 5, and finally summarise the work in Section 6.

### Related work

The aim of this work is to develop a coastal ocean forecasting system that is useful for short-term predictions of real-world ocean currents and sea-surface level in coastal waters. To this end, we use a shallow-water model in a rotating frame of reference. Such models are used operationally for forecasting of storm surges (Flowerdew et al., **2010**; Guide to storm surge forecasting, **2011**; Bertin, **2016**) and tsunamis (Rakowsky et al., **2013**; Castro et al., **2015**).

Shallow-water models can be discretised and solved in different ways. They are often discretised using classical finite-difference methods on staggered C-grids within operational meteorology and oceanography (see, e.g. Shchepetkin and McWilliams, **2005**; the NEMO team, **2008**; Røed, **2019**) as they are especially well suited for consistent specifications of boundary conditions and nested models. Finite-volume methods have also been used, but require more care when implementing boundary conditions (Natvig, **2006**).

Finite-volume schemes have traditionally required computationally expensive Riemann solvers that resolve the Riemann problem at the interface between two cells, but many modern schemes use solvers that replace the Riemann solver by an efficient approximation. One example is the family of schemes based on the work of Kurganov and Tadmor (**2000**) who proposed a numerical scheme that was well balanced for lake at rest steady-state solutions. This scheme was further developed to handle dry states (Kurganov and Levy, **2002**; Kurganov and Petrova, **2007**), and recently Chertock et al. (**2018**) adapted the scheme to rotating shallow-water flow in which steady-states in geostrophic balance are represented correctly. We have previously demonstrated that this scheme captures important properties required for oceanographic applications and compared it against classical finite-difference approaches (Holm et al., **2020**).

Hyperbolic conservation laws, such as the shallow-water equations, are often solved using explicit discretizations in time. Such schemes can be represented using stencil computations to compute the numerical fluxes, which is an embarrassingly parallel algorithmic primitive (Asanovic et al., **2006**). Stencil computations are perfectly suited for implementation on massively parallel accelerator hardware, such as GPUs (see, e.g. Hagen et al., **2007**; de la Asunción et al., **2011**; Brodtkorb et al., **2012**; Parna et al., **2018**), and when implemented correctly, the use of GPUs not only increases the computational efficiency (Brodtkorb et al., May **2010**), but also typically the energy efficiency (see, e.g. Huang et al., **2009**; Qi et al., **2014**; Dong et al., **2014**; Holm et al., **2020**). The use of GPUs for shallow water and oceanographic simulations is currently also included in large commercial and academic software packages such as ClawPack (Qin et al., **2019**), Telemac (Grasset et al., **2019**), TUFLOW (Huxley and Syme, **2016**) and MIKE (MIKE Powered by DHI, **2019**).

In an operational setting, it is important to have a sound treatment of boundary conditions and source terms. In this work, we apply the flow relaxation scheme (Martinsen and Engedahl, **1987**) for boundary conditions, but a more thorough discussion on the state-of-the-art on the topic can be found in Klingbeil et al. (**2018**).

## Mathematical formulation

We use the shallow-water equations with source terms to simulate ocean dynamics. Since the equations are purely barotropic, they are not expected to generate or preserve features such as mesoscale eddies in the long term. These features arise from baroclinic instabilities, typically caused by variations in temperature and salinity that are not included in the present mathematical model, but the equations are nevertheless capable of capturing the most important physics required to simulate short-term ocean dynamics. We can write the equations with source terms as

Here, *h* is water depth, *hu* momentum along the *x*-axis, and *hv* momentum along the *y*-axis (see also Fig. 1). Furthermore, *g* is the gravitational acceleration, *H* is the equilibrium ocean depth measured from a reference sea-surface level, *r* is the bottom drag coefficient, *f* is the Coriolis parameter, and *τ _{x}* and

*τ*are the wind stress along the

_{y}*x*- and

*y*-axis, respectively. In vector form, we can write the equations as

Here, $Q={[h,hu,hv]}^{T}$ is the vector of conserved variables and *B*, *S*, *C*, and *W* are source terms representing varying bathymetry, bed shear stress, Coriolis force, and wind drag, respectively. Note that this model does not include the atmospheric pressure gradient, which is important in, e.g., storm surge modelling.

By using a finite-volume discretisation on a Cartesian grid to solve (2), we get the following semi-discrete formulation

To evolve the solution in time, we use a second-order strong stability-preserving Runge-Kutta scheme (Gottlieb et al., **2001**),

*M*represents the right hand side of (3). The time step is restricted by the CFL condition given by

^{2}The dominant term for oceanographic applications is typically the speed of gravity waves $\sqrt{gh},$ and we scale $\Delta t$ by a factor ${C}_{\mathit{CFL}}\in (0,1]$ in our simulations to account for Coriolis and wind drag.

### Spatial discretisation

The starting point for our simulator is the scheme proposed by Chertock et al. (**2018**), from now on referred to as CDKLM. It is a modern central-upwind scheme that uses a reconstruction of the physical variables to be able to capture steady-state solutions important for physical oceanography, and includes source terms for varying bathymetry and Coriolis. The original paper includes several idealised test cases and examples demonstrating that the steady states in geostrophic balance are well captured. However, there are several shortcomings that need to be addressed before the scheme can be used for real-world simulations.

The CDKLM scheme is based on the same principles as the MUSCL scheme (van Leer, **1979**) to obtain a high-order, total variation diminishing discretisation. For each cell in the domain, we first reconstruct a piecewise (linear) function for our physical variables (see Fig. 1). At the interface between two cells, we evaluate the slope limited function from the right and left cell, and use these two values to compute the flux across the interface. The scheme uses the Riemann-solver-free central-upwind flux function,

**2001**). Here,

*a*

^{+}and

*a*

^{–}are the largest positive and negative wave speeds at the interface, respectively.

When adding source terms, it becomes challenging to preserve steady states such as ‘lake at rest’, i.e.

*η*is the sea-surface deviation from a mean equilibrium level, given by

Kurganov and Levy (**2002**) presented a discretisation of the bottom slope source term *B*(*Q*) and a matching reconstruction based on water elevation to capture such steady states. For oceanographic simulation, an important steady state is the so-called geostrophic balance, in which the angular momentum caused by the rotational reference frame is balanced by the pressure gradient:

The novel idea in the CDKLM scheme is to reconstruct the face values ${Q}_{i+1/2}^{l}$ and ${Q}_{i+1/2}^{r}$ based on this geostrophic balance so that the resulting flux ${F}_{i+1/2}$ also balances the Coriolis force for steady-state solutions. This makes the scheme suitable for simulating ocean currents in a rotating frame of reference.

## Efficient simulation of real-world ocean currents

Our simulator is implemented using CUDA (NVIDIA, **2019**) coupled with PyCUDA (Klöckner et al., **2012**) to access the GPU from Python, and we have previously shown that using Python instead of C++ has a negligible performance impact when used like this (Holm et al., **2020**). Using the Python ecosystem for tasks such as pre-processing input data and post-processing results has significantly increased our productivity, while still maintaining high performance.

A simulation starts by reading initial conditions, forcing terms, and other grid data from NetCDF files hosted by *thredds* servers at the Norwegian Meteorological Institute. We continue by initialising the simulator, which internally allocates memory and uploads initial conditions on the GPU. Finally, the main loop steps the simulator forward in time, and occasionally (e.g., every simulated hour) downloads and writes results to NetCDF files on disc.

The rest of this section covers additions and improvements to the CDKLM scheme required for simulating real-world scenarios.

### Efficient parallel formulation of CDKLM

The original CDKLM scheme uses a recursive formulation of the potential energies used for the well-balanced flux reconstruction. This turns the reconstruction procedure into a global operation with a data-dependency spanning the whole domain, making the scheme prohibitively expensive on parallel architectures. The recursive expression can be calculated somewhat efficiently using a prefix sum, as suggested by the authors themselves, but by carefully reformulating the recursive terms it can also be rephrased as a local operation.

The scheme reconstructs *h*, *hu*, and *hv* on each side of a face to calculate the fluxes in (6). However, to capture the rotating steady-state solutions in geostrophic balance, the reconstruction is based on the potential energies,

*K*is the potential energy due to the Coriolis force along the abscissa and that

*L*is the potential along the ordinate.

*U*and

*V*are the primitives of the Coriolis force, given by

Note that setting *K _{x}* = 0 and

*L*= 0 is equivalent to the geostrophic balance from (9). In the following, we detail the reconstruction along the abscissa, but the same derivations apply in the ordinate. The recursive terms appear from the discretisation of

_{y}*U*and

*V*, which are defined on the cell faces as

The Coriolis parameter ${f}_{k,j}$ is allowed to change both along the *x*- and *y*-axis, even though *f* only varies with latitude (for example, the NorKyst-800 grid is rotated to follow the Norwegian coastline.) This is because we will use a spatially varying latitude, as described in more detail in Section 3.7. Values of *V* in the cell centres are obtained by taking the mean of the face values, so that the cell-average value of *K* from (10) becomes

By reconstructing the (limited) slopes of *K* within each cell, the face values of *h* are found by combining (13) and (8), so that

The reconstructed values (14) and (15) of *h* at the cell faces quickly become a performance bottleneck as long as they depend on calculating the recursive relation in (12). As it turns out, however, we never need to explicitly compute the recursive terms. First of all, notice that (14) and (15) only depend on derivatives of *K* and not on *L*. Similarly, the reconstructions of ${h}_{i,j}^{N}$ and ${h}_{i,j}^{S}$ depend on *L* and *L _{y}*. This means that we are only interested in derivatives of

*K*(and

*L*) in the same direction as the recursive terms for

*V*(and

*U*). The derivatives are limited using the generalised minmod function using the backward, central, and forward differences. The backward difference is given by

We see that ${V}_{i-1/2,j}$ cancels, and focus on the remaining *V* values. By applying the recursive expression in (12) twice on ${V}_{i+1/2,j}$ we get

Inserted back into (16), the backward difference needed to evaluate ${({K}_{x})}_{i,j}$ can be written as

After obtaining ${({K}_{x})}_{i,j}$ we look at the reconstruction of ${h}_{i,j}^{E}$ from (14). We start by inserting the expression for *K* from (13) into (14) and gather all *V*-terms,

Here, the values for ${\eta}_{i,j},{H}_{i,j}$ and ${({K}_{x})}_{i,j}$ are known, and by using the recursive expression in (12) we get that

Inserting (20) back into (19) we see that we get

Again, the same derivation can be applied on the other three face values as well. This enables us to express the reconstruction step in terms of only local variables, which unlocks an embarrassingly parallel numerical algorithm.

### Accuracy, precision and GPUs

Today’s GPUs can be categorised in two major classes: GPUs meant for the consumer market, and GPUs meant for supercomputers and other professional users. For the most part, these GPUs are close to identical, but there are some key differences. One is price: professional GPUs can cost over ten times more than equivalent consumer market GPUs, making the latter tractable from a financial stand point. Another difference is in the feature set. Professional GPUs offer more memory, error-correcting code (ECC) memory, and significantly higher performance for features such as double precision.

Double precision is not an important feature on consumer market GPUs, and typically offer less than 5% of the performance of single-precision arithmetics. On professional GPUs, on the other hand, the performance is 50% of single precision, which is the same ratio as on CPUs.^{3} If we are able to utilise single-precision calculations, we should therefore be able to get the highest performance at minimum monetary cost. We have previously shown that using single precision is sufficiently accurate for most simulation scenarios for the shallow-water equations (Brodtkorb et al., **2010**), and recent studies have shown a 40% increase in performance for complex forecasting models when using single precision (Váňa et al., **2017**).

It is well known that floating-point numbers have a round-off error that increases for larger numbers. A single-precision floating-point number is represented as

*s*is the sign bit,

*e*is represented using 8 bits, and

*f*is the fractional part, represented by 23 bits. This formulation means that the distance between two floating-point numbers is smallest close to zero, and increases as

*e*increases. This is important to note in oceanographic simulations, as the water depth,

*h*, can often be thousands of metres, whereas the relevant tidal wave height is perhaps 0.5 metres. The loss in precision due to the water depth makes floating point prohibitively inaccurate if we implement the shallow-water equations as formulated in (1). To counter this loss in precision, we can possibly use double precision (at twice the computational and memory cost), but a better alternative is to reformulate the problem to represent the relevant physical quantities. This can be done by basing our simulation on

*η*from (8) instead of

*h*and use ${[\eta ,hu,hv]}^{T}$ as our vector of conserved variables. By combining this approach with single-precision floating-point numbers, we get more than sufficient accuracy for the relevant oceanographic simulations.

When the water depth becomes close to zero, the calculation of the particle velocity, $u=hu/h$ can become severely ill conditioned as the expression is prone to large round-off errors. To counter this, Kurganov (**2018**) and Kurganov and Petrova (**2007**) have proposed to *desingularize* the term using expressions such as

**2012**). Furthermore, the desingularization of

*u*here is not very well-behaved. Instead, we propose to use the following formulation

The advantages of this formulation is that (i) *κ* now controls directly what magnitudes of *h* to desingularize, (ii) it improves the numerical stability as it avoids the square root and the problematic *h*^{4} term, (iii) it yields a smooth transition between normal and desingularized values, and (iv) the effective magnitude of ${h}^{*}$ is controlled. Figure 2 shows the three desingularization strategies and clearly indicates that the desingularized quantity *h* is better behaved with our proposed approach.

### Global wall boundary conditions

We have implemented wall and periodic boundary conditions to be able to run test cases to assess the numerical correctness of our simulator. For reflective von Neumann-type wall boundary conditions, we use so-called ghost cells that mirror the ocean state across the boundary to enforce a zero flux out of the domain. In the absence of Coriolis forces, wall boundary conditions at *x* = 0 are constructed by setting the ghost cell values according to

Naïve implementation of these conditions with the CDKLM scheme leads to gravitational waves along the boundary. Special care is therefore required to balance the Coriolis potential energies in the reconstruction of *h*, so that

*H*cancels immediately, whereas

*η*cancels by the use of (27). To make the last terms cancel, we need to set ${f}_{-1,j}{v}_{-1,j}=-{f}_{0,j}{v}_{0,j},$ which means that in addition to (27), we need to enforce

Note also that this results in $({K}_{x})=0$ across the boundary from (18).

### Moving wet-dry boundary and land mask

The original CDKLM scheme breaks down on land as the gravitational wave speed $\sqrt{gh}$ becomes imaginary with negative depths. To support a moving wet-dry boundary, we use a similar approach as Kurganov and Petrova (**2007**) by adjusting the slope of *h* so that all the reconstructed point values become non-negative. For example, if we get ${h}_{i,j}^{E}<0$ from (21), we adjust the slope to become

The moving wet-dry boundary is required for simulating, e.g. tsunamis and other phenomena in which the run-up is important. The approach nonetheless violates the steady-state reconstruction, causing non-physical waves and often also small time step sizes. For static wet-dry boundaries, we have therefore also implemented wall boundaries following a static land mask, i.e. ensuring a zero flux for the shore-line. In our approach, we explicitly set the flux between two neighbouring cells to zero if any one of them is masked. The mask is represented by a unique no-data value in the bathymetry, thereby having no increase in the memory footprint. It should be noted that both of these approaches reduce the reconstruction to first order along the wet-dry boundary.

### Wind forcing and bed friction source terms

The original formulation of CDKLM includes the bed slope source term, *B*(*Q*), and Coriolis force, *C*(*Q*), and we extend the scheme to also include the bed shear stress, *S*(*Q*), and wind forcing, *W*(*Q*). Bed shear stress is discretised using a semi-implicit formulation in the strong stability preserving Runge-Kutta scheme,

*Q*, and half of the friction using the predictive step ${Q}^{*}.$

^{n}The wind stress comes from the atmospheric wind transferring momentum to the water column. We take the approach of Large and Pond (**1981**) and use

*W*

_{10}is the wind speed at 10 metres, and

*C*is the drag coefficient computed as

_{D}The NorKyst-800 model stores its output, including the wind forcing, every hour. Because the time step of our simulator is significantly smaller than one hour, we use linear interpolation in time to get the most accurate wind stress source term for each time step. Given the wind stresses ${W}_{10}^{a}$ at time *t _{a}* and ${W}_{10}^{b}$ at time

*t*, we compute the wind stress at time $t\in [{t}_{a},{t}_{b}]$ as a convex combination of the two

_{b}In space, we use specialised GPU texture hardware for bilinear interpolation (see e.g. Brodtkorb et al., May **2010**), resulting in a trilinearly interpolated wind stress. This is highly efficient, as the GPU has dedicated hardware for bilinear spatial interpolation that also includes caching. An added benefit is that the grid size of the wind stress data does not need to match up with the underlying simulation grid. This further provides us with a choice to reduce the accuracy and memory footprint of the forcing fields, and thus trade better performance against accuracy.

### Nesting the model into NorKyst-800

A common technique in operational oceanography is to nest high-resolution local models within lower-resolution models that cover a larger domain. For instance, the NorKyst-800 model is nested within TOPAZ, which is a 12-16 km resolution HYCOM-based coupled ocean and sea ice model with data assimilation (Xie et al., **2017**). In this work, we nest our simulations into the NorKyst-800 model by setting initial ocean state and boundary conditions appropriately. We use the so-called flow relaxation scheme (Davies, **1976**), in which we have a relaxation zone,^{4} between the internal computational domain and the external boundary conditions. In this region, we solve the shallow-water equations as normal to obtain an internal solution ${Q}_{i,j}^{\mathit{int}},$ and then gradually nudge it towards an external solution ${Q}^{\mathit{ext}}$ using

Here, ${d}_{i,j}$ is distance in number of cells to the external boundary, and the parameter *d*_{0} is typically set to 2 or 3 to control the fall-off of the $\mathrm{tanh}$ relaxation function.

The implementation of the boundary conditions follows the same ideas as for the wind stress, using textures for linear interpolation in time and space. We use float4^{5} textures to hold our physical variables, $\eta ,hu,hv,$ because float3 is not supported as a texture format. We furthermore only use two textures, as we pack the north and south boundaries into one texture, and similarly for the east and west.

### Coriolis force

The original CDKLM scheme requires that the grid is aligned with the cardinal directions, which significantly restricts the use of the scheme. For higher latitudes this becomes especially pronounced, as both the latitude and direction towards north will vary across the domain. We have therefore extended the scheme to support both a spatially varying north-vector and a spatially varying latitude, which is required to be able to run realistic simulations for the areas covered by the NorKyst-800. To account for the varying north vector, we project the momentum onto the local north and east vectors,

*C*(

*Q*) so that

*x*,

*y*)-coordinate system. Note that we also need to use the same approach in the reconstruction of

*h*as discussed in Section 3.1 to maintain the rotational steady states.

Our spatially varying Coriolis parameter is computed from the latitude, ${l}_{i,j},$ of a cell

Here, *f* is linearised around a reference point, $({x}_{\mathit{ref}},{y}_{\mathit{ref}}),$ with a slope *β* in the direction of $\overrightarrow{n}.$ Setting *β* to zero yields a constant Coriolis parameter throughout the domain.

Figure 3 demonstrates how the angle to north plays an important role in the generation of planetary Rossby waves, caused by variations in the Coriolis force using (39). The figure shows the resulting *η* after long simulations initialised with a rotating Gaussian bump in geostrophic balance in the middle of the domain, similar to the case presented in Holm et al. (**2020**). Here, we have used a domain consisting of 350 × 350 cells with $\Delta x=\Delta y=20\text{km},$ a depth of $H=50\mathrm{m},$ and the centre of the domain corresponding to approximately ${33}^{\xb0}$ north, with $f=8\xb7{10}^{-4}$ and $\beta =2\xb7{10}^{-11}.$ The simulation ends after approximately 35 days. The figure clearly shows the effect of varying the north vector and hence the Coriolis source term.

### Bathymetry

The bathymetry in the NorKyst-800 model is given as cell midpoint values, and can not be used directly in our framework as CDKLM requires the bathymetry to be defined as a piecewise bilinear surface specified by point values at the cell intersections. It is still desirable that the averages of the resulting four intersection values end up to be equal to the NorKyst-800 cell average values, as large deviations can disturb the balanced relations in the provided ocean state. Constructing such a piecewise bilinear surface from the cell averages is an ill-posed problem, and a simple averaging of cell values excessively smears important features such as deep fjords and straits.

We have therefore devised a pragmatic iterative algorithm, which is based on reconstruction of slopes for each cell, similar to the reconstruction procedure in our numerical shallow-water scheme. Our initial guess is computed as follows:

1. Compute the local gradient in each cell so that

2. Evaluate the piecewise planar function at the four corners of each cell,

3. Average the estimate from the four cells meeting at one intersection point so that

After obtaining the initial guess, we start an iterative procedure consisting of two stages. The first stage minimises the difference between computed and true midpoint values, and the second dampens oscillations created by the first stage. We start by computing the difference between our estimate and the target bathymetry as

This brings our updated midpoints of ${H}^{*}(i,j)$ closer to the target ${H}_{i,j},$ but it also results in unwanted oscillations, as there is nothing that limits the slopes of this piecewise bilinear surface. We therefore smooth ${H}^{*},$ and repeat the procedure until the update between two subsequent iterations is sufficiently small. Around ten iterations is typically sufficient to create a piecewise bilinear surface that lies close to the target. Figure 4 shows the result of our approach applied to the bathymetry from NorKyst-800 in the innermost parts of Vestfjorden (see also Section 5.2).

CDKLM is a full two-dimensional scheme, yet it exhibits a grid orientation bias close to the land mask. For example, we have experienced that the scheme has difficulties in propagating the momentum through narrow straits and canals that are only one or two cells wide. To ameliorate this, we can use binary erosion on the land mask to erode one row of cells at a time as shown in Fig. 5. We extrapolate *η* into these areas using grey dilation,^{6} and set the water depth equal the land value, which in the NorKyst-800 model is 5 metres. The process can be repeated to erode the land mask even more, and is required for tidal waves to properly propagate in features such as narrow fjords and straits. We have found that using single iteration is sufficient to propagate tides in and out of the fjords in the NorKyst-800 model.

### Refinement and coarsening of grid

The possibility to run simulations at different resolutions is attractive to quickly obtain low-resolution preliminary results, and also to resolve more of the dynamics for more detailed and accurate results with higher resolution. We have included functionality in our simulation framework to increase and decrease the grid resolution with factors of two to support both of these scenarios. We use a static grid refinement/coarsening performed on the initial conditions themselves, but the same approach can be used for efficient adaptive mesh refinement following the same approach as Saetra, Brodtkorb and Lie (Sætra et al., **2015**).

The physical variables are represented by cell average values, and refinement is based on a slope-limited reconstruction within each cell, which we evaluate at the refined grid cell centres. As the bathymetry is specified by a piecewise bilinear surface, the high-resolution representation of the bathymetry evaluates the surface on the refined cell intersections. Low-resolution representation of the physical variables on a coarsened grid is simply the average of the original cell values, and intersection values for the bathymetry are sampled from the corresponding original intersection values. With a factor two coarsening, both these operations are well-defined.

Note that the boundary conditions and wind forcing terms are independent of the mesh resolution due to our use of GPU textures.

## Performance and accuracy of GPU implementation

We have previously evaluated the standard CDKLM scheme on several test cases which target oceanographic dynamics relevant for the shallow-water equations when implemented on a Cartesian grid (Holm et al., **2020**). In this work, however, we have reformulated the scheme in terms of *η* instead of *h*, added several new source terms, and enabled varying latitude and north vector. It is therefore important to revisit the grid convergence test. We also measure computational performance of our simulation framework through assessing its weak scaling.

### Convergence of the numerical scheme

To evaluate numerical convergence, we define a benchmark case covering $500\text{km}\times 500\text{km}.$ We model the bathymetry after the Matlab peaks function,

*x*and

*y*given in km, so that the depth varies between 34 and 181 metres (see Fig. 6). We initialise the sea-surface level with a bump centred in the domain,

*c*controls the size of the bump, set to 300 km in our case. We assume that the North pole is located at $(3000\text{km},3000\text{km}),$ which means that the angle between the

*y*-axis and the north vector varies from ${33.7}^{\xb0}$ to ${56.3}^{\xb0}$ across the domain and that the latitude varies from ${56.0}^{\xb0}$ to ${67.4}^{\xb0}$ north. These parameters roughly correspond to the North Sea. Including land inevitably yield only first-order accurate fluxes, and we have thus deliberately not included such areas in our grid convergence test.

The simulation is run for 800 seconds, after which we compute the difference for the whole domain between the solution and a reference computed on a 1024 × 1024 grid. Figure 6 and Table 1 show that we achieve second-order accuracy in both the *L*_{1} and the *L*_{2} norm, whilst the convergence in ${L}_{\text{inf}}$ is somewhat slower.

### Computational performance

To assess the computational performance, we continue to use the same benchmark, but run the simulation twice as long. We also measure the cost of downloading the results from the GPU to the CPU. Figure 7 shows the results for domains with 16^{2} to 16384^{2} cells, running on a Tesla P100 GPU. We see that for the larger domain sizes, we obtain perfect weak scaling, as a factor two grid refinement leads to a factor eight increase in computational complexity (the time step size is cut in half due to the CFL condition, and we have four times the number of cells). Cases with less than 512^{2} cells are dominated by overheads and hence, the cost of a single time step is close to constant. The problem size required to make overheads negligible and obtain perfect weak scaling varies between different GPUs and their configurations. Time for downloading data scales linearly with the problem size, as is expected. Again, we see the same behaviour for small domains as for the computational performance (Table 2).

Our current implementation is designed for running massive ensembles (Holm et al., **2020**), and therefore does not utilise multiple GPUs for a single simulation. However, we have previously shown that finite-volume schemes such as the one discussed here are very well suited for multi-GPU simulations (Sætra and Brodtkorb, **2010**).

## Real-world simulations

Using the NorKyst-800 model system, the Norwegian Meteorological Institute issues daily ocean forecasts for the complete Norwegian coast and makes them publicly available through thredds servers and various web pages, as well as transferring to other authorities. In addition to the 3 D velocities of the ocean currents, these forecasts also provide the 2 D depth-averaged currents, which we use as initial conditions along with the sea-surface level. The forecasts also contain wind forcing, bathymetry, land masks, longitudes, latitudes, and angle towards north for each cell. This means that it is simply a matter of accessing the file and cutting out the correct sections to be able to initialise our simulator. To start a simulation within our framework, we simply need the full url to the specific forecast we want to nest our simulation within, and specify a desirable sub-domain.

We run our simulations on the three different sub-domains shown in Fig. 8. The simulation cases run independently, but use the same global data set for initialisation and forcing. Our first simulation is in the Norwegian Sea with no land values within the domain, and is therefore a fairly simple test for using real initial and boundary conditions. The second case is centred on the Lofoten archipelago in Northern Norway, which is dominated by a complex coastline consisting of narrow fjords and a large number of islands, and serves as a challenging test for our reconstruction of the bathymetry and land mask. Finally, we run almost the complete domain, leaving only 25 cells in each direction to serve as boundary conditions. With the entire Norwegian coastline, it becomes important to account for the difference in the north vector throughout the domain (see the longitudinal lines and north arrows in Fig. 8), and we should see the tides following the coast as Kelvin waves. The last case is also used to make sea-level predictions at five different locations, indicated as stars in Fig. 8.

All three cases are run with three different resolutions; a high resolution with a grid refined to 400 m cells, the original 800 m resolution of NorKyst-800, and a low resolution grid consisting of 1600 m cells. We use an adaptive time step, and update the time step every 20 minutes according to the CFL condition in (5) scaled by ${C}_{\mathit{CFL}}=0.8.$ Each dataset provided from the NorKyst-800 forecasts covers 23 hours, which means that we only have boundary conditions to run simulations for the same time range.

### Case 1: Norwegian sea

The first case runs an open water domain in the Norwegian Sea and tests our ability to use the NorKyst data as initial and boundary conditions. The sea-surface level changes significantly during the simulation time range as the tidal waves enter and exit the domain, and this dynamic is largely dependent on the boundary conditions. The velocity field on the other hand, is dominated by already present features of slowly rotating dynamics, and will not change much during the simulation time range. This enables us to test both that the initial conditions are correctly captured, and how well our simulator maintains complex rotating structures over time.

Figure 9 shows simulation results of the Norwegian Sea for all three different grid resolutions after 23 hours, compared to the reference solution from NorKyst-800. First, we see that the contour lines for the sea-surface level in the left column show the same values at similar places for all simulations. This shows that the boundary conditions allow tidal waves to propagate correctly in and out of the domain as intended. The right column shows the corresponding particle velocities. Note that the initial state for both the high-resolution and original runs are visually inseparable from the NorKyst-800 model state, whereas the low-resolution initial conditions slightly smear out the sharper features for the velocity. At 23 hours, we see that the high-resolution model to a large degree maintains the sharp features from NorKyst-800. This shows that even though we run a barotropic model, we are able to maintain some of the baroclinic signals present in the reference model. We suggest that this is due to a combination of initialisation of the baroclinic motion and a well-balanced numerical scheme for geostrophic flow. With the original resolution, however, the particle velocities resemble a slightly blurred image of the reference solution, and in the low-resolution simulation, the features are blurred even more. Also, we are starting to see clear signs of grid effects both internally and from the boundary.

### Case 2: Lofoten

The second case is centred around the Lofoten archipelago in Northern Norway. This region has a complex coast line consisting of a large number of fjords, straits and islands. In addition to the aspects tested in the first case, this case tests the reconstruction of the bathymetry at intersections and the handling of dry zones through the complex land mask. We expect to see a Kelvin wave following the coast, but because the Lofoten archipelago protrudes into the Atlantic, some of the wave will be partially trapped and amplified in Vestfjorden between Lofoten and the main land. Furthermore, as can be seen from the depth map in Fig. 8, the edge of the continental shelf crosses through the upper right corner of the sub-domain, and we can here see a line of eddies driven by baroclinic instabilities.

We again run simulations with the three different grid resolutions, and Fig. 10 shows the final simulation state after 23 hours, with a high tide in Vestfjorden. All three simulations show that the contour lines for the sea-surface level are slightly off compared to the reference solution, with marginally more water into the fjord. Similarly as in Case 1, the low-resolution simulation is unable to preserve the fine structures in the currents and shows grid effects. The original and high grid resolutions are also unable to preserve the exact eddy structure as found in the reference solution, but we can instead more clearly see a consistent coastal current with natural irregularities. This is consistent with what we can expect using a barotropic model. These results also show a larger area with high particle velocity around the smallest islands in our simulations, possibly due to the reconstructed bathymetry, which widens some straits by the landmask erosion.

### Case 3: Norway

Our third case consists of almost the complete domain covered by the NorKyst-800 model. We leave out the 25 outermost cells and use these as boundary conditions. With the complete domain, we cover a large enough area to see the Kelvin waves entering in the southwestern part of the domain and travelling north.

Figure 11 shows the simulation results after 23 hours in the same manner as the earlier figures. We now see from the sea-surface level (left column) that there is a low tide in the middle of the domain, whereas there are high tides both in the north at the Russian border and in the Oslo fjord in the lower left corner. The contour lines are similar for all three grid-resolutions, and the particle velocities (right column) are in general consistent with the reference solution from NorKyst-800. Due to the scale of the domain, it is harder to compare the fine details in Fig. 11, but the large scale features from NorKyst-800 are found in both the high and original grid resolutions. As we have seen in the other cases, the low-resolution simulation maintains less details. It should be noted that by zooming into the same areas simulated in Case 1 and 2, we have seen that the results from running the complete domain are similar within each resolution as to the results shown in Figs. 9 and 10.

Table 3 reports the run time for simulating this case on a laptop, a desktop, and a server GPU. The laptop contains a dedicated GeForce 840 M GPU, which is powerful for a laptop but on the low end of the GPU performance spectrum. It completes the simulation in slightly less than 54 minutes for the original grid resolution of NorKyst-800, which is comparable to the time required for 512 CPU cores to run the NorKyst-800 simulation (Christensen, **2020**).^{7} On a desktop with a slightly old GeForce 780GTX gaming GPU, the same simulation is carried out in approximately 10 minutes. Tesla P100 is a GPU often found in supercomputers, thus belonging to the very high-end of the performance (and price) spectrum, and runs the 23-hour forecast in approximately 3 minutes. Note that for the high-resolution scheme, the time step is on average just 0.46 seconds, meaning that the results shown in Figs. 9–11 are obtained after 180 000 time steps, using only single-precision floating-point operations.

Table 4 takes a closer look at the simulation time of Case 3 by comparing the time for spent initialising, running the actual simulation for one hour, and writing the result to disc for one time step. The initialisation time is measured under the assumption that the NetCDF file from NorKyst-800 is available on disc locally, as downloading from the thredds server depends on your internet connection and therefore of less interest. First of all, we see that initialising the original resolution costs approximately the same as half an hour of simulation. With higher or lower resolution, initialisation is slightly more expensive as the initial currents, sea-surface level and bathymetry need to undergo pre-processing. Furthermore, we see that initialisation constitutes a considerable amount of total run time when using a low grid resolution, but less so with high resolution. The table also shows us that simulation dominates over writing results to disc for all resolutions. Note that the cost of writing to file scales with the number of cells, meaning that the fraction of total run time spent in writing the results to file is reduced by the square root of the number of cells. This matches the measurements presented in 4.2.

### Tidal forecast verification and validation

The oceanographic phenomena best captured and maintained by the shallow-water equations are arguably tides and storm surges. We therefore use Case 3 to generate tidal forecasts for five locations along the Norwegian coast (see Fig. 8) and compare these with the hourly values from the NorKyst-800 dataset. The locations have been chosen based on the availability of actual sea-level observation stations, which record data every 5 minutes, and we use these to show the realism in the forecasts generated by both NorKyst-800 and the current model. It should be noted that the sea level variation depends on more physical processes than those modelled by both NorKyst-800 and our code, such as atmospheric pressure and other friction terms (see also e.g. (Döös et al., **2004**),). We therefore expect a discrepancy between the forecasts and the observations. For us, it is therefore most relevant to use the NorKyst-800 result as a reference solution. The five locations (see Fig. 8) are Bergen and Honningsvåg close to the open sea; Oslo and Narvik located in the innermost parts of long fjords; and Bodø, which is affected by how the Lofoten archipelago catches the tidal waves.

Figure 12 shows the tidal forecasts generated from Case 3 for each of the five locations, sorted from south to north. First of all, we see that our forecasts at Honningsvåg are very well in accordance with NorKyst-800. The same is the case in Bergen, even though our simulations here consistently give slightly too low values for *η*. At Bodø, we see that NorKyst-800 gives a delay in the transition from high to low tide between hours 14 and 16, and we can see weak indications of this in our high-resolution simulation as well. All these locations are at the coast, which makes the tides here relatively easy to capture, but it should still be noted that they are all protected by islands. The tides in Oslo and Narvik are seen to be harder to capture, as these two cities are located at the innermost parts of two long fjords, and we get larger differences between the different grid resolutions. In Narvik, we get a large improvement by using the high-resolution grid over the original and low resolutions, but overall, our results are fairly close to the reference solution. Oslo is a particularly hard case, as it is positioned within a very narrow and shallow fjord. This leads to larger relative discrepancies here, even though the absolute difference is roughly the same as in Narvik. We see that the shape of our high-resolution result is a bit different than for the reference solution. The tidal signal predicted by the low-resolution grid in Oslo is very weak, as the fjord almost becomes closed off at this resolution. Finally, note that our solutions are well behaved already from the start, meaning that we manage to initialise our simulations in a balanced state.

## Summary

We have presented a GPU simulation framework suitable for real-world oceanographic applications. The framework is based on a modern high-resolution finite-volume method for the shallow-water equations (Chertock et al., **2018**) with new adaptations and extensions to handle moving wet-dry fronts, land mask, bottom friction, wind forcing, varying north vector, varying latitude, and external boundary conditions. The numerical algorithm is also improved to facilitate for efficient implementation on GPUs using single-precision floating-point operations and includes an efficient reformulation of a problematic recursive term. The framework is designed to be initialised from operational ocean forecasts issued by the Norwegian Meteorological Institute from a three-dimensional ocean model.

We have presented second-order grid convergence for a synthetic but challenging benchmark containing complex topography and varying north vector. We have also shown that the computational performance follows the expected weak scaling.

Finally, we have validated the simulation framework through three different real-world cases along the Norwegian coastline, initialised from ocean forecasts produced by the NorKyst-800 operational model with an $800\mathrm{m}\times 800\mathrm{m}$ horizontal resolution. Our results demonstrate that the framework manages to maintain fine rotational structures, especially when increasing the grid resolution to $400\mathrm{m}\times 400\mathrm{m}.$ On a coarse grid (1600 m grid cells), however, the features are lost and the simulations get dominated by grid effects. The tidal forecasts show that we are able to predict the observed tides relatively well at coastal locations on any of the three grid resolutions used, when compared to the NorKyst-800 reference data. We see, however, that within long fjords, our results are improved when increasing the grid resolution, compared to using the original horizontal resolution of the NorKyst-800 model.

## CRediT author statement

A. R. Brodtkorb: Conceptualization, Methodology, Software, Investigation, Writing - Original Draft, Writing - Review & Editing, Visualization, Project administration, Supervision, H. H. Holm: Conceptualization, Methodology, Software, Investigation, Writing - Original Draft, Writing - Review & Editing, Visualization.