## Ensemble Data Assimilation with auto-correlated model error

Data assimilation is a mathematical method to combine forecasts with observations, in order to improve the accuracy of the original forecast. Normally data assimilation methods are performed with the perfect-model assumption (weak-constraint). However, there are different sources that can produce model error, such as missing description of the dynamic system and numerical discretisation. Therefore, in recent years, the model error has been more often considered in the data assimilation process (strong-constraint settings).

There are several data assimilation methods applied in various fields. My PhD project mainly focuses on the ensemble/Monte Carlo formulation of the Kalman Filter-based methods, more specifically, the ensemble Kalman Smoother (EnKS). Different from the filter, a smoother updates the state of the system using observations from the past, present and possibly the future. The smoother does not only improve the forecast at the observation time, instead, it updates the whole simulation period.

The main purpose of my research is to investigate the performance of the data assimilation methods with auto-correlated model error. We want to know what will happen if we propose a misspecified auto-correlation in the model error, for both state update and parameter estimation. We start our project with a very simple linear auto-regressive model. As for the auto-correlation in model error, we propose an exponential decaying decorrelation. Naturally, the system has a exponential decaying parameter ωr, and the parameter we use in the forecast and data assimilation is ωg which can be different from the real one.

A simple example can illuminate the decorrelation issue. In Figure 1, we show results of a smoothing process for a simple one-dimensional system over a time window of 20 nature time steps. We use an ensemble Kalman Smoother with two different observation densities in time. The memories in the nature model and the forecasts models do not coincide. We can see that with ωr = 0.0, when the actual model error is a white-in-time random variable, the evolution of the true state of the system behaves rather erratically with the present model settings. If we do not know the memory and assume the model error is a bias in the data assimilation process (ωg → ∞), the estimation made by the data assimilation method is not even close to the truth, even with very dense observations in the simulation period, as shown in the left two subplots in Figure 1. On the other hand, if the model error in the true model evolution behaves like a bias, and we assume that the model error is white in time in the data assimilation process, the results are quite different with different observation frequencies. As shown in two subplots on the right in Figure 1, with very frequent observations, we can see a fairly good performance of the data assimilation process, but with a single observation, the estimation is still not accurate. Figure 1: Plots of the trajectories of the true state of the system (black), three posterior ensemble members (pink, randomly chosen from 200 members), and the posterior ensemble mean (red). The left subplots show results for a true white noise model error and an assumed bias model error for two observation densities. Note that the posterior estimates are poor in both cases. The right subplots depict a bias true model error and an assumed white noise model error. The result with one observation is poor, while if many observations are present the assimilation result is consistent within the ensemble spread.

In order to evaluate the performance of the EnKS, we need to compare the root-mean-square error (RMSE) with the ensemble spread of the posterior. The best performance of the EnKS is when ratio of RMSE over the spread is equal to 1.0. The results are shown in Figure 2. As we can see, the Kalman Smoother works well when ωg ωr for all the cases, with the ratio of RMSE over the spread equal to 1.0. With relatively high observational frequency, 5 observations or more in the simulation window, the RMSE is larger than the spread when ωg > ωr, and vice versa. In a further investigation, the mismatch between the two timescales ωr and ωg barely has any impact on the RMSE. The ratio is dominated by the ensemble spread. Figure 2: Ratio of MSE over the posterior variance for the 1-dimensional system, calculated using numerical evaluation of the exact analytical expressions. The different panels show results for different numbers of observations.

Then, we move to estimating the parameter encoded in the auto-correlation of the model error. We estimate the exponential decaying parameter by the state augmentation using the EnKS, and the results are shown in Figure 3. Instead of the exponential parameter, ωg, we use the log scale of the memory timescale to avoid negative memory estimates. The initial log-timescale values are drawn from a normal distribution: ln ωgi ∈ N (ln ωg, 1.0). Hence we assume that the prior distribution of the memory time scale is lognormal distributed. According to Figure 3, with an increasing number of windows we obtain better estimates. Also, the convergence is faster with more observations. And in some cases the solution did not converge to the correct value. This is not surprising given the highly nonlinear character of the parameter estimation problem, especially with only one observation per window. When we observe every time step the convergence is much faster, and the variance in the estimate decreases, as shown in the lower two subplots. In this case we always found fast convergence with different first guess and true timescale combinations, demonstrating that more observations bring us closer to the truth, and hence make the parameter estimation problem more linear. Figure 3: PDFs of the prior (blue) and posterior (reddish colours) estimated ωg, using an increasing number of assimilation windows. The different panels show results for different observation densities and prior mean larger (a, c) or smaller (b, d) than ωr. The vertical black line denotes the true value ωr.

As the results of the experiments show, the influence of an incorrect decorrelation timescale in the model error can be significant. We found that when the observation density is high, state augmentation is sufficient to obtain converging results. The new element is that online estimation is possible beyond a relatively simple bias estimate of the model error.

As a next step we will explore the influence of incorrectly specified model errors in nonlinear systems, and a more complex auto-correlation in the model error.