Beyond least squares analysis: Regression considering correlation

Long-term climate data records are vital for the study and quantification of environmental and climatic change. Practically, the several-decade observation periods required to establish climatic trends and variability fall well outside the duration of any single Earth Observation satellite mission. As such, producing suitable datasets from remote sensing data requires the combination of observations from an extended series of a given sensor; these series are known as fundamental climate data records (FCDRs). A primary goal of the FIDUCEO project is to produce four FCDRs in a metrologically rigorous manner, for sensor series observing in visible, infrared and microwave spectral domains.

Most of the sensors studied in the project were not designed to operate at the level of accuracy required for climate research and so sensor-to-sensor differences impact derived geophysical products. As such, we attempt to perform a consistent in-flight retrospective recalibration of all the sensors in a series in a process referred to as harmonisation. This involves taking the data from the periods during which sensors operated concurrently and finding instances, known as match-ups, where two sensors simultaneously observe the same location on the Earth with compatible viewing geometry (within a given tolerance). In addition, match-ups are also found to modern reference sensors which have calibrations considered to be high quality. The recalibration then involves calibrating the whole sensor series to the reference sensor. This recalibration is a large non-linear regression problem, solving for new calibration parameters in the measurement equation of each sensor based on the information provided by the match-ups.

Ordinary least squares (LSQ) is a commonly used approach for regression, however, it can only consider uncertainties in the derived quantity (simplistically – the y-axis) and further it treats all the observations as having independent errors. For the FIDUCEO harmonisation approach we want to be able to respect the uncertainties associated with all measured values and the error correlation between match-ups. Aside from these philosophical requirements, in practice, LSQ solutions have been found to cause biases which will affect the long-term stability of the series, and therefore the ability to determine a climate trend, whereas a more robust error-in-variables regression models (EIV), which can consider uncertainties associated with  all variables including the ‘x-axis’ perform much better.

A simple set of simulations can illustrate the fundamental problem. We have taken a simple straight line equation ($Y = A + B \times X$) with fixed values = 0.0, B = 1.0 in the range 0.0 < < 1.0 and have generated X,Y pairs where noise of 0.05 has been added to both X and Y values. We have then fitted a straight line to the data where

  1. Only the uncertainty associated with Y has been included in the fitting process (LSQ)
  2. Uncertainties associated with both X and Y have been included explicitly with Orthogonal Distance Regression (ODR)

The first two columns of Figure 1 show the distribution of the deviation of the fitted parameters (denoted as p[0] and p[1]) from the true values (A, B) as a function of the estimated uncertainty associated with p[0] and p[1] based on the solution’s covariance matrix. Also shown in red are the predicted normal distributions for a statistically consistent set of values relative to the truth. Figure 1 makes it clear that LSQ is not capable of returning the correct value of A and B whereas the ODR solution is completely consistent within the estimated uncertainty.  The right hand set of plots show the deviation of the estimated Y value from the fits from the true Y value for an X of 2.0. Again the LSQ fits are biased whereas the ODR values are not. 

Figure 1. Left hand plots show the error in the fitted parameters for a straight line model from their true values in terms of their deviation relative to their estimated uncertainty. The red curve shows a normal distribution centred at zero which is the expected distribution for a correct fit. The ODR example is completely statistically consistent with the true values, whereas the LSQ case is biased. The right hand pair of plots shows the error in Y from the fitting process and again shows the bias in the LSQ method.

This is not, however, the end of the story as existing EIV implementations, such as ODR, still do not capture the error correlation structure between the data for the match-ups and so cannot provide an optimal solution. In the FIDUCEO project we are developing novel methods for a rigorous, metrological solution to the EIV regression which fully respects the match-up error correlation structure.

The implementation of this methodology is challenging due to the possibly complex error and geophysical correlation structures and high data volume (up to tens of millions of match-ups). Harmonisation tools are under development and the testing and validation of this new software is under way utilising both simulated and real data from the AVHRR (Advanced Very High Resolution Radiometer) sensor, which has flown on both NOAA and EUMETSAT/MetOp satellite series from 1979 through to the present day.

Add new comment