Skip to content
José Gómez-Dans edited this page Mar 2, 2018 · 7 revisions

An example of using KaFKA

(You can see a small example of this here)

KaFKA relies on a few things:

  1. The existence of an observation operator (e.g. RT model) interfaced by a Gaussian Process emulator (not necessarily, but it helps)
  2. The provision of an observations object that reads in the EO data, and provides an estimate of uncertainty.
  3. A class used for writing the output to disk. This can probably use information from the observations to define e.g. locations and whatnot.

Defining the problem

The region of interest/state mask/state grid and the temporal grid

KaFKA will produce inferences on parameters over a region, defined on both space and time. For time, the current implementation uses a temporal grid: a normal Python list where each element is a datetime.datetime object. The inferences will be done assuming the land surface state doesn't change between time steps (list elements). The first period of inference will take all observations between the 0-th element of the list up to (and possibly including, I can't be arsed checking) the 1st element in the list, the second from the 1st until the second and so on.

The spatial mask defines both the area and spatial resolution of where parameters will be inferred. As such, it is designed to be a True/False mask over some area. In this case, some locations will form part of the inference, and some will not (allowing the user to mask roads, rivers, lakes, urban areas, ...). An advantageous format for this is to use a geospatial dataset (a GeoTIFF file). This stores the projection as well as extent and spatial resolution, while also providing the actual pixel mask. The GeoTIFF file can then be used to automatically reproject other datasets (observations, priors, ...) or to define the shape of the output geospatial files.

The state vector and the covariance information

KaFKA uses a normal pdf to encode all information related to the state vector, which means that the state can be fully stored as a so-called mean vector and a covariance (or inverse covariance) matrix. The causal nature of the system implies that each for each time step, we'll have a vector and a matrix for all pixels. We have a preference for using inverse covariance matrices, as they're often sparse.

The state vector for a single pixel is determined by a vector of N parameters p1, ..., pN. When considering many pixels, each pixel will be characterised by a set of N parameters, and hence, the total state vector for P pixels will be a vector of size P*N with the following ordering p1,1, ..., pN,1, p2, 1, ..., pN,2, ..., p1,P, ..., pN,P. In other words, we store all the parameters related to one pixel in succession, and then stack the individual vectors sequentially. The pixel ordering is given by the True elements of the state mask after flattening.

The (inverse) covariance matrix follows the same ordering along the main diagonal. This means that in the case of no inter-pixel interactions, the matrix will be block diagonal, with P blocks of size N, hence the total size will be N*P, N*P.

The observations class

The observation class requires a method and an attribute which are compulsory (these are used by the inference engine and are expected):

  • get_band_data(the_date, band_no), which returns observation, uncertainty and relevant emulators as a namedtuple (or similar class) with elements attributes observations, uncertainty, mask, metadata and emulator (need to define the actual contents)
  • dates: a dictionary containing a set of datetime.datetime objects with the dates of the different observations.

A barebones class looks like this

class Observations(object):
    def __init__(self):
        self.dates = []
    def get_band_data(self, the_date, band_no):
        # [...]
        return data_chunk

There are examples of these kind of classes in here

But generally speaking, we can define the return object from get_band_data as

return_object=namedtuple("observation_return_object", 
              "observations uncertainty mask metadata emulator")

Currently, the observations and the mask are 2D arrays (remember they refer to a single band and time step), resampled to match the extent and resolution of the state mask. The mask refers to invalid pixels due to the observations (e.g. clouds and so on), not due to the state mask. The metadata usually contains information such as imaging geometry, broadly speaking information required to run the observation operator but that is not part of the state parameters. The emulator is a reference to an emulator for the relevant band. Uncertainty here is provided as a CSR-format sparse matrix, and it stores the inverse observation covariance for the current band. In future versions, information about the point spread function (PSF) should also be included, although I'm yet to figure out how this is best encoded...

The previous discussion suggests that an Observations class will need some setup that will need to be aware of the state mask, and will be able to reproject the actual observations to match that mask, as done here.

Prior information

In a system such as that envisaged by KaFKA, normality assumptions result in all prior information being defined by a mean state vector and an associated inverse covariance matrix. Element ordering and more details of this are given above.

Further, there are two ways of providing prior information. One is to query e.g. a database to provide the prior mean and inverse covariance matrix data, and another one is to base the prior information in the last inferred state. While the former approach is more generic, the latter is more practical for parameters that where little information is available, or for parameters with temporal trajectories that can be easily parameterised by a simple (linear model). The first type is what we term later on "the prior class", and in this case, can be completely independent of any past inferences made by KaFKA. The state propagator however, takes the last inference as a starting point. Both approaches are detailed below.

Using the priors with LinearKalman

Quite simply, you just pass the class instance and/or pointer to function to the LinearKalman creator:

    kf = LinearKalman(bhr_data, output, mask, 
                      create_nonlinear_observation_operator,parameter_list,
                      state_propagation=state_propagator,
                      prior=the_prior,
                      linear=False)

The state_propagator function is passed, and also the prior instance. Details of their individual interfaces are given below. In case you don't want to use one or the other, you can set either to None (but you must have one of them always set).

The prior class

Again, the class is quite flexible, and only requires a single method to exist with a defined interface:

*process_prior(self, time, inv_cov=True), where time is a timestep in datetime.datetime format, and we can choose between returning the mean and either the inverse covariance matrix or the covariance matrix. The method should return the state mean and either the covariance or inverse covariance matrix (as a sparse matrix). The order of the elements and so on is as defined above.

Note that if the prior information is retrieved from files, the prior class will likely need access to the state mask (so you can e.g. use on-the-fly reprojecting/resampling like a champ). But that might not be necessary in all circumstances.

The state propagator

Between two time steps, the state can change according to a dynamical model. This is also part of the prior, and in KaFKA, we take this information to be independent to the other more traditional type of prior information. The strategy to propagate the state can be quite different depending on the problem, so KaFKA expects a pointer to a function (well, OK, the Python equivalent of that ;-D) to propagate the state mean vector and inverse covariance matrix between two steps. This is done by setting the variable state_propagator in LinearKalman to be this function (or None if you don't want to consider this).

The general interface to the state propagator is

state_propagator(x_analysis, P_analysis, 
                 P_analysis_inverse, M_matrix, Q_matrix)

where x_analysis is the state mean vector at the last time, P_analysis is the covariance matrix associated with x_analysis, but will more often than not be None. P_analysis_inverse is the inverse covariance matrix associated with x_analysis, but can also be None (either P_analysis or P_analysis_inverse need to be defined). M_matrix is the dynamic model propagating the state from one time step to the next. This is very flexible, provided that the model is linear. Q_matrix is the model uncertainty inflation matrix. As a consequence of using an inaccurate model, we expect that additional uncertainty will creep into the estimate of the state, so this matrix encodes that concept.

Note that the simplest possible state propagation is one where the model (M_matrix) is identity and the Q_matrix is a diagonal matrix. This basically means that we assume the mean state doesn't change, but we just increase uncertainty around it.

Combining the prior and the state propagation

In case both the prior and the state propagator are available, they will be combined internally (they are two independent multivariate gaussians), but note that they truly must be (statistically speaking) independent.

The data dumper class

  • dump_data(timestep, x_analysis, P_analysis, P_analysis_inv, state_mask) which dumps the data into the disk. This method doesn't return anything. Currently, the format isn't defined. Clearly, a raster file per parameter would be sensible and in line with common use (so e.g. a GeoTIFF for parameter 1). The covariance matrix could be provided as a mixture of the main diagonal (variances) as individual files with type (eg) float32, and with the upper-diagonal elements per pixel (ignoring interactions between pixels), one could think of just storing the correlations times 100, and quantised by 0.01. This means that we can use a signed byte to store all the quantised correlations. Two assumptions are that we can quantise them (I'm happy with that), and that between-pixel interactions are ignored (OK for the time being, might be an issue later on). At some point, maybe investigate matrix compression techniques.

Setting up the inference engine

We use the LinearKalman class. The setup takes the observations and output objects, as well as a description of a "state mask". The state mask is a 2D boolean array that flags the pixels over the region of interest where the inference is to be done.