Architecture#

Currently arviz_stats has some top level general functionality, and then submodules that take care of the actual computations. Submodules can be completely independent or build on top of one another.

Architecture diagram of ArviZ stats

Top level functionality#

This includes the top level functions that accept multiple types as input, the accessors for dataarray, Dataset and Datatree objects, the dispatcher mechanism and some general dataclasses and utilities like ELPDData.

Computation submodules#

Computation submodules are structured into two main classes: an array facing class and a dataarray facing class. All submodules should have both available, however, if two modules are somewhat similar, one can limit itself to inherit a class from the other without any dedicated implementation.

The array_stats class instance#

The array facing class takes array_like inputs, and aims to have an API similar to NumPy/SciPy. It can be used independently of the dataarray class (to the point of not needing to have arviz_base nor xarray installed) but it is (consequently) lower level interface. There are more required arguments, no rcParams integration…

Within this class, functions are generally defined as “atomic” methods which take the lowest dimension array possible and the proper array functions, also called ufuncs, which take arbitrary dimensionality arrays and perform batched computation as needed. This conversion from atomic functions to array functions happens through utility functions implemented within the same module.

To make integration with the dataarray facing class easier, all array functions should take axes (or equivalent arguments such as chain_axis and draw_axis) which should allow either integers or sequences of them for the functions to work over, batching over the rest of the axes. It is also imperative that whenever new axes are added to an array, these are added as ending dimensions; otherwise interfacing with xarray via xarray.apply_ufunc won’t behave correctly.

The dataarray_stats class instance#

The dataarray facing class, builds on top of the array facing one, and takes DataArray inputs aiming to have a more xarray-like and arvizian API. Among other things, this means that the order of the dimensions shouldn’t matter, only their names, it should use defaults defined in arviz_base.rcParams. As the array facing class API is defined and should be common between submodules, that means that this class can very often be limited to an instance of the base array facing class.

On dim, dims and sample_dims#

At one point in time, dim and dims coexisted as equivalent arguments, some functions had one, other the other but their behaviour was exactly the same. Then the xarray developers decided to enforce consistency and use dim everywhere. We decided to have arviz-stats follow that convention.

sample_dims is not another name for the same argument but a different concept altogether. Both in xarray and in ArviZ sample_dims are expected to be present in all variables of the input. Let’s see an example.

Suppose we have a xarray.Dataset ds with multiple variables:

  • mu (chain: 4, draw: 100)

  • theta (chain: 4, draw: 100, school: 8)

It is perfectly valid to do ds.mean(dim=["chain", "school"]) even though “school” is not a dimension in mu. In the output both mu and theta will have only the “draw” dimension. That is, the operation applied to each variable was different, we reduced a single dimension, “chain”, in mu and two dimensions, “chain” and “school” in theta. A similar thing would happen with ds.azstats.kde(dim=["chain", "school"]), we would compute the KDE over the “chain” dimension only for mu and over the stacked “chain”+”school” dimension for theta.

On the other hand, ds.to_stacked_array(sample_dims=["chain", "school"]) or ds.azstats.ess(sample_dims=["chain", "school"]) is not valid. sample_dims must be present in all variables of the input, but mu doesn’t have a the “school” dimension. With dim we can make many combinations of “chain”, “draw” and “school”, but with sample_dims there are only 3 valid options: "chain", "draw", or ["chain", "draw"].

As a final note, keep in mind that while there are more valid combinations of dim than there are for sample_dims that doesn’t mean that any combination of “chain”, “draw” and “school” will be valid as dim. This will depend on the function. There are functions that need at least 1 dimension to operate over. The function to compute the KDE for example needs >=1 dimensions to reduce whereas the mean also works on 0d arrays.

Consequently, dim="school" would be valid for .mean but not for .kde as that second case would imply computing the KDE over nothing for mu which is not supported. There are still many more valid cases than sample_dims though: ["chain", "draw", "school"] or ["draw", "school"] would not be valid as sample_dims but are valid dim values for .kde.

Specific implementations#

Base (aka numpy+scipy)#

This is the core backend which should have most functionality available and that defines the general API for both array and dataarray facing classes.

Numba#

The numba submodule builds on top of the base submodule, using numba to both accelerate computations and generate better behaved ufuncs, ensuring compatibility with Dask for example.

Note

In a large percentage of cases, functions from arviz-stats are used to compute plot elements. Therefore, while reimplementing the most expensive operations can speed things up, be sure to profile both stats and plotting functions to make sure it will actually provide a noticeable speed-up before dedicating too much time to it.