Overview#
ArviZ-stats is the computing library of ArviZ. And it can be used trough two interfaces
The xarray interface, which is used to compute statistics on xarray objects.
The array interface, which is used to compute statistics on numpy arrays.
The array interface is low-level and has minimal dependencies, it is intended for advanced users and developers of third-party libraries. In contrast, the xarray interface is designed for end users, which simplifies usage by automating common tasks and handling metadata. In this overview, we will focus on the xarray interface. For a description of the array interface, please refer to the array interface guide.
Getting Started with ArviZ-stats#
To start, we’ll import ArviZ.
import arviz_base as azb
import arviz_stats as azs
To make the example easier to follow we are going to skip the modeling step, and assume we have already built and solve a Bayesian model. ArviZ provides several pre-saved models, so we can use one of those to demonstrate the statistic functionality. In this case we will use the centered_eight
model, which is a classic, but even if you are not familiar with it, you can still follow along and see how the computations works.
data = azb.load_arviz_data("centered_eight")
A common task in Bayesian data analysis is to compute the posterior summaries. For instance, we can use the summary
function to summarize the posterior distribution of a model. By default this function produces computes a series of statistics, including the mean, standard deviation, and credible intervals (defaults to equally tailed interval eti
) for each variable in the model. And also a series of diagnostics, including the effective sample size (ess), the rhat statistic, and the monte carlo standard error (mcse).
azs.summary(data)
summary | mean | sd | eti94_lb | eti94_ub | ess_bulk | ess_tail | r_hat | mcse_mean | mcse_sd |
---|---|---|---|---|---|---|---|---|---|
label | |||||||||
mu | 4.49 | 3.49 | -1.97 | 10.54 | 240.99 | 484.87 | 1.02 | 0.23 | 0.16 |
theta[Choate] | 6.46 | 5.87 | -3.51 | 19.09 | 365.05 | 689.57 | 1.01 | 0.30 | 0.35 |
theta[Deerfield] | 5.03 | 4.88 | -4.07 | 14.61 | 427.32 | 810.53 | 1.01 | 0.23 | 0.18 |
theta[Phillips Andover] | 3.94 | 5.69 | -7.77 | 13.67 | 514.72 | 753.27 | 1.01 | 0.23 | 0.24 |
theta[Phillips Exeter] | 4.87 | 5.01 | -4.78 | 14.39 | 337.18 | 817.24 | 1.01 | 0.26 | 0.22 |
theta[Hotchkiss] | 3.67 | 4.95 | -6.19 | 12.17 | 365.35 | 693.51 | 1.01 | 0.25 | 0.20 |
theta[Lawrenceville] | 3.97 | 5.19 | -6.55 | 13.16 | 521.46 | 942.11 | 1.01 | 0.22 | 0.21 |
theta[St. Paul's] | 6.58 | 5.10 | -2.25 | 17.51 | 275.68 | 487.26 | 1.01 | 0.30 | 0.25 |
theta[Mt. Hermon] | 4.77 | 5.74 | -6.35 | 15.57 | 451.86 | 721.84 | 1.01 | 0.26 | 0.27 |
tau | 4.12 | 3.10 | 0.93 | 11.80 | 66.57 | 44.00 | 1.06 | 0.26 | 0.33 |
In this example, there are 3 variables (or parameters), mu
, tau
and theta
, the first two are unidimesional, while theta
is a vector of length 8. Thus, in total we got 10 rows. If we want to focus on a specific variable or a subset of variables, we can use the var_names
argument to specify which ones to display. Simply pass the names of the variables you’d like to summarize.
azs.summary(data, var_names=["mu", "tau"])
summary | mean | sd | eti94_lb | eti94_ub | ess_bulk | ess_tail | r_hat | mcse_mean | mcse_sd |
---|---|---|---|---|---|---|---|---|---|
label | |||||||||
mu | 4.49 | 3.49 | -1.97 | 10.54 | 240.99 | 484.87 | 1.02 | 0.23 | 0.16 |
tau | 4.12 | 3.10 | 0.93 | 11.80 | 66.57 | 44.00 | 1.06 | 0.26 | 0.33 |
or negate those for which we do not want to calculate summaries.
azs.summary(data, var_names=["~theta"])
summary | mean | sd | eti94_lb | eti94_ub | ess_bulk | ess_tail | r_hat | mcse_mean | mcse_sd |
---|---|---|---|---|---|---|---|---|---|
label | |||||||||
mu | 4.49 | 3.49 | -1.97 | 10.54 | 240.99 | 484.87 | 1.02 | 0.23 | 0.16 |
tau | 4.12 | 3.10 | 0.93 | 11.80 | 66.57 | 44.00 | 1.06 | 0.26 | 0.33 |
Let’s say that we now want to do something else. Like computing the root mean square deviation between the model predictions and the observed data. We can use the metrics
function to compute this statistic.
azs.metrics(data)
rmse(mean=9.6, se=2.6)
ArviZ and DataTree#
So far we have seen two functions, they both use the same input data
, but generate different results. This is a common pattern in ArviZ, we pass an object like data
and ArviZ perform different computations based on the function we call. This allows us to easily switch between different functions without having to change the underlying data structure. Internally ArviZ select the necessary subset of the data to perform the computations. Thus, we don’t need to worry about the details of how the data is organized or how to extract the relevant information for each function, unless we want to.
In this example, data
is a DataTree
object; a flexible and efficient structure for storing and manipulating data in ArviZ. While DataTrees are used extensively in ArviZ, they are not specific to it; they are actually a core data structure in the xarray library. We rely on DataTrees because they offer a powerful way to represent multi-dimensional data with labeled axes, making it straightforward to work with the complex datasets, including those generated during a Bayesian workflow, such as posterior samples, prior samples, and sampling statistics. This structure promotes clear organization and simplifies both data access and manipulation.
To keep this introduction short and simple we are going to focus our attention on some very general concepts related to DataTrees in the context of ArviZ. Details on how to work with DataTrees, and a description of other data-structures used by ArviZ, can be found in the Working With DataTree guide.
At a high level, a DataTree
is a hierarchical structure with groups, the number of groups vary but within ArviZ we have a convention for the names of the groups. For instance, the posterior samples are stored in a group called posterior
, the prior samples are stored in a group called prior
, and the sampling statistics (stuff related to the inner workings of samplers) are stored in a group called sample_stats
. This allows for a clear organization of the data, making it easy to access and manipulate.
We can see that for the centered_eight
model we have several groups:
data.groups
('/',
'/posterior',
'/posterior_predictive',
'/log_likelihood',
'/sample_stats',
'/prior',
'/prior_predictive',
'/observed_data',
'/constant_data')
Each group can contain multiple variables. We already saw that we have mu
, theta
, and tau
. And each variable can have multiple dimensions. For instance, the variable mu
has a dimension called chain
, which represents the different chains used in the MCMC sampling process, and a dimension called draw
, which represents the different samples drawn from the posterior distribution. theta
has these two dimensions plus a dimension called school
, which represents the each one of the 8 schools in the model.
data.posterior.data_vars
Data variables:
mu (chain, draw) float64 16kB 7.872 3.385 9.1 ... 1.767 3.486 3.404
theta (chain, draw, school) float64 128kB 12.32 9.905 ... 6.762 1.295
tau (chain, draw) float64 16kB 4.726 3.909 4.844 ... 2.741 2.932 4.461
This is a more complete view of the content of the posterior group
data.posterior
<xarray.DatasetView> Size: 165kB Dimensions: (chain: 4, draw: 500, school: 8) Coordinates: * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 4kB 0 1 2 3 4 5 6 7 ... 493 494 495 496 497 498 499 * school (school) <U16 512B 'Choate' 'Deerfield' ... 'Mt. Hermon' Data variables: mu (chain, draw) float64 16kB 7.872 3.385 9.1 ... 1.767 3.486 3.404 theta (chain, draw, school) float64 128kB 12.32 9.905 ... 6.762 1.295 tau (chain, draw) float64 16kB 4.726 3.909 4.844 ... 2.741 2.932 4.461 Attributes: created_at: 2022-10-13T14:37:37.315398 arviz_version: 0.13.0.dev0 inference_library: pymc inference_library_version: 4.2.2 sampling_time: 7.480114936828613 tuning_steps: 1000
Exploring groups, dimensions and coordinates#
By default the xarray interface will use one or more of the available groups. For instance the summary
function will use the posterior
group, while the metrics
function will use both the posterior_predictive
and the observed_data
groups. These two functions combine the chain
and draw
dimensions to produce the final computations. That explain, for example, why we get a single mean for each variable when calling summary even when the posterior is represented using 2000 samples, 4 chains x 500 draws. But notice, for instance that the dimension school
is not “reduced”, instead we get a different “mean” (and other quantities) for each school.
All these defaults can be changed, as we already saw with the var_names
argument. Usually ArviZ functions will have a group
argument. For instance the metrics
function does not have it, as computation are always done using the posterior_predictive
and the observed_data
, but summary have it and allows you to specify many groups different groups
Let’s set group to posterior_predictive
. Can you anticipate what we will get?
azs.summary(data, group="posterior_predictive")
summary | mean | sd | eti94_lb | eti94_ub | ess_bulk | ess_tail | r_hat | mcse_mean | mcse_sd |
---|---|---|---|---|---|---|---|---|---|
label | |||||||||
obs[Choate] | 6.65 | 15.86 | -22.79 | 37.42 | 1370.99 | 1393.53 | 1.00 | 0.43 | 0.32 |
obs[Deerfield] | 4.88 | 11.25 | -16.51 | 25.31 | 1279.14 | 1941.95 | 1.00 | 0.31 | 0.22 |
obs[Phillips Andover] | 4.26 | 16.42 | -26.62 | 35.04 | 1673.19 | 1725.26 | 1.00 | 0.40 | 0.28 |
obs[Phillips Exeter] | 5.00 | 12.27 | -18.39 | 28.23 | 1646.09 | 1679.22 | 1.00 | 0.30 | 0.22 |
obs[Hotchkiss] | 3.46 | 9.93 | -14.98 | 21.75 | 1600.17 | 1734.48 | 1.01 | 0.25 | 0.18 |
obs[Lawrenceville] | 3.42 | 12.22 | -20.03 | 26.15 | 1537.25 | 1715.13 | 1.00 | 0.31 | 0.22 |
obs[St. Paul's] | 6.36 | 11.16 | -15.47 | 27.07 | 1440.17 | 1544.38 | 1.00 | 0.29 | 0.20 |
obs[Mt. Hermon] | 4.92 | 18.90 | -32.06 | 39.69 | 1948.58 | 1917.24 | 1.00 | 0.43 | 0.31 |
We get predictions of the model at each observation. In this example, we have 8 observations, so we get 8 summaries.
ArviZ does not come with guarantees that any possible combination of arguments that you may want to try will work, or that the generated result will be sensible or useful. Defaults usually are, but then you are free to explore the data as you wish and be responsible for the results.
For instance
azs.summary(data, group="sample_stats")
will not work, because the sample_stats
group does not contain samples to summarize. However,
azs.summary(data, group="log_likelihood")
will work, and it will compute the summary for the log-likelihood per observation. How useful is this, it depends on the context the analysis.
Sample_dims#
Now, we are going to discuss the sample_dims
argument. This argument is used to specify which dimensions of the data should be reduced when doing calculations. An example will help us understand this better. Let’s say we want to use the summary
function to summarize the observed data. If we do:
azs.summary(data, group="observed_data")
We will get and error, instead we should write:
azs.summary(data, group="observed_data", sample_dims=["school"], kind="stats")
summary | mean | sd | eti94_lb | eti94_ub |
---|---|---|---|---|
label | ||||
obs | 8.75 | 9.77 | -2.58 | 25.9 |
We need to add these two arguments for two separated reasons.
By default,
sample_dims
is set to["chain", "draw"]
, but those dimensions are not present (or meaningful) in theobserved_data
group. By passingsample_dims=["school"]
, we tell ArviZ to reduce theschool
dimension, which is the only dimension present in theobserved_data
group.Conceptually it does not make sense to compute sampling diagnostics on observed data, as they are not samples generated using MCMC methods. Additionally, we will run into issues if we try to compute the
r_hat
diagnostics, as it assumes we have at leas two chains.
If we were interested in summarizing the posterior distribution for mu
per chain
, we could do:
azs.summary(data,
var_names=["mu"],
kind="stats",
sample_dims=["draw"])
summary | mean | sd | eti94_lb | eti94_ub |
---|---|---|---|---|
label | ||||
mu[0] | 4.25 | 3.40 | -3.01 | 10.43 |
mu[1] | 4.18 | 3.24 | -1.54 | 10.14 |
mu[2] | 4.66 | 3.81 | -2.05 | 11.15 |
mu[3] | 4.85 | 3.43 | -1.55 | 10.45 |
We can compute the summary for sample_dims=["draw", "school"]
azs.summary(data,
kind="stats",
sample_dims=["draw", "school"])
summary | mean | sd | eti94_lb | eti94_ub |
---|---|---|---|---|
label | ||||
mu[0] | 4.25 | 3.40 | -3.01 | 10.43 |
mu[1] | 4.18 | 3.24 | -1.54 | 10.14 |
mu[2] | 4.66 | 3.81 | -2.05 | 11.15 |
mu[3] | 4.85 | 3.43 | -1.55 | 10.45 |
theta[0] | 4.63 | 5.04 | -4.96 | 14.09 |
theta[1] | 4.64 | 5.31 | -5.41 | 15.12 |
theta[2] | 5.17 | 5.87 | -6.04 | 16.27 |
theta[3] | 5.21 | 5.38 | -4.87 | 15.40 |
tau[0] | 3.68 | 2.71 | 1.05 | 10.89 |
tau[1] | 4.25 | 3.15 | 0.90 | 11.44 |
tau[2] | 4.66 | 3.27 | 1.37 | 13.96 |
tau[3] | 3.91 | 3.17 | 1.13 | 11.60 |
Notice that draw
is present in the 3 variables, but school
is only present in theta
. Thus, we get a summary for mu
and tau
that is reduced over the draw
dimension, while for theta
we get a summary that is reduced over both draw
and school
.
Coords#
We can use the coords
argument to specify which coordinates we want to use for the computations. This is useful to filter variables across dimensions. For instance, if we want to summarize the posterior distribution of only a subset of schools, we can do:
azs.summary(data,
var_names=["mu", "theta"],
coords={"school": ["Choate", "Deerfield"]},
)
summary | mean | sd | eti94_lb | eti94_ub | ess_bulk | ess_tail | r_hat | mcse_mean | mcse_sd |
---|---|---|---|---|---|---|---|---|---|
label | |||||||||
mu | 4.49 | 3.49 | -1.97 | 10.54 | 240.99 | 484.87 | 1.02 | 0.23 | 0.16 |
theta[Choate] | 6.46 | 5.87 | -3.51 | 19.09 | 365.05 | 689.57 | 1.01 | 0.30 | 0.35 |
theta[Deerfield] | 5.03 | 4.88 | -4.07 | 14.61 | 427.32 | 810.53 | 1.01 | 0.23 | 0.18 |
Notice that the coords
argument is used to select specific schools and thus it does not affect variables without the school dimension, like mu
.
Global defaults#
ArviZ has a set of defaults that apply globally, meaning that you can change them in one place and it will affect the behaviour of many functions. ArviZ’s fucntions, like summary
takes from here the default value for sample_dims
and other arguments. They are stored in azb.rcParams
. Can you spot, the default values for interval and their probability?
azb.rcParams
RcParams({'data.http_protocol': 'https',
'data.index_origin': 0,
'data.log_likelihood': False,
'data.sample_dims': ('chain', 'draw'),
'data.save_warmup': False,
'plot.backend': 'matplotlib',
'plot.bokeh.bounds_x_range': 'auto',
'plot.bokeh.bounds_y_range': 'auto',
'plot.bokeh.figure.dpi': 60,
'plot.bokeh.figure.height': 500,
'plot.bokeh.figure.width': 500,
'plot.bokeh.layout.order': 'default',
'plot.bokeh.layout.sizing_mode': 'fixed',
'plot.bokeh.layout.toolbar_location': 'above',
'plot.bokeh.marker': 'cross',
'plot.bokeh.output_backend': 'webgl',
'plot.bokeh.tools': 'reset,pan,box_zoom,wheel_zoom,lasso_select,undo,save,hover',
'plot.density_kind': 'kde',
'plot.max_subplots': 40,
'stats.ci_kind': 'eti',
'stats.ci_prob': 0.94,
'stats.ic_compare_method': 'stacking',
'stats.ic_pointwise': True,
'stats.ic_scale': 'log',
'stats.information_criterion': 'loo',
'stats.module': 'base',
'stats.point_estimate': 'mean'})
If we don’t like the default value of 0.94 for the credible interval, we can change it globally:
azb.rcParams["stats.ci_prob"] = 0.95
This will change the default value for the credible interval to 0.95 for all subsequent calls to summary
and other functions that use this parameter, or we can change it per call:
azs.summary(data, ci_prob=0.95)
See also
arviz-base data related functionality, including converters from different PPLs.
arviz-plots for statistical summaries, diagnostics, metrics and other estimators.
Exploratory Analysis of Bayesian Models online book.
The array interface tutorial, for a description of the array interface.