Overview#

ArviZ-stats is the computing library of ArviZ. And it can be used trough two interfaces

  • The xarray interface, which is used to compute statistics on xarray objects.

  • The array interface, which is used to compute statistics on numpy arrays.

The array interface is low-level and has minimal dependencies, it is intended for advanced users and developers of third-party libraries. In contrast, the xarray interface is designed for end users, which simplifies usage by automating common tasks and handling metadata. In this overview, we will focus on the xarray interface. For a description of the array interface, please refer to the array interface guide.

Getting Started with ArviZ-stats#

To start, we’ll import ArviZ.

import arviz_base as azb

import arviz_stats as azs

To make the example easier to follow we are going to skip the modeling step, and assume we have already built and solve a Bayesian model. ArviZ provides several pre-saved models, so we can use one of those to demonstrate the statistic functionality. In this case we will use the centered_eight model, which is a classic, but even if you are not familiar with it, you can still follow along and see how the computations works.

data = azb.load_arviz_data("centered_eight")

A common task in Bayesian data analysis is to compute the posterior summaries. For instance, we can use the summary function to summarize the posterior distribution of a model. By default this function produces computes a series of statistics, including the mean, standard deviation, and credible intervals (defaults to equally tailed interval eti) for each variable in the model. And also a series of diagnostics, including the effective sample size (ess), the rhat statistic, and the monte carlo standard error (mcse).

azs.summary(data)
summary mean sd eti94_lb eti94_ub ess_bulk ess_tail r_hat mcse_mean mcse_sd
label
mu 4.49 3.49 -1.97 10.54 240.99 484.87 1.02 0.23 0.16
theta[Choate] 6.46 5.87 -3.51 19.09 365.05 689.57 1.01 0.30 0.35
theta[Deerfield] 5.03 4.88 -4.07 14.61 427.32 810.53 1.01 0.23 0.18
theta[Phillips Andover] 3.94 5.69 -7.77 13.67 514.72 753.27 1.01 0.23 0.24
theta[Phillips Exeter] 4.87 5.01 -4.78 14.39 337.18 817.24 1.01 0.26 0.22
theta[Hotchkiss] 3.67 4.95 -6.19 12.17 365.35 693.51 1.01 0.25 0.20
theta[Lawrenceville] 3.97 5.19 -6.55 13.16 521.46 942.11 1.01 0.22 0.21
theta[St. Paul's] 6.58 5.10 -2.25 17.51 275.68 487.26 1.01 0.30 0.25
theta[Mt. Hermon] 4.77 5.74 -6.35 15.57 451.86 721.84 1.01 0.26 0.27
tau 4.12 3.10 0.93 11.80 66.57 44.00 1.06 0.26 0.33

In this example, there are 3 variables (or parameters), mu, tau and theta, the first two are unidimesional, while theta is a vector of length 8. Thus, in total we got 10 rows. If we want to focus on a specific variable or a subset of variables, we can use the var_names argument to specify which ones to display. Simply pass the names of the variables you’d like to summarize.

azs.summary(data, var_names=["mu", "tau"])
summary mean sd eti94_lb eti94_ub ess_bulk ess_tail r_hat mcse_mean mcse_sd
label
mu 4.49 3.49 -1.97 10.54 240.99 484.87 1.02 0.23 0.16
tau 4.12 3.10 0.93 11.80 66.57 44.00 1.06 0.26 0.33

or negate those for which we do not want to calculate summaries.

azs.summary(data, var_names=["~theta"])
summary mean sd eti94_lb eti94_ub ess_bulk ess_tail r_hat mcse_mean mcse_sd
label
mu 4.49 3.49 -1.97 10.54 240.99 484.87 1.02 0.23 0.16
tau 4.12 3.10 0.93 11.80 66.57 44.00 1.06 0.26 0.33

Let’s say that we now want to do something else. Like computing the root mean square deviation between the model predictions and the observed data. We can use the metrics function to compute this statistic.

azs.metrics(data)
rmse(mean=9.6, se=2.6)

ArviZ and DataTree#

So far we have seen two functions, they both use the same input data, but generate different results. This is a common pattern in ArviZ, we pass an object like data and ArviZ perform different computations based on the function we call. This allows us to easily switch between different functions without having to change the underlying data structure. Internally ArviZ select the necessary subset of the data to perform the computations. Thus, we don’t need to worry about the details of how the data is organized or how to extract the relevant information for each function, unless we want to.

In this example, data is a DataTree object; a flexible and efficient structure for storing and manipulating data in ArviZ. While DataTrees are used extensively in ArviZ, they are not specific to it; they are actually a core data structure in the xarray library. We rely on DataTrees because they offer a powerful way to represent multi-dimensional data with labeled axes, making it straightforward to work with the complex datasets, including those generated during a Bayesian workflow, such as posterior samples, prior samples, and sampling statistics. This structure promotes clear organization and simplifies both data access and manipulation.

To keep this introduction short and simple we are going to focus our attention on some very general concepts related to DataTrees in the context of ArviZ. Details on how to work with DataTrees, and a description of other data-structures used by ArviZ, can be found in the Working With DataTree guide.

At a high level, a DataTree is a hierarchical structure with groups, the number of groups vary but within ArviZ we have a convention for the names of the groups. For instance, the posterior samples are stored in a group called posterior, the prior samples are stored in a group called prior, and the sampling statistics (stuff related to the inner workings of samplers) are stored in a group called sample_stats. This allows for a clear organization of the data, making it easy to access and manipulate.

We can see that for the centered_eight model we have several groups:

data.groups
('/',
 '/posterior',
 '/posterior_predictive',
 '/log_likelihood',
 '/sample_stats',
 '/prior',
 '/prior_predictive',
 '/observed_data',
 '/constant_data')

Each group can contain multiple variables. We already saw that we have mu, theta, and tau. And each variable can have multiple dimensions. For instance, the variable mu has a dimension called chain, which represents the different chains used in the MCMC sampling process, and a dimension called draw, which represents the different samples drawn from the posterior distribution. theta has these two dimensions plus a dimension called school, which represents the each one of the 8 schools in the model.

data.posterior.data_vars
Data variables:
    mu       (chain, draw) float64 16kB 7.872 3.385 9.1 ... 1.767 3.486 3.404
    theta    (chain, draw, school) float64 128kB 12.32 9.905 ... 6.762 1.295
    tau      (chain, draw) float64 16kB 4.726 3.909 4.844 ... 2.741 2.932 4.461

This is a more complete view of the content of the posterior group

data.posterior
<xarray.DatasetView> Size: 165kB
Dimensions:  (chain: 4, draw: 500, school: 8)
Coordinates:
  * chain    (chain) int64 32B 0 1 2 3
  * draw     (draw) int64 4kB 0 1 2 3 4 5 6 7 ... 493 494 495 496 497 498 499
  * school   (school) <U16 512B 'Choate' 'Deerfield' ... 'Mt. Hermon'
Data variables:
    mu       (chain, draw) float64 16kB 7.872 3.385 9.1 ... 1.767 3.486 3.404
    theta    (chain, draw, school) float64 128kB 12.32 9.905 ... 6.762 1.295
    tau      (chain, draw) float64 16kB 4.726 3.909 4.844 ... 2.741 2.932 4.461
Attributes:
    created_at:                 2022-10-13T14:37:37.315398
    arviz_version:              0.13.0.dev0
    inference_library:          pymc
    inference_library_version:  4.2.2
    sampling_time:              7.480114936828613
    tuning_steps:               1000

Exploring groups, dimensions and coordinates#

By default the xarray interface will use one or more of the available groups. For instance the summary function will use the posterior group, while the metrics function will use both the posterior_predictive and the observed_data groups. These two functions combine the chain and draw dimensions to produce the final computations. That explain, for example, why we get a single mean for each variable when calling summary even when the posterior is represented using 2000 samples, 4 chains x 500 draws. But notice, for instance that the dimension school is not “reduced”, instead we get a different “mean” (and other quantities) for each school.

All these defaults can be changed, as we already saw with the var_names argument. Usually ArviZ functions will have a group argument. For instance the metrics function does not have it, as computation are always done using the posterior_predictive and the observed_data, but summary have it and allows you to specify many groups different groups

Let’s set group to posterior_predictive. Can you anticipate what we will get?

azs.summary(data, group="posterior_predictive")
summary mean sd eti94_lb eti94_ub ess_bulk ess_tail r_hat mcse_mean mcse_sd
label
obs[Choate] 6.65 15.86 -22.79 37.42 1370.99 1393.53 1.00 0.43 0.32
obs[Deerfield] 4.88 11.25 -16.51 25.31 1279.14 1941.95 1.00 0.31 0.22
obs[Phillips Andover] 4.26 16.42 -26.62 35.04 1673.19 1725.26 1.00 0.40 0.28
obs[Phillips Exeter] 5.00 12.27 -18.39 28.23 1646.09 1679.22 1.00 0.30 0.22
obs[Hotchkiss] 3.46 9.93 -14.98 21.75 1600.17 1734.48 1.01 0.25 0.18
obs[Lawrenceville] 3.42 12.22 -20.03 26.15 1537.25 1715.13 1.00 0.31 0.22
obs[St. Paul's] 6.36 11.16 -15.47 27.07 1440.17 1544.38 1.00 0.29 0.20
obs[Mt. Hermon] 4.92 18.90 -32.06 39.69 1948.58 1917.24 1.00 0.43 0.31

We get predictions of the model at each observation. In this example, we have 8 observations, so we get 8 summaries.

ArviZ does not come with guarantees that any possible combination of arguments that you may want to try will work, or that the generated result will be sensible or useful. Defaults usually are, but then you are free to explore the data as you wish and be responsible for the results.

For instance

azs.summary(data, group="sample_stats")

will not work, because the sample_stats group does not contain samples to summarize. However,

azs.summary(data, group="log_likelihood")

will work, and it will compute the summary for the log-likelihood per observation. How useful is this, it depends on the context the analysis.

Sample_dims#

Now, we are going to discuss the sample_dims argument. This argument is used to specify which dimensions of the data should be reduced when doing calculations. An example will help us understand this better. Let’s say we want to use the summary function to summarize the observed data. If we do:

azs.summary(data, group="observed_data")

We will get and error, instead we should write:

azs.summary(data, group="observed_data", sample_dims=["school"], kind="stats")
summary mean sd eti94_lb eti94_ub
label
obs 8.75 9.77 -2.58 25.9

We need to add these two arguments for two separated reasons.

  • By default, sample_dims is set to ["chain", "draw"], but those dimensions are not present (or meaningful) in the observed_data group. By passing sample_dims=["school"], we tell ArviZ to reduce the school dimension, which is the only dimension present in the observed_data group.

  • Conceptually it does not make sense to compute sampling diagnostics on observed data, as they are not samples generated using MCMC methods. Additionally, we will run into issues if we try to compute the r_hat diagnostics, as it assumes we have at leas two chains.

If we were interested in summarizing the posterior distribution for mu per chain, we could do:

azs.summary(data,
            var_names=["mu"],
            kind="stats",
            sample_dims=["draw"])
summary mean sd eti94_lb eti94_ub
label
mu[0] 4.25 3.40 -3.01 10.43
mu[1] 4.18 3.24 -1.54 10.14
mu[2] 4.66 3.81 -2.05 11.15
mu[3] 4.85 3.43 -1.55 10.45

We can compute the summary for sample_dims=["draw", "school"]

azs.summary(data,
            kind="stats",
            sample_dims=["draw", "school"])
summary mean sd eti94_lb eti94_ub
label
mu[0] 4.25 3.40 -3.01 10.43
mu[1] 4.18 3.24 -1.54 10.14
mu[2] 4.66 3.81 -2.05 11.15
mu[3] 4.85 3.43 -1.55 10.45
theta[0] 4.63 5.04 -4.96 14.09
theta[1] 4.64 5.31 -5.41 15.12
theta[2] 5.17 5.87 -6.04 16.27
theta[3] 5.21 5.38 -4.87 15.40
tau[0] 3.68 2.71 1.05 10.89
tau[1] 4.25 3.15 0.90 11.44
tau[2] 4.66 3.27 1.37 13.96
tau[3] 3.91 3.17 1.13 11.60

Notice that draw is present in the 3 variables, but school is only present in theta. Thus, we get a summary for mu and tau that is reduced over the draw dimension, while for theta we get a summary that is reduced over both draw and school.

Coords#

We can use the coords argument to specify which coordinates we want to use for the computations. This is useful to filter variables across dimensions. For instance, if we want to summarize the posterior distribution of only a subset of schools, we can do:

azs.summary(data,
              var_names=["mu", "theta"],
              coords={"school": ["Choate", "Deerfield"]},
              )
summary mean sd eti94_lb eti94_ub ess_bulk ess_tail r_hat mcse_mean mcse_sd
label
mu 4.49 3.49 -1.97 10.54 240.99 484.87 1.02 0.23 0.16
theta[Choate] 6.46 5.87 -3.51 19.09 365.05 689.57 1.01 0.30 0.35
theta[Deerfield] 5.03 4.88 -4.07 14.61 427.32 810.53 1.01 0.23 0.18

Notice that the coords argument is used to select specific schools and thus it does not affect variables without the school dimension, like mu.

Global defaults#

ArviZ has a set of defaults that apply globally, meaning that you can change them in one place and it will affect the behaviour of many functions. ArviZ’s fucntions, like summary takes from here the default value for sample_dims and other arguments. They are stored in azb.rcParams. Can you spot, the default values for interval and their probability?

azb.rcParams
RcParams({'data.http_protocol': 'https',
          'data.index_origin': 0,
          'data.log_likelihood': False,
          'data.sample_dims': ('chain', 'draw'),
          'data.save_warmup': False,
          'plot.backend': 'matplotlib',
          'plot.bokeh.bounds_x_range': 'auto',
          'plot.bokeh.bounds_y_range': 'auto',
          'plot.bokeh.figure.dpi': 60,
          'plot.bokeh.figure.height': 500,
          'plot.bokeh.figure.width': 500,
          'plot.bokeh.layout.order': 'default',
          'plot.bokeh.layout.sizing_mode': 'fixed',
          'plot.bokeh.layout.toolbar_location': 'above',
          'plot.bokeh.marker': 'cross',
          'plot.bokeh.output_backend': 'webgl',
          'plot.bokeh.tools': 'reset,pan,box_zoom,wheel_zoom,lasso_select,undo,save,hover',
          'plot.density_kind': 'kde',
          'plot.max_subplots': 40,
          'stats.ci_kind': 'eti',
          'stats.ci_prob': 0.94,
          'stats.ic_compare_method': 'stacking',
          'stats.ic_pointwise': True,
          'stats.ic_scale': 'log',
          'stats.information_criterion': 'loo',
          'stats.module': 'base',
          'stats.point_estimate': 'mean'})

If we don’t like the default value of 0.94 for the credible interval, we can change it globally:

azb.rcParams["stats.ci_prob"] = 0.95

This will change the default value for the credible interval to 0.95 for all subsequent calls to summary and other functions that use this parameter, or we can change it per call:

azs.summary(data, ci_prob=0.95)

See also