arviz_stats.update_subsample

Contents

arviz_stats.update_subsample#

arviz_stats.update_subsample(loo_orig, data, observations=None, var_name=None, reff=None, log_weights=None, seed=315, method='lpd', log_lik_fn=None, param_names=None, log=True)[source]#

Update a sub-sampled PSIS-LOO-CV object with new observations.

Extends a sub-sampled PSIS-LOO-CV result by adding new observations to the sub-sample without recomputing values for previously sampled observations. This allows for incrementally improving the sub-sampled PSIS-LOO-CV estimate with additional observations.

The sub-sampling method is described in [1].

Parameters:
loo_origELPDData

Original PSIS-LOO-CV result created with loo_subsample with pointwise=True.

dataxarray.DataTree or InferenceData

Input data. It should contain the posterior and the log_likelihood groups.

observationsint or ndarray, optional

The additional observations to use:

  • An integer specifying the number of new observations to randomly sub-sample without replacement.

  • An array of integer indices specifying the exact new observations to use.

  • If None or 0, returns the original PSIS-LOO-CV result unchanged.

var_namestr, optional

The name of the variable in log_likelihood groups storing the pointwise log likelihood data to use for loo computation.

refffloat, optional

Relative MCMC efficiency, ess / n i.e. number of effective samples divided by the number of actual samples. Computed from trace by default.

log_weightsxarray.DataArray or ELPDData, optional

Smoothed log weights. Can be either:

Defaults to None. If not provided, it will be computed using the PSIS-LOO method.

seedint, optional

Seed for random sampling.

method: str, optional

Method used for approximating the pointwise log predictive density:

  • lpd: Use standard log predictive density approximation (default)

  • plpd: Use point log predictive density approximation which requires a log_lik_fn.

log_lik_fncallable, optional

Function that computes the log-likelihood for observations given posterior parameters. Required when method="plpd" or when method="lpd" and custom likelihood is needed. The function signature is log_lik_fn(observations, datatree) where observations is a DataArray of observed data and datatree is a DataTree object. For method="plpd", posterior means are computed automatically and passed in the posterior group. For method="lpd", full posterior samples are passed. All other groups remain unchanged for direct access. Recommended to pass the required parameter names from the posterior group that are necessary for the log-likelihood function.

param_names: list, optional

List of parameter names to extract from the posterior. If None, all parameters are used.

log: bool, optional

Whether the log_lik_fn returns log-likelihood (True) or likelihood (False). Default is True.

Returns:
ELPDData

Object with the following attributes:

  • elpd: updated approximated expected log pointwise predictive density (elpd)

  • se: standard error of the elpd (includes approximation and sampling uncertainty)

  • p: effective number of parameters

  • n_samples: number of samples in the posterior

  • n_data_points: total number of data points (N)

  • warning: True if the estimated shape parameter k of the Pareto distribution is > good_k for any observation in the subsample.

  • elpd_i: DataArray with pointwise elpd values (filled with NaNs for non-subsampled points), only if pointwise=True.

  • pareto_k: DataArray with Pareto shape values for the subsample (filled with NaNs for non-subsampled points), only if pointwise=True.

  • scale: scale of the elpd results (“log”, “negative_log”, or “deviance”).

  • good_k: Threshold for Pareto k warnings.

  • approx_posterior: True if approximate posterior was used.

  • subsampling_se: Standard error estimate from subsampling uncertainty only.

  • subsample_size: Number of observations in the subsample (original + new).

  • log_p: Log density of the target posterior.

  • log_q: Log density of the proposal posterior.

  • thin: Thinning factor for posterior draws.

  • log_weights: Smoothed log weights.

Warning

When using custom log-likelihood functions with auxiliary data (e.g., measurement errors, covariates, or any observation-specific parameters), that data must be stored in the constant_data group of your DataTree/InferenceData object. During subsampling, data from this group is automatically aligned with the subset of observations being evaluated. This ensures that when computing the log-likelihood for observation i, the corresponding auxiliary data is correctly matched.

If auxiliary data is not properly placed in this group, indexing mismatches will occur, leading to incorrect likelihood calculations.

See also

loo

Exact PSIS-LOO cross-validation.

loo_subsample

PSIS-LOO-CV with subsampling.

References

[1]

Magnusson, M., Riis Andersen, M., Jonasson, J., & Vehtari, A. Bayesian Leave-One-Out Cross-Validation for Large Data. Proceedings of the 36th International Conference on Machine Learning, PMLR 97:4244–4253 (2019) https://proceedings.mlr.press/v97/magnusson19a.html arXiv preprint https://arxiv.org/abs/1904.10679

Examples

Calculate initial sub-sampled PSIS-LOO-CV using 4 observations, then update with 4 more:

In [1]: from arviz_stats import loo_subsample, update_subsample
   ...: from arviz_base import load_arviz_data
   ...: data = load_arviz_data("non_centered_eight")
   ...: initial_loo = loo_subsample(data, observations=4, var_name="obs", pointwise=True)
   ...: updated_loo = update_subsample(initial_loo, data, observations=2)
   ...: updated_loo
   ...: 
Out[1]: 
Computed from 2000 by 6 subsampled log-likelihood
values from 8 total observations.

         Estimate   SE subsampling SE
elpd_loo     -30.8  1.4            0.2
p_loo          1.0

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)        6  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%