Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summary with one line per _named_ parameter #43

Open
jgabry opened this issue Nov 8, 2019 · 9 comments · May be fixed by #343
Open

summary with one line per _named_ parameter #43

jgabry opened this issue Nov 8, 2019 · 9 comments · May be fixed by #343
Labels
feature New feature or request

Comments

@jgabry
Copy link
Member

jgabry commented Nov 8, 2019

The discussion in #32 reminded me that one thing @andrewgelman and others have been wanting for a while (and that I think would also be useful as an alternative to the standard summary output) is a summary where each vector/matrix/array parameter only occupies a single line just like scalars. That is, each line corresponds to a named parameter rather than a parameter element.

(This could be its own thing or just an option to a custom print method for our existing summary objects.)

The quantities to display are debatable (e.g. maybe min of all the individual elements’ ESS, max of the Rhats, etc), so it would be good to get input from Andrew and anyone else interested.

@jgabry jgabry added the feature New feature or request label Nov 8, 2019
@jgabry jgabry changed the title Concise summary for non-scalar parameters summary with one line per _named_ parameter Nov 8, 2019
@jgabry
Copy link
Member Author

jgabry commented Nov 12, 2019

One possibility is something like what I put below (I think @andrewgelman and @avehtari and I discussed something like this one time when Aki was visiting). Not sure which columns we include, but a concise table like this would be a really nice option for models with many parameters:

variable   min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha[1]       4.5     4.5    1.0         881         300
tau[1]         3.8     3.8    1.1         386         311
omega[9]       1.5     7.4    1.0         552         272
Sigma[7,7]    -1.7     5.1    1.0         421         202

Here the numbers inside brackets indicate the dimensions (we could also omit if a scalar instead of putting [1]). At first people could confuse this with just showing the 9th element of omega, for example, but I think we could make it clear in the doc (and this also wouldn't be the default output so they would be intentionally selecting this). The columns min_mean and max_mean here are the min and max of the "mean" columns if the summary was computed for the individual elements (to give a sense of how much the means can differ) but I'm not sure if we want to include that or not.

@andrewgelman Did we end up coming up with a better idea than putting the dimensions in brackets?

Anyone else have any thoughts on this?

@avehtari
Copy link
Collaborator

Since brackets are used for indexing, this can cause confusion for R users. Would it be too much to have a column dim as the second column?

variable dim min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha    1       4.5     4.5    1.0         881         300
tau      1       3.8     3.8    1.1         386         311
omega    9       1.5     7.4    1.0         552         272
Sigma    7,7    -1.7     5.1    1.0         421         202

@paul-buerkner
Copy link
Collaborator

dim separately may be a little ugly and not not too helpful for filtering I would say as the structure would vary across variables, that is, sometimes 1 number sometimes 2 or more separated with commas.

@jgabry
Copy link
Member Author

jgabry commented Nov 13, 2019

What if we just replace the brackets to a different symbol to refer to dimension? There are lots of possibilities. Here are just a few:

variable   min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha (1)       4.5     4.5    1.0         881         300
tau   (1)       3.8     3.8    1.1         386         311
omega (9)       1.5     7.4    1.0         552         272
Sigma (7,7)    -1.7     5.1    1.0         421         202
variable   min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha {1}       4.5     4.5    1.0         881         300
tau   {1}       3.8     3.8    1.1         386         311
omega {9}       1.5     7.4    1.0         552         272
Sigma {7,7}    -1.7     5.1    1.0         421         202
variable   min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha <1>       4.5     4.5    1.0         881         300
tau   <1>       3.8     3.8    1.1         386         311
omega <9>       1.5     7.4    1.0         552         272
Sigma <7,7>    -1.7     5.1    1.0         421         202

@mjskay
Copy link
Collaborator

mjskay commented Nov 13, 2019

@paul-buerkner I'm not sure I understand the filtering issue with dim in a separate column --- it seems to me it would be easier to filter if dims were not a part of the variable column? E.g. filter(., variable == "Sigma")?

@paul-buerkner
Copy link
Collaborator

@mjskay you have a good point there. Not sure how to best represent the dimension, but you are right that filtering after variable should be easily possible.

@andrewgelman
Copy link

andrewgelman commented Nov 13, 2019 via email

@jsocolar
Copy link
Contributor

jsocolar commented May 21, 2021

Hey guys, I'm still playing with ways to summarize the huge draws array that I'm working with, and realized that this issue is pretty relevant as well. I've approached the problem as below. If the approach looks not-crazy then I could try to clean it up and turn it into a PR. This would involve better error handling, inferring the dimensions of the nonscalar parameters, and wrangling the output into whatever format you want.

The default (which is tailored to Stan) is to "roll up" the summaries for any parameter with a [ in its name. There's also an option to pass the parameter names for rolling up, again with the assumption that they'll show up in the draws array as ^parameter_name\\[.*. The logic for passing parameter names is that nonscalar parameters come in (at least) two different flavors. One is big random effects and covariance matrices that ought to be rolled up. The other is vectors of regression coefficients probably should not be rolled up.

Functions:

summary_rollup <- function (draws_summary, rollup_vars = NULL,
                            min_only = c("ess_bulk", "ess_tail"),
                            max_only = "rhat") {
  vars <- draws_summary$variable
  
  if (is.null(rollup_vars)) {
    vars_nonscalar <- grepl("\\[", vars)
  } else {
    vars_nonscalar <- as.logical(colSums(do.call(rbind, lapply(paste0("^", rollup_vars, "\\["), function(x){grepl(x, vars)}))))
  }
  ds_scalar <- draws_summary[!vars_nonscalar, ]
  ds_nonscalar <- draws_summary[vars_nonscalar, ]
  varnames_nonscalar <- gsub("\\[(.*)", "", ds_nonscalar$variable)
  summary_names <- names(draws_summary)[-1]
  names_minmax <- summary_names[!(summary_names %in% c(min_only, max_only))]
  split_nonscalar <- split(ds_nonscalar, varnames_nonscalar)[unique(varnames_nonscalar)] # [unique(varnames_nonscalar)] preserves the order of the names
  
  min_max <- do.call(rbind, lapply(split_nonscalar, rollup_helper_minmax, names = names_minmax))
  min_only <- do.call(rbind, lapply(split_nonscalar, rollup_helper_min, names = min_only))
  max_only <- do.call(rbind, lapply(split_nonscalar, rollup_helper_max, names = max_only))
  
  nonscalar_out <- cbind(min_max, max_only, min_only)
  
  out <- list(unrolled = ds_scalar, rolled = nonscalar_out)
  out
}

rollup_helper_minmax <- function(x, names){
  x <- x[, names]
  mm <- c(apply(x, 2, function(x) {c(min(x), max(x))}))
  names(mm) <- paste0(rep(names(x), each = 2), c("_min", "_max"))
  mm
}

rollup_helper_min <- function(x, names){
  x <- x[, names]
  min_only <- apply(x, 2, min)
  names(min_only) <- paste0(names(x), "_min")
  min_only
}

rollup_helper_max <- function(x, names){
  x <- x[, names]
  max_only <- apply(x, 2, max)
  names(max_only) <- paste0(names(x), "_max")
  max_only
}

Example:

ds <- summarise_draws(example_draws())
ds2 <- summarise_draws(2 * example_draws())
ds2$variable <- paste0("new_", ds2$variable)
draws_summary <- rbind(ds, ds2)

summary_rollup(draws_summary)
summary_rollup(draws_summary, rollup_vars = c("theta", "new_theta"))
summary_rollup(draws_summary, rollup_vars = "theta")

@mjskay
Copy link
Collaborator

mjskay commented May 21, 2021

This looks like it could be very useful to me! Two quick thoughts if we go down this road:

  1. To be consistent with other functions I'd consider a verb-ish name, maybe rollup_summary() instead of summary_rollup()
  2. For dimension parsing, we probably want to have a consistent scheme for doing this wherever we have to do it, suggesting we at the very least should have an internal helper function and maybe even expose this externally if we think it would be useful. This will allow us to have consistent rules for doing it, and maybe add options in the future for people that need to parse non-standard dimension naming schemes. The beginnings of such a function could be made by factoring out the dimension-parsing code from as_draws_rvars.draws_matrix(), which already handles a number of weird cases (like missing elements, dimensions that don't start at 1, and non-numeric dimensions). Discussion of this aspect specifically should probably happen on Coding matrix/vector variables in tidy form #61

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants