Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR/Bug: Calculate Sigma (get_sigma) appropiately for vglm (and vgam) models #578

Open
iago-pssjd opened this issue Jun 7, 2022 · 4 comments
Labels
3 investigators ❔❓ Need to look further into this issue Bug 🐛 Something isn't working Low priority 😴 This issue can be easily workaround or happens only in edge cases

Comments

@iago-pssjd
Copy link

iago-pssjd commented Jun 7, 2022

First, it is a bug, since, as stated in ?get_sigma

By default, get_sigma() tries to extract sigma by calling stats::sigma(). If the model-object has no sigma() method, the next step is calculating sigma as square-root of the model-deviance divided by the residual degrees of freedom. Finally, if even this approach fails, and x is a mixed model, the residual standard deviation is accessed using the square-root from get_variance_residual().

Therefore, when the model has no sigma() method get_sigma tries a calculation that not always is suitable, at least for many of vglm models. For example (from help("tobit", package = "VGAM")):

tdata <- data.frame(x2 = seq(-1, 1, length = (nn <- 100)))
set.seed(1)
Lower <- 1; Upper <- 4  # For the nonstandard Tobit model
tdata <- transform(tdata,
                   Lower.vec = rnorm(nn, Lower, 0.5),
                   Upper.vec = rnorm(nn, Upper, 0.5))
meanfun1 <- function(x) 0 + 2*x
meanfun2 <- function(x) 2 + 2*x
meanfun3 <- function(x) 2 + 2*x
meanfun4 <- function(x) 3 + 2*x
tdata <- transform(tdata,
  y1 = rtobit(nn, mean = meanfun1(x2)),  # Standard Tobit model
  y2 = rtobit(nn, mean = meanfun2(x2), Lower = Lower, Upper = Upper),
  y3 = rtobit(nn, mean = meanfun3(x2), Lower = Lower.vec, Upper = Upper.vec),
  y4 = rtobit(nn, mean = meanfun3(x2), Lower = Lower.vec, Upper = Upper.vec))
fit1 <- vglm(y1 ~ x2, tobit, data = tdata)
get_sigma(fit1)
[1] 2.694853
attr(,"class")
[1] "insight_aux" "numeric"  

while the right sigma value is

exp(coef(fit1)["(Intercept):2"]) # look also at coef(fit1, matrix = TRUE)
(Intercept):2 
    0.9395564 

(see https://stats.oarc.ucla.edu/r/dae/tobit-models/ and the book "Vector Generalized Linear and Additive Models" by Thomas Yee). In fact, in this package there are 2 points to consider:

  • on one hand (section 1.2.1 of the book), sigma is not treated as a scale parameter, like in GLM's, but it is generally (not always) modelled (as also happens with gamlss package), therefore it is found as an intercept;
  • on the other hand (sections 1.4 and 3.5.1 of the book), it allows for multiple and multivariate responses. In this case, should be sigma maybe a list, with each element being a matrix corresponding to the multivariate responses (using the intercepts (Intercept):2, (Intercept):4, etc.)? (I refer to the distinction between multiple and multivariate responses made in section 3.5.1 of the book, in the second half of page 115). This second point would be related to this issue I recently added to performance: FR: improve formatting of multivariate models by model_performance performance#430

Thanks!

@iago-pssjd iago-pssjd changed the title FR/Bug: Calculate Sigma (get_sigma) appropiately for vglm models FR/Bug: Calculate Sigma (get_sigma) appropiately for vglm (and vgam) models Jun 7, 2022
@strengejacke strengejacke added Bug 🐛 Something isn't working 3 investigators ❔❓ Need to look further into this issue labels Jun 7, 2022
@strengejacke
Copy link
Member

Any example for multivariate models?

strengejacke added a commit that referenced this issue Jun 24, 2022
@iago-pssjd
Copy link
Author

Hi @strengejacke.

I assume you ask for the multivariate response models. I cite a paragraph in the same page of the book I mentioned above

Some care is needed to distinguish between multiple responses and multivariate responses. Multiple responses are treated independently of each other, and are to be inputted side-by-side on the LHS of the formula using cbind().Eachresponse may be a vector, of dimension Q1. Multivariate responses correspond to Q1 > 1, and are handled by some full-likelihood model that takes into account their joint distribution.

The example given by the author is the family binormal, so

set.seed(123); nn <- 1000
bdata <- data.frame(x2 = runif(nn), x3 = runif(nn))
bdata <- transform(bdata, y1 = rnorm(nn, 1 + 2 * x2),
                          y2 = rnorm(nn, 3 + 4 * x2))
fit1 <- vglm(cbind(y1, y2) ~ x2,
             binormal(), data = bdata, trace = FALSE)
coef(fit1, matrix = TRUE)
               mean1    mean2 loglink(sd1) loglink(sd2) rhobitlink(rho)
(Intercept) 1.021178 2.929393  0.009053151  -0.02282574      0.05231844
x2          2.042807 4.101541  0.000000000   0.00000000      0.00000000

In this case sigma would be the exp of the diagonal matrix 0.009053151, -0.02282574 . (Further, we would have an intercept-vector 1.021178, 2.929393 and an x2-slope vector 2.042807, 4.101541). Other examples would be with the families trinormal or bistudentt (as shown in the help pages of these functions). More implamented bivariate distributions are given in table 13.1 (page 382) of the book.

Then, first to get the parameters and other coefficients of VGAM models, and second, to distinguish between multiple responses and multivariate responses, one should look at the vglm-object@family. Therefore, the way to get them, and in particular sigma, is not so immediate as

insight/R/get_sigma.R

Lines 96 to 101 in a81c92c

.get_sigma.VGAM <- function(x, verbose = TRUE, ...) {
s <- tryCatch(exp(stats::coef(x)[["(Intercept):2"]]),
function(e) NULL)
class(s) <- c("insight_aux", class(s))
s
}

Let's look some other examples.

  1. Multiple responses and multivariate responses simultaneously (got from ?bistudentt)
nn <- 1000
mydof <- logloglink(1, inverse = TRUE)
ymat <- cbind(rt(nn, df = mydof), rt(nn, df = mydof))
bdata <- data.frame(y1 = ymat[, 1], y2 = ymat[, 2],
                    y3 = ymat[, 1], y4 = ymat[, 2], x2 = runif(nn))
## Not run:  plot(ymat, col = "blue") 
fit1 <- vglm(cbind(y1, y2, y3, y4) ~ 1,  # 2 responses, e.g., (y1,y2) is the 1st
             fam = bistudentt,  # crit = "coef",  # Sometimes a good idea
             data = bdata, trace = FALSE)
coef(fit1, matrix = TRUE)
            logloglink(df1) rhobitlink(rho1) logloglink(df2) rhobitlink(rho2)
(Intercept)        1.072883      -0.01766264        1.072882      -0.01765367

In this case, as we had two responses, we would have two sigma matrices, both having 1 in all the elements of the main diagonal, and having as secondary diagonal the values rho1 (maybe = rhobitlink(-0.01766264, inverse = TRUE)?) and rho2 (maybe = rhobitlink(-0.01765367, inverse = TRUE)?) respectively.

  1. Multiple responses (got from ?tobit)
tdata <- data.frame(x2 = seq(-1, 1, length = (nn <- 100)))
set.seed(1)
Lower <- 1; Upper <- 4  # For the nonstandard Tobit model
tdata <- transform(tdata,
                   Lower.vec = rnorm(nn, Lower, 0.5),
                   Upper.vec = rnorm(nn, Upper, 0.5))
meanfun1 <- function(x) 0 + 2*x
meanfun2 <- function(x) 2 + 2*x
meanfun3 <- function(x) 2 + 2*x
meanfun4 <- function(x) 3 + 2*x
tdata <- transform(tdata,
  y1 = rtobit(nn, mean = meanfun1(x2)),  # Standard Tobit model
  y2 = rtobit(nn, mean = meanfun2(x2), Lower = Lower, Upper = Upper),
  y3 = rtobit(nn, mean = meanfun3(x2), Lower = Lower.vec, Upper = Upper.vec),
  y4 = rtobit(nn, mean = meanfun3(x2), Lower = Lower.vec, Upper = Upper.vec))
fit4 <- vglm(cbind(y3, y4) ~ x2,
             tobit(Lower = rep(with(tdata, Lower.vec), each = 2),
                   Upper = rep(with(tdata, Upper.vec), each = 2),
                   byrow.arg = TRUE),
             data = tdata, crit = "coeff", trace = FALSE)
coef(fit4, matrix = TRUE)
                 mu1 loglink(sd1)      mu2 loglink(sd2)
(Intercept) 1.950757   0.04030946 1.944842  0.008811054
x2          1.899930   0.00000000 2.122347  0.000000000

In this case, we would have two sigma values, one for each response, being coef(fit4, matrix = TRUE)["(Intercept)","loglink(sd1)"] (or also coef(fit4)["(Intercept):2"]) and coef(fit4, matrix = TRUE)["(Intercept)","loglink(sd2)"] (or also coef(fit4)["(Intercept):4"])

  1. A more trivial example (from ?vglm): same as for glm
d.AD <- data.frame(treatment = gl(3, 3),
                         outcome = gl(3, 1, 9),
                         counts = c(18,17,15,20,10,20,25,13,12))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson,
                 data = d.AD)
vglm.D93 <- vglm(counts ~ outcome + treatment, family = poissonff,
                 data = d.AD, trace = FALSE)
sigma(glm.D93)
[1] 1.13238
sigma(vglm.D93)
[1] 1.13238

Finally, other arguments given to family may be relevant, like zero, nointercept and other (table 18.6, page 518 in the book), even this may appear summarized in some way by the vglm-object@family object.

@iago-pssjd
Copy link
Author

Another thought related to this issue is if sigma should be reported among the "performance" values of a model, when it is being estimated by the model. Probably, reporting it as until now, among the "parameters", is better, but would look even better if we could detect it by means of reading loglink(sd) or sigma (or some other alternative) instead of (Intercept):2.

@strengejacke strengejacke added the Low priority 😴 This issue can be easily workaround or happens only in edge cases label Jun 28, 2022
@strengejacke
Copy link
Member

I think this will take some time until I (or someone else) can look into this, seems not to be very straightforward. If you are able to, we of course also appreciate PRs on this issue :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 investigators ❔❓ Need to look further into this issue Bug 🐛 Something isn't working Low priority 😴 This issue can be easily workaround or happens only in edge cases
Projects
None yet
Development

No branches or pull requests

2 participants