Suggestion for data sets with missing data #112

Irazall · 2021-05-31T16:27:10Z

Dear developers,

First, thanks for this great packages. I am using it for my Phd-Thesis. Now, I found something very counterintuitive and suggest to change this. This "bug" occurred while doing a logistic regression with missing data. My McFadden-R^2 fell and it took me a while to figure out why. So, I attach you a reproducible example. There you can see that the R^2 is different if you plug in a model with a data set with missings and if you do not.
This is counterintuitive because the glm-function does not distinguish between these two data sets as it deletes already all observations with missing data. So the McFadden-R^2 should not change neither. Mathematically this is because the calculation of observations with missing data is different between the full model and the empty model. So, I suggest to use the function complete.cases before calculating the loglikehood for the two models in order to be more intuitive.
Let me know what you think about this suggestion. Thank you in advance!

Find below my reproducible example. I hope this is correctly done as I am new to reprex

# Delete environment
rm(list = ls())

# Package names
packages <- c("ISLR", "blorr")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Packages loading
invisible(lapply(packages, library, character.only = TRUE))

# set seed for reproducibility
set.seed(176)

# remove columns not needed for regression
dataset <- subset(Smarket, select = -c(Year, Today))

# define function that creates NAs and execute it
createNAs <- function (x, pctNA = 0.1) {
  n <- nrow(x)
  p <- ncol(x)
  NAloc <- rep(FALSE, n * p)
  NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
  x[matrix(NAloc, nrow = n, ncol = p)] <- NA
  return(x)
}
dataset <- createNAs(dataset, 0.1)

# do first regression without complete cases
glm.fit1 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = dataset, family = binomial)
blr_rsq_mcfadden(glm.fit1)
#> [1] 0.4616006

# do second regression with complete cases
dataset <- dataset[complete.cases(dataset),]
glm.fit2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = dataset, family = binomial)
blr_rsq_mcfadden(glm.fit2)
#> [1] 0.003519264

# NOTE THAT THERE IS A DIFFERENCE BETWEEN THE TWO MC FADDEN R^2!

The text was updated successfully, but these errors were encountered:

aravindhebbali · 2021-06-01T03:51:34Z

Hi @Irazall

Thank you very much for bringing this to our attention. Based on your suggestion, we have decided to review blorr API using data sets with missing data and fix the bugs that arise subsequently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion for data sets with missing data #112

Suggestion for data sets with missing data #112

Irazall commented May 31, 2021

aravindhebbali commented Jun 1, 2021

Suggestion for data sets with missing data #112

Suggestion for data sets with missing data #112

Comments

Irazall commented May 31, 2021

aravindhebbali commented Jun 1, 2021