Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for data sets with missing data #112

Open
Irazall opened this issue May 31, 2021 · 1 comment
Open

Suggestion for data sets with missing data #112

Irazall opened this issue May 31, 2021 · 1 comment

Comments

@Irazall
Copy link

Irazall commented May 31, 2021

Dear developers,

First, thanks for this great packages. I am using it for my Phd-Thesis. Now, I found something very counterintuitive and suggest to change this. This "bug" occurred while doing a logistic regression with missing data. My McFadden-R^2 fell and it took me a while to figure out why. So, I attach you a reproducible example. There you can see that the R^2 is different if you plug in a model with a data set with missings and if you do not.
This is counterintuitive because the glm-function does not distinguish between these two data sets as it deletes already all observations with missing data. So the McFadden-R^2 should not change neither. Mathematically this is because the calculation of observations with missing data is different between the full model and the empty model. So, I suggest to use the function complete.cases before calculating the loglikehood for the two models in order to be more intuitive.
Let me know what you think about this suggestion. Thank you in advance!

Find below my reproducible example. I hope this is correctly done as I am new to reprex

# Delete environment
rm(list = ls())

# Package names
packages <- c("ISLR", "blorr")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Packages loading
invisible(lapply(packages, library, character.only = TRUE))

# set seed for reproducibility
set.seed(176)

# remove columns not needed for regression
dataset <- subset(Smarket, select = -c(Year, Today))

# define function that creates NAs and execute it
createNAs <- function (x, pctNA = 0.1) {
  n <- nrow(x)
  p <- ncol(x)
  NAloc <- rep(FALSE, n * p)
  NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
  x[matrix(NAloc, nrow = n, ncol = p)] <- NA
  return(x)
}
dataset <- createNAs(dataset, 0.1)

# do first regression without complete cases
glm.fit1 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = dataset, family = binomial)
blr_rsq_mcfadden(glm.fit1)
#> [1] 0.4616006

# do second regression with complete cases
dataset <- dataset[complete.cases(dataset),]
glm.fit2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = dataset, family = binomial)
blr_rsq_mcfadden(glm.fit2)
#> [1] 0.003519264

# NOTE THAT THERE IS A DIFFERENCE BETWEEN THE TWO MC FADDEN R^2!
@aravindhebbali
Copy link
Member

Hi @Irazall

Thank you very much for bringing this to our attention. Based on your suggestion, we have decided to review blorr API using data sets with missing data and fix the bugs that arise subsequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants