-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion of new function: describe_missing()
#561
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think it would be good to have describe_missing()
but the way it is implemented and documented looks very field-specific to me. I find the output of skimr::skim()
easier to understand with n_missing
and complete_rate
for instance. I'm also not familiar at all with aggregating stats on missing values across several variables (e.g. Ozone:Wind
) and the default output looks unexpected to me (I'd rather expect one row per variable).
R/describe_missing.R
Outdated
#' @description Provides a detailed description of missing values in a data frame. | ||
#' This function reports both absolute and percentage missing values of specified | ||
#' column lists or scales, following recommended guidelines. Some authors recommend | ||
#' reporting item-level missingness per scale, as well as a participant's maximum | ||
#' number of missing items by scale. For example, Parent (2013) writes: | ||
#' | ||
#' *I recommend that authors (a) state their tolerance level for missing data by scale | ||
#' or subscale (e.g., "We calculated means for all subscales on which participants gave | ||
#' at least 75% complete data") and then (b) report the individual missingness rates | ||
#' by scale per data point (i.e., the number of missing values out of all data points | ||
#' on that scale for all participants) and the maximum by participant (e.g., "For Attachment | ||
#' Anxiety, a total of 4 missing data points out of 100 were observed, with no participant | ||
#' missing more than a single data point").* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a bit too much focused on survey data while this function can be interesting for all kinds of data. I'd rather keep the first or two first sentences here and move the rest in a specific section in 'Details' (but even there, this seems very field-specific).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved everything after "Some authors recommend" to @details
.
Also, I think the way I see it, is that a lot of packages and functions can report basic missing data features, like skimr::skim()
(that's the "easy" part). What is missing is a way to handle, as you highlight, survey data in that field-specific way. I thought it still fits with datawizard
even if offers additional field-specific features, although we can probably try to make it more general for other users. In the details section, I added a paragraph adding more context about scales as used in psychology:
#' In psychology, it is common to ask participants to answer questionnaires in
#' which people answer several questions about a specific topic. For example,
#' people could answer 10 different questions about how extroverted they are.
#' In turn, researchers calculate the average for those 10 questions (called
#' items). These questionnaires are called (e.g., Likert) "scales" (such as the
#' Rosenberg Self-Esteem Scale, also known as the RSES).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose one question we have to answer is: do we want to have describe_missing
only report basic missing info that is field-general a bit more like skim()
, OR we do we also want it to include the features specific to the survey format? (or said another way, should we remove or keep the survey feature)
R/describe_missing.R
Outdated
#' missing more than a single data point").* | ||
#' | ||
#' @param data The data frame to be analyzed. | ||
#' @param vars Variable (or lists of variables) to check for missing values (NAs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use select
, exclude
, etc. in all other dataframe functions, I think we should here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it works a little bit differently than select
elsewhere. vars
takes a list of list of strings (such as list(c("openness_1", "openness_2", "openness_3"), c("extroversion_1", "extroversion_2", "extroversion_3"))
) to take into account the nested structure of the items / columns. I can rename it to select
, but do you think it will create confusion or expectations that it should rely on and work with .select_nse
? Or should we include select
and exclude
in addition to vars
? I'm not sure how .select_nse
could accommodate the nested structure like I'm doing right now 🤔
R/describe_missing.R
Outdated
#' @keywords missing values NA guidelines | ||
#' @return A dataframe with the following columns: | ||
#' - `var`: Variables selected. | ||
#' - `items`: Number of items for selected variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think unique_values
instead of items
would be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, so in this case "number of items" refers to the number of columns selected for each "scale" or combination of variables. Maybe I should use that instead, as I'm afraid unique_values
would suggest unique responses for a given column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed specific as in psychology we tend to think of variables as made of several "items". So items 1-10 create a variable such as a personality trait "extroversion". I'm not sure how to call it because "variable" might be confused with "scale" (i.e., a composite score). Maybe I could just rename that output column "columns", but I'm open to your suggestions if you have more. A more accurate name (for psychology) would be n_items
, so perhaps we can do n_columns
??
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Thanks for the feedback and comments! We can definitely rename the column names for more clarity e.g., to use
There is one row per variable / scale, but each variable / scale can be defined by multiple items / columns, and so the output has to be able to accommodate that (the current strategy is to use the But if I understand correctly, you would like that the default, instead of reporting for all columns as an aggregate (i.e., always exactly 1 row), would report one row per column, for all columns. Although for large datasets this would create a long output, that could work. |
Ok so I changed the default so that when no scale or variable are specified, all columns are reported on separate rows: However, this behaviour is overwritten if scales or variables are specified: library(datawizard)
# Use the entire data frame
set.seed(15)
fun <- function() {
c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
ID = c("idz", NA),
openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)
describe_missing(df)
#> variable n_columns n_missing cells missing_percent complete_percent
#> 1 ID 1 7 14 50.00 50.00
#> 2 openness_1 1 4 14 28.57 71.43
#> 3 openness_2 1 4 14 28.57 71.43
#> 4 openness_3 1 3 14 21.43 78.57
#> 5 extroversion_1 1 6 14 42.86 57.14
#> 6 extroversion_2 1 6 14 42.86 57.14
#> 7 extroversion_3 1 5 14 35.71 64.29
#> 8 agreeableness_1 1 3 14 21.43 78.57
#> 9 agreeableness_2 1 4 14 28.57 71.43
#> 10 agreeableness_3 1 3 14 21.43 78.57
#> 11 Total 10 45 140 32.14 67.86
#> missing_max missing_max_percent all_missing
#> 1 1 100 7
#> 2 1 100 4
#> 3 1 100 4
#> 4 1 100 3
#> 5 1 100 6
#> 6 1 100 6
#> 7 1 100 5
#> 8 1 100 3
#> 9 1 100 4
#> 10 1 100 3
#> 11 10 100 2
# If the questionnaire items start with the same name,
# one can list the scale names directly:
describe_missing(df, scales = c("ID", "openness", "extroversion", "agreeableness"))
#> variable n_columns n_missing cells missing_percent
#> 1 ID 1 7 14 50.00
#> 2 openness_1:openness_3 3 11 42 26.19
#> 3 extroversion_1:extroversion_3 3 17 42 40.48
#> 4 agreeableness_1:agreeableness_3 3 10 42 23.81
#> 5 Total 10 45 140 32.14
#> complete_percent missing_max missing_max_percent all_missing
#> 1 50.00 1 100 7
#> 2 73.81 3 100 3
#> 3 59.52 3 100 3
#> 4 76.19 3 100 3
#> 5 67.86 10 100 2
# Otherwise you can provide nested columns manually:
describe_missing(df,
select = list(
c("ID"),
c("openness_1", "openness_2", "openness_3"),
c("extroversion_1", "extroversion_2", "extroversion_3"),
c("agreeableness_1", "agreeableness_2", "agreeableness_3")
)
)
#> variable n_columns n_missing cells missing_percent
#> 1 ID 1 7 14 50.00
#> 2 openness_1:openness_3 3 11 42 26.19
#> 3 extroversion_1:extroversion_3 3 17 42 40.48
#> 4 agreeableness_1:agreeableness_3 3 10 42 23.81
#> 5 Total 10 45 140 32.14
#> complete_percent missing_max missing_max_percent all_missing
#> 1 50.00 1 100 7
#> 2 73.81 3 100 3
#> 3 59.52 3 100 3
#> 4 76.19 3 100 3
#> 5 67.86 10 100 2 Created on 2024-12-16 with reprex v2.1.1 |
I feel like most unresolved comments and questions regarding the documentation and the implementation are related to the scope of this function. I'd rather have a "generalist" function à la @easystats/core-team what do you think? Are you interested in having some of those field-specific features in this function? |
I tend to agree. This function should be more general purpose - and maybe a psych-centric wrapper can be housed in @rempsyc 's package (I also just now noticed your handle is the name of the package 😅) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #561 +/- ##
==========================================
+ Coverage 91.14% 91.25% +0.11%
==========================================
Files 76 77 +1
Lines 6045 6144 +99
==========================================
+ Hits 5510 5607 +97
- Misses 535 537 +2 ☔ View full report in Codecov by Sentry. |
If I understand, the main outstanding issue is what to do with the |
Alright, in this case, I think I can introduce |
Fixes #454