diff --git a/04-roadmap.Rmd b/04-roadmap.Rmd index fd71cd7..0ed7a4b 100644 --- a/04-roadmap.Rmd +++ b/04-roadmap.Rmd @@ -31,9 +31,10 @@ efficient estimators of the statistical estimand. The roadmap is a six-stage process: -1. Define the data as a random variable with a probability distribution, $O \sim P_0$ +1. Define the data as a random variable with a probability distribution, $O \sim + P_0$ 2. Specify the statistical model $\M$ realistically, such that $P_0 \in \M$ -3. Translate the scientific question of interest into a statistical target +3. Translate the scientific question of interest into a statistical target parameter $\Psi$ and establish the target population 4. Choose an estimator $\hat{\Psi}$ for $\Psi$ under realistic $\M$ 5. Construct a measure of uncertainty for the estimate $\hat{\Psi}(P_n)$ @@ -44,154 +45,151 @@ The roadmap is a six-stage process: The dataset we are confronted with is the collection of the results of a scientific (or natural) experiment. We can view the data as a _random variable_; -that is, if the same experiment were to be repeated, we should expect to -see a different realization of the data generated by the same underlying law -governing the experiment in question. In particular, if the experiment were -repeated many times, the underlying probability distribution generating the -data, $P_0$, would be revealed. The observed data on a single unit, $O$, -may be thought of as being drawn from this probability distribution $P_0$. Most -often, we have $n$ _independent and identically distributed_ (i.i.d.) -observations of the random variable $O$ in our dataset. Then, the observed data -is the collection -$O_1, \ldots, O_n$, where the subscripts denote the individual observational -units. While not all data are i.i.d., this is certainly the most common case in -applied data analysis. There are a number of techniques for handling non-i.i.d. -data, including establishing conditional independence, such that -conditional on some variable (e.g., subject ID for repeated measures data) the -i.i.d. assumption holds, and incorporating inferential corrections for -repeated or clustered observations, to name but a few. +that is, if the same experiment were to be repeated, we should expect to see a +different realization of the data generated by the same underlying law governing +the experiment in question. In particular, if the experiment were repeated many +times, the underlying probability distribution generating the data, $P_0$, would +be revealed. The observed data on a single unit, $O$, may be thought of as being +drawn from this probability distribution $P_0$. Most often, we have $n$ +_independent and identically distributed_ (i.i.d.) observations of the random +variable $O$ in our dataset. Then, the observed data is the collection $O_1, +\ldots, O_n$, where the subscripts denote the individual observational units. +While not all data are i.i.d., this is certainly the most common case in applied +data analysis. There are a number of techniques for handling non-i.i.d. data, +including establishing conditional independence, such that conditional on some +variable (e.g., subject ID for repeated measures data) the i.i.d. assumption +holds, and incorporating inferential corrections for repeated or clustered +observations, to name but a few. #### The empirical probability measure, $P_n$ {-} With $n$ i.i.d. observations in hand, we can define an empirical probability measure, $P_n$. The empirical probability measure is an approximation of the true probability measure, $P_0$, allowing us to learn from the observed data. -For example, we can define the empirical probability measure of a set of -variables, say $W$, to be the proportion of observations that belong in $W$. +For example, we can define the empirical probability measure of a set of +variables, say $W$, to be the proportion of observations that belong in $W$. That is, \begin{equation*} P_n(W) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in W) \end{equation*} -In order to understand the scope for learning from a particular dataset, we -next need to ask *"What do we know about the process that led to the data's +In order to understand the scope for learning from a particular dataset, we next +need to ask *"What do we know about the process that led to the data's generation?"* This brings us on to Step 2. ### (2) Defining the statistical model $\M$ such that $P_0 \in \M$ {-} -The statistical model $\M$ is the set of all possible probability -distributions that could describe the process by which our observed data have -been generated, appropriately constrained by background scientific knowledge. -Often, $\M$ is necessarily very large (i.e., non-parametric), reflecting the fact -that statistical knowledge about $P_0$ is limited. +The statistical model $\M$ is the set of all possible probability distributions +that could describe the process by which our observed data have been generated, +appropriately constrained by background scientific knowledge. Often, $\M$ is +necessarily very large (i.e., non-parametric), reflecting the fact that +statistical knowledge about $P_0$ is limited. -If $P_0$ is described by a finite number of parameters, then the statistical -model is referred to as _parametric_. Such an assumption is made, for example, -by the proposition that $O$ has a Normal distribution with mean $\mu$ -and variance $\sigma^2$. More generally, a parametric model may be defined as +If $P_0$ is described by a finite number of parameters, then the statistical +model is referred to as _parametric_. Such an assumption is made, for example, +by the proposition that $O$ has a Normal distribution with mean $\mu$ and +variance $\sigma^2$. More generally, a parametric model may be defined as \begin{equation*} \M(\theta) = \{P_{\theta} : \theta \in \R^d \}, \end{equation*} which describes a constrained statistical model consisting of all distributions -$P_{\theta}$ that are indexed by some finite, $d$-dimensional parameter $\theta$. - -The assumption that $P_0$ has a specific, parametric form is made quite -commonly. Unfortunately, this is even the case when such -assumptions are not supported by domain knowledge about the data-generating -process. This -practice of oversimplification in the current, and traditional, culture of -statistical data analysis typically complicates or entirely thwarts any attempt -to reliably answer the scientific question at hand. Why, you ask? Consider how -much knowledge one must have to _know_ (beyond a shadow of a doubt) that the +$P_{\theta}$ that are indexed by some finite, $d$-dimensional parameter +$\theta$. + +The assumption that $P_0$ has a specific, parametric form is made quite +commonly. Unfortunately, this is even the case when such assumptions are not +supported by domain knowledge about the data-generating process. This practice +of oversimplification in the current, and traditional, culture of statistical +data analysis typically complicates or entirely thwarts any attempt to reliably +answer the scientific question at hand. Why, you ask? Consider how much +knowledge one must have to _know_ (beyond a shadow of a doubt) that the data-generating distribution underlying a given dataset is, in fact, governed by -just two parameters, as is the case with the ubiquitously -relied upon Normal distribution. Similarly, main terms Cox proportional hazards, -logistic regression, and linear models imply a highly constrained statistical -model, and if any of the assumptions are unwarranted then there will be bias in -their result (except when treatment is randomized). The philosophy used to justify -parametric assumptions is rooted in misinterpretations of the often-quoted -saying of George Box, that "All models are wrong but some are useful", which has +just two parameters, as is the case with the ubiquitously relied upon Normal +distribution. Similarly, main terms Cox proportional hazards, logistic +regression, and linear models imply a highly constrained statistical model, and +if any of the assumptions are unwarranted then there will be bias in their +result (except when treatment is randomized). The philosophy used to justify +parametric assumptions is rooted in misinterpretations of the often-quoted +saying of George Box, that "All models are wrong but some are useful", which has been irresponsibly used to encourage the data analyst to make arbitrary modeling -choices. However, when one makes such unfounded assumptions, it is more likely -that $\M$ does not contain $P_0$, in which case the statistical model -is said to be misspecified. Statistical model misspecification introduces a -bias that leads to misleading, unrealiable results and inference. - -The result of unwarranted assumptions and oversimplifications is a practice of -statistical data science in which starkly disparate answers to the same -scientific problem emerge. Practically, this is owed to the application of distinct -statistical techniques under differing modeling decisions and assumptions made -(but not communicated well) by different data analysts. Even in the nascent days -of statistical data analysis, it was recognized that it is "far +choices. However, when one makes such unfounded assumptions, it is more likely +that $\M$ does not contain $P_0$, in which case the statistical model is said to +be misspecified. Statistical model misspecification introduces a bias that leads +to misleading, unreliable results and inference. + +The result of unwarranted assumptions and oversimplifications is a practice of +statistical data science in which starkly disparate answers to the same +scientific problem emerge. Practically, this is owed to the application of +distinct statistical techniques under differing modeling decisions and +assumptions made (but not communicated well) by different data analysts. Even in +the nascent days of statistical data analysis, it was recognized that it is "far better [to develop] an approximate answer to the right question...than an exact answer to the wrong question, which can always be made precise" [@tukey1962future], though traditional statistics failed to heed this advice for -a number of decades [@donoho2017fifty]. The roadmap avoids -this bias by defining the statistical model through a representation of the true -data-generating distribution underlying the observed data. The ultimate goal is -to formulate the statistical estimation problem _precisely_ (up to the -constraints imposed by available scientific knowledge), so that one -can then tailor the estimation procedure to the motivating scientific problem. - -It is crucial that the domain scientist(s) have absolute -clarity about what is _actually known_ about the process/experiment that -generated the data, and that this is communicated to data scientists with as -much detail as possible. This knowledge is rarely ground truth -itself, but instead comes in the form of scientific conventions, accepted -hypotheses, and operational assumptions. +a number of decades [@donoho2017fifty]. The roadmap avoids this bias by defining +the statistical model through a representation of the true data-generating +distribution underlying the observed data. The ultimate goal is to formulate the +statistical estimation problem _precisely_ (up to the constraints imposed by +available scientific knowledge), so that one can then tailor the estimation +procedure to the motivating scientific problem. + +It is crucial that the domain scientist(s) have absolute clarity about what is +_actually known_ about the process/experiment that generated the data, and that +this is communicated to data scientists with as much detail as possible. This +knowledge is rarely ground truth itself, but instead comes in the form of +scientific conventions, accepted hypotheses, and operational assumptions. -It is then the data scientist's responsibility to translate the domain knowledge -into statistical knowledge about $P_0$, and then to define the statistical -model $\M$ so that it respects what is known about $P_0$ and makes no further -restrictions. In this manner, we can ensure that $P_0$ is contained in $\M$, +It is then the data scientist's responsibility to translate the domain knowledge +into statistical knowledge about $P_0$, and then to define the statistical model +$\M$ so that it respects what is known about $P_0$ and makes no further +restrictions. In this manner, we can ensure that $P_0$ is contained in $\M$, which we refer to generally as defining a _realistic_ statistical model $\M$. -Defining $\M$ realistically requires a shift in the paradigm of statistical problem -solving. Instead of considering the methods/software one is familiar with and -then trying to solve most problems with that toolbox, one must obtain a deep -understanding of the experiment and scientific question first and then formulate -a plan for learning from the data in a way that respects this. This requires -statisticians to have not only solid methodological and -theoretical foundations, but good communication skills, as -several meetings with domain experts are typically required to review details -of the study, possibly refine of the question of interest, translate technical -details, and interpret the findings in a way that is statistically correct and -agreeable with non-statistician domain experts. Unfortunately, communication -between statisticians and non-statistician researchers is often fraught with -misinterpretation. This is to be expected, as each have their own expertise, -but proper communication about the underlying science and the motivating study -can help to ensure each have appropriate context for a given statistical data -analysis. The roadmap provides a principled mechanism for learning from data -realistically, so that what is learned from the data represents a reliable and -reproducible approximation of the answer to the scientific question of interest. -As the roadmap provides a rigorous method for translating scientific knowledge and -questions into a statistical framework that can be used to learn from data, it -is an invaluable tool to guide communication between statisticians and -non-statistician domain scientists. This brings us to our next step in the +Defining $\M$ realistically requires a shift in the paradigm of statistical +problem solving. Instead of considering the methods/software one is familiar +with and then trying to solve most problems with that toolbox, one must obtain a +deep understanding of the experiment and scientific question first and then +formulate a plan for learning from the data in a way that respects this. This +requires statisticians to have not only solid methodological and theoretical +foundations, but good communication skills, as several meetings with domain +experts are typically required to review details of the study, possibly refine +of the question of interest, translate technical details, and interpret the +findings in a way that is statistically correct and agreeable with +non-statistician domain experts. Unfortunately, communication between +statisticians and non-statistician researchers is often fraught with +misinterpretation. This is to be expected, as each have their own expertise, but +proper communication about the underlying science and the motivating study can +help to ensure each have appropriate context for a given statistical data +analysis. The roadmap provides a principled mechanism for learning from data +realistically, so that what is learned from the data represents a reliable and +reproducible approximation of the answer to the scientific question of interest. +As the roadmap provides a rigorous method for translating scientific knowledge +and questions into a statistical framework that can be used to learn from data, +it is an invaluable tool to guide communication between statisticians and +non-statistician domain scientists. This brings us to our next step in the roadmap, *"What are we trying to learn from the data?"* ### (3) The statistical target parameter $\Psi$ and statistical estimand $\psi_0$ {-} -The statistical target parameter, -$\Psi$, is defined as a mapping from the -statistical model, $\M$, to the parameter space. Usually, the parameter -space is a real number (but not necessarily so), in which case we can -formally define the target parameter as the mapping $\Psi: \M \rightarrow \R$. -The statistical estimand may be seen as a representation of the quantity that -we wish to learn from the data, the answer to a well-specified --- often causal --- -question of interest about a particular target population. In contrast to -ordinary statistical estimands, causal estimands require an extra set of -assumptions to allow for their _identification from the observed data_. Based -on causal models [@pearl2009causality; @hernan2022causal], identification -assumptions are untestable and must be justified through a combination of -knowledge about the system under study or the process by which the experiment -was conducted. These assumptions are described in greater detail in the -following section on [causal target parameters](#causal). +The statistical target parameter, $\Psi$, is defined as a mapping from the +statistical model, $\M$, to the parameter space. Usually, the parameter space is +a real number (but not necessarily so), in which case we can formally define the +target parameter as the mapping $\Psi: \M \rightarrow \R$. The statistical +estimand may be seen as a representation of the quantity that we wish to learn +from the data, the answer to a well-specified -- often causal -- question of +interest about a particular target population. In contrast to ordinary +statistical estimands, causal estimands require an extra set of assumptions to +allow for their _identification from the observed data_. Based on causal models +[@pearl2009causality; @hernan2022causal], identification assumptions are +untestable and must be justified through a combination of knowledge about the +system under study or the process by which the experiment was conducted. These +assumptions are described in greater detail in the following section on [causal +target parameters](#causal). For a simple example, consider a dataset containing observations of a survival time on every adult, for which our question of interest is "What's the @@ -201,39 +199,39 @@ probability that an adult lives longer than five years?" We have, \psi_0 = \Psi(P_0) = \E_{P_0}(O > 5) = \int_5^{\infty} dP_0(o). \end{equation*} -This answer to this question is the **statistical estimand, $\Psi(P_0)=\psi_0$**, -which is the quantity we wish to learn from the data. As discussed above, -back-and-forth communication between domain scientists and statisticians is -often required to define $\M$ realistically, and to finalize $\Psi$ and the -target population such that the question is supported in the data. -For instance, say we are interested in learning the average effect of a headache -medication for treating migraines in adults and we learn that no one with high -blood pressure can receive the medication. In the next meeting with domain -scientists, we might suggest that the target population be modified to adults -without high blood pressure or ask a question involving a dynamic treatment -such that within $\Psi$ adults with high blood pressure are never considered -as individuals who could receive treatment. Once we have defined -$O$, $\M$ realistically and $\Psi$, we have formally defined the statistical -estimation problem. Next comes Step 4: "_How do we learn from the data the -approximate answer to the question of interest?_" +This answer to this question is the **statistical estimand, +$\Psi(P_0)=\psi_0$**, which is the quantity we wish to learn from the data. As +discussed above, back-and-forth communication between domain scientists and +statisticians is often required to define $\M$ realistically, and to finalize +$\Psi$ and the target population such that the question is supported in the +data. For instance, say we are interested in learning the average effect of a +headache medication for treating migraines in adults and we learn that no one +with high blood pressure can receive the medication. In the next meeting with +domain scientists, we might suggest that the target population be modified to +adults without high blood pressure or ask a question involving a dynamic +treatment such that within $\Psi$ adults with high blood pressure are never +considered as individuals who could receive treatment. Once we have defined $O$, +$\M$ realistically and $\Psi$, we have formally defined the statistical +estimation problem. Next comes Step 4: "_How do we learn from the data the +approximate answer to the question of interest?_" ### (4) The estimator $\hat{\Psi}$ and estimate $\psi_n$ {-} To obtain a good approximation of the statistical estimand, we need an estimator -$\hat{\Psi}, an _a priori_-specified algorithm defined as a mapping from the set -of the set of possible empirical distributions $P_n$ (which live in a -non-parametric statistical model $\M_{NP}$) to the parameter space for our -target parameter of interest: $\hat{\Psi} : \M_{NP} \rightarrow \R$. In other -words, $\hat{\Psi} is a function that takes as input the observed data, a -realization of $P_n$, and then outputs a value in the parameter space. Where -the estimator may be seen as an operator that maps the observed data's -corresponding empirical distribution to a value in the parameter space, the -numerical output produced by such a function is the -**estimate, $\hat{\Psi}(P_n)=\psi_n$**. Thus, $\psi_n$ is an -element of the parameter space as informed by the empirical probability -distribution $P_n$ of the observed data $O_1, \ldots, O_n$. If we plug in a -realization of $P_n$ (based on a sample size $n$ of the random variable $O$), we -get back an estimate $\psi_n$ of the true parameter value $\psi_0$. +$\hat{\Psi}$, an _a priori_-specified algorithm defined as a mapping from the set +of the set of possible empirical distributions $P_n$ (which live in a +non-parametric statistical model $\M_{NP}$) to the parameter space for our +target parameter of interest: $\hat{\Psi} : \M_{NP} \rightarrow \R$. In other +words, $\hat{\Psi} is a function that takes as input the observed data, a +realization of $P_n$, and then outputs a value in the parameter space. Where the +estimator may be seen as an operator that maps the observed data's corresponding +empirical distribution to a value in the parameter space, the numerical output +produced by such a function is the **estimate, $\hat{\Psi}(P_n)=\psi_n$**. Thus, +$\psi_n$ is an element of the parameter space as informed by the empirical +probability distribution $P_n$ of the observed data $O_1, \ldots, O_n$. If we +plug in a realization of $P_n$ (based on a sample size $n$ of the random +variable $O$, we get back an estimate $\psi_n$ of the true parameter value +$\psi_0$. -As we have motivated in step 2, it is imperative to consider realistic -statistical models for estimation. Therefore, flexible estimators that allow -for parts of the data-generating process to be unrestricted are necessary. -Semiparametric statistical theory and empirical process theory provide a -framework for constructing, benchmarking, and understanding the behavior of -estimators that depend on flexible estimation strategies in realistic -statistical models. In general, desirable properties of an estimator are that -it is regular asymptotically linear (RAL) and efficient, thereby admitting a -Normal limit distribution that has minimal variance. Substitution/plug-in RAL -estimators are also advantageous: they are guaranteed to remain within the -bounds of $\M$ and, relative to estimators that are not plug-in, have improved -bias and variance in finite samples. In-depth discussion of the theory and -these properties are available in the literature [e.g., @kennedy2016semiparametric; @vdl2011targeted]. We review a few key concepts in the following step. +As we have motivated in step 2, it is imperative to consider realistic +statistical models for estimation. Therefore, flexible estimators that allow for +parts of the data-generating process to be unrestricted are necessary. +Semiparametric statistical theory and empirical process theory provide a +framework for constructing, benchmarking, and understanding the behavior of +estimators that depend on flexible estimation strategies in realistic +statistical models. In general, desirable properties of an estimator are that it +is regular asymptotically linear (RAL) and efficient, thereby admitting a Normal +limit distribution that has minimal variance. Substitution/plug-in RAL +estimators are also advantageous: they are guaranteed to remain within the +bounds of $\M$ and, relative to estimators that are not plug-in, have improved +bias and variance in finite samples. In-depth discussion of the theory and these +properties are available in the literature [e.g., @kennedy2016semiparametric; +@vdl2011targeted]. We review a few key concepts in the following step. In order to quantify the uncertainty in our estimate of the target parameter, part of the process of conducting statistical inference, an understanding of the -sampling distribution of our estimator is necessary. This brings us to Step -5: "_How confident should we be in our statistical answer to the scientific +sampling distribution of our estimator is necessary. This brings us to Step 5: +"_How confident should we be in our statistical answer to the scientific question?_" ### (5) A measure of uncertainty for the estimate $\psi_n$ {-} Since the estimator $\hat{\Psi}$ is a function of the empirical distribution $P_n$, the estimator itself is a random variable with a sampling distribution. -Therefore, if we repeat the experiment of drawing $n$ observations, we would -every time end up with a different realization of our estimate. The hypothetical +Therefore, if we repeat the experiment of drawing $n$ observations, we would +every time end up with a different realization of our estimate. The hypothetical distribution of these estimates is the sampling distribution of the estimator. A primary goal in the construction of estimators is to be able to derive their -asymptotic sampling distribution through a theoretical analysis involving -empirical process theory. In this regard, an important property of the -estimators on which we focus is their asymptotic linearity. In particular, -asymptotic linearity states that the difference between the estimator and the -target parameter (i.e., the truth) can be represented, asymptotically, as an -average of i.i.d. random variables plus an asymptotically negligible remainder +asymptotic sampling distribution through a theoretical analysis involving +empirical process theory. In this regard, an important property of the +estimators on which we focus is their asymptotic linearity. In particular, +asymptotic linearity states that the difference between the estimator and the +target parameter (i.e., the truth) can be represented, asymptotically, as an +average of i.i.d. random variables plus an asymptotically negligible remainder term: \begin{equation*} \hat{\Psi}(P_n) - \Psi(P_0) = \frac{1}{n} \sum_{i=1}^n IC(P_0)(O_i) + o_p(n^{-1/2}), \end{equation*} -where the influence curve (IC) is a function of the observed data $O$ but the -function itself is defined by the underlying data-generating distribution $P_0$. -Based on this asymptotic approximation, the Central Limit Theorem can be used to +where the influence curve (IC) is a function of the observed data $O$ but the +function itself is defined by the underlying data-generating distribution $P_0$. +Based on this asymptotic approximation, the Central Limit Theorem can be used to show \begin{equation*} @@ -299,111 +298,108 @@ example, a standard $(1 - \alpha)$ CI takes the form \begin{equation*} \psi_n \pm Z \frac{\hat{\sigma}_{IC}}{\sqrt{n}} \ , \end{equation*} -where $Z$ is the $(1 - \alpha / 2)^\text{th}$ -quantile of the standard Normal distribution. Following convention, we will -often be interested in constructing 95% two-tailed CIs, corresponding to -probability mass $\alpha/2 = 0.025$ in each tail of the limit distribution; -thus, we will take $Z \approx 1.96$ as the quantile. - -Steps (1)--(5) of the roadmap define the statistical analysis plan, all of which -can be done before any data is revealed. The last step of the roadmap involves -interpreting the results obtained in step (4) and (5) and therefore requires -the data to be analyzed; however, any additional analysis that may take place -as part of step (6) can be pre-specified as well. This final step of the -roadmap addresses the question, "_what is the interpretation and robustness -of the study's findings, and what conclusions can be drawn from them?_" +where $Z$ is the $(1 - \alpha / 2)^\text{th}$ quantile of the standard Normal +distribution. Following convention, we will often be interested in constructing +95% two-tailed CIs, corresponding to probability mass $\alpha/2 = 0.025$ in each +tail of the limit distribution; thus, we will take $Z \approx 1.96$ as the +quantile. + +Steps (1)--(5) of the roadmap define the statistical analysis plan, all of which +can be done before any data is revealed. The last step of the roadmap involves +interpreting the results obtained in step (4) and (5) and therefore requires the +data to be analyzed; however, any additional analysis that may take place as +part of step (6) can be pre-specified as well. This final step of the roadmap +addresses the question, "_what is the interpretation and robustness of the +study's findings, and what conclusions can be drawn from them?_" ### (6) Make substantive conclusion {-} -Making the substantive conclusion involves interpreting the study findings. It -also provides an opportunity to ask follow-up questions that might be -addressed later and/or discuss issues that can inform future studies. -Statistical estimands $\psi_0$ can have statistical (noncausal) and causal -interpretations. Both are often of interest and can be provided. The target -population should be clearly mentioned in the interpretation, regardless of -whether it's a purely statistical or causal interpretation, to curtail -extrapolation of results. - -The major distinction between statistical versus causal interpretations is that -the latter relies on untestable so-called "identifiability" assumptions. In the -following section, we review these -assumptions one-by-one. Here, we focus on the interpretation and robustness of -the study findings with respect to them. Specifically, causal target parameters -cannot be estimated from observed data without additional identifiability -assumptions, and so the validity of a result's causal interpretation -hinges on them holding in the data. The more these assumptions do not hold, the -larger the _causal gap_, the difference between the statistical estimand and -the causal estimand. In a perfect randomized control trial with no loss to -follow-up, the causal gap will be zero as the statistical and causal estimands -are equivalent. In @diaz2013sensitivity, a non-parametric sensitivity analysis -for assessing the impact of a hypothesized causal gaps on estimates and -inference is proposed. In @gruber2022targeted and @gruber2022evaluating, -there are example implementations of the methods proposed in -@diaz2013sensitivity; in particular, the difference between -adjusted and unadjusted effect estimates is used to define a range of -possible causal gaps relative to this difference. If the question of interest -is causal, then such a model-free sensitivity analysis (possibly as a -complement to other sensitivity analyses) is recommended to assess the -robustness of the study findings. + +Making the substantive conclusion involves interpreting the study findings. It +also provides an opportunity to ask follow-up questions that might be addressed +later and/or discuss issues that can inform future studies. Statistical +estimands $\psi_0$ can have statistical (noncausal) and causal interpretations. +Both are often of interest and can be provided. The target population should be +clearly mentioned in the interpretation, regardless of whether it's a purely +statistical or causal interpretation, to curtail extrapolation of results. + +The major distinction between statistical versus causal interpretations is that +the latter relies on untestable so-called "identifiability" assumptions. In the +following section, we review these assumptions one-by-one. Here, we focus on the +interpretation and robustness of the study findings with respect to them. +Specifically, causal target parameters cannot be estimated from observed data +without additional identifiability assumptions, and so the validity of a +result's causal interpretation hinges on them holding in the data. The more +these assumptions do not hold, the larger the _causal gap_, the difference +between the statistical estimand and the causal estimand. In a perfect +randomized control trial with no loss to follow-up, the causal gap will be zero +as the statistical and causal estimands are equivalent. In @diaz2013sensitivity, +a non-parametric sensitivity analysis for assessing the impact of a hypothesized +causal gaps on estimates and inference is proposed. In @gruber2022targeted and +@gruber2022evaluating, there are example implementations of the methods proposed +in @diaz2013sensitivity; in particular, the difference between adjusted and +unadjusted effect estimates is used to define a range of possible causal gaps +relative to this difference. If the question of interest is causal, then such a +model-free sensitivity analysis (possibly as a complement to other sensitivity +analyses) is recommended to assess the robustness of the study findings. ## Summary of the Roadmap {#roadmap-summary} Data collected across $n$ i.i.d. units, $O_1, \ldots, O_n$, may be viewed as a -collection of random variables arising from the -same underlying probability distribution $\P_0$. This is expressed by denoting -the collection of data as being generated as $O_1, \ldots, O_n \sim P_0$. -Domain knowledge about the experiment that generated the data (e.g., if the -treatment was randomized, if the treatment decision or loss to follow-up -depended on a subset of covariates, time ordering in which the variables were -added to the data) is translated by the statistician / data scientist to define -the statistical model $\M$, a postulated space of candidate probability -distributions that is supposed to contain $P_0$. In particular, the roadmap -emphasizes the critical role of defining $\M$ such that $P_0$ is guaranteed to -be encapsulated by it, $P_0 \in \M$. By only limiting $\M$ based on domain -knowledge about the experiment (i.e., reality) --- opposed to constraining it -unrealistically (e.g., assuming a restrictive functional form, like a main -terms linear/logistic model, describes $P_0$) --- it can be ensured that -$P_0 \in \M$, and we refer to this as defining a realistic statistical model. -Often, knowledge that can be used to constrain $\M$ is very limited, and so -$\M$ must be very large to define it such that $P_0 \in \M$; hence, realistic -statistical models are often termed semi- or non-parametric, since they are -too large to be indexed by a finite-dimensional set of parameters. Necessarily, -our statistical query must begin with, "What are we trying to learn from the data?", -a question whose answer is captured by the statistical target parameter, $\Psi$, -a function defined by the true data-generating distribution $P_0$, that -maps $\M$ into the statistical estimand, $\psi_0$. At this stage, the -statistical estimation problem is formally defined, allowing for the use of -statistical theory to guide the construction of estimators, which are -algorithms that approximate the answer the question of interest by learning -from the data. Desirable properties of an estimator are that it is unbiased, -efficient, plug-in, and robust in finite samples. If the question of interest -is causal, then a model-free sensitivity analysis is recommended to assess the -robustness of the study's findings under various hypothesized causal gaps. +collection of random variables arising from the same underlying probability +distribution $\P_0$. This is expressed by denoting the collection of data as +being generated as $O_1, \ldots, O_n \sim P_0$. Domain knowledge about the +experiment that generated the data (e.g., if the treatment was randomized, if +the treatment decision or loss to follow-up depended on a subset of covariates, +time ordering in which the variables were added to the data) is translated by +the statistician / data scientist to define the statistical model $\M$, a +postulated space of candidate probability distributions that is supposed to +contain $P_0$. In particular, the roadmap emphasizes the critical role of +defining $\M$ such that $P_0$ is guaranteed to be encapsulated by it, $P_0 \in +\M$. By only limiting $\M$ based on domain knowledge about the experiment (i.e., +reality) --- opposed to constraining it unrealistically (e.g., assuming a +restrictive functional form, like a main terms linear/logistic model, describes +$P_0$) --- it can be ensured that $P_0 \in \M$, and we refer to this as defining +a realistic statistical model. Often, knowledge that can be used to constrain +$\M$ is very limited, and so $\M$ must be very large to define it such that $P_0 +\in \M$; hence, realistic statistical models are often termed semi- or +non-parametric, since they are too large to be indexed by a finite-dimensional +set of parameters. Necessarily, our statistical query must begin with, "What are +we trying to learn from the data?", a question whose answer is captured by the +statistical target parameter, $\Psi$, a function defined by the true +data-generating distribution $P_0$, that maps $\M$ into the statistical +estimand, $\psi_0$. At this stage, the statistical estimation problem is +formally defined, allowing for the use of statistical theory to guide the +construction of estimators, which are algorithms that approximate the answer the +question of interest by learning from the data. Desirable properties of an +estimator are that it is unbiased, efficient, plug-in, and robust in finite +samples. If the question of interest is causal, then a model-free sensitivity +analysis is recommended to assess the robustness of the study's findings under +various hypothesized causal gaps. ## Causal Target Parameters {#causal} In many cases, we are interested in problems that ask questions regarding the _causal effect_ of an intervention, whether an assigned treatment (e.g., a -prescribed drug) or a "naturally occurring" exposure (e.g., pollution from -a nearby factory), on a future outcome of interest. These causal -effects may be defined as summaries of the population of interest (e.g., -population mean of a particular outcome) under contrasting interventions -(e.g., comparing the treated to the untreated condition). -For example, a causal effect could be -defined as the mean difference of a disease outcome between two -_causal contrasts_, counterfactual cases in which the study population were set -to uniformly experience low pollution levels for some pollutant, and in which -the same population were set to uniformly experience high levels of the same pollutant. +prescribed drug) or a "naturally occurring" exposure (e.g., pollution from a +nearby factory), on a future outcome of interest. These causal effects may be +defined as summaries of the population of interest (e.g., population mean of a +particular outcome) under contrasting interventions (e.g., comparing the treated +to the untreated condition). For example, a causal effect could be defined as +the mean difference of a disease outcome between two _causal contrasts_, +counterfactual cases in which the study population were set to uniformly +experience low pollution levels for some pollutant, and in which the same +population were set to uniformly experience high levels of the same pollutant. There are different ways of operationalizing the theoretical experiments that generate the counterfactual data necessary for describing such causal contrasts of interest. We could simply assume that the counterfactual outcomes exist in theory for all treatment contrasts of interest [@neyman1938contribution; @rubin2005causal; @imbens2015causal], which may be encoded in so-called "science -tables". Alternatively, we could consider interventions on structural causal models (SCMs) -[@pearl1995causal; @pearl2009causality], which may be represented by directed -acyclic graphs (DAGs). Both frameworks allow for the known or hypothesized set -of relationships between variables in the system under study to be encoded and -mathematically formalized. +tables". Alternatively, we could consider interventions on structural causal +models (SCMs) [@pearl1995causal; @pearl2009causality], which may be represented +by directed acyclic graphs (DAGs). Both frameworks allow for the known or +hypothesized set of relationships between variables in the system under study to +be encoded and mathematically formalized. ### The Causal Model {-} @@ -591,5 +587,51 @@ different identifiability assumptions and the resulting estimands can be functions of different components of $P_0$. We discuss several more complex estimands in subsequent chapters. +