From 1553beae09fa0d9921813a482430e1e500180773 Mon Sep 17 00:00:00 2001 From: Nima Hejazi Date: Wed, 5 Jul 2023 13:50:53 -0700 Subject: [PATCH] fix math mode and enforce linebreaks --- 04-roadmap.Rmd | 36 +++++----- 05-origami.Rmd | 175 +++++++++++++++++++++++----------------------- 09-tmle3shift.Rmd | 2 +- 3 files changed, 108 insertions(+), 105 deletions(-) diff --git a/04-roadmap.Rmd b/04-roadmap.Rmd index 0ed7a4b..482f6be 100644 --- a/04-roadmap.Rmd +++ b/04-roadmap.Rmd @@ -111,7 +111,7 @@ regression, and linear models imply a highly constrained statistical model, and if any of the assumptions are unwarranted then there will be bias in their result (except when treatment is randomized). The philosophy used to justify parametric assumptions is rooted in misinterpretations of the often-quoted -saying of George Box, that "All models are wrong but some are useful", which has +saying of George Box, that "All models are wrong but some are useful," which has been irresponsibly used to encourage the data analyst to make arbitrary modeling choices. However, when one makes such unfounded assumptions, it is more likely that $\M$ does not contain $P_0$, in which case the statistical model is said to @@ -222,7 +222,7 @@ $\hat{\Psi}$, an _a priori_-specified algorithm defined as a mapping from the se of the set of possible empirical distributions $P_n$ (which live in a non-parametric statistical model $\M_{NP}$) to the parameter space for our target parameter of interest: $\hat{\Psi} : \M_{NP} \rightarrow \R$. In other -words, $\hat{\Psi} is a function that takes as input the observed data, a +words, $\hat{\Psi}$ is a function that takes as input the observed data, a realization of $P_n$, and then outputs a value in the parameter space. Where the estimator may be seen as an operator that maps the observed data's corresponding empirical distribution to a value in the parameter space, the numerical output @@ -233,26 +233,26 @@ plug in a realization of $P_n$ (based on a sample size $n$ of the random variable $O$, we get back an estimate $\psi_n$ of the true parameter value $\psi_0$. As we have motivated in step 2, it is imperative to consider realistic statistical models for estimation. Therefore, flexible estimators that allow for parts of the data-generating process to be unrestricted are necessary. -Semiparametric statistical theory and empirical process theory provide a -framework for constructing, benchmarking, and understanding the behavior of -estimators that depend on flexible estimation strategies in realistic -statistical models. In general, desirable properties of an estimator are that it -is regular asymptotically linear (RAL) and efficient, thereby admitting a Normal -limit distribution that has minimal variance. Substitution/plug-in RAL -estimators are also advantageous: they are guaranteed to remain within the -bounds of $\M$ and, relative to estimators that are not plug-in, have improved -bias and variance in finite samples. In-depth discussion of the theory and these -properties are available in the literature [e.g., @kennedy2016semiparametric; +Semiparametric theory and empirical process theory provide a framework for +constructing, benchmarking, and understanding the behavior of estimators that +depend on flexible estimation strategies in realistic statistical models. In +general, desirable properties of an estimator are that it is regular +asymptotically linear (RAL) and efficient, thereby admitting a Normal limit +distribution that has minimal variance. Substitution/plug-in RAL estimators are +also advantageous: they are guaranteed to remain within the bounds of $\M$ and, +relative to estimators that are not plug-in, have improved bias and variance in +finite samples. In-depth discussion of the theory and these properties are +available in the literature [e.g., @kennedy2016semiparametric; @vdl2011targeted]. We review a few key concepts in the following step. In order to quantify the uncertainty in our estimate of the target parameter, diff --git a/05-origami.Rmd b/05-origami.Rmd index 98b0ca5..695d6d4 100644 --- a/05-origami.Rmd +++ b/05-origami.Rmd @@ -32,44 +32,45 @@ By the end of this chapter you will be able to: `r if (knitr::is_latex_output()) '\\end{VT1}'` ## Introduction -Following the [_Roadmap for Targeted Learning_](#roadmap), we start to -elaborate on the estimation step in the current chapter. In order to generate an -estimate of the target parameter, we need to decide how to evaluate the quality -of our estimation procedure's performance. The performance, or error, of -any algorithm (estimator) corresponds to its generalizability to independent -datasets arising from the same data-generating process. Assessment of -the performance of an algorithm is extremely important --- it provides a -quantitative measure of how well the algorithm performs, and it guides the -choice of selecting among a set (or "library") of algorithms. In order to -assess the performance of an algorithm, or a library of them, we introduce the concept of a -**loss function**, which defines the **risk** or the **expected prediction error**. -Our goal is to estimate the true performance (risk) of our estimator. -In the next chapter, we elaborate on how to estimate the performance of a library -of algorithms in order to choose the best-performing one. In the following, we propose a -method to do so using the observed data and **cross-validation** procedures implemented -in the `origami` package [@coyle2018origami; @coyle-cran-origami]. +Following the [_Roadmap for Targeted Learning_](#roadmap), we start to elaborate +on the estimation step in the current chapter. In order to generate an estimate +of the target parameter, we need to decide how to evaluate the quality of our +estimation procedure's performance. The performance, or error, of any algorithm +(estimator) corresponds to its generalizability to independent datasets arising +from the same data-generating process. Assessment of the performance of an +algorithm is extremely important --- it provides a quantitative measure of how +well the algorithm performs, and it guides the choice of selecting among a set +(or "library") of algorithms. In order to assess the performance of an +algorithm, or a library of them, we introduce the concept of a **loss +function**, which defines the **risk** or the **expected prediction error**. +Our goal is to estimate the true performance (risk) of our estimator. In the +next chapter, we elaborate on how to estimate the performance of a library of +algorithms in order to choose the best-performing one. In the following, we +propose a method to do so using the observed data and **cross-validation** +procedures implemented in the `origami` package [@coyle2018origami; +@coyle-cran-origami]. ## Background -Ideally, in a data-rich scenario (i.e., one with unlimited observations), we would -split our dataset into three parts: +Ideally, in a data-rich scenario (i.e., one with unlimited observations), we +would split our dataset into three parts: 1. the training set, 2. the validation set, and @@ -77,35 +78,35 @@ split our dataset into three parts: The training set is used to fit algorithm(s) of interest; we evaluate the performance of the fit(s) on a validation set, which can be used to estimate -prediction error (e.g., for algorithm tuning or selection). The final error of the -selected algorithm is obtained by using the test (or holdout) set, which is +prediction error (e.g., for algorithm tuning or selection). The final error of +the selected algorithm is obtained by using the test (or holdout) set, which is kept entirely separate such that the algorithms never encounter these observations until the final model evaluation step. One might wonder, with training data readily available, why not use the training error to evaluate the proposed algorithm's performance? Unfortunately, the training error is a biased -estimate of a fitted algorithm's generalizability, since it uses the same data +estimate of a fitted algorithm's generalizability, since it uses the same data for fitting and evaluation. Since data are often scarce, separating a dataset into training, validation and test sets can prove too limiting, on account of decreasing the available data for use in training by too much. In the absence of a large dataset and a -designated test set, we must resort to methods that estimate the algorithm's -true performance by efficient sample re-use. Re-sampling methods, like the +designated test set, we must resort to methods that estimate the algorithm's +true performance by efficient sample re-use. Re-sampling methods, like the bootstrap, involve repeatedly sampling from the training set and fitting the -algorithms to these samples. While often computationally -intensive, re-sampling methods are particularly useful for evaluating an -algorithm and selecting among a set of them. In addition, they provide -more insight on the variability and robustness of a fitted algorithm, relative -to fitting an algorithm only once to all of the training data. +algorithms to these samples. While often computationally intensive, re-sampling +methods are particularly useful for evaluating an algorithm and selecting among +a set of them. In addition, they provide more insight on the variability and +robustness of a fitted algorithm, relative to fitting an algorithm only once to +all of the training data. ### Introducing: cross-validation @@ -128,11 +129,11 @@ data-generating distribution. For further details on the theoretical results, we suggest consulting @vdl2003unified, @vdl2004asymptotic, @dudoit2005asymptotics and @vaart2006oracle. -The `origami` package provides a suite of tools for cross-validation. In the -following, we describe different types of cross-validation schemes readily available -in `origami`, introduce the general structure of the `origami` package, and demonstrate -the use of these procedures in various applied settings. ---- +The `origami` package provides a suite of tools for cross-validation. In the +following, we describe different types of cross-validation schemes readily +available in `origami`, introduce the general structure of the `origami` +package, and demonstrate the use of these procedures in various applied +settings. ## Estimation Roadmap: How does it all fit together? @@ -143,12 +144,12 @@ particular, the unified loss-based estimation framework [@vdl2003unified; which relies on cross-validation for estimator construction, selection, and performance assessment, consists of three main steps: -1. **Loss function**: +1. **Loss function**: Define the target parameter as the minimizer of the expected loss (risk) for a -full data loss function chosen to represent the desired performance measure. By -full data, we refer to the complete data including missingness process, for example. -Map the full data loss function into an observed data loss function, having -the same expected value and leading to an estimator of risk. +full data loss function chosen to represent the desired performance measure. By +full data, we refer to the complete data including missingness process, for +example. Map the full data loss function into an observed data loss function, +having the same expected value and leading to an estimator of risk. 2. **Algorithms**: Construct a finite collection of candidate estimators of the parameter of @@ -162,35 +163,37 @@ the overall performance of the resulting estimator. ## Example: Cross-validation and Prediction Having introduced the [Estimation Roadmap](#roadmap), we can more precisely -define our objective using prediction as an example. -Let the observed data be defined as $O = (W, Y)$, where a unit -specific data structure can be written as $O_i = (W_i, Y_i)$, for $i = 1, \ldots, n$. -We denote $Y_i$ as the outcome/dependent variable of interest, and $W_i$ as a -$p$-dimensional set of covariate (predictor) variables. We assume -the $n$ units are independent, or conditionally independent, and identically -distributed. Let $\psi_0(W)$ denote the target parameter of interest, the -quantity we wish to estimate (estimand). For this example, we are interested in estimating the -conditional expectation of the outcome given the covariates, $\psi_0(W) = \E(Y -\mid W)$. Following the [Estimation Roadmap](#roadmap), we choose the -appropriate loss function, $L$, such that $\psi_0(W) = \text{argmin}_{\psi} -\E_0[L(O, \psi(W))]$. Note that $\psi_0(W)$, the true target parameter, is a -minimizer of the risk (expected value of the chosen loss function). -The appropriate loss function for conditional expectation with continuous outcome -could be a mean squared error, for example. Then we can define $L$ as -$L(O, \psi(W)) = (Y_i -\psi(W_i)^2$. Note that there can be many different algorithms which -estimate the estimand (many different $\psi$s). How do we know how well each of the candidate -estimators of $\psi_0(W)$ are doing? To pick the best-performing -candidate estimator and assess its overall performance, we -use cross-validation. Observations in the training set are used to fit -(or train) the estimator, while those in validation set are used to assess -the risk of (or validate) it. +define our objective using prediction as an example. Let the observed data be +defined as $O = (W, Y)$, where a unit specific data structure can be written as +$O_i = (W_i, Y_i)$, for $i = 1, \ldots, n$. We denote $Y_i$ as the +outcome/dependent variable of interest, and $W_i$ as a $p$-dimensional set of +covariate (predictor) variables. We assume the $n$ units are independent, or +conditionally independent, and identically distributed. Let $\psi_0(W)$ denote +the target parameter of interest, the quantity we wish to estimate (estimand). +For this example, we are interested in estimating the conditional expectation of +the outcome given the covariates, $\psi_0(W) = \E(Y \mid W)$. Following the +[Estimation Roadmap](#roadmap), we choose the appropriate loss function, $L$, +such that $\psi_0(W) = \text{argmin}_{\psi} \E_0[L(O, \psi(W))]$. Note that +$\psi_0(W)$, the true target parameter, is a minimizer of the risk (expected +value of the chosen loss function). The appropriate loss function for +conditional expectation with continuous outcome could be a mean squared error, +for example. Then we can define $L$ as $L(O, \psi(W)) = (Y_i -\psi(W_i)^2$. Note +that there can be many different algorithms which estimate the estimand (many +different $\psi$s). How do we know how well each of the candidate estimators of +$\psi_0(W)$ are doing? To pick the best-performing candidate estimator and +assess its overall performance, we use cross-validation. Observations in the +training set are used to fit (or train) the estimator, while those in validation +set are used to assess the risk of (or validate) it. Next, we introduce notation flexible enough to represent any cross-validation -scheme. In particular, we define a **split vector**, $B_n = (B_n(i): i = 1, \ldots, n) \in -\{0,1\}^n$. -#Note that such a split vector is independent of the empirical distribution $P_n$, as in $B_n$ is not a function of $P_n$, but $P_0$. -A realization of $B_n$ defines a random split of the data into training and validation -subsets such that if +scheme. In particular, we define a **split vector**, $B_n = (B_n(i): i = 1, +\ldots, n) \in \{0,1\}^n$. + +A realization of $B_n$ defines a random split of the data into training and +validation subsets such that if $$B_n(i) = 0, \ \ \text{i sample is in the training set}$$ $$B_n(i) = 1, \ \ \text{i sample is in the validation set.}$$ We can further define $P_{n, B_n}^0$ and $P_{n, B_n}^1$ as the empirical @@ -214,11 +217,12 @@ maybe it gets annoying for time-series examples. just a thought... ## Cross-validation schemes in `origami` -A variety of different partitioning schemes exist, each tailored to the -salient details of the problem of interest, including data size, prevalence of -the outcome, and dependence structure (between units or across time). -In the following, we describe different cross-validation schemes available in the -`origami` package, and we go on to demonstrate their use in practical data analysis examples. +A variety of different partitioning schemes exist, each tailored to the salient +details of the problem of interest, including data size, prevalence of the +outcome, and dependence structure (between units or across time). In the +following, we describe different cross-validation schemes available in the +`origami` package, and we go on to demonstrate their use in practical data +analysis examples. ### WASH Benefits Study Example {-} @@ -519,10 +523,9 @@ and validation folds? --> ### Cross-validation for Time-series Data diff --git a/09-tmle3shift.Rmd b/09-tmle3shift.Rmd index a3df533..3d3b832 100644 --- a/09-tmle3shift.Rmd +++ b/09-tmle3shift.Rmd @@ -326,7 +326,7 @@ readily admits the conclusion that $\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + R(\hat{P}^{\star}, P_0)$. Under the additional condition that the remainder term $R(\hat{P}^{\star}, -P_0)$ decays as $o_P \left( \frac{1}{\sqrt{n}} \right),$ we have that $$\psi_n +P_0)$ decays as $o_P \left( \frac{1}{\sqrt{n}} \right)$, we have that $$\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + o_P \left( \frac{1}{\sqrt{n}} \right),$$ which, by a central limit theorem, establishes a Gaussian limiting distribution for the estimator: