From 1553beae09fa0d9921813a482430e1e500180773 Mon Sep 17 00:00:00 2001
From: Nima Hejazi <nh@nimahejazi.org>
Date: Wed, 5 Jul 2023 13:50:53 -0700
Subject: [PATCH] fix math mode and enforce linebreaks

---
 04-roadmap.Rmd    |  36 +++++-----
 05-origami.Rmd    | 175 +++++++++++++++++++++++-----------------------
 09-tmle3shift.Rmd |   2 +-
 3 files changed, 108 insertions(+), 105 deletions(-)

diff --git a/04-roadmap.Rmd b/04-roadmap.Rmd
index 0ed7a4b..482f6be 100644
--- a/04-roadmap.Rmd
+++ b/04-roadmap.Rmd
@@ -111,7 +111,7 @@ regression, and linear models imply a highly constrained statistical model, and
 if any of the assumptions are unwarranted then there will be bias in their
 result (except when treatment is randomized). The philosophy used to justify
 parametric assumptions is rooted in misinterpretations of the often-quoted
-saying of George Box, that "All models are wrong but some are useful", which has
+saying of George Box, that "All models are wrong but some are useful," which has
 been irresponsibly used to encourage the data analyst to make arbitrary modeling
 choices. However, when one makes such unfounded assumptions, it is more likely
 that $\M$ does not contain $P_0$, in which case the statistical model is said to
@@ -222,7 +222,7 @@ $\hat{\Psi}$, an _a priori_-specified algorithm defined as a mapping from the se
 of the set of possible empirical distributions $P_n$ (which live in a
 non-parametric statistical model $\M_{NP}$) to the parameter space for our
 target parameter of interest: $\hat{\Psi} : \M_{NP} \rightarrow \R$. In other
-words, $\hat{\Psi} is a function that takes as input the observed data, a
+words, $\hat{\Psi}$ is a function that takes as input the observed data, a
 realization of $P_n$, and then outputs a value in the parameter space. Where the
 estimator may be seen as an operator that maps the observed data's corresponding
 empirical distribution to a value in the parameter space, the numerical output
@@ -233,26 +233,26 @@ plug in a realization of $P_n$ (based on a sample size $n$ of the random
 variable $O$, we get back an estimate $\psi_n$ of the true parameter value
 $\psi_0$.
 <!--
-nh: idk how i feel about the above; it's a bit repetitive and feels imprecise
-nh: also, the use of \hat{} over \Psi feels redundant and misleading, since it
-would seem that \hat{\Psi} implies an approximate mapping, i.e., \hat{\Psi} is
-_not_ \Psi, which makes it sound as though we're answering an approximate
-version of the question represented by the mapping \Psi
-rp: Yes, with \hat{\Psi} we are approximating \Psi, \hat{\Psi} is an estimator of \Psi
+nh: the use of $\hat{\Psi}(\cdot)$ seems to imply an approximate mapping, i.e.,
+$\hat{\Psi}(\cdot)$ is _not_ $\Psi()$--almost sounds as though we're answering
+an approximate version of the question represented by the mapping $\Psi(\cdot)$
+
+rp: Yes, with $\hat{\Psi}$ we are approximating $\Psi$, $\hat{\Psi}$ is an
+estimator of $\Psi$
 -->
 As we have motivated in step 2, it is imperative to consider realistic
 statistical models for estimation. Therefore, flexible estimators that allow for
 parts of the data-generating process to be unrestricted are necessary.
-Semiparametric statistical theory and empirical process theory provide a
-framework for constructing, benchmarking, and understanding the behavior of
-estimators that depend on flexible estimation strategies in realistic
-statistical models. In general, desirable properties of an estimator are that it
-is regular asymptotically linear (RAL) and efficient, thereby admitting a Normal
-limit distribution that has minimal variance. Substitution/plug-in RAL
-estimators are also advantageous: they are guaranteed to remain within the
-bounds of $\M$ and, relative to estimators that are not plug-in, have improved
-bias and variance in finite samples. In-depth discussion of the theory and these
-properties are available in the literature [e.g., @kennedy2016semiparametric;
+Semiparametric theory and empirical process theory provide a framework for
+constructing, benchmarking, and understanding the behavior of estimators that
+depend on flexible estimation strategies in realistic statistical models. In
+general, desirable properties of an estimator are that it is regular
+asymptotically linear (RAL) and efficient, thereby admitting a Normal limit
+distribution that has minimal variance. Substitution/plug-in RAL estimators are
+also advantageous: they are guaranteed to remain within the bounds of $\M$ and,
+relative to estimators that are not plug-in, have improved bias and variance in
+finite samples. In-depth discussion of the theory and these properties are
+available in the literature [e.g., @kennedy2016semiparametric;
 @vdl2011targeted]. We review a few key concepts in the following step.
 
 In order to quantify the uncertainty in our estimate of the target parameter,
diff --git a/05-origami.Rmd b/05-origami.Rmd
index 98b0ca5..695d6d4 100644
--- a/05-origami.Rmd
+++ b/05-origami.Rmd
@@ -32,44 +32,45 @@ By the end of this chapter you will be able to:
 `r if (knitr::is_latex_output()) '\\end{VT1}'`
 
 <!--
-RP: 
+RP:
 suggestion to modify some LOs (I am also open to not having LOs).
-LOs should be expressed in terms of the reader and use action verbs. 
+LOs should be expressed in terms of the reader and use action verbs.
 Some ideas for action verbs related to "understanding" are here, within
 "Action Verbs Aligned with Blooms Taxonomy" section:
 https://academiceffectiveness.gatech.edu/assessment-toolkit/developing-student-learning-outcome-statements
 Here's another helpful article:
 https://www.thoughtco.com/common-mistakes-when-writing-learning-objectives-7786
 
-As an example, for LO 2, it can be rephrased by thinking about the answers to 
-these questions and targeting the LO towards them. 
-What specifically does a reader need to understand about a loss, risk, CV? 
-Why is this important for the reader to understand?
+As an example, for LO 2, it can be rephrased by thinking about the answers to
+these questions and targeting the LO towards them. What specifically does a
+reader need to understand about a loss, risk, CV? Why is this important for the
+reader to understand?
 -->
 
 ## Introduction
 
-Following the [_Roadmap for Targeted Learning_](#roadmap), we start to 
-elaborate on the estimation step in the current chapter. In order to generate an 
-estimate of the target parameter, we need to decide how to evaluate the quality 
-of our estimation procedure's performance. The performance, or error, of 
-any algorithm (estimator) corresponds to its generalizability to independent 
-datasets arising from the same data-generating process. Assessment of 
-the performance of an algorithm is extremely important --- it provides a 
-quantitative measure of how well the algorithm performs, and it guides the 
-choice of selecting among a set (or "library") of algorithms. In order to 
-assess the performance of an algorithm, or a library of them, we introduce the concept of a 
-**loss function**, which defines the **risk** or the **expected prediction error**.  
-Our goal is to estimate the true performance (risk) of our estimator. 
-In the next chapter, we elaborate on how to estimate the performance of a library 
-of algorithms in order to choose the best-performing one. In the following, we propose a
-method to do so using the observed data and **cross-validation** procedures implemented 
-in the `origami` package [@coyle2018origami; @coyle-cran-origami].
+Following the [_Roadmap for Targeted Learning_](#roadmap), we start to elaborate
+on the estimation step in the current chapter. In order to generate an estimate
+of the target parameter, we need to decide how to evaluate the quality of our
+estimation procedure's performance. The performance, or error, of any algorithm
+(estimator) corresponds to its generalizability to independent datasets arising
+from the same data-generating process. Assessment of the performance of an
+algorithm is extremely important --- it provides a quantitative measure of how
+well the algorithm performs, and it guides the choice of selecting among a set
+(or "library") of algorithms. In order to assess the performance of an
+algorithm, or a library of them, we introduce the concept of a **loss
+function**, which defines the **risk** or the **expected prediction error**.
+Our goal is to estimate the true performance (risk) of our estimator. In the
+next chapter, we elaborate on how to estimate the performance of a library of
+algorithms in order to choose the best-performing one. In the following, we
+propose a method to do so using the observed data and **cross-validation**
+procedures implemented in the `origami` package [@coyle2018origami;
+@coyle-cran-origami].
 
 ## Background
 
-Ideally, in a data-rich scenario (i.e., one with unlimited observations), we would
-split our dataset into three parts:
+Ideally, in a data-rich scenario (i.e., one with unlimited observations), we
+would split our dataset into three parts:
 
 1. the training set,
 2. the validation set, and
@@ -77,35 +78,35 @@ split our dataset into three parts:
 
 The training set is used to fit algorithm(s) of interest; we evaluate the
 performance of the fit(s) on a validation set, which can be used to estimate
-prediction error (e.g., for algorithm tuning or selection). The final error of the
-selected algorithm is obtained by using the test (or holdout) set, which is
+prediction error (e.g., for algorithm tuning or selection). The final error of
+the selected algorithm is obtained by using the test (or holdout) set, which is
 kept entirely separate such that the algorithms never encounter these
 observations until the final model evaluation step. One might wonder, with
 training data readily available, why not use the training error to evaluate the
 proposed algorithm's performance? Unfortunately, the training error is a biased
-estimate of a fitted algorithm's generalizability, since it uses the same data 
+estimate of a fitted algorithm's generalizability, since it uses the same data
 for fitting and evaluation.
 
 Since data are often scarce, separating a dataset into training, validation and
 test sets can prove too limiting, on account of decreasing the available data
 for use in training by too much. In the absence of a large dataset and a
-designated test set, we must resort to methods that estimate the algorithm's 
-true performance by efficient sample re-use. Re-sampling methods, like the 
+designated test set, we must resort to methods that estimate the algorithm's
+true performance by efficient sample re-use. Re-sampling methods, like the
 bootstrap, involve repeatedly sampling from the training set and fitting the
-algorithms to these samples. While often computationally
-intensive, re-sampling methods are particularly useful for evaluating an 
-algorithm and selecting among a set of them. In addition, they provide 
-more insight on the variability and robustness of a fitted algorithm, relative
-to fitting an algorithm only once to all of the training data.
+algorithms to these samples. While often computationally intensive, re-sampling
+methods are particularly useful for evaluating an algorithm and selecting among
+a set of them. In addition, they provide more insight on the variability and
+robustness of a fitted algorithm, relative to fitting an algorithm only once to
+all of the training data.
 
 <!--
 RP:
-What is meant by "scarce", "by too much", "large dataset" here? We also might 
-want to use CV even when we have thousands of observations, so our assessment of
-the algorithm isn't hinging on a single split. Is the data's size the motivating 
-reason for re-sampling? 
-Ah-ha! I knew you had it somewhere! I think the (some of) message that you're 
-getting across in the paragraph around L380 should be included here. 
+What is meant by "scarce", "by too much", "large dataset" here? We also might
+want to use CV even when we have thousands of observations, so our assessment
+of the algorithm isn't hinging on a single split. Is the data's size the
+motivating reason for re-sampling? Ah-ha! I knew you had it somewhere! I think
+the (some of) message that you're getting across in the paragraph around L380
+should be included here.
 -->
 
 ### Introducing: cross-validation
@@ -128,11 +129,11 @@ data-generating distribution. For further details on the theoretical results, we
 suggest consulting @vdl2003unified, @vdl2004asymptotic, @dudoit2005asymptotics
 and @vaart2006oracle.
 
-The `origami` package provides a suite of tools for cross-validation. In the 
-following, we describe different types of cross-validation schemes readily available 
-in `origami`, introduce the general structure of the `origami` package, and demonstrate 
-the use of these procedures in various applied settings.
----
+The `origami` package provides a suite of tools for cross-validation. In the
+following, we describe different types of cross-validation schemes readily
+available in `origami`, introduce the general structure of the `origami`
+package, and demonstrate the use of these procedures in various applied
+settings.
 
 ## Estimation Roadmap: How does it all fit together?
 
@@ -143,12 +144,12 @@ particular, the unified loss-based estimation framework [@vdl2003unified;
 which relies on cross-validation for estimator construction,
 selection, and performance assessment, consists of three main steps:
 
-1. **Loss function**: 
+1. **Loss function**:
 Define the target parameter as the minimizer of the expected loss (risk) for a
-full data loss function chosen to represent the desired performance measure. By 
-full data, we refer to the complete data including missingness process, for example.
-Map the full data loss function into an observed data loss function, having
-the same expected value and leading to an estimator of risk.
+full data loss function chosen to represent the desired performance measure. By
+full data, we refer to the complete data including missingness process, for
+example. Map the full data loss function into an observed data loss function,
+having the same expected value and leading to an estimator of risk.
 
 2. **Algorithms**:
 Construct a finite collection of candidate estimators of the parameter of
@@ -162,35 +163,37 @@ the overall performance of the resulting estimator.
 ## Example: Cross-validation and Prediction
 
 Having introduced the [Estimation Roadmap](#roadmap), we can more precisely
-define our objective using prediction as an example.
-Let the observed data be defined as $O = (W, Y)$, where a unit
-specific data structure can be written as $O_i = (W_i, Y_i)$, for $i = 1, \ldots, n$.
-We denote $Y_i$ as the outcome/dependent variable of interest, and $W_i$ as a 
-$p$-dimensional set of covariate (predictor) variables. We assume 
-the $n$ units are independent, or conditionally independent, and identically 
-distributed. Let $\psi_0(W)$ denote the target parameter of interest, the 
-quantity we wish to estimate (estimand). For this example, we are interested in estimating the
-conditional expectation of the outcome given the covariates, $\psi_0(W) = \E(Y
-\mid W)$. Following the [Estimation Roadmap](#roadmap), we choose the
-appropriate loss function, $L$, such that $\psi_0(W) = \text{argmin}_{\psi}
-\E_0[L(O, \psi(W))]$. Note that $\psi_0(W)$, the true target parameter, is a 
-minimizer of the risk (expected value of the chosen loss function). 
-The appropriate loss function for conditional expectation with continuous outcome 
-could be a mean squared error, for example. Then we can define $L$ as
-$L(O, \psi(W)) = (Y_i -\psi(W_i)^2$. Note that there can be many different algorithms which 
-estimate the estimand (many different $\psi$s). How do we know how well each of the candidate
-estimators of $\psi_0(W)$ are doing? To pick the best-performing
-candidate estimator and assess its overall performance, we
-use cross-validation. Observations in the training set are used to fit 
-(or train) the estimator, while those in validation set are used to assess 
-the risk of (or validate) it.
+define our objective using prediction as an example. Let the observed data be
+defined as $O = (W, Y)$, where a unit specific data structure can be written as
+$O_i = (W_i, Y_i)$, for $i = 1, \ldots, n$. We denote $Y_i$ as the
+outcome/dependent variable of interest, and $W_i$ as a $p$-dimensional set of
+covariate (predictor) variables. We assume the $n$ units are independent, or
+conditionally independent, and identically distributed. Let $\psi_0(W)$ denote
+the target parameter of interest, the quantity we wish to estimate (estimand).
+For this example, we are interested in estimating the conditional expectation of
+the outcome given the covariates, $\psi_0(W) = \E(Y \mid W)$. Following the
+[Estimation Roadmap](#roadmap), we choose the appropriate loss function, $L$,
+such that $\psi_0(W) = \text{argmin}_{\psi} \E_0[L(O, \psi(W))]$. Note that
+$\psi_0(W)$, the true target parameter, is a minimizer of the risk (expected
+value of the chosen loss function). The appropriate loss function for
+conditional expectation with continuous outcome could be a mean squared error,
+for example. Then we can define $L$ as $L(O, \psi(W)) = (Y_i -\psi(W_i)^2$. Note
+that there can be many different algorithms which estimate the estimand (many
+different $\psi$s). How do we know how well each of the candidate estimators of
+$\psi_0(W)$ are doing? To pick the best-performing candidate estimator and
+assess its overall performance, we use cross-validation. Observations in the
+training set are used to fit (or train) the estimator, while those in validation
+set are used to assess the risk of (or validate) it.
 
 Next, we introduce notation flexible enough to represent any cross-validation
-scheme. In particular, we define a **split vector**, $B_n = (B_n(i): i = 1, \ldots, n) \in
-\{0,1\}^n$. 
-#Note that such a split vector is independent of the empirical distribution $P_n$, as in $B_n$ is not a function of $P_n$, but $P_0$. 
-A realization of $B_n$ defines a random split of the data into training and validation 
-subsets such that if
+scheme. In particular, we define a **split vector**, $B_n = (B_n(i): i = 1,
+\ldots, n) \in \{0,1\}^n$.
+<!--
+Note that such a split vector is independent of the empirical distribution
+$P_n$, as in $B_n$ is not a function of $P_n$, but $P_0$.
+-->
+A realization of $B_n$ defines a random split of the data into training and
+validation subsets such that if
 $$B_n(i) = 0, \ \ \text{i sample is in the training set}$$
 $$B_n(i) = 1, \ \ \text{i sample is in the validation set.}$$
 We can further define $P_{n, B_n}^0$ and $P_{n, B_n}^1$ as the empirical
@@ -214,11 +217,12 @@ maybe it gets annoying for time-series examples. just a thought...
 
 ## Cross-validation schemes in `origami`
 
-A variety of different partitioning schemes exist, each tailored to the
-salient details of the problem of interest, including data size, prevalence of
-the outcome, and dependence structure (between units or across time). 
-In the following, we describe different cross-validation schemes available in the
-`origami` package, and we go on to demonstrate their use in practical data analysis examples.
+A variety of different partitioning schemes exist, each tailored to the salient
+details of the problem of interest, including data size, prevalence of the
+outcome, and dependence structure (between units or across time). In the
+following, we describe different cross-validation schemes available in the
+`origami` package, and we go on to demonstrate their use in practical data
+analysis examples.
 
 ### WASH Benefits Study Example {-}
 
@@ -519,10 +523,9 @@ and validation folds?
 -->
 
 <!--
-RP: 
-Should we add stratified cross-validation and clustered cross-validation examples
-with origami?
-I think these are both pretty common
+RP:
+Should we add stratified cross-validation and clustered cross-validation
+examples with origami? I think these are both pretty common
 -->
 
 ### Cross-validation for Time-series Data
diff --git a/09-tmle3shift.Rmd b/09-tmle3shift.Rmd
index a3df533..3d3b832 100644
--- a/09-tmle3shift.Rmd
+++ b/09-tmle3shift.Rmd
@@ -326,7 +326,7 @@ readily admits the conclusion that
 $\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + R(\hat{P}^{\star}, P_0)$.
 
 Under the additional condition that the remainder term $R(\hat{P}^{\star},
-P_0)$ decays as $o_P \left( \frac{1}{\sqrt{n}} \right),$ we have that $$\psi_n
+P_0)$ decays as $o_P \left( \frac{1}{\sqrt{n}} \right)$, we have that $$\psi_n
 - \psi_0 = (P_n - P_0) \cdot D(P_0) + o_P \left( \frac{1}{\sqrt{n}} \right),$$
 which, by a central limit theorem, establishes a Gaussian limiting distribution
 for the estimator: