Skip to content

Commit

Permalink
fix math mode and enforce linebreaks
Browse files Browse the repository at this point in the history
  • Loading branch information
nhejazi committed Jul 5, 2023
1 parent aa273d3 commit 1553bea
Show file tree
Hide file tree
Showing 3 changed files with 108 additions and 105 deletions.
36 changes: 18 additions & 18 deletions 04-roadmap.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ regression, and linear models imply a highly constrained statistical model, and
if any of the assumptions are unwarranted then there will be bias in their
result (except when treatment is randomized). The philosophy used to justify
parametric assumptions is rooted in misinterpretations of the often-quoted
saying of George Box, that "All models are wrong but some are useful", which has
saying of George Box, that "All models are wrong but some are useful," which has
been irresponsibly used to encourage the data analyst to make arbitrary modeling
choices. However, when one makes such unfounded assumptions, it is more likely
that $\M$ does not contain $P_0$, in which case the statistical model is said to
Expand Down Expand Up @@ -222,7 +222,7 @@ $\hat{\Psi}$, an _a priori_-specified algorithm defined as a mapping from the se
of the set of possible empirical distributions $P_n$ (which live in a
non-parametric statistical model $\M_{NP}$) to the parameter space for our
target parameter of interest: $\hat{\Psi} : \M_{NP} \rightarrow \R$. In other
words, $\hat{\Psi} is a function that takes as input the observed data, a
words, $\hat{\Psi}$ is a function that takes as input the observed data, a
realization of $P_n$, and then outputs a value in the parameter space. Where the
estimator may be seen as an operator that maps the observed data's corresponding
empirical distribution to a value in the parameter space, the numerical output
Expand All @@ -233,26 +233,26 @@ plug in a realization of $P_n$ (based on a sample size $n$ of the random
variable $O$, we get back an estimate $\psi_n$ of the true parameter value
$\psi_0$.
<!--
nh: idk how i feel about the above; it's a bit repetitive and feels imprecise
nh: also, the use of \hat{} over \Psi feels redundant and misleading, since it
would seem that \hat{\Psi} implies an approximate mapping, i.e., \hat{\Psi} is
_not_ \Psi, which makes it sound as though we're answering an approximate
version of the question represented by the mapping \Psi
rp: Yes, with \hat{\Psi} we are approximating \Psi, \hat{\Psi} is an estimator of \Psi
nh: the use of $\hat{\Psi}(\cdot)$ seems to imply an approximate mapping, i.e.,
$\hat{\Psi}(\cdot)$ is _not_ $\Psi()$--almost sounds as though we're answering
an approximate version of the question represented by the mapping $\Psi(\cdot)$
rp: Yes, with $\hat{\Psi}$ we are approximating $\Psi$, $\hat{\Psi}$ is an
estimator of $\Psi$
-->
As we have motivated in step 2, it is imperative to consider realistic
statistical models for estimation. Therefore, flexible estimators that allow for
parts of the data-generating process to be unrestricted are necessary.
Semiparametric statistical theory and empirical process theory provide a
framework for constructing, benchmarking, and understanding the behavior of
estimators that depend on flexible estimation strategies in realistic
statistical models. In general, desirable properties of an estimator are that it
is regular asymptotically linear (RAL) and efficient, thereby admitting a Normal
limit distribution that has minimal variance. Substitution/plug-in RAL
estimators are also advantageous: they are guaranteed to remain within the
bounds of $\M$ and, relative to estimators that are not plug-in, have improved
bias and variance in finite samples. In-depth discussion of the theory and these
properties are available in the literature [e.g., @kennedy2016semiparametric;
Semiparametric theory and empirical process theory provide a framework for
constructing, benchmarking, and understanding the behavior of estimators that
depend on flexible estimation strategies in realistic statistical models. In
general, desirable properties of an estimator are that it is regular
asymptotically linear (RAL) and efficient, thereby admitting a Normal limit
distribution that has minimal variance. Substitution/plug-in RAL estimators are
also advantageous: they are guaranteed to remain within the bounds of $\M$ and,
relative to estimators that are not plug-in, have improved bias and variance in
finite samples. In-depth discussion of the theory and these properties are
available in the literature [e.g., @kennedy2016semiparametric;
@vdl2011targeted]. We review a few key concepts in the following step.

In order to quantify the uncertainty in our estimate of the target parameter,
Expand Down
175 changes: 89 additions & 86 deletions 05-origami.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -32,80 +32,81 @@ By the end of this chapter you will be able to:
`r if (knitr::is_latex_output()) '\\end{VT1}'`

<!--
RP:
RP:
suggestion to modify some LOs (I am also open to not having LOs).
LOs should be expressed in terms of the reader and use action verbs.
LOs should be expressed in terms of the reader and use action verbs.
Some ideas for action verbs related to "understanding" are here, within
"Action Verbs Aligned with Blooms Taxonomy" section:
https://academiceffectiveness.gatech.edu/assessment-toolkit/developing-student-learning-outcome-statements
Here's another helpful article:
https://www.thoughtco.com/common-mistakes-when-writing-learning-objectives-7786
As an example, for LO 2, it can be rephrased by thinking about the answers to
these questions and targeting the LO towards them.
What specifically does a reader need to understand about a loss, risk, CV?
Why is this important for the reader to understand?
As an example, for LO 2, it can be rephrased by thinking about the answers to
these questions and targeting the LO towards them. What specifically does a
reader need to understand about a loss, risk, CV? Why is this important for the
reader to understand?
-->

## Introduction

Following the [_Roadmap for Targeted Learning_](#roadmap), we start to
elaborate on the estimation step in the current chapter. In order to generate an
estimate of the target parameter, we need to decide how to evaluate the quality
of our estimation procedure's performance. The performance, or error, of
any algorithm (estimator) corresponds to its generalizability to independent
datasets arising from the same data-generating process. Assessment of
the performance of an algorithm is extremely important --- it provides a
quantitative measure of how well the algorithm performs, and it guides the
choice of selecting among a set (or "library") of algorithms. In order to
assess the performance of an algorithm, or a library of them, we introduce the concept of a
**loss function**, which defines the **risk** or the **expected prediction error**.
Our goal is to estimate the true performance (risk) of our estimator.
In the next chapter, we elaborate on how to estimate the performance of a library
of algorithms in order to choose the best-performing one. In the following, we propose a
method to do so using the observed data and **cross-validation** procedures implemented
in the `origami` package [@coyle2018origami; @coyle-cran-origami].
Following the [_Roadmap for Targeted Learning_](#roadmap), we start to elaborate
on the estimation step in the current chapter. In order to generate an estimate
of the target parameter, we need to decide how to evaluate the quality of our
estimation procedure's performance. The performance, or error, of any algorithm
(estimator) corresponds to its generalizability to independent datasets arising
from the same data-generating process. Assessment of the performance of an
algorithm is extremely important --- it provides a quantitative measure of how
well the algorithm performs, and it guides the choice of selecting among a set
(or "library") of algorithms. In order to assess the performance of an
algorithm, or a library of them, we introduce the concept of a **loss
function**, which defines the **risk** or the **expected prediction error**.
Our goal is to estimate the true performance (risk) of our estimator. In the
next chapter, we elaborate on how to estimate the performance of a library of
algorithms in order to choose the best-performing one. In the following, we
propose a method to do so using the observed data and **cross-validation**
procedures implemented in the `origami` package [@coyle2018origami;
@coyle-cran-origami].

## Background

Ideally, in a data-rich scenario (i.e., one with unlimited observations), we would
split our dataset into three parts:
Ideally, in a data-rich scenario (i.e., one with unlimited observations), we
would split our dataset into three parts:

1. the training set,
2. the validation set, and
3. the test (or holdout) set.

The training set is used to fit algorithm(s) of interest; we evaluate the
performance of the fit(s) on a validation set, which can be used to estimate
prediction error (e.g., for algorithm tuning or selection). The final error of the
selected algorithm is obtained by using the test (or holdout) set, which is
prediction error (e.g., for algorithm tuning or selection). The final error of
the selected algorithm is obtained by using the test (or holdout) set, which is
kept entirely separate such that the algorithms never encounter these
observations until the final model evaluation step. One might wonder, with
training data readily available, why not use the training error to evaluate the
proposed algorithm's performance? Unfortunately, the training error is a biased
estimate of a fitted algorithm's generalizability, since it uses the same data
estimate of a fitted algorithm's generalizability, since it uses the same data
for fitting and evaluation.

Since data are often scarce, separating a dataset into training, validation and
test sets can prove too limiting, on account of decreasing the available data
for use in training by too much. In the absence of a large dataset and a
designated test set, we must resort to methods that estimate the algorithm's
true performance by efficient sample re-use. Re-sampling methods, like the
designated test set, we must resort to methods that estimate the algorithm's
true performance by efficient sample re-use. Re-sampling methods, like the
bootstrap, involve repeatedly sampling from the training set and fitting the
algorithms to these samples. While often computationally
intensive, re-sampling methods are particularly useful for evaluating an
algorithm and selecting among a set of them. In addition, they provide
more insight on the variability and robustness of a fitted algorithm, relative
to fitting an algorithm only once to all of the training data.
algorithms to these samples. While often computationally intensive, re-sampling
methods are particularly useful for evaluating an algorithm and selecting among
a set of them. In addition, they provide more insight on the variability and
robustness of a fitted algorithm, relative to fitting an algorithm only once to
all of the training data.

<!--
RP:
What is meant by "scarce", "by too much", "large dataset" here? We also might
want to use CV even when we have thousands of observations, so our assessment of
the algorithm isn't hinging on a single split. Is the data's size the motivating
reason for re-sampling?
Ah-ha! I knew you had it somewhere! I think the (some of) message that you're
getting across in the paragraph around L380 should be included here.
What is meant by "scarce", "by too much", "large dataset" here? We also might
want to use CV even when we have thousands of observations, so our assessment
of the algorithm isn't hinging on a single split. Is the data's size the
motivating reason for re-sampling? Ah-ha! I knew you had it somewhere! I think
the (some of) message that you're getting across in the paragraph around L380
should be included here.
-->

### Introducing: cross-validation
Expand All @@ -128,11 +129,11 @@ data-generating distribution. For further details on the theoretical results, we
suggest consulting @vdl2003unified, @vdl2004asymptotic, @dudoit2005asymptotics
and @vaart2006oracle.

The `origami` package provides a suite of tools for cross-validation. In the
following, we describe different types of cross-validation schemes readily available
in `origami`, introduce the general structure of the `origami` package, and demonstrate
the use of these procedures in various applied settings.
---
The `origami` package provides a suite of tools for cross-validation. In the
following, we describe different types of cross-validation schemes readily
available in `origami`, introduce the general structure of the `origami`
package, and demonstrate the use of these procedures in various applied
settings.

## Estimation Roadmap: How does it all fit together?

Expand All @@ -143,12 +144,12 @@ particular, the unified loss-based estimation framework [@vdl2003unified;
which relies on cross-validation for estimator construction,
selection, and performance assessment, consists of three main steps:

1. **Loss function**:
1. **Loss function**:
Define the target parameter as the minimizer of the expected loss (risk) for a
full data loss function chosen to represent the desired performance measure. By
full data, we refer to the complete data including missingness process, for example.
Map the full data loss function into an observed data loss function, having
the same expected value and leading to an estimator of risk.
full data loss function chosen to represent the desired performance measure. By
full data, we refer to the complete data including missingness process, for
example. Map the full data loss function into an observed data loss function,
having the same expected value and leading to an estimator of risk.

2. **Algorithms**:
Construct a finite collection of candidate estimators of the parameter of
Expand All @@ -162,35 +163,37 @@ the overall performance of the resulting estimator.
## Example: Cross-validation and Prediction

Having introduced the [Estimation Roadmap](#roadmap), we can more precisely
define our objective using prediction as an example.
Let the observed data be defined as $O = (W, Y)$, where a unit
specific data structure can be written as $O_i = (W_i, Y_i)$, for $i = 1, \ldots, n$.
We denote $Y_i$ as the outcome/dependent variable of interest, and $W_i$ as a
$p$-dimensional set of covariate (predictor) variables. We assume
the $n$ units are independent, or conditionally independent, and identically
distributed. Let $\psi_0(W)$ denote the target parameter of interest, the
quantity we wish to estimate (estimand). For this example, we are interested in estimating the
conditional expectation of the outcome given the covariates, $\psi_0(W) = \E(Y
\mid W)$. Following the [Estimation Roadmap](#roadmap), we choose the
appropriate loss function, $L$, such that $\psi_0(W) = \text{argmin}_{\psi}
\E_0[L(O, \psi(W))]$. Note that $\psi_0(W)$, the true target parameter, is a
minimizer of the risk (expected value of the chosen loss function).
The appropriate loss function for conditional expectation with continuous outcome
could be a mean squared error, for example. Then we can define $L$ as
$L(O, \psi(W)) = (Y_i -\psi(W_i)^2$. Note that there can be many different algorithms which
estimate the estimand (many different $\psi$s). How do we know how well each of the candidate
estimators of $\psi_0(W)$ are doing? To pick the best-performing
candidate estimator and assess its overall performance, we
use cross-validation. Observations in the training set are used to fit
(or train) the estimator, while those in validation set are used to assess
the risk of (or validate) it.
define our objective using prediction as an example. Let the observed data be
defined as $O = (W, Y)$, where a unit specific data structure can be written as
$O_i = (W_i, Y_i)$, for $i = 1, \ldots, n$. We denote $Y_i$ as the
outcome/dependent variable of interest, and $W_i$ as a $p$-dimensional set of
covariate (predictor) variables. We assume the $n$ units are independent, or
conditionally independent, and identically distributed. Let $\psi_0(W)$ denote
the target parameter of interest, the quantity we wish to estimate (estimand).
For this example, we are interested in estimating the conditional expectation of
the outcome given the covariates, $\psi_0(W) = \E(Y \mid W)$. Following the
[Estimation Roadmap](#roadmap), we choose the appropriate loss function, $L$,
such that $\psi_0(W) = \text{argmin}_{\psi} \E_0[L(O, \psi(W))]$. Note that
$\psi_0(W)$, the true target parameter, is a minimizer of the risk (expected
value of the chosen loss function). The appropriate loss function for
conditional expectation with continuous outcome could be a mean squared error,
for example. Then we can define $L$ as $L(O, \psi(W)) = (Y_i -\psi(W_i)^2$. Note
that there can be many different algorithms which estimate the estimand (many
different $\psi$s). How do we know how well each of the candidate estimators of
$\psi_0(W)$ are doing? To pick the best-performing candidate estimator and
assess its overall performance, we use cross-validation. Observations in the
training set are used to fit (or train) the estimator, while those in validation
set are used to assess the risk of (or validate) it.

Next, we introduce notation flexible enough to represent any cross-validation
scheme. In particular, we define a **split vector**, $B_n = (B_n(i): i = 1, \ldots, n) \in
\{0,1\}^n$.
#Note that such a split vector is independent of the empirical distribution $P_n$, as in $B_n$ is not a function of $P_n$, but $P_0$.
A realization of $B_n$ defines a random split of the data into training and validation
subsets such that if
scheme. In particular, we define a **split vector**, $B_n = (B_n(i): i = 1,
\ldots, n) \in \{0,1\}^n$.
<!--
Note that such a split vector is independent of the empirical distribution
$P_n$, as in $B_n$ is not a function of $P_n$, but $P_0$.
-->
A realization of $B_n$ defines a random split of the data into training and
validation subsets such that if
$$B_n(i) = 0, \ \ \text{i sample is in the training set}$$
$$B_n(i) = 1, \ \ \text{i sample is in the validation set.}$$
We can further define $P_{n, B_n}^0$ and $P_{n, B_n}^1$ as the empirical
Expand All @@ -214,11 +217,12 @@ maybe it gets annoying for time-series examples. just a thought...

## Cross-validation schemes in `origami`

A variety of different partitioning schemes exist, each tailored to the
salient details of the problem of interest, including data size, prevalence of
the outcome, and dependence structure (between units or across time).
In the following, we describe different cross-validation schemes available in the
`origami` package, and we go on to demonstrate their use in practical data analysis examples.
A variety of different partitioning schemes exist, each tailored to the salient
details of the problem of interest, including data size, prevalence of the
outcome, and dependence structure (between units or across time). In the
following, we describe different cross-validation schemes available in the
`origami` package, and we go on to demonstrate their use in practical data
analysis examples.

### WASH Benefits Study Example {-}

Expand Down Expand Up @@ -519,10 +523,9 @@ and validation folds?
-->

<!--
RP:
Should we add stratified cross-validation and clustered cross-validation examples
with origami?
I think these are both pretty common
RP:
Should we add stratified cross-validation and clustered cross-validation
examples with origami? I think these are both pretty common
-->

### Cross-validation for Time-series Data
Expand Down
2 changes: 1 addition & 1 deletion 09-tmle3shift.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ readily admits the conclusion that
$\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + R(\hat{P}^{\star}, P_0)$.
Under the additional condition that the remainder term $R(\hat{P}^{\star},
P_0)$ decays as $o_P \left( \frac{1}{\sqrt{n}} \right),$ we have that $$\psi_n
P_0)$ decays as $o_P \left( \frac{1}{\sqrt{n}} \right)$, we have that $$\psi_n
- \psi_0 = (P_n - P_0) \cdot D(P_0) + o_P \left( \frac{1}{\sqrt{n}} \right),$$
which, by a central limit theorem, establishes a Gaussian limiting distribution
for the estimator:
Expand Down

0 comments on commit 1553bea

Please sign in to comment.