Skip to content

Commit

Permalink
fix missing math mode
Browse files Browse the repository at this point in the history
  • Loading branch information
nhejazi committed Jul 5, 2023
1 parent 1553bea commit c27cfcf
Showing 1 changed file with 101 additions and 102 deletions.
203 changes: 101 additions & 102 deletions 06-sl3.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1805,67 +1805,67 @@ if (knitr::is_latex_output()) {
# plot variable importance
importance_plot(x = washb_varimp)
```
According to the `sl3` variable importance measures, which were assessed by
the mean squared error (MSE) difference under permutations of each covariate,
the fitted SL's (`sl_fit`) most important variables for predicting
weight-for-height z-score (`whz`) are child age (`aged`) and household assets
(`assets`) that reflect the socio-economic status of the study's subjects.
According to the `sl3` variable importance measures, which were assessed by the
mean squared error (MSE) difference under permutations of each covariate, the
fitted SL's (`sl_fit`) most important variables for predicting weight-for-height
z-score (`whz`) are child age (`aged`) and household assets (`assets`) that
reflect the socio-economic status of the study's subjects.

## Conditional Density Estimation

In certain scenarios it may be useful to estimate the conditional density of a
dependent variable, given predictors/covariates that precede it. In the
context of causal inference, this arises most readily when working with
continuous-valued treatments. Specifically, conditional density estimation (CDE)
is necessary when estimating the treatment mechanism for a continuous-valued
treatment, often called the _generalized propensity score_. Compared the
classical propensity score (PS) for binary treatments (the conditional
probability of receiving the treatment given covariates),
$\mathbb{P}(A = 1 \mid W)$, the generalized PS is the conditional density of
treatment $A$, given covariates $W$, $\mathbb{P}(A \mid W)$.

CDE often requires specialized approaches tied to very specific algorithmic
implementations. To our knowledge, general and flexible algorithms for
CDE have been proposed only sparsely in the literature. We have implemented two
such approaches in `sl3`: a semiparametric CDE approach that makes certain
assumptions about the constancy of (higher) moments of the underlying
distribution, and second approach that exploits the relationship between the
conditional hazard and density functions to allow CDE via pooled hazard
regression. Both approaches are flexible in that they allow
the use of arbitrary regression functions or machine learning algorithms for the
estimation of nuisance quantities (the conditional mean or the conditional
hazard, respectively). We elaborate on these two frameworks below. Importantly,
per @dudoit2005asymptotics and related works, a loss function appropriate for
density estimation is the negative log-density loss $L(\cdot) = -\log(p_n(\cdot))$.
In certain scenarios it may be useful to estimate the conditional density of a
dependent variable, given predictors/covariates that precede it. In the context
of causal inference, this arises most readily when working with
continuous-valued treatments. Specifically, conditional density estimation (CDE)
is necessary when estimating the treatment mechanism for a continuous-valued
treatment, often called the _generalized propensity score_. Compared the
classical propensity score (PS) for binary treatments (the conditional
probability of receiving the treatment given covariates), $\mathbb{P}(A = 1 \mid
W)$, the generalized PS is the conditional density of treatment $A$, given
covariates $W$, $\mathbb{P}(A \mid W)$.

CDE often requires specialized approaches tied to very specific algorithmic
implementations. To our knowledge, general and flexible algorithms for CDE have
been proposed only sparsely in the literature. We have implemented two such
approaches in `sl3`: a semiparametric CDE approach that makes certain
assumptions about the constancy of (higher) moments of the underlying
distribution, and second approach that exploits the relationship between the
conditional hazard and density functions to allow CDE via pooled hazard
regression. Both approaches are flexible in that they allow the use of arbitrary
regression functions or machine learning algorithms for the estimation of
nuisance quantities (the conditional mean or the conditional hazard,
respectively). We elaborate on these two frameworks below. Importantly, per
@dudoit2005asymptotics and related works, a loss function appropriate for
density estimation is the negative log-density loss $L(\cdot) =
-\log(p_n(\cdot))$.

### Moment-restricted location-scale

This family of semiparametric CDE approaches exploits the general form $\rho(Y -
\mu(X) / \sigma(X))$, where $Y$ is the dependent variable of interest (e.g.,
treatment $A$ in the PS), $X$ are the predictors (e.g., covariates $W$ in the
PS), \rho$ is a specified marginal density function, and $\mu(X) = \E(Y \mid X)$
and $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$ are nuisance functions of the
dependent variable that may be estimated flexibly. CDE procedures formulated
within this framework may be characterized as belonging to a
_conditional location-scale_ family, that is, in which
$p_n(Y \mid X) = \rho((Y - \mu_n(X)) / \sigma_n(X))$. While CDE with
conditional location-scale families is not without potential disadvantages
(e.g., the restriction on the density's functional form could lead to
misspecification bias), this strategy is flexible in that it allows for
arbitrary machine learning algorithms to be used in estimating the conditional
mean of $Y$ given $X$, \mu(X) = \E(Y \mid X)$, and the conditional variance
of $Y$ given $X$, $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$.
\mu(X) / \sigma(X))$, where $Y$ is the dependent variable of interest (e.g.,
treatment $A$ in the PS), $X$ are the predictors (e.g., covariates $W$ in the
PS), $\rho$ is a specified marginal density function, and $\mu(X) = \E(Y \mid
X)$ and $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$ are nuisance functions of the
dependent variable that may be estimated flexibly. CDE procedures formulated
within this framework may be characterized as belonging to a _conditional
location-scale_ family, that is, in which $p_n(Y \mid X) = \rho((Y - \mu_n(X)) /
\sigma_n(X))$. While CDE with conditional location-scale families is not without
potential disadvantages (e.g., the restriction on the density's functional form
could lead to misspecification bias), this strategy is flexible in that it
allows for arbitrary machine learning algorithms to be used in estimating the
conditional mean of $Y$ given $X$, \mu(X) = \E(Y \mid X)$, and the conditional
variance of $Y$ given $X$, $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$.

In settings with limited data, the additional structure imposed by the
assumption that the target density belongs to a location-scale family may prove
advantageous by smoothing over areas of low support in the data. However, in
practice, it is impossible to know whether and when this assumption holds. This
procedure is not a novel contribution of our own (and we have been unable to
locate a formal description of it in the literature); nevertheless, we provide
an informal algorithm sketch below. This algorithm considers access to $n$
independendent and identically distributed (i.i.d.) copies of an observed data
random variable $O = (Y, X)$, an _a priori_-specified kernel function $\rho$, a
candidate regression procedure $f_{\mu}$ to estimate $\mu(X)$, and a candidate
advantageous by smoothing over areas of low support in the data. However, in
practice, it is impossible to know whether and when this assumption holds. This
procedure is not a novel contribution of our own (and we have been unable to
locate a formal description of it in the literature); nevertheless, we provide
an informal algorithm sketch below. This algorithm considers access to $n$
independent and identically distributed (i.i.d.) copies of an observed data
random variable $O = (Y, X)$, an _a priori_-specified kernel function $\rho$, a
candidate regression procedure $f_{\mu}$ to estimate $\mu(X)$, and a candidate
regression procedure $f_{\sigma}$ to estimate $\sigma(X)$.

1. Estimate $\mu(X) = \E[Y \mid X]$, the conditional mean of $Y$ given $X$, by
Expand All @@ -1881,17 +1881,17 @@ regression procedure $f_{\sigma}$ to estimate $\sigma(X)$.

This algorithm sketch encompasses two forms of this CDE approach, which diverge
at the second step above. To simplify the approach, one may elect to estimate
only the conditional mean $\mu(X)$, leaving the conditional variance to be
assumed constant (i.e., estimated simply as the marginal mean of the
residuals $\E[(Y - \hat{\mu}(X))^2]$). This subclass of CDE approaches have
_homoscedastic error_ based on the variance assumption made. The conditional
variance can instead by estimated as the conditional mean of the residuals
$(Y - \hat{\mu}(X))^2$ given $X$, $\E[(Y - \hat{\mu}(X))^2 \mid X]$, where the
candidate algorithm $f_{\sigma}$ is used to evaluate the expectation.
Both approaches have been implemented in `sl3`, in the learner
`Lrnr_density_semiparametric`. The `mean_learner` argument specifies
$f_{\mu}$ and the optional `var_learner` argument specifies $f_{\sigma}$. We
demonstrate CDE with this approach below.
only the conditional mean $\mu(X)$, leaving the conditional variance to be
assumed constant (i.e., estimated simply as the marginal mean of the residuals
$\E[(Y - \hat{\mu}(X))^2]$). This subclass of CDE approaches have _homoscedastic
error_ based on the variance assumption made. The conditional variance can
instead by estimated as the conditional mean of the residuals $(Y -
\hat{\mu}(X))^2$ given $X$, $\E[(Y - \hat{\mu}(X))^2 \mid X]$, where the
candidate algorithm $f_{\sigma}$ is used to evaluate the expectation. Both
approaches have been implemented in `sl3`, in the learner
`Lrnr_density_semiparametric`. The `mean_learner` argument specifies $f_{\mu}$
and the optional `var_learner` argument specifies $f_{\sigma}$. We demonstrate
CDE with this approach below.

```{r cde-using-locscale, eval = FALSE}
# semiparametric density estimator with homoscedastic errors (HOSE)
Expand All @@ -1913,32 +1913,32 @@ sl_dens_lrnr <- Lrnr_sl$new(

### Pooled hazard regression

Another approach for CDE available in `sl3`, and originally proposed in
@diaz2011super, leverages the relationship between the (conditional) hazard and
density functions. To develop their CDE framework, @diaz2011super proposed
discretizing a continuous dependent variable $Y$ with support $\mathcal{Y}$
based on a number of bins $T$ and a binning procedure (e.g., cutting
Another approach for CDE available in `sl3`, and originally proposed in
@diaz2011super, leverages the relationship between the (conditional) hazard and
density functions. To develop their CDE framework, @diaz2011super proposed
discretizing a continuous dependent variable $Y$ with support $\mathcal{Y}$
based on a number of bins $T$ and a binning procedure (e.g., cutting
$\mathcal{Y}$ into $T$ bins of exactly the same length). The tuning parameter
$T$ conceptually corresponds to the choice of bandwidth in classical kernel
density estimation. Following discretization, each unit is represented by
a collection of records, and the number of records representing a given unit
depends on the rank of the bin (along the discretized support) into which the
$T$ conceptually corresponds to the choice of bandwidth in classical kernel
density estimation. Following discretization, each unit is represented by a
collection of records, and the number of records representing a given unit
depends on the rank of the bin (along the discretized support) into which the
unit falls.

To take an example, an instantiation of this procedure might divide the support
of $Y$ into, say, $T = 4$, bins of equal length (note this requires $T+1$ cut
points): $[\alpha_1, \alpha_2), [\alpha_2, \alpha_3), [\alpha_3, \alpha_4),
[\alpha_4, \alpha_5]$ (n.b., the rightmost interval is fully closed while the
others are only partially closed). Next, an artificial, repeated measures
dataset would be created in which each unit would be represented by up to $T$
records. To better see this structure, consider an individual unit
$O_i = (Y_i, X_i)$ whose $Y_i$ value is within $[\alpha_3, \alpha_4)$, the
third bin. This unit would be represented by three distinct records:
$\{Y_{ij}, X_{ij}\}_{j=1}^3$, where $\{\{Y_{ij} = 0\}_{j=1}^2$, $Y_{i3} = 1\}$
and three exact copies of $X_i$, $\{X_{ij}\}_{j=1}^3$. This representation in
terms of multiple records for the same unit allows for the conditional hazard
probability of $Y_i$ falling in a given bin along the discretized support to
be evaluated via standard binary regression techniques.
To take an example, an instantiation of this procedure might divide the support
of $Y$ into, say, $T = 4$, bins of equal length (note this requires $T+1$ cut
points): $[\alpha_1, \alpha_2), [\alpha_2, \alpha_3), [\alpha_3, \alpha_4),
[\alpha_4, \alpha_5]$ (n.b., the rightmost interval is fully closed while the
others are only partially closed). Next, an artificial, repeated measures
dataset would be created in which each unit would be represented by up to $T$
records. To better see this structure, consider an individual unit $O_i = (Y_i,
X_i)$ whose $Y_i$ value is within $[\alpha_3, \alpha_4)$, the third bin. This
unit would be represented by three distinct records: $\{Y_{ij},
X_{ij}\}_{j=1}^3$, where $\{\{Y_{ij} = 0\}_{j=1}^2$, $Y_{i3} = 1\}$ and three
exact copies of $X_i$, $\{X_{ij}\}_{j=1}^3$. This representation in terms of
multiple records for the same unit allows for the conditional hazard probability
of $Y_i$ falling in a given bin along the discretized support to be evaluated
via standard binary regression techniques.

In fact, this proposal reformulates the binary regression problem into a
corresponding set of hazard regressions: $\mathbb{P} (Y \in [\alpha_{t-1},
Expand Down Expand Up @@ -1968,30 +1968,29 @@ informal sketch of this algorithm below.
$Y_i$ falls.
3. Estimate the hazard probability, conditional on $X$, of bin membership
$\mathbb{P}(Y_i \in [\alpha_{t-1}, \alpha_t) \mid X)$ using any binary
regression estimator or appropriate machine learning algorithm.
regression estimator or appropriate machine learning algorithm.
4. Rescale the conditional hazard probability estimates to the conditional
density scale by dividing the cumulative hazard by the width of the bin into
which $X_i$ falls, for each observation $i = 1, \ldots, n$. If the support
set is partitioned into bins of equal size (approximately $n/T$ samples in
each bin), this amounts to rescaling by a constant. If the support
set is partitioned into bins of equal range, then the rescaling might vary
across bins.
set is partitioned into bins of equal size (approximately $n/T$ samples in
each bin), this amounts to rescaling by a constant. If the support set is
partitioned into bins of equal range, then the rescaling might vary across
bins.

A key element of this proposal is the flexibility to use any binary regression
procedure or appropriate machine learning algorithm to estimate $\mathbb{P}(Y
\in [\alpha_{t-1}, \alpha_t) \mid X)$, facilitating the incorporation of
flexibletechniques like ensemble learning [@breiman1996stacked; @vdl2007super].
This extreme degree of flexibility integrates perfectly with the underlying
design principles of `sl3`; however, we have not yet implemented this approach
in its full generality. A version of this CDE approach, which limits the
original proposal by replacing the use of arbitrary binary regression with the
procedure or appropriate machine learning algorithm to estimate $\mathbb{P}(Y
\in [\alpha_{t-1}, \alpha_t) \mid X)$, facilitating the incorporation of
flexibletechniques like ensemble learning [@breiman1996stacked; @vdl2007super].
This extreme degree of flexibility integrates perfectly with the underlying
design principles of `sl3`; however, we have not yet implemented this approach
in its full generality. A version of this CDE approach, which limits the
original proposal by replacing the use of arbitrary binary regression with the
highly adaptive lasso (HAL) algorithm [@benkeser2016hal] is supported in the
[`haldensify` package](https://github.com/nhejazi/haldensify)
[@hejazi2020haldensify] (the HAL implementation in `haldensify` is provided the
[`hal9001` package](https://github.com tlverse/hal9001)
[@coyle2020hal9001; @hejazi2020hal9001]). This CDE algorithm that uses
`haldensify` is incorporated as learner `Lrnr_haldensify` in `sl3`, as we
demonstrate below.
[@hejazi2020haldensify] (the HAL implementation in `haldensify` is provided the
[`hal9001` package](https://github.com tlverse/hal9001) [@coyle2020hal9001;
@hejazi2020hal9001]). This CDE algorithm that uses `haldensify` is incorporated
as learner `Lrnr_haldensify` in `sl3`, as we demonstrate below.

```{r cde-using-pooledhaz, eval = FALSE}
# learners used for conditional densities for (g_n)
Expand Down

0 comments on commit c27cfcf

Please sign in to comment.