From c27cfcf0b73d63ab8b46ce7e9f1519cc7440a749 Mon Sep 17 00:00:00 2001 From: Nima Hejazi Date: Wed, 5 Jul 2023 15:53:26 -0700 Subject: [PATCH] fix missing math mode --- 06-sl3.Rmd | 203 ++++++++++++++++++++++++++--------------------------- 1 file changed, 101 insertions(+), 102 deletions(-) diff --git a/06-sl3.Rmd b/06-sl3.Rmd index 9686ebf..d1e5e14 100644 --- a/06-sl3.Rmd +++ b/06-sl3.Rmd @@ -1805,67 +1805,67 @@ if (knitr::is_latex_output()) { # plot variable importance importance_plot(x = washb_varimp) ``` -According to the `sl3` variable importance measures, which were assessed by -the mean squared error (MSE) difference under permutations of each covariate, -the fitted SL's (`sl_fit`) most important variables for predicting -weight-for-height z-score (`whz`) are child age (`aged`) and household assets -(`assets`) that reflect the socio-economic status of the study's subjects. +According to the `sl3` variable importance measures, which were assessed by the +mean squared error (MSE) difference under permutations of each covariate, the +fitted SL's (`sl_fit`) most important variables for predicting weight-for-height +z-score (`whz`) are child age (`aged`) and household assets (`assets`) that +reflect the socio-economic status of the study's subjects. ## Conditional Density Estimation -In certain scenarios it may be useful to estimate the conditional density of a -dependent variable, given predictors/covariates that precede it. In the -context of causal inference, this arises most readily when working with -continuous-valued treatments. Specifically, conditional density estimation (CDE) -is necessary when estimating the treatment mechanism for a continuous-valued -treatment, often called the _generalized propensity score_. Compared the -classical propensity score (PS) for binary treatments (the conditional -probability of receiving the treatment given covariates), -$\mathbb{P}(A = 1 \mid W)$, the generalized PS is the conditional density of -treatment $A$, given covariates $W$, $\mathbb{P}(A \mid W)$. - -CDE often requires specialized approaches tied to very specific algorithmic -implementations. To our knowledge, general and flexible algorithms for -CDE have been proposed only sparsely in the literature. We have implemented two -such approaches in `sl3`: a semiparametric CDE approach that makes certain -assumptions about the constancy of (higher) moments of the underlying -distribution, and second approach that exploits the relationship between the -conditional hazard and density functions to allow CDE via pooled hazard -regression. Both approaches are flexible in that they allow -the use of arbitrary regression functions or machine learning algorithms for the -estimation of nuisance quantities (the conditional mean or the conditional -hazard, respectively). We elaborate on these two frameworks below. Importantly, -per @dudoit2005asymptotics and related works, a loss function appropriate for -density estimation is the negative log-density loss $L(\cdot) = -\log(p_n(\cdot))$. +In certain scenarios it may be useful to estimate the conditional density of a +dependent variable, given predictors/covariates that precede it. In the context +of causal inference, this arises most readily when working with +continuous-valued treatments. Specifically, conditional density estimation (CDE) +is necessary when estimating the treatment mechanism for a continuous-valued +treatment, often called the _generalized propensity score_. Compared the +classical propensity score (PS) for binary treatments (the conditional +probability of receiving the treatment given covariates), $\mathbb{P}(A = 1 \mid +W)$, the generalized PS is the conditional density of treatment $A$, given +covariates $W$, $\mathbb{P}(A \mid W)$. + +CDE often requires specialized approaches tied to very specific algorithmic +implementations. To our knowledge, general and flexible algorithms for CDE have +been proposed only sparsely in the literature. We have implemented two such +approaches in `sl3`: a semiparametric CDE approach that makes certain +assumptions about the constancy of (higher) moments of the underlying +distribution, and second approach that exploits the relationship between the +conditional hazard and density functions to allow CDE via pooled hazard +regression. Both approaches are flexible in that they allow the use of arbitrary +regression functions or machine learning algorithms for the estimation of +nuisance quantities (the conditional mean or the conditional hazard, +respectively). We elaborate on these two frameworks below. Importantly, per +@dudoit2005asymptotics and related works, a loss function appropriate for +density estimation is the negative log-density loss $L(\cdot) = +-\log(p_n(\cdot))$. ### Moment-restricted location-scale This family of semiparametric CDE approaches exploits the general form $\rho(Y - -\mu(X) / \sigma(X))$, where $Y$ is the dependent variable of interest (e.g., -treatment $A$ in the PS), $X$ are the predictors (e.g., covariates $W$ in the -PS), \rho$ is a specified marginal density function, and $\mu(X) = \E(Y \mid X)$ -and $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$ are nuisance functions of the -dependent variable that may be estimated flexibly. CDE procedures formulated -within this framework may be characterized as belonging to a -_conditional location-scale_ family, that is, in which -$p_n(Y \mid X) = \rho((Y - \mu_n(X)) / \sigma_n(X))$. While CDE with -conditional location-scale families is not without potential disadvantages -(e.g., the restriction on the density's functional form could lead to -misspecification bias), this strategy is flexible in that it allows for -arbitrary machine learning algorithms to be used in estimating the conditional -mean of $Y$ given $X$, \mu(X) = \E(Y \mid X)$, and the conditional variance -of $Y$ given $X$, $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$. +\mu(X) / \sigma(X))$, where $Y$ is the dependent variable of interest (e.g., +treatment $A$ in the PS), $X$ are the predictors (e.g., covariates $W$ in the +PS), $\rho$ is a specified marginal density function, and $\mu(X) = \E(Y \mid +X)$ and $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$ are nuisance functions of the +dependent variable that may be estimated flexibly. CDE procedures formulated +within this framework may be characterized as belonging to a _conditional +location-scale_ family, that is, in which $p_n(Y \mid X) = \rho((Y - \mu_n(X)) / +\sigma_n(X))$. While CDE with conditional location-scale families is not without +potential disadvantages (e.g., the restriction on the density's functional form +could lead to misspecification bias), this strategy is flexible in that it +allows for arbitrary machine learning algorithms to be used in estimating the +conditional mean of $Y$ given $X$, \mu(X) = \E(Y \mid X)$, and the conditional +variance of $Y$ given $X$, $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$. In settings with limited data, the additional structure imposed by the assumption that the target density belongs to a location-scale family may prove -advantageous by smoothing over areas of low support in the data. However, in -practice, it is impossible to know whether and when this assumption holds. This -procedure is not a novel contribution of our own (and we have been unable to -locate a formal description of it in the literature); nevertheless, we provide -an informal algorithm sketch below. This algorithm considers access to $n$ -independendent and identically distributed (i.i.d.) copies of an observed data -random variable $O = (Y, X)$, an _a priori_-specified kernel function $\rho$, a -candidate regression procedure $f_{\mu}$ to estimate $\mu(X)$, and a candidate +advantageous by smoothing over areas of low support in the data. However, in +practice, it is impossible to know whether and when this assumption holds. This +procedure is not a novel contribution of our own (and we have been unable to +locate a formal description of it in the literature); nevertheless, we provide +an informal algorithm sketch below. This algorithm considers access to $n$ +independent and identically distributed (i.i.d.) copies of an observed data +random variable $O = (Y, X)$, an _a priori_-specified kernel function $\rho$, a +candidate regression procedure $f_{\mu}$ to estimate $\mu(X)$, and a candidate regression procedure $f_{\sigma}$ to estimate $\sigma(X)$. 1. Estimate $\mu(X) = \E[Y \mid X]$, the conditional mean of $Y$ given $X$, by @@ -1881,17 +1881,17 @@ regression procedure $f_{\sigma}$ to estimate $\sigma(X)$. This algorithm sketch encompasses two forms of this CDE approach, which diverge at the second step above. To simplify the approach, one may elect to estimate -only the conditional mean $\mu(X)$, leaving the conditional variance to be -assumed constant (i.e., estimated simply as the marginal mean of the -residuals $\E[(Y - \hat{\mu}(X))^2]$). This subclass of CDE approaches have -_homoscedastic error_ based on the variance assumption made. The conditional -variance can instead by estimated as the conditional mean of the residuals -$(Y - \hat{\mu}(X))^2$ given $X$, $\E[(Y - \hat{\mu}(X))^2 \mid X]$, where the -candidate algorithm $f_{\sigma}$ is used to evaluate the expectation. -Both approaches have been implemented in `sl3`, in the learner -`Lrnr_density_semiparametric`. The `mean_learner` argument specifies -$f_{\mu}$ and the optional `var_learner` argument specifies $f_{\sigma}$. We -demonstrate CDE with this approach below. +only the conditional mean $\mu(X)$, leaving the conditional variance to be +assumed constant (i.e., estimated simply as the marginal mean of the residuals +$\E[(Y - \hat{\mu}(X))^2]$). This subclass of CDE approaches have _homoscedastic +error_ based on the variance assumption made. The conditional variance can +instead by estimated as the conditional mean of the residuals $(Y - +\hat{\mu}(X))^2$ given $X$, $\E[(Y - \hat{\mu}(X))^2 \mid X]$, where the +candidate algorithm $f_{\sigma}$ is used to evaluate the expectation. Both +approaches have been implemented in `sl3`, in the learner +`Lrnr_density_semiparametric`. The `mean_learner` argument specifies $f_{\mu}$ +and the optional `var_learner` argument specifies $f_{\sigma}$. We demonstrate +CDE with this approach below. ```{r cde-using-locscale, eval = FALSE} # semiparametric density estimator with homoscedastic errors (HOSE) @@ -1913,32 +1913,32 @@ sl_dens_lrnr <- Lrnr_sl$new( ### Pooled hazard regression -Another approach for CDE available in `sl3`, and originally proposed in -@diaz2011super, leverages the relationship between the (conditional) hazard and -density functions. To develop their CDE framework, @diaz2011super proposed -discretizing a continuous dependent variable $Y$ with support $\mathcal{Y}$ -based on a number of bins $T$ and a binning procedure (e.g., cutting +Another approach for CDE available in `sl3`, and originally proposed in +@diaz2011super, leverages the relationship between the (conditional) hazard and +density functions. To develop their CDE framework, @diaz2011super proposed +discretizing a continuous dependent variable $Y$ with support $\mathcal{Y}$ +based on a number of bins $T$ and a binning procedure (e.g., cutting $\mathcal{Y}$ into $T$ bins of exactly the same length). The tuning parameter -$T$ conceptually corresponds to the choice of bandwidth in classical kernel -density estimation. Following discretization, each unit is represented by -a collection of records, and the number of records representing a given unit -depends on the rank of the bin (along the discretized support) into which the +$T$ conceptually corresponds to the choice of bandwidth in classical kernel +density estimation. Following discretization, each unit is represented by a +collection of records, and the number of records representing a given unit +depends on the rank of the bin (along the discretized support) into which the unit falls. -To take an example, an instantiation of this procedure might divide the support -of $Y$ into, say, $T = 4$, bins of equal length (note this requires $T+1$ cut -points): $[\alpha_1, \alpha_2), [\alpha_2, \alpha_3), [\alpha_3, \alpha_4), -[\alpha_4, \alpha_5]$ (n.b., the rightmost interval is fully closed while the -others are only partially closed). Next, an artificial, repeated measures -dataset would be created in which each unit would be represented by up to $T$ -records. To better see this structure, consider an individual unit -$O_i = (Y_i, X_i)$ whose $Y_i$ value is within $[\alpha_3, \alpha_4)$, the -third bin. This unit would be represented by three distinct records: -$\{Y_{ij}, X_{ij}\}_{j=1}^3$, where $\{\{Y_{ij} = 0\}_{j=1}^2$, $Y_{i3} = 1\}$ -and three exact copies of $X_i$, $\{X_{ij}\}_{j=1}^3$. This representation in -terms of multiple records for the same unit allows for the conditional hazard -probability of $Y_i$ falling in a given bin along the discretized support to -be evaluated via standard binary regression techniques. +To take an example, an instantiation of this procedure might divide the support +of $Y$ into, say, $T = 4$, bins of equal length (note this requires $T+1$ cut +points): $[\alpha_1, \alpha_2), [\alpha_2, \alpha_3), [\alpha_3, \alpha_4), +[\alpha_4, \alpha_5]$ (n.b., the rightmost interval is fully closed while the +others are only partially closed). Next, an artificial, repeated measures +dataset would be created in which each unit would be represented by up to $T$ +records. To better see this structure, consider an individual unit $O_i = (Y_i, +X_i)$ whose $Y_i$ value is within $[\alpha_3, \alpha_4)$, the third bin. This +unit would be represented by three distinct records: $\{Y_{ij}, +X_{ij}\}_{j=1}^3$, where $\{\{Y_{ij} = 0\}_{j=1}^2$, $Y_{i3} = 1\}$ and three +exact copies of $X_i$, $\{X_{ij}\}_{j=1}^3$. This representation in terms of +multiple records for the same unit allows for the conditional hazard probability +of $Y_i$ falling in a given bin along the discretized support to be evaluated +via standard binary regression techniques. In fact, this proposal reformulates the binary regression problem into a corresponding set of hazard regressions: $\mathbb{P} (Y \in [\alpha_{t-1}, @@ -1968,30 +1968,29 @@ informal sketch of this algorithm below. $Y_i$ falls. 3. Estimate the hazard probability, conditional on $X$, of bin membership $\mathbb{P}(Y_i \in [\alpha_{t-1}, \alpha_t) \mid X)$ using any binary - regression estimator or appropriate machine learning algorithm. + regression estimator or appropriate machine learning algorithm. 4. Rescale the conditional hazard probability estimates to the conditional density scale by dividing the cumulative hazard by the width of the bin into which $X_i$ falls, for each observation $i = 1, \ldots, n$. If the support - set is partitioned into bins of equal size (approximately $n/T$ samples in - each bin), this amounts to rescaling by a constant. If the support - set is partitioned into bins of equal range, then the rescaling might vary - across bins. + set is partitioned into bins of equal size (approximately $n/T$ samples in + each bin), this amounts to rescaling by a constant. If the support set is + partitioned into bins of equal range, then the rescaling might vary across + bins. A key element of this proposal is the flexibility to use any binary regression -procedure or appropriate machine learning algorithm to estimate $\mathbb{P}(Y -\in [\alpha_{t-1}, \alpha_t) \mid X)$, facilitating the incorporation of -flexibletechniques like ensemble learning [@breiman1996stacked; @vdl2007super]. -This extreme degree of flexibility integrates perfectly with the underlying -design principles of `sl3`; however, we have not yet implemented this approach -in its full generality. A version of this CDE approach, which limits the -original proposal by replacing the use of arbitrary binary regression with the +procedure or appropriate machine learning algorithm to estimate $\mathbb{P}(Y +\in [\alpha_{t-1}, \alpha_t) \mid X)$, facilitating the incorporation of +flexibletechniques like ensemble learning [@breiman1996stacked; @vdl2007super]. +This extreme degree of flexibility integrates perfectly with the underlying +design principles of `sl3`; however, we have not yet implemented this approach +in its full generality. A version of this CDE approach, which limits the +original proposal by replacing the use of arbitrary binary regression with the highly adaptive lasso (HAL) algorithm [@benkeser2016hal] is supported in the [`haldensify` package](https://github.com/nhejazi/haldensify) -[@hejazi2020haldensify] (the HAL implementation in `haldensify` is provided the -[`hal9001` package](https://github.com tlverse/hal9001) -[@coyle2020hal9001; @hejazi2020hal9001]). This CDE algorithm that uses -`haldensify` is incorporated as learner `Lrnr_haldensify` in `sl3`, as we -demonstrate below. +[@hejazi2020haldensify] (the HAL implementation in `haldensify` is provided the +[`hal9001` package](https://github.com tlverse/hal9001) [@coyle2020hal9001; +@hejazi2020hal9001]). This CDE algorithm that uses `haldensify` is incorporated +as learner `Lrnr_haldensify` in `sl3`, as we demonstrate below. ```{r cde-using-pooledhaz, eval = FALSE} # learners used for conditional densities for (g_n)