Result validity despite low predictive performances #215

DarioSimonato · 2023-09-27T20:59:50Z

DarioSimonato
Sep 27, 2023

Hi,
I'm using your library to perform causal effect estimation on real data. The analysis finds significative results (even considering CIs) but while double-checking my results I noticed underwhelming performances in terms of accuracy of the ML estimators.
Intuitively to me this seems in contradiction: with "bad" ML estimates I thought I had very high CIs, leading to no discoveries. Am I wrong, and ML accuracies are not that significant?

To address this question I tried to replicate your analysis by re-running it and evaluating the performances of the ML algorithms you used. You can find my attempt in a colab notebook in which I removed all the comments and I added the performance evaluation part before the CI study. Also in this case, the ML algorithms does not seem to have very good performances (R^2 max of 0.3), but you labeled your analysis as valid.

It would be great to have a discussion about this topic, thanks a lot for your help.

Answered by SvenKlaassen

Oct 2, 2023

I am sorry for the late response.
I agree with most of the points.

the model is flexible enougth to approximate $P(D=1|X)$
you should ofc try to avoid overfitting, but the RMSE from the package should allow to obtain a good overview of the performance (or using evaluate_learners()) as this returns a cross-fitted value of the metric
For the last point:
I think the important distinction is that usually the relevant confounding variables have to be included (in most cases variables affecting both $D$ and $Y$, but you can make the argument more complex, as blocking all backdoor paths, see e.g. Book of Why)
This does not mean that the variables have to be important predictors of $D$ or $Y$, …

View full answer

SvenKlaassen · 2023-09-29T14:29:22Z

SvenKlaassen
Sep 29, 2023
Maintainer

Hi Dario,

Could you elaborate more on how you determine "bad" ML estimates?
In some settings an $R^2$ of $0.3$ can be quite good as it depends on the noise level of the model.
Usually DoubleML requires conditional exogeneity for identification, such that only the confounding variables have to be included. But if there are only few confounders with small impact this might give a "bad" $R^2$ but the parameter is still identified (in the extreme case of RCTs it is sufficent to learn the average).

Of course it might lead to issues if the machine learning model is not able to learn the nuisance elements e.g. the propensity score $m_0(X) = \mathbb{E}[D|X]$ for the IRM model.
But if the nuisance estimates have e.g. a comparatively large RMSE this does not automatically imply that you confidence intervals will be so large that the include the zero (they might be a bit larger if your learner is very variable due to overfitting etc.).
Instead your point estimate will very likely be biased if the ML model is not able to control for the confounding.

7 replies

DarioSimonato Sep 29, 2023
Author

Sorry the second link was broken: I updated it and now you should be able to see the notebook with my analysis.

My point was that, by looking at the prediction plots for both Y and the propensity score, it looks like the model is not fitting those curves properly and with such results I wouldn't be confident to say that the analysis is still statistically significant.
If I'm getting your first point correctly, the analysis is sound or not depending on the maximum theoretical performances of the best model that approximates the reality (both in terms of model and of using the right data). Right?

Additionally, regarding your second point and by looking at the second plot that I added, would you say that the model is able to learn the nuisance elements in the uploaded example? By looking at the R^2 in this case it seem that we are actually predicting only noise. I get that it was an introductory notebook to showcase DoubleML usage, but I'm facing similar issues so that's why I'm asking.

SvenKlaassen Sep 29, 2023
Maintainer

You have to consider that you are comparing the predictions with the true outcome not the conditional expectation.
If you order your observations according to your values $Y$ you will include the random noise in the ordering and not consider the distribution of $X$.
Consider a simple model, where

$$Y = X + \epsilon, \quad \epsilon\sim\mathcal{N}(0,1)$$

such that $\mathbb{E}[Y|X]=X$ if $\epsilon$ is independent. Assume $X\sim \mathcal{U}[0,1]$.

If you then plot the perfect prediction $X$ against the ordered values of $Y$ you will get something like this, due to the noise of the error term (even if your prediction is perfect!):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

n = 500
X = np.random.uniform(size=n)
Y = X + np.random.normal(size=n)

df = pd.DataFrame({'Y': Y, 'X': X})
df.sort_values(by='Y', inplace=True)

plt.plot(df['X'].values, label="Predictions", linestyle='None', marker='o', alpha = 0.5)
plt.plot(df['Y'].values, label="Observations")
plt.legend()

The goal in this simple model would be to approximate the conditional expectation $\mathbb{E}[Y|X] = X$. But you can not use this plot to see if this is achieved.

The same is true for the treatment.

You are comparing $D$ to the estimation of $m_0 = \mathbb{E}[D|X] = P(D=1|X)$. I am not sure about this setting, but if e.g. the treatment is randomized that $ P(D=1|X)=0.5$ for all vallues of $X$, such that we will see a plot like this. This could also be the case in a setting without randomization, because i would suspect that the model would not predict this far away from $0.5$.
The question is how well the machine learning model is approximating $P(D=1|X)$.

DarioSimonato Sep 29, 2023
Author

Ok, thanks a lot for the example, it explains very well your point. What would be the way then to argue that the model is correctly approximating $P(D=1|X)$?

Some points that came into my mind are:

the model is flexible enough to approximate the value
the data is large enough (w.r.t. sample size) to not avoid overfitting
the set $X$ contains the important variables for the prediction of $D$

SvenKlaassen Oct 2, 2023
Maintainer

I am sorry for the late response.
I agree with most of the points.

the model is flexible enougth to approximate $P(D=1|X)$
you should ofc try to avoid overfitting, but the RMSE from the package should allow to obtain a good overview of the performance (or using evaluate_learners()) as this returns a cross-fitted value of the metric
For the last point:
I think the important distinction is that usually the relevant confounding variables have to be included (in most cases variables affecting both $D$ and $Y$, but you can make the argument more complex, as blocking all backdoor paths, see e.g. Book of Why)
This does not mean that the variables have to be important predictors of $D$ or $Y$, but just that the model is able to account for sampling bias in your observed data. Generally this is an assumption and not testable. Writing your causal model as a graph might be helpful to visualize these assumptions.
Still, if you compare two models (based on the same covariates $X$), you should choose the model with better predictive quality. Further, it may be helpful to include predictive features, even if they are not confouders (e.g. if your treatment $D$ is randomized it could still be helpful to include variables which are good predictors for your outcome $Y$), as this might decrease the noise of your estimate and you will be able to obtain smaller confidence intervals.

Answer selected by DarioSimonato

DarioSimonato Oct 2, 2023
Author

Great, thanks a lot for your points!

Lastly, would it make sense to check for the normality of $\psi$ to check the validity of the normality assumption upon which the confidence intervals are made? I'm saying that because of Th. 3.3 of Double/debiased machine learning for treatment and structural parameters exploits $\psi$ as well and the fact that we have psi attribute of the DML estimator.

By plotting the psi attribute in your example (even tho I do not completely understand why we have 3 values per samples and not just one, in here I plotted the first component) we obtain such plots that are bell shaped but with variance different from 1.

SvenKlaassen Oct 4, 2023
Maintainer

I do not think that $/psi$ has to be normally distributed in every setting since the confidence intervals are based on a central limit theorem argument which holds for the average of $/psi$ but not necessarily for the score itself.
I am guessing that you have three columns due to the n_rep argument. This specifies how many times the sample splitting is performed. Each repetition produces a separate estimate which can be accessed via the all_coef attribute.
This makes the final estimate more robust against the randomness of the cross-fitting.

DarioSimonato Oct 4, 2023
Author

Thanks a lot, you've been very helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result validity despite low predictive performances #215

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Result validity despite low predictive performances #215

DarioSimonato Sep 27, 2023

Replies: 1 comment · 7 replies

SvenKlaassen Sep 29, 2023 Maintainer

DarioSimonato Sep 29, 2023 Author

SvenKlaassen Sep 29, 2023 Maintainer

DarioSimonato Sep 29, 2023 Author

SvenKlaassen Oct 2, 2023 Maintainer

DarioSimonato Oct 2, 2023 Author

SvenKlaassen Oct 4, 2023 Maintainer

DarioSimonato Oct 4, 2023 Author

DarioSimonato
Sep 27, 2023

Replies: 1 comment 7 replies

SvenKlaassen
Sep 29, 2023
Maintainer

DarioSimonato Sep 29, 2023
Author

SvenKlaassen Sep 29, 2023
Maintainer

DarioSimonato Sep 29, 2023
Author

SvenKlaassen Oct 2, 2023
Maintainer

DarioSimonato Oct 2, 2023
Author

SvenKlaassen Oct 4, 2023
Maintainer

DarioSimonato Oct 4, 2023
Author