How to obtain y and d residuals for plotting #161
-
Hello, I am using doubleMLPR to estimate the effect of 8 candidate treatment variables (d1:d8, with use_other_treat_as_covariate=TRUE) on an outcome measure (y) while adjusting for a set of other covariates. After fitting, I found a significant effect of d1 and now I would like to plot the residualized outcome measure (y'~covariates) against the residualized d1 measure(d1'~covariates). Are the residuals for y' and d1' available somewhere in the fitted object? While I could plot the original y vs. d1, it seems like plotting the residuals might be a more accurate representation of the results from double ML. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @PhilipSpechler , thanks for your question. I think we do not directly export the residuals of the nuisance parts, but you can compute them on your own. To do this you can export the predictions from the fitting stage and construct the residuals accordingly. See below an example based on the code example from the section on simultaneous inference in our user guide. To save the nuisance prediction, you'd have to make sure that you specify the option import doubleml as dml
import numpy as np
from sklearn.base import clone
from sklearn.linear_model import LassoCV
np.random.seed(1234)
n_obs = 500
n_vars = 100
X = np.random.normal(size=(n_obs, n_vars))
theta = np.array([3., 3., 3.])
y = np.dot(X[:, :3], theta) + np.random.standard_normal(size=(n_obs,))
dml_data = dml.DoubleMLData.from_arrays(X[:, 10:], y, X[:, :10])
learner = LassoCV()
ml_l = clone(learner)
ml_m = clone(learner)
dml_plr = dml.DoubleMLPLR(dml_data, ml_l, ml_m)
dml_plr.fit(store_predictions = True)
print(dml_plr.summary) The predictions are then stored in an numpy array with dimensions (number of observations x number of cross-fitting repetitions x number of treatment variables). You can access these arrays to calculate and plot the residuals # Predictions for nuisance part 'ml_l' and 'ml_m' stored in an array with dimensions (n_obs x n_rep x n_treat)
print(dml_plr.predictions['ml_l'].shape)
print(dml_plr.predictions['ml_m'].shape)
# Compute residuals for ml_l = E[Y|X]
residuals_ml_l_d1 = dml_data.y - dml_plr.predictions['ml_l'][:,:,0]
# Compute residuals for ml_m = E[D_1 | X] (for first treatment variable)
residuals_ml_m_d1 = dml_data.data[dml_data.d_cols[0]].values - dml_plr.predictions['ml_m'][:,:,0]
# Generate a scatter plot of the residuals
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
plt.scatter(residuals_ml_m_d1, residuals_ml_l_d1)
plt.show() I hope this helps you a bit. @MalteKurz - if you'd like to add anything here, feel free to edit/comment. Best, Philipp |
Beta Was this translation helpful? Give feedback.
-
Just a small correction. To calculate the correct residuals the # Predictions for nuisance part 'ml_l' and 'ml_m' stored in an array with dimensions (n_obs x n_rep x n_treat)
print(dml_plr.predictions['ml_l'].shape)
print(dml_plr.predictions['ml_m'].shape)
# Compute residuals for ml_l = E[Y|X]
residuals_ml_l_d1 = dml_data.y - dml_plr.predictions['ml_l'][:,:,0].reshape(-1)
# Compute residuals for ml_m = E[D_1 | X] (for first treatment variable)
residuals_ml_m_d1 = dml_data.data[dml_data.d_cols[0]].values - dml_plr.predictions['ml_m'][:,:,0].reshape(-1)
# Generate a scatter plot of the residuals
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
plt.scatter(residuals_ml_m_d1, residuals_ml_l_d1)
plt.show() But with version # Target values for nuisance part 'ml_l' and 'ml_m' stored in an array with dimensions (n_obs x n_rep x n_treat)
print(dml_plr.nuisance_targets['ml_l'].shape)
print(dml_plr.nuisance_targets['ml_m'].shape)
# Compute residuals for ml_l = E[Y|X] for all treatments
residuals_ml_l = dml_plr.nuisance_targets['ml_l'] - dml_plr.predictions['ml_l']
# Compute residuals for ml_m = E[D | X] (for all treatment variables)
residuals_ml_m = dml_plr.nuisance_targets['ml_m'] - dml_plr.predictions['ml_m'] And the corresponding plot would look something like this import pandas as pd
import seaborn as sns
df = pd.melt(pd.DataFrame(residuals_ml_l[:, 0, :]), var_name="Treatment", value_name="Residual ml_l")
df["Residual ml_m"] = pd.melt(pd.DataFrame(residuals_ml_m[:, 0, :]))["value"]
g = sns.FacetGrid(df, col="Treatment", col_wrap=3)
g.map(sns.scatterplot, "Residual ml_m", "Residual ml_l") Best, |
Beta Was this translation helpful? Give feedback.
Just a small correction. To calculate the correct residuals the
predictions
have to be reshaped: