-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
26-dif-in-dif.Rmd
3979 lines (2760 loc) · 146 KB
/
26-dif-in-dif.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Difference-in-differences
[List of packages](https://github.com/lnsongxf/DiD-1)
Examples in marketing
- [@liaukonyte2015television]: TV ad on online shopping
- [@wang2018border]: political ad source and message tone on vote shares and turnout using discontinuities in the level of political ads at the borders
- [@datta2018changing]: streaming service on total music consumption using timing of users adoption of a music streaming service
- [@janakiraman2018effect]: data breach announcement affect customer spending using timing of data breach and variation whether customer info was breached in that event
- [@israeli2018online]: digital monitoring and enforcement on violations using enforcement of min ad price policies
- [@ramani2019effects]: firms respond to foreign direct investment liberalization using India's reform in 1991.
- [@pattabhiramaiah2019paywalls]: paywall affects readership
- [@akca2020value]: aggregators for airlines business effect
- [@lim2020competitive]: nutritional labels on nutritional quality for other brands in a category using variation in timing of adoption of nutritional labels across categories
- [@guo2020let]: payment disclosure laws effect on physician prescription behavior using Timing of the Massachusetts open payment law as the exogenous shock
- [@he2022market]: using Amazon policy change to examine the causal impact of fake reviews on sales, average ratings.
- [@peukert2022regulatory]: using European General data protection Regulation, examine the impact of policy change on website usage.
Examples in econ:
- [@rosenzweig2000natural]
- [@angrist2001instrumental]
- [@fuchs2016natural]: macro
Show the mechanism via
- [Mediation Under DiD] analysis: see [@habel2021variable]
- Moderation analysis: see [@goldfarb2011online]
Steps to trust DID:
1. Visualize the treatment rollout (e.g., `panelView`).
2. Document the number of treated units in each cohort (e.g., control and treated).
3. Visualize the trajectory of average outcomes across cohorts (if you have multiple periods).
4. [Parallel Trends](#prior-parallel-trends-test) Conduct an event-study analysis with and without covariates.
5. For the case with covariates, check for overlap in covariates between treated and control groups to ensure control group validity (e.g., if the control is relatively small than the treated group, you might not have overlap, and you have to make extrapolation).
6. Conduct sensitivity analysis for parallel trend violations (e.g., `honestDiD`).
## Visualization
```{r}
library(panelView)
library(fixest)
library(tidyverse)
base_stagg <- fixest::base_stagg |>
# treatment status
dplyr::mutate(treat_stat = dplyr::if_else(time_to_treatment < 0, 0, 1)) |>
select(id, year, treat_stat, y)
head(base_stagg)
panelView::panelview(
y ~ treat_stat,
data = base_stagg,
index = c("id", "year"),
xlab = "Year",
ylab = "Unit",
display.all = F,
gridOff = T,
by.timing = T
)
# alternatively specification
panelView::panelview(
Y = "y",
D = "treat_stat",
data = base_stagg,
index = c("id", "year"),
xlab = "Year",
ylab = "Unit",
display.all = F,
gridOff = T,
by.timing = T
)
# Average outcomes for each cohort
panelView::panelview(
data = base_stagg,
Y = "y",
D = "treat_stat",
index = c("id", "year"),
by.timing = T,
display.all = F,
type = "outcome",
by.cohort = T
)
```
## Simple Dif-n-dif
- A tool developed intuitively to study "natural experiment", but its uses are much broader.
- [Fixed Effects Estimator] is the foundation for DID
- Why is dif-in-dif attractive? Identification strategy: Inter-temporal variation between groups
- **Cross-sectional estimator** helps avoid omitted (unobserved) **common trends**
- **Time-series estimator** helps overcome omitted (unobserved) **cross-sectional differences**
Consider
- $D_i = 1$ treatment group
- $D_i = 0$ control group
- $T= 1$ After the treatment
- $T =0$ Before the treatment
| | After (T = 1) | Before (T = 0) |
|-------------------|------------------------|----------------------|
| Treated $D_i =1$ | $E[Y_{1i}(1)|D_i = 1]$ | $E[Y_{0i}(0)|D)i=1]$ |
| Control $D_i = 0$ | $E[Y_{0i}(1) |D_i =0]$ | $E[Y_{0i}(0)|D_i=0]$ |
missing $E[Y_{0i}(1)|D=1]$
**The Average Treatment Effect on Treated (ATT)**
$$
\begin{aligned}
E[Y_1(1) - Y_0(1)|D=1] &= \{E[Y(1)|D=1] - E[Y(1)|D=0] \} \\
&- \{E[Y(0)|D=1] - E[Y(0)|D=0] \}
\end{aligned}
$$
More elaboration:
- For the treatment group, we isolate the difference between being treated and not being treated. If the untreated group would have been affected in a different way, the DiD design and estimate would tell us nothing.
- Alternatively, because we can't observe treatment variation in the control group, we can't say anything about the treatment effect on this group.
**Extension**
1. **More than 2 groups** (multiple treatments and multiple controls), and more than 2 period (pre and post)
$$
Y_{igt} = \alpha_g + \gamma_t + \beta I_{gt} + \delta X_{igt} + \epsilon_{igt}
$$
where
- $\alpha_g$ is the group-specific fixed effect
- $\gamma_t$ = time specific fixed effect
- $\beta$ = dif-in-dif effect
- $I_{gt}$ = interaction terms (n treatment indicators x n post-treatment dummies) (capture effect heterogeneity over time)
This specification is the "two-way fixed effects DiD" - **TWFE** (i.e., 2 sets of fixed effects: group + time).
- However, if you have [Staggered Dif-n-dif] (i.e., treatment is applied at different times to different groups). TWFE is really bad.
2. **Long-term Effects**
To examine the dynamic treatment effects (that are not under rollout/staggered design), we can create a centered time variable,
+------------------------+------------------------------------------------+
| Centered Time Variable | Period |
+========================+================================================+
| ... | |
+------------------------+------------------------------------------------+
| $t = -1$ | 2 periods before treatment period |
+------------------------+------------------------------------------------+
| $t = 0$ | Last period right before treatment period |
| | |
| | Remember to use this period as reference group |
+------------------------+------------------------------------------------+
| $t = 1$ | Treatment period |
+------------------------+------------------------------------------------+
| ... | |
+------------------------+------------------------------------------------+
By interacting this factor variable, we can examine the dynamic effect of treatment (i.e., whether it's fading or intensifying)
$$
\begin{aligned}
Y &= \alpha_0 + \alpha_1 Group + \alpha_2 Time \\
&+ \beta_{-T_1} Treatment+ \beta_{-(T_1 -1)} Treatment + \dots + \beta_{-1} Treatment \\
&+ \beta_1 + \dots + \beta_{T_2} Treatment
\end{aligned}
$$
where
- $\beta_0$ is used as the reference group (i.e., drop from the model)
- $T_1$ is the pre-treatment period
- $T_2$ is the post-treatment period
With more variables (i.e., interaction terms), coefficients estimates can be less precise (i.e., higher SE).
3. DiD on the relationship, not levels. Technically, we can apply DiD research design not only on variables, but also on coefficients estimates of some other regression models with before and after a policy is implemented.
Goal:
1. Pre-treatment coefficients should be non-significant $\beta_{-T_1}, \dots, \beta_{-1} = 0$ (similar to the [Placebo Test])
2. Post-treatment coefficients are expected to be significant $\beta_1, \dots, \beta_{T_2} \neq0$
- You can now examine the trend in post-treatment coefficients (i.e., increasing or decreasing)
```{r}
library(tidyverse)
library(fixest)
od <- causaldata::organ_donations %>%
# Treatment variable
dplyr::mutate(California = State == 'California') %>%
# centered time variable
dplyr::mutate(center_time = as.factor(Quarter_Num - 3))
# where 3 is the reference period precedes the treatment period
class(od$California)
class(od$State)
cali <- feols(Rate ~ i(center_time, California, ref = 0) |
State + center_time,
data = od)
etable(cali)
iplot(cali, pt.join = T)
coefplot(cali)
```
## Notes
- [Matching Methods]
- Match treatment and control based on pre-treatment observables
- Modify SEs appropriately [@heckman1997matching]. It's might be easier to just use the [Doubly Robust DiD] [@sant2020doubly] where you just need either matching or regression to work in order to identify your treatment effect
- Whereas the group fixed effects control for the group time-invariant effects, it does not control for selection bias (i.e., certain groups are more likely to be treated than others). Hence, with these backdoor open (i.e., selection bias) between (1) propensity to be treated and (2) dynamics evolution of the outcome post-treatment, matching can potential close these backdoor.
- Be careful when matching time-varying covariates because you might encounter "regression to the mean" problem, where pre-treatment periods can have an unusually bad or good time (that is out of the ordinary), then the post-treatment period outcome can just be an artifact of the regression to the mean [@daw2018matching]. This problem is not of concern to time-invariant variables.
- Matching and DiD can use pre-treatment outcomes to correct for selection bias. From real world data and simulation, [@chabe2015analysis] found that matching generally underestimates the average causal effect and gets closer to the true effect with more number of pre-treatment outcomes. When selection bias is symmetric around the treatment date, DID is still consistent when implemented symmetrically (i.e., the same number of period before and after treatment). In cases where selection bias is asymmetric, the MC simulations show that Symmetric DiD still performs better than Matching.
- It's always good to show results with and without controls because
- If the controls are fixed within group or within time, then those should be absorbed under those fixed effects
- If the controls are dynamic across group and across, then your parallel trends assumption is not plausible.
- Under causal inference, $R^2$ is not so important.
For count data, one can use the fixed-effects Poisson pseudo-maximum likelihood estimator (PPML) [@athey2006identification, @puhani2012treatment] (For applied papers, see @burtch2018can in management and @he2021end in marketing). This also allows for robust standard errors under over-dispersion [@wooldridge1999quasi].
- This estimator outperforms a log OLS when data have many 0s[@silva2011further], since log-OLS can produce biased estimates [@o2010not] under heteroskedascity [@silva2006log].
- For those thinking of negative binomial with fixed effects, there isn't an estimator right now [@allison20027].
For [Zero-valued Outcomes], we have to distinguish the treatment effect on the intensive (outcome: 10 to 11) vs. extensive margins (outcome: 0 to 1), and we can't readily interpret the treatment coefficient of log-transformed outcome regression as percentage change [@chen2023logs]. Alternatively, we can either focus on
- **Proportional treatment effects**: $\theta_{ATT\%} = \frac{E(Y_{it}(1) | D_i = 1, Post_t = 1) - E(Y_{it}(0) |D_i = 1, Post_t = 1)}{E(Y_{it}(0) | D_i = 1 , Post_t = 1}$ (i.e., percentage change in treated group's average post-treatment outcome). Instead of relying on the parallel trends assumption in levels, we could also rely on parallel trends assumption in ratio [@wooldridge2023simple].
- We can use Poisson QMLE to estimate the treatment effect: $Y_{it} = \exp(\beta_0 + D_i \times \beta_1 Post_t + \beta_2 D_i + \beta_3 Post_t + X_{it}) \epsilon_{it}$ and $\hat{\theta}_{ATT \%} = \exp(\hat{\beta}_1-1)$.
- To examine the parallel trends assumption in ratio holds, we can also estimate a dynamic version of the Poisson QMLE: $Y_{it} = \exp(\lambda_t + \beta_2 D_i + \sum_{r \neq -1} \beta_r D_i \times (RelativeTime_t = r)$, we would expect $\exp(\hat{\beta_r}) - 1 = 0$ for $r < 0$.
- Even if we see the plot of these coefficients are 0, we still should run sensitivity analysis [@rambachan2023more] to examine violation of this assumption (see [Prior Parallel Trends Test](#prior-parallel-trends-test)).
- **Log Effects with Calibrated Extensive-margin value**: due to problem with the mean value interpretation of the proportional treatment effects with outcomes that are heavy-tailed, we might be interested in the extensive margin effect. Then, we can explicit model how much weight we put on the intensive vs. extensive margin [@chen2023logs, p. 39].
1. **Proportional treatment effects**
```{r simulate a dataset}
set.seed(123) # For reproducibility
n <- 500 # Number of observations per group (treated and control)
# Generating IDs for a panel setup
ID <- rep(1:n, times = 2)
# Defining groups and periods
Group <- rep(c("Control", "Treated"), each = n)
Time <- rep(c("Before", "After"), times = n)
Treatment <- ifelse(Group == "Treated", 1, 0)
Post <- ifelse(Time == "After", 1, 0)
# Step 1: Generate baseline outcomes with a zero-inflated model
lambda <- 20 # Average rate of occurrence
zero_inflation <- 0.5 # Proportion of zeros
Y_baseline <-
ifelse(runif(2 * n) < zero_inflation, 0, rpois(2 * n, lambda))
# Step 2: Apply DiD treatment effect on the treated group in the post-treatment period
Treatment_Effect <- Treatment * Post
Y_treatment <-
ifelse(Treatment_Effect == 1, rpois(n, lambda = 2), 0)
# Incorporating a simple time trend, ensuring outcomes are non-negative
Time_Trend <- ifelse(Time == "After", rpois(2 * n, lambda = 1), 0)
# Step 3: Combine to get the observed outcomes
Y_observed <- Y_baseline + Y_treatment + Time_Trend
# Ensure no negative outcomes after the time trend
Y_observed <- ifelse(Y_observed < 0, 0, Y_observed)
# Create the final dataset
data <-
data.frame(
ID = ID,
Treatment = Treatment,
Period = Post,
Outcome = Y_observed
)
# Viewing the first few rows of the dataset
head(data)
```
```{r estimate the proportional treatment effects}
library(fixest)
res_pois <-
fepois(Outcome ~ Treatment + Period + Treatment * Period,
data = data,
vcov = "hetero")
etable(res_pois)
# Average percentage change
exp(coefficients(res_pois)["Treatment:Period"]) - 1
# SE using delta method
exp(coefficients(res_pois)["Treatment:Period"]) *
sqrt(res_pois$cov.scaled["Treatment:Period", "Treatment:Period"])
```
In this example, the DID coefficient is not significant. However, say that it's significant, we can interpret the coefficient as 3 percent increase in posttreatment period due to the treatment.
```{r estiamte teh proprortion treatment effects for event-study form}
library(fixest)
base_did_log0 <- base_did |>
mutate(y = if_else(y > 0, y, 0))
res_pois_es <-
fepois(y ~ x1 + i(period, treat, 5) | id + period,
data = base_did_log0,
vcov = "hetero")
etable(res_pois_es)
iplot(res_pois_es)
```
This parallel trend is the "ratio" version as in @wooldridge2023simple :
$$
\frac{E(Y_{it}(0) |D_i = 1, Post_t = 1)}{E(Y_{it}(0) |D_i = 1, Post_t = 0)} = \frac{E(Y_{it}(0) |D_i = 0, Post_t = 1)}{E(Y_{it}(0) |D_i =0, Post_t = 0)}
$$
which means without treatment, the average percentage change in the mean outcome for treated group is identical to that of the control group.
2. **Log Effects with Calibrated Extensive-margin value**
If we want to study the treatment effect on a concave transformation of the outcome that is less influenced by those in the distribution's tail, then we can perform this analysis.
Steps:
1. Normalize the outcomes such that 1 represents the minimum non-zero and positve value (i.e., divide the outcome by its minimum non-zero and positive value).
2. Estimate the treatment effects for the new outcome
$$
m(y) =
\begin{cases}
\log(y) & \text{for } y >0 \\
-x & \text{for } y = 0
\end{cases}
$$
The choice of $x$ depends on what the researcher is interested in:
| Value of $x$ | Interest |
|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $x = 0$ | The treatment effect in logs where all zero-valued outcomes are set to equal the minimum non-zero value (i.e., we exclude the extensive-margin change between 0 and $y_{min}$ ) |
| $x>0$ | Setting the change between 0 and $y_{min}$ to be valued as the equivalent of a $x$ log point change along the intensive margin. |
```{r}
library(fixest)
base_did_log0_cali <- base_did_log0 |>
# get min
mutate(min_y = min(y[y > 0])) |>
# normalized the outcome
mutate(y_norm = y / min_y)
my_regression <-
function(x) {
base_did_log0_cali <-
base_did_log0_cali %>% mutate(my = ifelse(y_norm == 0,-x,
log(y_norm)))
my_reg <-
feols(
fml = my ~ x1 + i(period, treat, 5) | id + period,
data = base_did_log0_cali,
vcov = "hetero"
)
return(my_reg)
}
xvec <- c(0, .1, .5, 1, 3)
reg_list <- purrr::map(.x = xvec, .f = my_regression)
iplot(reg_list,
pt.col = 1:length(xvec),
pt.pch = 1:length(xvec))
legend("topleft",
col = 1:length(xvec),
pch = 1:length(xvec),
legend = as.character(xvec))
etable(
reg_list,
headers = list("Extensive-margin value (x)" = as.character(xvec)),
digits = 2,
digits.stats = 2
)
```
We have the dynamic treatment effects for different hypothesized extensive-margin value of $x \in (0, .1, .5, 1, 3, 5)$
The first column is when the zero-valued outcome equal to $y_{min, y>0}$ (i.e., there is no different between the minimum outcome and zero outcome - $x = 0$)
For this particular example, as the extensive margin increases, we see an increase in the effect magnitude. The second column is when we assume an extensive-margin change from 0 to $y_{min, y >0}$ is equivalent to a 10 (i.e., $0.1 \times 100$) log point change along the intensive margin.
## Standard Errors
Serial correlation is a big problem in DiD because [@bertrand2004much]
1. DiD often uses long time series
2. Outcomes are often highly positively serially correlated
3. Minimal variation in the treatment variable over time within a group (e.g., state).
To overcome this problem:
- Using parametric correction (standard AR correction) is not good.
- Using nonparametric (e.g., **block bootstrap**- keep all obs from the same group such as state together) is good when number of groups is large.
- Remove time series dimension (i.e., aggregate data into 2 periods: pre and post). This still works with small number of groups (See [@donald2007inference] for more notes on small-sample aggregation).
- Empirical and arbitrary variance-covariance matrix corrections work only in large samples.
## Examples
Example by [Philipp Leppert](https://rpubs.com/phle/r_tutorial_difference_in_differences) replicating [Card and Krueger (1994)](https://davidcard.berkeley.edu/data_sets.html)
Example by [Anthony Schmidt](https://bookdown.org/aschmi11/causal_inf/difference-in-differences.html)
### Example by @doleac2020unintended
- The purpose of banning a checking box for ex-criminal was banned because we thought that it gives more access to felons
- Even if we ban the box, employers wouldn't just change their behaviors. But then the unintended consequence is that employers statistically discriminate based on race
3 types of ban the box
1. Public employer only
2. Private employer with government contract
3. All employers
Main identification strategy
- If any county in the Metropolitan Statistical Area (MSA) adopts ban the box, it means the whole MSA is treated. Or if the state adopts "ban the ban," every county is treated
Under [Simple Dif-n-dif]
$$ Y_{it} = \beta_0 + \beta_1 Post_t + \beta_2 treat_i + \beta_2 (Post_t \times Treat_i) + \epsilon_{it} $$
But if there is no common post time, then we should use [Staggered Dif-n-dif]
$$ \begin{aligned} E_{imrt} &= \alpha + \beta_1 BTB_{imt} W_{imt} + \beta_2 BTB_{mt} + \beta_3 BTB_{mt} H_{imt}\\ &+ \delta_m + D_{imt} \beta_5 + \lambda_{rt} + \delta_m\times f(t) \beta_7 + e_{imrt} \end{aligned} $$
where
- $i$ = person; $m$ = MSA; $r$ = region (US regions e.g., Midwest) ; $r$ = region; $t$ = year
- $W$ = White; $B$ = Black; $H$ = Hispanic
- $\beta_1 BTB_{imt} W_{imt} + \beta_2 BTB_{mt} + \beta_3 BTB_{mt} H_{imt}$ are the 3 dif-n-dif variables ($BTB$ = "ban the box")
- $\delta_m$ = dummy for MSI
- $D_{imt}$ = control for people
- $\lambda_{rt}$ = region by time fixed effect
- $\delta_m \times f(t)$ = linear time trend within MSA (but we should not need this if we have good pre-trend)
If we put $\lambda_r - \lambda_t$ (separately) we will more broad fixed effect, while $\lambda_{rt}$ will give us deeper and narrower fixed effect.
Before running this model, we have to drop all other races. And $\beta_1, \beta_2, \beta_3$ are not collinear because there are all interaction terms with $BTB_{mt}$
If we just want to estimate the model for black men, we will modify it to be
$$ E_{imrt} = \alpha + BTB_{mt} \beta_1 + \delta_m + D_{imt} \beta_5 + \lambda_{rt} + (\delta_m \times f(t)) \beta_7 + e_{imrt} $$
$$ \begin{aligned} E_{imrt} &= \alpha + BTB_{m (t - 3t)} \theta_1 + BTB_{m(t-2)} \theta_2 + BTB_{mt} \theta_4 \\ &+ BTB_{m(t+1)}\theta_5 + BTB_{m(t+2)}\theta_6 + BTB_{m(t+3t)}\theta_7 \\ &+ [\delta_m + D_{imt}\beta_5 + \lambda_r + (\delta_m \times (f(t))\beta_7 + e_{imrt}] \end{aligned} $$
We have to leave $BTB_{m(t-1)}\theta_3$ out for the category would not be perfect collinearity
So the year before BTB ($\theta_1, \theta_2, \theta_3$) should be similar to each other (i.e., same pre-trend). Remember, we only run for places with BTB.
If $\theta_2$ is statistically different from $\theta_3$ (baseline), then there could be a problem, but it could also make sense if we have pre-trend announcement.
### Example from [Princeton](https://www.princeton.edu/~otorres/DID101R.pdf)
```{r}
library(foreign)
mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta") %>%
# create a dummy variable to indicate the time when the treatment started
dplyr::mutate(time = ifelse(year >= 1994, 1, 0)) %>%
# create a dummy variable to identify the treatment group
dplyr::mutate(treated = ifelse(country == "E" |
country == "F" | country == "G" ,
1,
0)) %>%
# create an interaction between time and treated
dplyr::mutate(did = time * treated)
```
estimate the DID estimator
```{r}
didreg = lm(y ~ treated + time + did, data = mydata)
summary(didreg)
```
The `did` coefficient is the differences-in-differences estimator. Treat has a negative effect
### Example by @card1993minimum
found that increase in minimum wage increases employment
Experimental Setting:
- New Jersey (treatment) increased minimum wage
- Penn (control) did not increase minimum wage
| | | After | Before | |
|-----------|-----|-------|--------|-------------------|
| Treatment | NJ | A | B | A - B |
| Control | PA | C | D | C - D |
| | | A - C | B - D | (A - B) - (C - D) |
where
- A - B = treatment effect + effect of time (additive)
- C - D = effect of time
- (A - B) - (C - D) = dif-n-dif
**The identifying assumptions**:
- Can't have **switchers**
- PA is the control group
- is a good counter factual
- is what NJ would look like if they hadn't had the treatment
$$
Y_{jt} = \beta_0 + NJ_j \beta_1 + POST_t \beta_2 + (NJ_j \times POST_t)\beta_3+ X_{jt}\beta_4 + \epsilon_{jt}
$$
where
- $j$ = restaurant
- $NJ$ = dummy where $1 = NJ$, and $0 = PA$
- $POST$ = dummy where $1 = post$, and $0 = pre$
Notes:
- We don't need $\beta_4$ in our model to have unbiased $\beta_3$, but including it would give our coefficients efficiency
- If we use $\Delta Y_{jt}$ as the dependent variable, we don't need $POST_t \beta_2$ anymore
- Alternative model specification is that the authors use NJ high wage restaurant as control group (still choose those that are close to the border)
- The reason why they can't control for everything (PA + NJ high wage) is because it's hard to interpret the causal treatment
- Dif-n-dif utilizes similarity in pretrend of the dependent variables. However, this is neither a necessary nor sufficient for the identifying assumption.
- It's not sufficient because they can have multiple treatments (technically, you could include more control, but your treatment can't interact)
- It's not necessary because trends can be parallel after treatment
- However, we can't never be certain; we just try to find evidence consistent with our theory so that dif-n-dif can work.
- Notice that we don't need before treatment the **levels of the dependent variable** to be the same (e.g., same wage average in both NJ and PA), dif-n-dif only needs **pre-trend (i.e., slope)** to be the same for the two groups.
### Example by @butcher2014effects
Theory:
- Highest achieving students are usually in hard science. Why?
- Hard to give students students the benefit of doubt for hard science
- How unpleasant and how easy to get a job. Degrees with lower market value typically want to make you feel more pleasant
Under OLS
$$
E_{ij} = \beta_0 + X_i \beta_1 + G_j \beta_2 + \epsilon_{ij}
$$
where
- $X_i$ = student attributes
- $\beta_2$ = causal estimate (from grade change)
- $E_{ij}$ = Did you choose to enroll in major $j$
- $G_j$ = grade given in major $j$
Examine $\hat{\beta}_2$
- Negative bias: Endogenous response because department with lower enrollment rate will give better grade
- Positive bias: hard science is already having best students (i.e., ability), so if they don't their grades can be even lower
Under dif-n-dif
$$
Y_{idt} = \beta_0 + POST_t \beta_1 + Treat_d \beta_2 + (POST_t \times Treat_d)\beta_3 + X_{idt} + \epsilon_{idt}
$$
where
- $Y_{idt}$ = grade average
| | Intercept | Treat | Post | Treat\*Post |
|--------------|-----------------------------------|-------|------|-------------|
| Treat Pre | 1 | 1 | 0 | 0 |
| Treat Post | 1 | 1 | 1 | 1 |
| Control Pre | 1 | 0 | 0 | 0 |
| Control Post | 1 | 0 | 1 | 0 |
| | Average for pre-control $\beta_0$ | | | |
A more general specification of the dif-n-dif is that
$$
Y_{idt} = \alpha_0 + (POST_t \times Treat_d) \alpha_1 + \theta_d + \delta_t + X_{idt} + u_{idt}
$$
where
- $(\theta_d + \delta_t)$ richer , more df than $Treat_d \beta_2 + Post_t \beta_1$ (because fixed effects subsume Post and treat)
- $\alpha_1$ should be equivalent to $\beta_3$ (if your model assumptions are correct)
## One Difference
The regression formula is as follows [@liaukonyte2023frontiers]:
$$
y_{ut} = \beta \text{Post}_t + \gamma_u + \gamma_w(t) + \gamma_l + \gamma_g(u)p(t) + \epsilon_{ut}
$$
where
- $y_{ut}$: Outcome of interest for unit u in time t.
- $\text{Post}_t$: Dummy variable representing a specific post-event period.
- $\beta$: Coefficient measuring the average change in the outcome after the event relative to the pre-period.
- $\gamma_u$: Fixed effects for each unit.
- $\gamma_w(t)$: Time-specific fixed effects to account for periodic variations.
- $\gamma_l$: Dummy variable for a specific significant period (e.g., a major event change).
- $\gamma_g(u)p(t)$: Group x period fixed effects for flexible trends that may vary across different categories (e.g., geographical regions) and periods.
- $\epsilon_{ut}$: Error term.
This model can be used to analyze the impact of an event on the outcome of interest while controlling for various fixed effects and time-specific variations, but using units themselves pre-treatment as controls.
## Two-way Fixed-effects
A generalization of the dif-n-dif model is the two-way fixed-effects models where you have multiple groups and time effects. But this is not a designed-based, non-parametric causal estimator [@imai2021use]
When applying TWFE to multiple groups and multiple periods, the supposedly causal coefficient is the weighted average of all two-group/two-period DiD estimators in the data where some of the weights can be negative. More specifically, the weights are proportional to group sizes and treatment indicator's variation in each pair, where units in the middle of the panel have the highest weight.
The canonical/standard TWFE only works when
- Effects are homogeneous across units and across time periods (i.e., no dynamic changes in the effects of treatment). See [@goodman2021difference; @de2020two; @sun2021estimating; @borusyak2021revisiting] for details. Similarly, it relies on the assumption of **linear additive effects** [@imai2021use]
- Have to argue why treatment heterogeneity is not a problem (e.g., plot treatment timing and decompose treatment coefficient using [Goodman-Bacon Decomposition]) know the percentage of observation are never treated (because as the never-treated group increases, the bias of TWFE decreases, with 80% sample to be never-treated, bias is negligible). The problem is worsen when you have long-run effects.
- Need to manually drop two relative time periods if everyone is eventually treated (to avoid multicollinearity). Programs might do this randomly and if it chooses to drop a post-treatment period, it will create biases. The choice usually -1, and -2 periods.
- Treatment heterogeneity can come in because (1) it might take some time for a treatment to have measurable changes in outcomes or (2) for each period after treatment, the effect can be different (phase in or increasing effects).
- 2 time periods.
Within this setting, TWFE works because, using the baseline (e.g., control units where their treatment status is unchanged across time periods), the comparison can be
- Good for
- Newly treated units vs. control
- Newly treated units vs not-yet treated
- Bad for
- Newly treated vs. already treated (because already treated cannot serve as the potential outcome for the newly treated).
- Strict exogeneity (i.e., time-varying confounders, feedback from past outcome to treatment) [@imai2019should]
- Specific functional forms (i.e., treatment effect homogeneity and no carryover effects or anticipation effects) [@imai2019should]
Note: Notation for this section is consistent with [@arkhangelsky2021double]
$$
Y_{it} = \alpha_i + \lambda_t + \tau W_{it} + \beta X_{it} + \epsilon_{it}
$$
where
- $Y_{it}$ is the outcome
- $\alpha_i$ is the unit FE
- $\lambda_t$ is the time FE
- $\tau$ is the causal effect of treatment
- $W_{it}$ is the treatment indicator
- $X_{it}$ are covariates
When $T = 2$, the TWFE is the traditional DiD model
Under the following assumption, $\hat{\tau}_{OLS}$ is unbiased:
1. homogeneous treatment effect
2. parallel trends assumptions
3. linear additive effects [@imai2021use]
**Remedies for TWFE's shortcomings**
- [@goodman2021difference]: diagnostic robustness tests of the TWFE DiD and identify influential observations to the DiD estimate ([Goodman-Bacon Decomposition])
- [@callaway2021difference]: 2-step estimation with a bootstrap procedure that can account for autocorrelation and clustering,
- the parameters of interest are the group-time average treatment effects, where each group is defined by when it was first treated ([Multiple periods and variation in treatment timing])
- Comparing post-treatment outcomes fo groups treated in a period against a similar group that is never treated (using matching).
- Treatment status cannot switch (once treated, stay treated for the rest of the panel)
- Package: `did`
- [@sun2021estimating]: a specialization of [@callaway2021difference] in the event-study context.
- They include lags and leads in their design
- have cohort-specific estimates (similar to group-time estimates in [@callaway2021difference]
- They propose the "interaction-weighted" estimator.
- Package: `fixest`
- [@imai2021use]
- Different from [@callaway2021difference] because they allow units to switch in and out of treatment.
- Based on matching methods, to have weighted TWFE
- Package: `wfe` and `PanelMatch`
- [@gardner2022two]: two-stage DiD
- `did2s`
- In cases with an unaffected unit (i.e., never-treated), using the exposure-adjusted difference-in-differences estimators can recover the average treatment effect [@de2020two]. However, if you want to see the treatment effect heterogeneity (in cases where the true heterogeneous treatment effects vary by the exposure rate), exposure-adjusted did still fails [@sun2022linear].
- [@arkhangelsky2021double]: see below
To be robust against
1. time- and unit-varying effects
We can use the reshaped inverse probability weighting (RIPW)- TWFE estimator
With the following assumptions:
- SUTVA
- Binary treatment: $\mathbf{W}_i = (W_{i1}, \dots, W_{it})$ where $\mathbf{W}_i \sim \mathbf{\pi}_i$ generalized propensity score (i.e., each person treatment likelihood follow $\pi$ regardless of the period)
Then, the unit-time specific effect is $\tau_{it} = Y_{it}(1) - Y_{it}(0)$
Then the Doubly Average Treatment Effect (DATE) is
$$
\tau(\xi) = \sum_{T=1}^T \xi_t \left(\frac{1}{n} \sum_{i = 1}^n \tau_{it} \right)
$$
where
- $\frac{1}{n} \sum_{i = 1}^n \tau_{it}$ is the unweighted effect of treatment across units (i.e., time-specific ATE).
- $\xi = (\xi_1, \dots, \xi_t)$ are user-specific weights for each time period.
- This estimand is called DATE because it's weighted (averaged) across both time and units.
A special case of DATE is when both time and unit-weights are equal
$$
\tau_{eq} = \frac{1}{nT} \sum_{t=1}^T \sum_{i = 1}^n \tau_{it}
$$
Borrowing the idea of inverse propensity-weighted least squares estimator in the cross-sectional case that we reweight the objective function via the treatment assignment mechanism:
$$
\hat{\tau} \triangleq \arg \min_{\tau} \sum_{i = 1}^n (Y_i -\mu - W_i \tau)^2 \frac{1}{\pi_i (W_i)}
$$
where
- the first term is the least squares objective
- the second term is the propensity score
In the panel data case, the IPW estimator will be
$$
\hat{\tau}_{IPW} \triangleq \arg \min_{\tau} \sum_{i = 1}^n \sum_{t =1}^T (Y_{i t}-\alpha_i - \lambda_t - W_{it} \tau)^2 \frac{1}{\pi_i (W_i)}
$$
Then, to have DATE that users can specify the structure of time weight, we use reshaped IPW estimator [@arkhangelsky2021double]
$$
\hat{\tau}_{RIPW} (\Pi) \triangleq \arg \min_{\tau} \sum_{i = 1}^n \sum_{t =1}^T (Y_{i t}-\alpha_i - \lambda_t - W_{it} \tau)^2 \frac{\Pi(W_i)}{\pi_i (W_i)}
$$
where it's a function of a data-independent distribution $\Pi$ that depends on the support of the treatment path $\mathbb{S} = \cup_i Supp(W_i)$
This generalization can transform to
- IPW-TWFE estimator when $\Pi \sim Unif(\mathbb{S})$
- randomized experiment when $\Pi = \pi_i$
To choose $\Pi$, we don't need to data, we just need possible assignments in your setting.
- For most practical problems (DiD, staggered, transient), we have closed form solutions
- For generic solver, we can use nonlinear programming (e..g, BFGS algorithm)
As argued in [@imai2021use] that TWFE is not a non-parametric approach, it can be subjected to incorrect model assumption (i.e., model dependence).
- Hence, they advocate for matching methods for time-series cross-sectional data [@imai2021use]
- Use `wfe` and `PanelMatch` to apply their paper.
This package is based on [@somaini2016algorithm]
```{r}
# dataset
library(bacondecomp)
df <- bacondecomp::castle
```
```{r}
# devtools::install_github("paulosomaini/xtreg2way")
library(xtreg2way)
# output <- xtreg2way(y,
# data.frame(x1, x2),
# iid,
# tid,
# w,
# noise = "1",
# se = "1")
# equilvalently
output <- xtreg2way(l_homicide ~ post,
df,
iid = df$state, # group id
tid = df$year, # time id
# w, # vector of weight
se = "1")
output$betaHat
output$aVarHat
# to save time, you can use your structure in the
# last output for a new set of variables
# output2 <- xtreg2way(y, x1, struc=output$struc)
```
Standard errors estimation options
| Set | Estimation |
|----------------------|---------------------------------------------------------------------------------------------|
| `se = "0"` | Assume homoskedasticity and no within group correlation or serial correlation |
| `se = "1"` (default) | robust to heteroskadasticity and serial correlation [@arellano1987computing] |
| `se = "2"` | robust to heteroskedasticity, but assumes no correlation within group or serial correlation |
| `se = "11"` | Aerllano SE with df correction performed by Stata xtreg [@somaini2021twfem] |
Alternatively, you can also do it manually or with the `plm` package, but you have to be careful with how the SEs are estimated
```{r}
library(multiwayvcov) # get vcov matrix
library(lmtest) # robust SEs estimation
# manual
output3 <- lm(l_homicide ~ post + factor(state) + factor(year),
data = df)
# get variance-covariance matrix
vcov_tw <- multiwayvcov::cluster.vcov(output3,
cbind(df$state, df$year),
use_white = F,
df_correction = F)
# get coefficients
coeftest(output3, vcov_tw)[2,]
```
```{r}
# using the plm package
library(plm)
output4 <- plm(l_homicide ~ post,
data = df,
index = c("state", "year"),
model = "within",
effect = "twoways")
# get coefficients
coeftest(output4, vcov = vcovHC, type = "HC1")
```
As you can see, differences stem from SE estimation, not the coefficient estimate.
## Multiple periods and variation in treatment timing
This is an extension of the DiD framework to settings where you have
- more than 2 time periods
- different treatment timing
When treatment effects are heterogeneous across time or units, the standard [Two-way Fixed-effects] is inappropriate.
Notation is consistent with `did` [package](https://cran.r-project.org/web/packages/did/vignettes/multi-period-did.html) [@callaway2021difference]
- $Y_{it}(0)$ is the potential outcome for unit $i$
- $Y_{it}(g)$ is the potential outcome for unit $i$ in time period $t$ if it's treated in period $g$
- $Y_{it}$ is the observed outcome for unit $i$ in time period $t$
$$
Y_{it} =
\begin{cases}
Y_{it} = Y_{it}(0) & \forall i \in \text{never-treated group} \\
Y_{it} = 1\{G_i > t\} Y_{it}(0) + 1\{G_i \le t \}Y_{it}(G_i) & \forall i \in \text{other groups}
\end{cases}
$$
- $G_i$ is the time period when $i$ is treated
- $C_i$ is a dummy when $i$ belongs to the **never-treated** group
- $D_{it}$ is a dummy for whether $i$ is treated in period $t$
**Assumptions**:
- Staggered treatment adoption: once treated, a unit cannot be untreated (revert)
- Parallel trends assumptions (conditional on covariates):
- Based on never-treated units: $E[Y_t(0)- Y_{t-1}(0)|G= g] = E[Y_t(0) - Y_{t-1}(0)|C=1]$
- Without treatment, the average potential outcomes for group $g$ equals the average potential outcomes for the never-treated group (i.e., control group), which means that we have (1) enough data on the never-treated group (2) the control group is similar to the eventually treated group.
- Based on not-yet treated units: $E[Y_t(0) - Y_{t-1}(0)|G = g] = E[Y_t(0) - Y_{t-1}(0)|D_s = 0, G \neq g]$
- Not-yet treated units by time $s$ ( $s \ge t$) can be used as comparison groups to calculate the average treatment effects for the group first treated in time $g$
- Additional assumption: pre-treatment trends across groups [@marcus2021role]
- Random sampling
- Irreversibility of treatment (once treated, cannot be untreated)