-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPS File Progress Report #90
Comments
@andersonfrailey said in taxdata issue #90 wrt the being-developed CPS data file:
Thanks for the progress report! Sounds like things are progressing nicely. I always thought one of the big advantages of CPS data is that you know everybody's age, right?
Without these variables (especially without the EIC variable) the tax estimates generated by this CPS file are going to be way off. Because they are not on the missing variable list, I'm assuming you have values for the following variables, right?
|
@martinholmer, you are correct that a big advantage of the CPS is knowing everybody's age. The two variables you listed first, And we do have the values for the second set listed. |
@andersonfrailey said:
Great! |
For earlier discussion (from 03-Nov-2016 to 02-Dec-2016) of issues related to the development of this CPS input file for Tax-Calculator, see Tax-Calculator issue 1030. |
Quick update on the CPS file. Here is a notebook where you can see some comparisons between the CPS and PUF file after running Tax-Calc with each for 2017. Some notable differences between the two can be seen in regular tax and income tax liability. That could be partially due to a couple of missing income items such as investment income. You can also see that itemized deductions are significantly lower in the CPS. As noted before, John is imputing state and local tax deductions so once we have that the difference should close some, but we will still be missing a few deductions. I also want to add farm income (e02100) and capital gain distributions (e01100) to the missing variables list. After looking at the aggregates for each, they are significantly higher in the CPS than the PUF. This may simply be due to a mislabeling when prepping the CPS. I'm looking into it further. That being said, I am a little more concerned about the drop in tax liabilities given that these two income sources were so much higher than they should have been and yet liabilities still fall. If there is anything specific you would like to see in my next update, please let me know. |
I've added a few distribution plots for the CPS data to the notebook. It looks to me like significant stage 3 adjustment will be needed. |
John sent me a preliminary version of state and local tax deduction imputation. The notebook has been updated to include a chart comparing totals for a number of itemized deductions. |
@andersonfrailey Just to confirm all blow-up factors used for CPS-Tax-Unit file is the same ones used for PUF, is that right? Same for the stage-I factors used to calculate weights? If that's the case, I think we might need to develop a separate stage-I factor, or adjust the base-year file before feeding it to TC due to the large gap for several income/expense items. |
@Amy-Xu asked:
Unless I'm confused, "blow-up factors" and "stage-I factors" are two different names for the same thing. And yes, they affect the baseline (CBO-derived) projection, so they must be the same for all micro datasets, by definition. @Amy-Xu continued:
This is not possible given the logic above. The stage2 weights can differ for the CPS and PUF datasets and the stage3 adjustment factors can be different for CPS and PUF, but not stage1 factors (otherwise the two datasets would be using different CBO baseline projections, which doesn't make any sense). |
@martinholmer said
Right I'm aware of that so I said 'just to confirm'.
For this part, it's true that we derive most factors to match CBO baseline, but not every factor is derived from CBO projection. Instead, some of the factors are tuned to fit PUF data. For example, we originally assumed to apply personal income factor (ATXPY) to most PUF variables outside AGI, which is a very reasonable assumption. But then we realized that applying this factor to ATXPY is not proper for e19200, which is interest paid that used to calculate itemized deduction. It's not proper since the growth rate of its major component -- home mortgage interest deduction -- deviates from the personal income growth rate. So we added one extra factor (AIPD) for this variable only to blow up this factor at the rate given by SOI tables prior to 2014, and then apply ATXPY after 2014. This particular factor yields better itemized deduction, which is not predicted by CBO explicitly, and therefore makes our results closer to CBO projected baseline. When it comes to CPS data, as you can see in Anderson's notebook, home mortgage interest differs from PUF total by about 200 billion (total at ~350 billion in PUF), which means if we use exact same factor (originally derive from SOI tables) for CPS, we will significantly over-estimate all interest paid deduction, and therefore overestimate total number of itemizers, which would drag our CPS-tax-unit results away from CBO baseline. Same would apply to state and local deduction. Yes I'm aware that having two sets of factors merely to adjust itemize deduction is too much work and may or may not be efficient for the goal (adjusting the baseline to resemble CBO projection) we want to achieve. So alternatively I proposed,
which I consider as a more doable way for adjusting itemized deduction and related expense projection. Does it make some sense? |
As in perform an adjustment similar to what we do in stage 3? Or did you have something else in mind? |
@andersonfrailey I was thinking to tweak stage I factors a bit like AIPD, but perfectly fine with a Stage III as well. The goal is to take a look whether adjusting the itemize deductions would fix itemizer/standard deductor numbers, and furthur help AMT numbers. |
@Amy-Xu said in issue #90 wrt the CPS dataset:
Yes, it is true that not all of the stage1 growfactors are taken from an CBO projection. The story you tell about why the mortgage interest needed its own growfactor (because of the sharp drop in mortgage interest rates following the 2008-2009 financial crash and house-price deflation) is true. But that growfactor reflects the reality of the macroeconomic situation in the years after 2009. And therefore, it is not a logical choice for changing across micro datasets. And we seem to be in agreement on that point because you say:
Perhaps @andersonfrailey could do a one-time adjustment (to close the gap) at the end of the python code that creates the raw CPS dataset. And then we can see if the gap remains closed in all projection years. If the gap becomes too large, @andersonfrailey could use the stage3 adjustment methodology he created to manage the gap in subsequent years. Or can stage3 methods be used to close the gap in all years? I can't remember whether stage3 can deal with the first year. Does this seem like a sensible way to proceed in the effort to make the CPS dataset generate results that a reasonably close to those generated by the |
@Amy-Xu @martinholmer, Stage 3 adjustments wouldn't be the best way to go about this because all stage 3 does is adjust the distribution of the variable, not the total amount. In my opinion, a one time adjustment like @martinholmer proposed would work best. To be clear, this CPS file is being produced by SAS files provided by John, but this adjustment could be added at the end of the CPS version of |
@andersonfrailey said in issue #90 wrt the new CPS-only input dataset:
Sounds sensible to me. What do you think, @Amy-Xu ? |
@andersonfrailey said:
@martinholmer followed with:
Yes that sounds good to me! |
@andersonfrailey I personally feel it would be valuable to ask John whether this level (we're talking about ~150b difference for both state & local and home mortgage) of discrepancy is expected while implementing this adjustment. |
@Amy-Xu I agree. I'll shoot him an email. |
In response to my email John pointed out that he did the imputations for every tax unit, where the matched file would only have values for the records that had filed so I was comparing two things that weren't really the same. I've updated the notebook to account for this and the state and local deduction looks much better. |
As mentioned previously, many of the income variables in the CPS-PUF will need substantial adjustments to get their distribution right. Unlike the PUF file, the CPS-PUF contains records from the 2013, 2014, and 2015 CPS, which correspond to 2012, 2013, and 2014 tax years. The 2013 and 2014 files are aged to match the 2015 file (2014 tax year). Because so many of the variables need adjusting and they're already equivalent to the 2014 tax year, I'm considering adding a one time adjustment to the Thoughts? |
By adjustment, do you mean augment or shrink the total to match PUF? How many variables do we need to adjust? @andersonfrailey |
@Amy-Xu, neither. In this case I'm just talking about fixing the distribution of the variables. Specifically I'd say interest income and ordinary and qualified dividends definitely need adjustments and business income could benefit as well. The actual totals for the income variables are actually relatively close to the PUF already. |
I see, that sounds valuable. At the same time, it might also be helpful to investigate what is the major source dragging individual income tax liability off by almost 40%. The distribution of those income variables could be a reason, but not sure whether they're big problems given totals are close. |
I spent the day working on adjusting the distribution of interest and business income and ordinary and qualified dividends. I don't believe it's possible to get the distribution as accurate as we were able to with the PUF, but there is improvement. AGI is reported in the CPS, but I quickly realized that it would not be very helpful. The highest reported AGI was just over $2 million, which would put nobody in the top two income bins, as defined in the IRS SOI data mentioned previously in this thread. I instead added up wages, interest income, dividends, alimony, business income, pensions, rental income, farm income, unemployment compensation, and social security income. The new distributions can be seen in the updated notebook. There was a slight improvement in individual income tax liability. It is now short by about 30% rather than 40% |
The number of itemizers and standard deductors look much much better, but it seems from the last chart for itemized deduction, interest paid in CPS is gone in this version? @andersonfrailey |
@Amy-Xu that's because I'm working on slightly tweaking the final prep scripts and left I'm reading through the IRS documentation and there are a couple of instances where HMIE is not fully deductible, so I'm working that into final prep to see the affects on the aggregate total. |
@andersonfrailey Sounds great. Thanks! |
Update on above comment. All of the instances where home mortgage interest isn't deductible are related to when you took out the mortgage and total home equity so it doesn't look like I'll be able to adjust |
@andersonfrailey along the line, I was thinking if HMIE wasn't imputed for the purpose of itemized deduction, it might include interest of standard deductors although this item is not deductible for them. So the total of HMIE might not be reflective how much is taken account in actual itemize deduction. Do you see what I mean? Have you looked at how much in total under the records who take itemized deduction? |
Updated notebook shows the result of taking the HMIE variable in the CPS and shirking it for everyone by the ratio of home interest in table 2.1 of the SOI stats to HMIE. |
@andersonfrailey Looking at cell 20 in your notebook, it seems the adjusted interest paid deduction is quite a bit lower than PUF or SOI number. Did you use the total of HMIE or the total of itemizers' HMIE to create the ratio? |
@Amy-Xu I did total HMIE in the CPS to the HMIE portion of the Interest Paid deduction reported by SOI. I'll do HMIE of just itemizers in the CPS and see what happens |
I guess this version confuses me in that CPS total itemized deduction is lower than PUF, but somehow CPS data has more itemizers. In a previous version where CPS has higher aggregates on both interest paid and state and local income taxes, it makes sense that CPS would end up with more itemizers than it should be. But in this version, the total itemized deduction in CPS is lower than PUF, but still end up with more itemizers? |
@andersonfrailey Thanks! |
@Amy-Xu, using |
@andersonfrailey I see. My whole point is just to see whether there is a way to get interest paid deduction close to SOI, and theoretically number of itemizers will be closer as a result. If the ratio of |
@Amy-Xu, I'm sure there's a way to get HMIE close to SOI interest paid. In theory this would help minimize |
After playing with the interest paid deduction for awhile, I've updated the notebook to show the results of using Despite I'll post more as I work on this. cc @MattHJensen |
I suspect the reason why the number of itemizers in CPS outruns PUF is that CPS has more people w/ smaller amount of itemized deduction than PUF, and less people with large amount of itemized deduction. One way to verify is to tabulate itemizer number by itemize deduction amount @andersonfrailey. If this is true, it means we might need to apply different scalers to people w/ different itemize deduction amount. For example, disqualify some itemizers by reducing their deduction to levels below 6k, and augment the others itemize deduction to match the total admin itemized deduction number. Even though this approach might get us a better total liability number at some point, it seems to me this approach will take longer and is susceptible to accusations being too manipulative. @MattHJensen What do you think the best way to deal with itemized deduction at this point? |
@Amy-Xu said
How do the distributions of the components (medical expenses, interest paid, etc.) of the itemized deduction amount in the CPS compare to those of the PUF? To me, it would make more sense to scale those components than to scale the full itemized deduction amount. |
@hdoupe Take a look at Cell 22 in the latest notebook Anderson posted in his comment. I believe what Anderson has been doing is just scaling the interest paid section of itemized deduction, instead of the full itemized deduction. If you trace back a few previous comments by Anderson, you can see that scaling back one item will increase the number of total itemizers more than desired level -- at least that's my understanding. The increased itemizer number decreases total number of standard deductors, and total standard deduction, which drag ind income liability further away from where it's supposed to be. Does that make sense? |
On Wed, 12 Jul 2017, Henry Doupe wrote:
@Amy-Xu said
I suspect the reason why the number of itemizers in CPS outruns
PUF is that CPS has more people w/ smaller amount of itemized
deduction than PUF, and less people with large amount of
itemized deduction. One way to verify is to tabulate itemizer
number by itemize deduction amount @andersonfrailey. If this is
true, it means we might need to apply different scalers to
people w/ different itemize deduction amount. For example,
disqualify some itemizers by reducing their deduction to levels
below 6k, and augment the others itemize deduction to match the
total admin itemized deduction number
How do the distributions of the components (medical expenses, interest paid,
etc.) of the itemized deduction amount in the CPS compare to those of the
PUF? To me, it would make more sense to scale those components than to scale
the full itemized deduction amount.
It would help to know where the itemized deduction imputation comes from.
Is it the CEX?
dan
|
@Amy-Xu Ah, ok thanks. That makes sense. Sorry, I should've looked before I commented. |
@andersonfrailey Since we have been applying scaler, the adjustment might be accused of data manipulation already. One brutal force way to solve this problem I think is to apply one scaler getting total itemizers in right place, and then apply another scaler to augment these itemizers' interest paid & state and local to match the total itemize deduction. But it's Matt's call to do this or not. @hdoupe no worries. |
@feenberg Dan, John has this documentation for CPS tax unit online at: http://www.quantria.com/assets/img/TechnicalDocumentationV4-2.pdf This documentation (on page 14) seems to indicate two biggest itemized deductions -- home mortgage interest and state and local income taxes -- are respectively imputed from Survey of Consumer Finance (Federal Reserve Board) and a proprietary state calculator. Other items are matched from SOI etc. Do you think this mortgage interest issue might be rooted in original CPS income distribution? CPS doesn't have enough high income earners so the mortgage interests matched from SCF are tilted toward lower end. |
@Amy-Xu it looks like you're right about there being more itemizers, but smaller totals, as can be seen in the plots below. |
@Amy-Xu asked:
My view is that we should leave everything as is and focus on documenting the file and how to recreate the file from scratch from initial datasets (CPS, SCF, etc.), including all imputations. Only after the file is documented and we have scripts to create it from scratch should we focus on tuning things. At that point we'll have a much better idea of whether the imputations themselves should be adjusted or whether the file should receive 'post processing' adjustments like those discussed above. |
@MattHJensen said:
What do you mean by 'as is'? You asked to include HMIE as e19200 in this comment. Without any scaling adjustment, this HMIE/e19200 will increase the different of individual income liability between SOI and CPS to more than 30%. Do you mean to include HMIE or not? |
@Amy-Xu asked:
If others agree with me that home mortgage interest expense is a component "total interest paid", then I think we should include HMIE in e19200. If others don't, then I'd like to discuss that further.
My point is that we should make correspondances between as many of the variables from the CPS file with variables from the puf as we can and then wait on doing further adjustments until we have documented and can reproduce the CPS file. Otherwise we will run into more conversations like this: q. Why did you adjust the interest deduction amount in post processing? |
@MattHJensen said:
It is a component of total interest paid, and the most important one. I would love if we could get the other components as well, but I could settle for at least being able to distinguish between HMIE and non-HMIE deductible interest expense. |
@andersonfrailey Could you explain again why you thought HMIE is not equivalent to e19200 so everyone can see? |
@Amy-Xu I came to that conclusion by talking to John about it. It is a component of e19200 as @codykallen said, but not a direct map like other variables are. I generally agree with the points @MattHJensen and @Amy-Xu have made about excessive data manipulation. I suppose it would be best to leave it as is and make a note in the documentation of the dataset about it and the effects it may have on the results coming out of tax-calculator. |
Updated notebook just includes some additional information requested by @Amy-Xu on the number of benefits participants. |
@andersonfrailey said:
Thanks for the additional tabulations. I'm puzzled by the distribution of benefits by wage percentile among participants. For means-tested programs, I would expect to see benefits decline as wages rise. But what we see for SSI is nothing like that. And we see just the opposite for Social Security, which has an individual level "earnings test" (that reduces benefits as wages rise). I do not expect to see such a decline for benefits that are not means-tested, such as Medicare. What am I missing? Is there some explanation that makes sense out of what is puzzling me? |
@martinholmer Thanks for looking into the charts. At this point, it might be hard to explain the trends in participation and benefit charts because these charts are not final version yet -- we might still need to tweak here and there. That being said, I try to explain what looks sensible to me regarding three observations you have respectively for SSI, Social Security and Medicare. Many are just my speculations -- I'm happy to continue the discussion and hear more feedback since there's no official tax-unit distribution we could compare with.
|
The CPS files and documentation have been merged into master so I am closing this issue. |
This issue is just an overview of the progress we've made in preparing the CPS-based file for use in Tax-Calculator.
John gave me the files needed to create the CPS file along with an associated weights file that covers the years 2015-2027. The SAS scripts create tax-units from the CPS in the same manner used to create the CPS tax-units that are then merged with the 2009 IRS-PUF file to create the final PUF currently used. After that, the following files are adjusted for top-coding:
Then the following are imputed:
Finally, the following are targeted at a state level:
There are some variables that are currently missing from the file as well:
nu18, n1821, and n21
can be found, and I'm editing the SAS files to do so. We're waiting for John to get us imputations for state and local taxes as well. I'm also digging more into the CPS to see if there are any other variables that can be found.@Amy-Xu and I are analyzing the final files to make sure the results after using it in tax-calc makes sense.
I will use this issue to post updates as more progress is made
@martinholmer @MattHJensen @codykallen
The text was updated successfully, but these errors were encountered: