-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp CPS final prep #110
Conversation
@andersonfrailey, A couple of questions: (1) Where are you putting the code that rounds the CPS weights to the nearest whole number and converts those rounded numbers to integers? You had said you were planning on doing that. (2) Don't you want to include the raw and final Let me put it another way. I want to be able to use the taxdata repo on my local computer to check that I can produce the same |
@martinholmer thanks for reminding me about rounding the weights file. Latest commit includes compressed raw CPS and weights files. There is no need to decompress them before running the scripts. It also rounds the weights file before exporting. Both files are also compressed when exported. |
@andersonfrailey, Thanks for the quick response. Here is what I tried:
I thought the CPS data didn't have any adjustment ratios (that is, |
@andersonfrailey, Looking at the Also, in order to be consistent with the PUF directory structure, shouldn't the |
@andersonfrailey, I just used the cps.csv.gz file you emailed me yesterday. My impression is that it is the same as what you sent on Monday except without all the benefit variables. But when I use yesterday's file to simulate taxes, I get much different results than I get when using Monday's file. For example, for 2017 payroll taxes are unchanged at $1177.7 billion, but income taxes are now noticeably lower: $1246.6 billion rather than $1341.9 billion, which is a decline of about 7 percent. So when I compare the two files, I see that not only were the benefit variables dropped, but there was a new |
@andersonfrailey, Thanks for adding the A question about the two |
@martinholmer there is an adjustment file. The key change was I added the adjustment to final prep rather than have a separate script for it because it's only being done for one year and no additional file is being created, as is the case with the PUF. This was discussed in issue #90. Latest commit includes that file as well. And |
|
@andersonfrailey said:
OK, fine. No need to do too much on that. I understand the sequence of events. |
@andersonfrailey said:
OK. The script that creates the AGI bins (not "age bins', right?) should be included in the taxdata repo (I guess in the @andersonfrailey also said:
|
@martinholmer latest commit moves all the files related to the weights to the |
@andersonfrailey, Sorry I missed this before, but wouldn't it be more accurate to multiply the weights by 100.0 before rounding to integers. I think that will be noticeably more accurate than doing the rounding first. I should have seen this before. Sorry for this extra bit of work. |
@martinholmer no problem. Latest commit switches the two lines. |
@andersonfrailey, With respect to your latest version of #110, are will still missing the final |
@martinholmer good catch. I left it out of the first set of |
@andersonfrailey, Turns out there is just one more fly in the ointment. :( It turns out that when pandas uses gzip compression it follows the gzip default of putting the timestamp of the gzipped file into the header of the .gz file. This means that when I tried to replicate your work on my computer the two .gz files produced by the scripts were viewed as different by git (because the embedded timestamp generated on my computer was latter than the one generated on your computer and included in the .gz file in the pull request). It turns out that the gzip utility has an option to exclude the timestamp: So, here is what I think we must do to avoid the false indication that the newly generated .gz files are different from those in the repository. I suggest that (in both the scripts) we replace the single
And do the analogous one-for-two change for the cps weights. In both scripts you'll have to add an Does this make sense? Is there another way to avoid this problem? |
@martinholmer thanks for pointing this out. What you've suggested makes sense and I'm not familiar with other ways to avoid this issue. I've made the edits to the code and am running them now to make sure it works as expected. |
@andersonfrailey, Thanks for all the work. I was able to generate on my computer the exact same two cps*csv.gz files as included in this pull request #110. Merging into the master branch. |
Most of the logic in
cps_data/finalprep.py
stays the same. I mostly just moved chunks of code into functions and added a few feature that were missing before hand.Code related to benefits data has been commented out because we have decided to first release
cps.csv
with only tax data. It will be uncommented when we add benefit data tocps.csv
@MattHJensen @martinholmer @Amy-Xu