Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp CPS final prep #110

Merged
merged 7 commits into from
Jul 20, 2017
Merged

Conversation

andersonfrailey
Copy link
Collaborator

Most of the logic in cps_data/finalprep.py stays the same. I mostly just moved chunks of code into functions and added a few feature that were missing before hand.

Code related to benefits data has been commented out because we have decided to first release cps.csv with only tax data. It will be uncommented when we add benefit data to cps.csv

@MattHJensen @martinholmer @Amy-Xu

@martinholmer
Copy link
Contributor

martinholmer commented Jul 19, 2017

@andersonfrailey, A couple of questions:

(1) Where are you putting the code that rounds the CPS weights to the nearest whole number and converts those rounded numbers to integers? You had said you were planning on doing that.

(2) Don't you want to include the raw and final cps.csv and cps_weights.csv files in pull request #110? They contain public data, so there is no need to "hide" them like we do the puf.csv file. By putting them in the taxdata repo, anybody can use the raw data (from John) to reproduce what we do to get the final CPS input files. Given that there are no data restrictions, isn't this what we want for our open-source project?

Let me put it another way. I want to be able to use the taxdata repo on my local computer to check that I can produce the same cps.csv and cps_weights.csv files as you produced on your computer and sent to me.

@MattHJensen

@andersonfrailey
Copy link
Collaborator Author

@martinholmer thanks for reminding me about rounding the weights file.

Latest commit includes compressed raw CPS and weights files. There is no need to decompress them before running the scripts. It also rounds the weights file before exporting. Both files are also compressed when exported.

@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for the quick response. Here is what I tried:

$ cd cps_data
$ python finalprep.py
Traceback (most recent call last):
  File "finalprep.py", line 378, in <module>
    sys.exit(main())
  File "finalprep.py", line 11, in main
    adj_targets = pd.read_csv('adjustment_targets.csv')
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 405, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 764, in __init__
    self._make_engine(self.engine)
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 985, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1605, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 394, in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)
  File "pandas/_libs/parsers.pyx", line 710, in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)
IOError: File adjustment_targets.csv does not exist

I thought the CPS data didn't have any adjustment ratios (that is, cps_ratios.csv does not exist).
What did I do wrong?

@martinholmer
Copy link
Contributor

@andersonfrailey, Looking at the cps_data/finalprep.py script, it would appear as if you need to add the adjustment_targets.csv file to PR #110.

Also, in order to be consistent with the PUF directory structure, shouldn't the cps_weights_raw.csv.gz file and the code that modifies it and writes the cps_weights.csv.gz file all be in the taxdata/cps_stage2 directory?

@martinholmer
Copy link
Contributor

@andersonfrailey, I just used the cps.csv.gz file you emailed me yesterday. My impression is that it is the same as what you sent on Monday except without all the benefit variables. But when I use yesterday's file to simulate taxes, I get much different results than I get when using Monday's file. For example, for 2017 payroll taxes are unchanged at $1177.7 billion, but income taxes are now noticeably lower: $1246.6 billion rather than $1341.9 billion, which is a decline of about 7 percent.

So when I compare the two files, I see that not only were the benefit variables dropped, but there was a new e19200 variable. What's the story on this? A new variable you've added just this week?

@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for adding the adjustment_targets.csv file to PR #110.

A question about the two cps*raw.csv.gz files:
Are they files that come straight from John O'Hare's website?
Or has there been some processing of the O'Hare files to get what is in the two cps*raw.csv.gz files?

@andersonfrailey
Copy link
Collaborator Author

@martinholmer there is an adjustment file. The key change was I added the adjustment to final prep rather than have a separate script for it because it's only being done for one year and no additional file is being created, as is the case with the PUF. This was discussed in issue #90. Latest commit includes that file as well.

And e19200 is interest deduction and was added recently. I'm trying to find a link, but @MattHJensen , @Amy-Xu , and myself had a lengthily discussion on this. I'll link to it as soon as I find it.

@andersonfrailey
Copy link
Collaborator Author

@martinholmer

Are they files that come straight from John O'Hare's website?
Or has there been some processing of the O'Hare files to get what is in the two cps*raw.csv.gz files?

cps_raw.csv.gzip was created using the scripts John sent with our two additions mentioned in the CPS documentation - adding benefits variables and counting the number of people in specified age bins. cps_weights_raw.csv comes straight from John.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

And e19200 is interest deduction and was added recently. I'm trying to find a link, but @MattHJensen , @Amy-Xu , and myself had a lengthily discussion on this. I'll link to it as soon as I find it.

OK, fine. No need to do too much on that. I understand the sequence of events.

@martinholmer
Copy link
Contributor

@andersonfrailey said:

cps_raw.csv.gzip was created using the scripts John sent with our two additions mentioned in the CPS documentation - adding benefits variables and counting the number of people in specified age bins.

OK. The script that creates the AGI bins (not "age bins', right?) should be included in the taxdata repo (I guess in the taxdata/cps_data directory), so that the transformation of John's cps_ohare.csv.gzip file to the cps_raw.csv.gzip file can be reproduced. Or maybe its better to put this one step into the cps_data/finalprep.py script so that everything is in one place. What do you think?

@andersonfrailey also said:

cps_weights_raw.csv comes straight from John. OK, the README.md in the cps_stage2 directory should rename it cps_weights_ohare.csv and say that and then describe the simple script that transforms it into the cps_weights.csv.gz file. That way your CPS work is parallel with your PUF work. Does this make sense?

@andersonfrailey
Copy link
Collaborator Author

@martinholmer latest commit moves all the files related to the weights to the cps_stage2 directory and updates that README.

@martinholmer
Copy link
Contributor

martinholmer commented Jul 20, 2017

@andersonfrailey, Sorry I missed this before, but wouldn't it be more accurate to multiply the weights by 100.0 before rounding to integers. I think that will be noticeably more accurate than doing the rounding first. I should have seen this before. Sorry for this extra bit of work.

@andersonfrailey
Copy link
Collaborator Author

@martinholmer no problem. Latest commit switches the two lines.

@martinholmer
Copy link
Contributor

@andersonfrailey, With respect to your latest version of #110, are will still missing the final cps.csv.gz file in the taxdata/cps_data directory? We want to add that to the pull request just as you have added the cps_weights.csv.gz file in the taxdata/cps_stage2 directory.

@andersonfrailey
Copy link
Collaborator Author

@martinholmer good catch. I left it out of the first set of git add statements. Added.

@martinholmer
Copy link
Contributor

@andersonfrailey, Turns out there is just one more fly in the ointment. :(

It turns out that when pandas uses gzip compression it follows the gzip default of putting the timestamp of the gzipped file into the header of the .gz file. This means that when I tried to replicate your work on my computer the two .gz files produced by the scripts were viewed as different by git (because the embedded timestamp generated on my computer was latter than the one generated on your computer and included in the .gz file in the pull request).

It turns out that the gzip utility has an option to exclude the timestamp: gzip -n ... will suppress the default inclusion of the timestamp. However, we don't have that option from pandas.

So, here is what I think we must do to avoid the false indication that the newly generated .gz files are different from those in the repository. I suggest that (in both the scripts) we replace the single pd.to_csv(..., compression='gzip') statement with two statements:

data.to_csv('cps.csv', index=False)
subprocess.check_call(["gzip", "-n", "cps.csv"])

And do the analogous one-for-two change for the cps weights. In both scripts you'll have to add an import subprocess statement at the top.

Does this make sense? Is there another way to avoid this problem?

@andersonfrailey
Copy link
Collaborator Author

@martinholmer thanks for pointing this out. What you've suggested makes sense and I'm not familiar with other ways to avoid this issue. I've made the edits to the code and am running them now to make sure it works as expected.

@martinholmer martinholmer merged commit 6e0400b into PSLmodels:master Jul 20, 2017
@martinholmer
Copy link
Contributor

@andersonfrailey, Thanks for all the work. I was able to generate on my computer the exact same two cps*csv.gz files as included in this pull request #110. Merging into the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants