Revamp CPS final prep #110

andersonfrailey · 2017-07-19T19:38:14Z

Most of the logic in cps_data/finalprep.py stays the same. I mostly just moved chunks of code into functions and added a few feature that were missing before hand.

Code related to benefits data has been commented out because we have decided to first release cps.csv with only tax data. It will be uncommented when we add benefit data to cps.csv

@MattHJensen @martinholmer @Amy-Xu

martinholmer · 2017-07-19T20:37:14Z

@andersonfrailey, A couple of questions:

(1) Where are you putting the code that rounds the CPS weights to the nearest whole number and converts those rounded numbers to integers? You had said you were planning on doing that.

(2) Don't you want to include the raw and final cps.csv and cps_weights.csv files in pull request #110? They contain public data, so there is no need to "hide" them like we do the puf.csv file. By putting them in the taxdata repo, anybody can use the raw data (from John) to reproduce what we do to get the final CPS input files. Given that there are no data restrictions, isn't this what we want for our open-source project?

Let me put it another way. I want to be able to use the taxdata repo on my local computer to check that I can produce the same cps.csv and cps_weights.csv files as you produced on your computer and sent to me.

@MattHJensen

andersonfrailey · 2017-07-19T21:18:10Z

@martinholmer thanks for reminding me about rounding the weights file.

Latest commit includes compressed raw CPS and weights files. There is no need to decompress them before running the scripts. It also rounds the weights file before exporting. Both files are also compressed when exported.

martinholmer · 2017-07-19T21:31:26Z

@andersonfrailey, Thanks for the quick response. Here is what I tried:

$ cd cps_data
$ python finalprep.py
Traceback (most recent call last):
  File "finalprep.py", line 378, in <module>
    sys.exit(main())
  File "finalprep.py", line 11, in main
    adj_targets = pd.read_csv('adjustment_targets.csv')
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 405, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 764, in __init__
    self._make_engine(self.engine)
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 985, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1605, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 394, in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)
  File "pandas/_libs/parsers.pyx", line 710, in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)
IOError: File adjustment_targets.csv does not exist

I thought the CPS data didn't have any adjustment ratios (that is, cps_ratios.csv does not exist).
What did I do wrong?

martinholmer · 2017-07-20T10:08:50Z

@andersonfrailey, Looking at the cps_data/finalprep.py script, it would appear as if you need to add the adjustment_targets.csv file to PR #110.

Also, in order to be consistent with the PUF directory structure, shouldn't the cps_weights_raw.csv.gz file and the code that modifies it and writes the cps_weights.csv.gz file all be in the taxdata/cps_stage2 directory?

martinholmer · 2017-07-20T10:31:32Z

@andersonfrailey, I just used the cps.csv.gz file you emailed me yesterday. My impression is that it is the same as what you sent on Monday except without all the benefit variables. But when I use yesterday's file to simulate taxes, I get much different results than I get when using Monday's file. For example, for 2017 payroll taxes are unchanged at $1177.7 billion, but income taxes are now noticeably lower: $1246.6 billion rather than $1341.9 billion, which is a decline of about 7 percent.

So when I compare the two files, I see that not only were the benefit variables dropped, but there was a new e19200 variable. What's the story on this? A new variable you've added just this week?

martinholmer · 2017-07-20T13:42:29Z

@andersonfrailey, Thanks for adding the adjustment_targets.csv file to PR #110.

A question about the two cps*raw.csv.gz files:
Are they files that come straight from John O'Hare's website?
Or has there been some processing of the O'Hare files to get what is in the two cps*raw.csv.gz files?

andersonfrailey · 2017-07-20T13:42:38Z

@martinholmer there is an adjustment file. The key change was I added the adjustment to final prep rather than have a separate script for it because it's only being done for one year and no additional file is being created, as is the case with the PUF. This was discussed in issue #90. Latest commit includes that file as well.

And e19200 is interest deduction and was added recently. I'm trying to find a link, but @MattHJensen , @Amy-Xu , and myself had a lengthily discussion on this. I'll link to it as soon as I find it.

andersonfrailey · 2017-07-20T13:48:36Z

@martinholmer

Are they files that come straight from John O'Hare's website?
Or has there been some processing of the O'Hare files to get what is in the two cps*raw.csv.gz files?

cps_raw.csv.gzip was created using the scripts John sent with our two additions mentioned in the CPS documentation - adding benefits variables and counting the number of people in specified age bins. cps_weights_raw.csv comes straight from John.

martinholmer · 2017-07-20T13:51:22Z

@andersonfrailey said:

And e19200 is interest deduction and was added recently. I'm trying to find a link, but @MattHJensen , @Amy-Xu , and myself had a lengthily discussion on this. I'll link to it as soon as I find it.

OK, fine. No need to do too much on that. I understand the sequence of events.

martinholmer · 2017-07-20T14:01:06Z

@andersonfrailey said:

cps_raw.csv.gzip was created using the scripts John sent with our two additions mentioned in the CPS documentation - adding benefits variables and counting the number of people in specified age bins.

OK. The script that creates the AGI bins (not "age bins', right?) should be included in the taxdata repo (I guess in the taxdata/cps_data directory), so that the transformation of John's cps_ohare.csv.gzip file to the cps_raw.csv.gzip file can be reproduced. Or maybe its better to put this one step into the cps_data/finalprep.py script so that everything is in one place. What do you think?

@andersonfrailey also said:

cps_weights_raw.csv comes straight from John. OK, the README.md in the cps_stage2 directory should rename it cps_weights_ohare.csv and say that and then describe the simple script that transforms it into the cps_weights.csv.gz file. That way your CPS work is parallel with your PUF work. Does this make sense?

andersonfrailey · 2017-07-20T15:26:24Z

@martinholmer latest commit moves all the files related to the weights to the cps_stage2 directory and updates that README.

martinholmer · 2017-07-20T15:35:35Z

@andersonfrailey, Sorry I missed this before, but wouldn't it be more accurate to multiply the weights by 100.0 before rounding to integers. I think that will be noticeably more accurate than doing the rounding first. I should have seen this before. Sorry for this extra bit of work.

andersonfrailey · 2017-07-20T15:43:46Z

@martinholmer no problem. Latest commit switches the two lines.

martinholmer · 2017-07-20T15:44:34Z

@andersonfrailey, With respect to your latest version of #110, are will still missing the final cps.csv.gz file in the taxdata/cps_data directory? We want to add that to the pull request just as you have added the cps_weights.csv.gz file in the taxdata/cps_stage2 directory.

andersonfrailey · 2017-07-20T15:47:47Z

@martinholmer good catch. I left it out of the first set of git add statements. Added.

martinholmer · 2017-07-20T16:57:43Z

@andersonfrailey, Turns out there is just one more fly in the ointment. :(

It turns out that when pandas uses gzip compression it follows the gzip default of putting the timestamp of the gzipped file into the header of the .gz file. This means that when I tried to replicate your work on my computer the two .gz files produced by the scripts were viewed as different by git (because the embedded timestamp generated on my computer was latter than the one generated on your computer and included in the .gz file in the pull request).

It turns out that the gzip utility has an option to exclude the timestamp: gzip -n ... will suppress the default inclusion of the timestamp. However, we don't have that option from pandas.

So, here is what I think we must do to avoid the false indication that the newly generated .gz files are different from those in the repository. I suggest that (in both the scripts) we replace the single pd.to_csv(..., compression='gzip') statement with two statements:

data.to_csv('cps.csv', index=False)
subprocess.check_call(["gzip", "-n", "cps.csv"])

And do the analogous one-for-two change for the cps weights. In both scripts you'll have to add an import subprocess statement at the top.

Does this make sense? Is there another way to avoid this problem?

andersonfrailey · 2017-07-20T17:13:12Z

@martinholmer thanks for pointing this out. What you've suggested makes sense and I'm not familiar with other ways to avoid this issue. I've made the edits to the code and am running them now to make sure it works as expected.

martinholmer · 2017-07-20T19:34:30Z

@andersonfrailey, Thanks for all the work. I was able to generate on my computer the exact same two cps*csv.gz files as included in this pull request #110. Merging into the master branch.

final prep revamp

9a13b29

add raw files. round weights

c5cd9e1

adjustment targets

0c6704a

move weights files

ca2c7f6

multiply before round

88f657e

add final cps

72965d9

gzip fix

77218df

martinholmer mentioned this pull request Jul 20, 2017

Add SAS files to make CPS #111

Merged

martinholmer merged commit 6e0400b into PSLmodels:master Jul 20, 2017

martinholmer mentioned this pull request Jul 20, 2017

Basic modifications needed to read CPS input data PSLmodels/Tax-Calculator#1484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp CPS final prep #110

Revamp CPS final prep #110

andersonfrailey commented Jul 19, 2017

martinholmer commented Jul 19, 2017 •

edited

Loading

andersonfrailey commented Jul 19, 2017

martinholmer commented Jul 19, 2017

martinholmer commented Jul 20, 2017

martinholmer commented Jul 20, 2017

martinholmer commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017

martinholmer commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017 •

edited

Loading

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017

Revamp CPS final prep #110

Revamp CPS final prep #110

Conversation

andersonfrailey commented Jul 19, 2017

martinholmer commented Jul 19, 2017 • edited Loading

andersonfrailey commented Jul 19, 2017

martinholmer commented Jul 19, 2017

martinholmer commented Jul 20, 2017

martinholmer commented Jul 20, 2017

martinholmer commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017

martinholmer commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017 • edited Loading

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017

andersonfrailey commented Jul 20, 2017

martinholmer commented Jul 20, 2017

martinholmer commented Jul 19, 2017 •

edited

Loading

martinholmer commented Jul 20, 2017 •

edited

Loading