-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
169 lines (137 loc) · 5.04 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
output: github_document
editor_options:
markdown:
wrap: 72
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# claimsdb
<!-- badges: start -->
[![CRAN
status](https://www.r-pkg.org/badges/version/claimsdb)](https://CRAN.R-project.org/package=claimsdb)
<!-- badges: end -->
**claimsdb** provides easy access to a sample of health insurance
enrollment and claims data from the [CMS Data Entrepreneurs' Synthetic
Public Use File
(DE-SynPUF)](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF),
as a set of relational tables or as an in-memory database using
[DuckDB](https://duckdb.org). This package is inspired by and based on
the [starwarsdb](https://github.com/gadenbuie/starwarsdb) package.
The data are structured as actual Medicare claims data but are fully
"synthetic," after a process of alterations meant to reduce the risk of
re-identification of real Medicare beneficiaries. The synthetic process
that CMS adopted changes the co-variation across variables, so analysts
should be cautious about drawing inferences about the actual Medicare
population.
The data included in **claimsdb** comes from 500 randomly selected 2008
Medicare beneficiaries from Sample 2 of the DE-SynPUF, and it includes
all the associated claims for these members for 2008-2009. CMS provides
[resources](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DESample02),
including a codebook, FAQs, and other documents with more information
about this data.
[![Source: CMS Linkable 2008–2010 Medicare
DE-SynPUF](man/figures/README-diagram.PNG "Source: CMS Linkable 2008–2010 Medicare DE-SynPUF")](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/SynPUF_DUG.pdf)
**Formats:** [Metadata](#health-insurance-claims-data), [Local
Tables](#local-tables), [Remote Database
Tables](#remote-database-tables)
## Installation
You can install the development version of **claimsdb** from
[GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("jfangmeier/claimsdb")
```
## Health Insurance Claims Data
```{r echo=FALSE, warning=FALSE}
library(tibble)
```
All of the tables are available when you load **claimsdb**
```{r message=FALSE, warning=FALSE}
library(dplyr)
library(claimsdb)
```
The tables are also available by loading just the data from **claimsdb**
```{r eval=FALSE}
data(package = "claimsdb")
```
This includes a `schema` table that describes each of the tables and the
included variables from the [CMS
DE-SynPUF](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF).
```{r}
schema
```
```{r}
schema %>%
filter(TABLE == "inpatient") %>%
pull(PROPERTIES)
```
## Local Tables
Using the sample data in the tables, you can ask questions such as,
*what were the average and median prescription drug costs for males and
females in 2008 and 2009?*
```{r}
rx_costs <- pde %>%
mutate(BENE_YEAR = lubridate::year(SRVC_DT)) %>%
group_by(BENE_YEAR, DESYNPUF_ID) %>%
summarize(TOTAL_RX_COST = sum(TOT_RX_CST_AMT, na.rm = T), .groups = "drop")
bene %>%
transmute(
BENE_YEAR,
DESYNPUF_ID,
BENE_SEX_IDENT = case_when(
BENE_SEX_IDENT_CD == "1" ~ "Male",
TRUE ~ "Female")
) %>%
left_join(
rx_costs, by = c("BENE_YEAR", "DESYNPUF_ID")
) %>%
mutate(TOTAL_RX_COST = ifelse(is.na(TOTAL_RX_COST), 0, TOTAL_RX_COST)) %>%
group_by(BENE_YEAR, BENE_SEX_IDENT) %>%
summarize(
MEAN_RX_COST = mean(TOTAL_RX_COST, na.rm = T),
MEDIAN_RX_COST = median(TOTAL_RX_COST, na.rm = T))
```
## Remote Database Tables
Many organizations store claims data in a remote database, so
**claimsdb** also includes all of the tables as an in-memory DuckDB
database. This can be a great way to practice working with this type of
data, including building queries with dplyr code using
[dbplyr](https://dbplyr.tidyverse.org/).
```{r}
con <- claims_connect()
bene_rmt <- tbl(con, "bene")
pde_rmt <- tbl(con, "pde")
pde_rmt
rx_costs_rmt <- pde_rmt %>%
# note below that lubridate functions do not currently work on remote databases,
# so you need to use date/time functions appropriate for the database.
mutate(BENE_YEAR = date_part('year', SRVC_DT)) %>%
group_by(BENE_YEAR, DESYNPUF_ID) %>%
summarize(TOTAL_RX_COST = sum(TOT_RX_CST_AMT, na.rm = T), .groups = "drop") %>%
ungroup()
rx_costs_rmt
bene_rmt %>%
transmute(
BENE_YEAR,
DESYNPUF_ID,
BENE_SEX_IDENT = case_when(
BENE_SEX_IDENT_CD == "1" ~ "Male",
TRUE ~ "Female")
) %>%
left_join(
rx_costs_rmt, by = c("BENE_YEAR", "DESYNPUF_ID")
) %>%
mutate(TOTAL_RX_COST = ifelse(is.na(TOTAL_RX_COST), 0, TOTAL_RX_COST)) %>%
group_by(BENE_YEAR, BENE_SEX_IDENT) %>%
summarize(
MEAN_RX_COST = mean(TOTAL_RX_COST, na.rm = T),
MEDIAN_RX_COST = median(TOTAL_RX_COST, na.rm = T))
```