Household surveys, particularly DHS and MICS, are a key global health resource and their results are often relied upon to inform programmatic and strategic decision-making. Unfortunately, these data are scattered across hundreds of pages in a survey report and locked away in PDF format. The goal here is to detail an approach to break out of PDF prison and enjoy the sweet freedom of tidy data frames.
Here is a typical data table from a DHS report - note the liberal use of multi-level headers and row headers:
Sidenote: the DHS Program implemented the Indicator Data API with excellent documentation. This, of course, is ideal as one can interface with DHS survey data without ever touching a PDF 🎉
🚨 Please see my fetchdhs
package to do just that! 🚨
We will be using tabulizer
, a
powerful package that wraps the Tabula
java library, to convert PDF pages to machine-readable text. Then, we
will rely on various tidyverse
functions
to manipulate the text data and produce a tidy data frame of our desired
table.
Let’s try running the table we have just seen through tabulizer
and
see what we get.
suppressPackageStartupMessages(library(tidyverse))
library(here)
library(tabulizer)
source <- "dhs_uganda_2016"
raw_txt <- extract_tables(here("report-pdfs", str_c(source,".pdf")), pages = 211, guess = FALSE)
raw_txt[[1]][1:25,]
## [1] "Table 10.4 Possession and observation of vaccination cards, according to background characteristics"
## [2] "Percentage of children age 12-23 months and children age 24-35 months who ever had a vaccination card, and percentage with a vaccination card"
## [3] "seen, according to background characteristics, Uganda DHS 2016"
## [4] "Children age 12-23 months Children age 24-35 months"
## [5] "Percentage who Percentage with a Percentage who Percentage with a"
## [6] "Background ever had a vaccination card Number of ever had a vaccination card Number of"
## [7] "characteristic vaccination card1 seen1 children vaccination card1 seen1 children"
## [8] "Sex"
## [9] "Male 96.7 70.6 1,477 96.4 58.3 1,463"
## [10] "Female 97.1 68.8 1,382 97.2 59.6 1,426"
## [11] ""
## [12] "Birth order"
## [13] "1 98.2 66.3 694 97.1 56.2 615"
## [14] "2-3 97.3 70.9 970 97.9 61.3 1,010"
## [15] "4-5 98.2 71.5 636 96.8 56.4 607"
## [16] "6+ 93.2 69.8 558 94.7 60.0 658"
## [17] ""
## [18] "Residence"
## [19] "Urban 96.5 63.5 670 97.1 54.2 613"
## [20] "Rural 97.1 71.6 2,189 96.7 60.2 2,277"
## [21] ""
## [22] "Region"
## [23] "South Central 95.8 65.4 360 95.3 48.2 386"
## [24] "North Central 96.9 67.0 313 95.1 53.3 282"
## [25] "Kampala 95.0 56.2 143 98.5 54.2 109"
Well, we can work with this! The rows of data are clearly identifiable and anything is better than keying in all those data by hand 😭
What we see above is a character matrix of scraped text from
extract_tables()
. To move from this mess to a data frame, let’s employ
some regex
to split the row labels from the data values. Fortunately,
values following row labels are typically the same format: between 0-100
to the tenths place, e.g. 98.6. So, to split the row labels from the
data values, we use a regex
with that pattern.
We also tighten up the data frame by filtering out completely empty rows
and then slicing the data frame per keys
, which are specific rows that
mark the boundaries of data within the table. In this case, we see that
the rows Sex
and Total
mark the top and bottom rows that contain
data values.
# this regex looks for the first space preceding the predominant
# initial data value pattern found in DHS and MICS tables, e.g. 98.6
str_split_regex <- regex("
\\s(?= # initial space followed by any:
\\(?\\d{1,3}[,.]\\d\\)?| # coverage value (0-100 to the tenths place; optional parens)
^\\d+$| # denominator value (whole numbers)
\\(?\\*\\)?| # special character * (denotes small n; optional parens)
na\\s) # explicit NA followed by a space
", comments = TRUE)
keys <- c("Sex","Total")
df <- raw_txt %>%
# string manipulation
unlist() %>%
str_squish() %>% # remove extraneous whitespace
str_split_fixed(str_split_regex, n = 2) %>% # split row labels from data values
cbind(., str_split_fixed(.[, 2], "\\s", n = 6)) %>% # split data values (6 columns for this table)
.[,-2] %>% # drop column of raw data values
# build df
as_tibble() %>%
na_if("") %>%
filter(!is.na(.[[1]])) %>%
slice(str_which(.[[1]], keys[1])[1]:str_which(.[[1]], keys[2])[1])
df
## # A tibble: 43 x 7
## V1 V2 V3 V4 V5 V6 V7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Sex <NA> <NA> <NA> <NA> <NA> <NA>
## 2 Male 96.7 70.6 1,477 96.4 58.3 1,463
## 3 Female 97.1 68.8 1,382 97.2 59.6 1,426
## 4 Birth order <NA> <NA> <NA> <NA> <NA> <NA>
## 5 1 98.2 66.3 694 97.1 56.2 615
## 6 2-3 97.3 70.9 970 97.9 61.3 1,010
## 7 4-5 98.2 71.5 636 96.8 56.4 607
## 8 6+ 93.2 69.8 558 94.7 60.0 658
## 9 Residence <NA> <NA> <NA> <NA> <NA> <NA>
## 10 Urban 96.5 63.5 670 97.1 54.2 613
## # ... with 33 more rows
# let's make some quick functions to use later on
split_raw_txt <- function(raw_txt, str_split_regex, n_cols) {
raw_txt %>%
unlist() %>%
str_squish() %>%
str_replace_all("(\\s\\d+)[\\s,.](\\d{3}\\)?)","\\1\\2") %>% # harmonize 1,000 separator
str_replace_all("(\\d)\\,(\\d\\)?\\b)","\\1.\\2") %>% # harmonize decimal separator
str_split_fixed(str_split_regex, n = 2) %>%
cbind(., str_split_fixed(.[, 2], "\\s", n = length(n_cols))) %>%
.[,-2]
}
build_df <- function(split_txt, keys) {
split_txt %>%
as_tibble() %>%
na_if("") %>%
filter(!is.na(.[[1]])) %>%
slice(str_which(.[[1]], keys[1])[1]:str_which(.[[1]], keys[2])[1])
}
Success! Oh, but what about those pesky row headers?
I like to think of the row headers as the grouping variable for the sets
of row labels. We leverage the existing structure of the data frame,
where row headers only contain NA
s, to create the row_grp
variable
and filter out the original header rows. Note that we check for any
summary row by detecing the string Total
, in this case, to ensure that
row label is retained.
df_row_headr <- df %>%
mutate(row_grp = if_else(is.na(.[[2]]) | str_detect(.[[1]],str_c(keys, collapse = "|")), .[[1]], NA_character_)) %>%
fill(row_grp) %>%
filter(!is.na(.[2])) %>%
mutate(row_grp = if_else(is.na(row_grp), .[[1]], row_grp)) %>%
select(row_grp, everything())
df_row_headr
## # A tibble: 36 x 8
## row_grp V1 V2 V3 V4 V5 V6 V7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Sex Male 96.7 70.6 1,477 96.4 58.3 1,463
## 2 Sex Female 97.1 68.8 1,382 97.2 59.6 1,426
## 3 Birth order 1 98.2 66.3 694 97.1 56.2 615
## 4 Birth order 2-3 97.3 70.9 970 97.9 61.3 1,010
## 5 Birth order 4-5 98.2 71.5 636 96.8 56.4 607
## 6 Birth order 6+ 93.2 69.8 558 94.7 60.0 658
## 7 Residence Urban 96.5 63.5 670 97.1 54.2 613
## 8 Residence Rural 97.1 71.6 2,189 96.7 60.2 2,277
## 9 Region South Central 95.8 65.4 360 95.3 48.2 386
## 10 Region North Central 96.9 67.0 313 95.1 53.3 282
## # ... with 26 more rows
# let's wrap this into a function as well
clean_row_headr <- function(df, keys) {
df %>%
mutate(row_grp = if_else(is.na(.[[2]]) | str_detect(.[[1]], str_c(keys, collapse = "|")), .[[1]], NA_character_)) %>%
fill(row_grp) %>%
filter(!is.na(.[2])) %>%
mutate(row_grp = if_else(is.na(row_grp), .[[1]], row_grp)) %>%
select(row_grp, everything())
}
Unfortunately, tabulizer
does not reliably parse column headers from
DHS and MICS tables due to their often irregular, multi-level structure.
Because of this, we write a function make_headr
to assemble the
desired column headers
. We feed the function each level of the column
headers: parent or
child.
make_col_headr <- function(child, parent, sep = "_") {
if (class(child) == "list") {
child %>%
map2(., parent, ~str_c(.x, .y, sep = sep)) %>%
unlist()
} else {
parent %>%
map(~str_c(child, .x, sep = sep)) %>%
unlist()
}
}
# In the above image, we see that Table 10.4 has a set of 3 column headers
# repeated twice - once for each age group: 12-23 month and 24-35 months
headr_list <- list(
child = c("vacc-card-ever", "vacc-card-seen","denom"),
parent = c("age12-23m","age24-35m")
)
headers <- make_col_headr(headr_list$child, headr_list$parent)
headers
## [1] "vacc-card-ever_age12-23m" "vacc-card-seen_age12-23m" "denom_age12-23m"
## [4] "vacc-card-ever_age24-35m" "vacc-card-seen_age24-35m" "denom_age24-35m"
# we can now set the appropriate column headers
df_row_headr %>%
set_names(c("row_grp", "row_lbl", headers))
## # A tibble: 36 x 8
## row_grp row_lbl `vacc-card-ever… `vacc-card-seen… `denom_age12-23… `vacc-card-ever…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Sex Male 96.7 70.6 1,477 96.4
## 2 Sex Female 97.1 68.8 1,382 97.2
## 3 Birth … 1 98.2 66.3 694 97.1
## 4 Birth … 2-3 97.3 70.9 970 97.9
## 5 Birth … 4-5 98.2 71.5 636 96.8
## 6 Birth … 6+ 93.2 69.8 558 94.7
## 7 Reside… Urban 96.5 63.5 670 97.1
## 8 Reside… Rural 97.1 71.6 2,189 96.7
## 9 Region South … 95.8 65.4 360 95.3
## 10 Region North … 96.9 67.0 313 95.1
## # ... with 26 more rows, and 2 more variables: `vacc-card-seen_age24-35m` <chr>,
## # `denom_age24-35m` <chr>
tidy_df <- df_row_headr %>%
set_names(c("row_grp", "row_lbl", headers)) %>%
gather("indicator", "value", -row_grp, -row_lbl) %>%
separate(indicator, c("indicator","denom_lbl"), sep = "_")
tidy_df
## # A tibble: 216 x 5
## row_grp row_lbl indicator denom_lbl value
## <chr> <chr> <chr> <chr> <chr>
## 1 Sex Male vacc-card-ever age12-23m 96.7
## 2 Sex Female vacc-card-ever age12-23m 97.1
## 3 Birth order 1 vacc-card-ever age12-23m 98.2
## 4 Birth order 2-3 vacc-card-ever age12-23m 97.3
## 5 Birth order 4-5 vacc-card-ever age12-23m 98.2
## 6 Birth order 6+ vacc-card-ever age12-23m 93.2
## 7 Residence Urban vacc-card-ever age12-23m 96.5
## 8 Residence Rural vacc-card-ever age12-23m 97.1
## 9 Region South Central vacc-card-ever age12-23m 95.8
## 10 Region North Central vacc-card-ever age12-23m 96.9
## # ... with 206 more rows
In practice, you would be looking to extract a series of tables across
several PDFs. To accomplish this, we create a tibble of arguments to
pass to a new function scrape_table()
and use pmap()
to iterate.
We put the helper functions split_raw_txt()
build_df()
make_col_headr()
to work inside the scrape_table()
function.
scrape_table <- function(source_folder, source, tbl_id, pg, keys, str_split_regex, headers) {
# build column headers
col_hdrs <- headers %>%
pmap(make_col_headr) %>%
unlist()
# scrape table
orig <- extract_tables(here(source_folder, str_c(source, ".pdf")), pages = pg, guess = FALSE) %>%
split_raw_txt(str_split_regex, col_hdrs) %>%
build_df(keys) %>%
clean_row_headr(keys) %>%
set_names(c("row_grp", "row_lbl", col_hdrs))
# reshape table
rshp <- orig %>%
gather("indicator", "value", -row_grp, -row_lbl) %>%
mutate(source = str_remove(source, "\\.pdf"),
tbl_id = tbl_id) %>%
select(source, tbl_id, everything())
list(orig, rshp)
}
# run a table and inspect output
headr_list <- list(
child = list(c("vacc-card-ever","vacc-card-seen","denom")),
parent = list(c("age12-23m","age24-35"))
)
scrape_table(
source_folder = "report-pdfs",
source = "dhs_uganda_2016",
tbl_id = "tbl_10.4",
pg = 211,
keys = c("Sex", "Total"),
headers = headr_list,
str_split_regex = "\\s(?=\\d+.\\d)"
)
## [[1]]
## # A tibble: 36 x 8
## row_grp row_lbl `vacc-card-ever… `vacc-card-seen… `denom_age12-23… `vacc-card-ever…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Sex Male 96.7 70.6 1477 96.4
## 2 Sex Female 97.1 68.8 1382 97.2
## 3 Birth … 1 98.2 66.3 694 97.1
## 4 Birth … 2-3 97.3 70.9 970 97.9
## 5 Birth … 4-5 98.2 71.5 636 96.8
## 6 Birth … 6+ 93.2 69.8 558 94.7
## 7 Reside… Urban 96.5 63.5 670 97.1
## 8 Reside… Rural 97.1 71.6 2189 96.7
## 9 Region South … 95.8 65.4 360 95.3
## 10 Region North … 96.9 67.0 313 95.1
## # ... with 26 more rows, and 2 more variables: `vacc-card-seen_age24-35` <chr>,
## # `denom_age24-35` <chr>
##
## [[2]]
## # A tibble: 216 x 6
## source tbl_id row_grp row_lbl indicator value
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 dhs_uganda_2016 tbl_10.4 Sex Male vacc-card-ever_age12-23m 96.7
## 2 dhs_uganda_2016 tbl_10.4 Sex Female vacc-card-ever_age12-23m 97.1
## 3 dhs_uganda_2016 tbl_10.4 Birth order 1 vacc-card-ever_age12-23m 98.2
## 4 dhs_uganda_2016 tbl_10.4 Birth order 2-3 vacc-card-ever_age12-23m 97.3
## 5 dhs_uganda_2016 tbl_10.4 Birth order 4-5 vacc-card-ever_age12-23m 98.2
## 6 dhs_uganda_2016 tbl_10.4 Birth order 6+ vacc-card-ever_age12-23m 93.2
## 7 dhs_uganda_2016 tbl_10.4 Residence Urban vacc-card-ever_age12-23m 96.5
## 8 dhs_uganda_2016 tbl_10.4 Residence Rural vacc-card-ever_age12-23m 97.1
## 9 dhs_uganda_2016 tbl_10.4 Region South Central vacc-card-ever_age12-23m 95.8
## 10 dhs_uganda_2016 tbl_10.4 Region North Central vacc-card-ever_age12-23m 96.9
## # ... with 206 more rows
Using scrape_table()
and pmap()
, we will scrape and tidy 8 tables
out of 4 survey reports: DHS
Uganda 2016;
MICS Democratic People’s Republic of
Korea 2017; DHS
Haiti 2016-2017;
and MICS
Paraguay 2016.
split_regex <- "\\s(?=\\(?\\d{1,2}[,.]\\d|^\\d+$|\\(?\\*\\)?|na\\s)" # same regex outline above
# create nested df of column headers
headr_df <- tribble(
~tbl_id, ~child, ~parent,
"hti_tbl_25.2", list(c("hiv-positive","denom"), c("hiv-positive","denom"), c("hiv-positive","denom")), c("female","male","all"),
"hti_tbl_28", list(c("fever-2wks-prior","denom"), c("treatment-advice-sought","finder-heel-blood-test","antibiotics","denom")),
c("age0-4m","age0-4m-fever"),
"prk_tc_1.1", list(c("vacc-record","mother-report","either","vacc-by-12-mos"),
c("vacc-record","mother-report","either","vacc-by-24-mos")),
c("age12-23m", "age24-35m"),
"prk_tc_2.1", c("episode-diarrhoea","ari-syptoms","episode-fever","denom"), "age0-59m",
"pry_nu_5", c("breastfeed-median-any","breastfeed-median-excl","breastfeed-median-predom","denom"), "age0-35m",
"pry_mt_2", c("computer-ever-used","computer-used-12m","computer-used-1perweek-lastmonth",
"internet-ever-used","internet-used-12m","internet-used-1perweek-lastmonth", "denom"), "age15-24y",
"uga_tbl_10.14", c("per-attend-ece","denom"),"age12-23m",
"uga_tbl_11.8", c("any-anaemia","mild-anaemia","moderate-anaemia","severe-anaemia","denom"), "age6-59m"
) %>%
nest(child, parent, .key = "headers")
# ensure tribble column names match scrape_table() argument names
tbl_scrape_args <- tribble(
~source, ~tbl_id, ~pg, ~keys,
"dhs_haiti_2017", "hti_tbl_25.2", 62, c("Milieu de résidence", "Ensemble 15-64"),
"dhs_haiti_2017", "hti_tbl_28", 66, c("Milieu de Résidence", "Ensemble"),
"mics_nkorea_2017", "prk_tc_1.1", 80, c("BCG", "^Number"),
"mics_nkorea_2017", "prk_tc_2.1", 84, c("Total", "40 percent highest"),
"mics_paraguay_2016", "pry_nu_5", 75, c("Mediana$", "Media$"),
"mics_paraguay_2016", "pry_mt_2", 241, c("Total", "idioma$"),
"dhs_uganda_2016", "uga_tbl_10.14", 221, c("Age", "Total"),
"dhs_uganda_2016", "uga_tbl_11.8", 247, c("Age in months", "Total")
) %>%
# add arguments that do not vary
mutate(source_folder = "report-pdfs",
str_split_regex = split_regex) %>%
# pull in column headers
left_join(., headr_df, by = "tbl_id")
# pass args tibble through pmap()
tbl_scrapes <- tbl_scrape_args %>%
mutate(output = pmap(., scrape_table))
tbl_scrapes
## # A tibble: 8 x 8
## source tbl_id pg keys source_folder str_split_regex headers output
## <chr> <chr> <dbl> <list> <chr> <chr> <list> <list>
## 1 dhs_haiti_… hti_tbl_… 62 <chr [… report-pdfs "\\s(?=\\(?\\d{1,2}[,.]\\d|^… <tibble [… <list …
## 2 dhs_haiti_… hti_tbl_… 66 <chr [… report-pdfs "\\s(?=\\(?\\d{1,2}[,.]\\d|^… <tibble [… <list …
## 3 mics_nkore… prk_tc_1… 80 <chr [… report-pdfs "\\s(?=\\(?\\d{1,2}[,.]\\d|^… <tibble [… <list …
## 4 mics_nkore… prk_tc_2… 84 <chr [… report-pdfs "\\s(?=\\(?\\d{1,2}[,.]\\d|^… <tibble [… <list …
## 5 mics_parag… pry_nu_5 75 <chr [… report-pdfs "\\s(?=\\(?\\d{1,2}[,.]\\d|^… <tibble [… <list …
## 6 mics_parag… pry_mt_2 241 <chr [… report-pdfs "\\s(?=\\(?\\d{1,2}[,.]\\d|^… <tibble [… <list …
## 7 dhs_uganda… uga_tbl_… 221 <chr [… report-pdfs "\\s(?=\\(?\\d{1,2}[,.]\\d|^… <tibble [… <list …
## 8 dhs_uganda… uga_tbl_… 247 <chr [… report-pdfs "\\s(?=\\(?\\d{1,2}[,.]\\d|^… <tibble [… <list …
Great! We have a data frame that includes all the input parameters as
well as a list-column containing the scraped and reshaped tables. Let’s
take a look at the output to make sure we are on track and then use
map_dfr
to row-bind all reshaped data frames into a single
master_df
:
# inspect some table output
tbl_scrapes[['output']][[4]][[1]]
## # A tibble: 26 x 6
## row_grp row_lbl `episode-diarrhoea_a… `ari-syptoms_age0… `episode-fever_age… `denom_age0-59m`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Total Total 11.4 0.5 6.8 2275
## 2 Sex Male 12.3 0.4 6.0 1164
## 3 Sex Female 10.5 0.6 7.6 1111
## 4 Area Urban 10.1 0.6 6.9 1361
## 5 Area Rural 13.4 0.4 6.6 914
## 6 Province Ryanggang 10.5 0.9 8.3 71
## 7 Province North Ham… 12.7 1.3 5.8 221
## 8 Province South Ham… 10.6 0.0 9.2 298
## 9 Province Kangwon 9.9 2.2 6.5 148
## 10 Province Jagang 12.8 0.4 7.3 131
## # ... with 16 more rows
tbl_scrapes[['output']][[4]][[2]]
## # A tibble: 104 x 6
## source tbl_id row_grp row_lbl indicator value
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 mics_nkorea_2017 prk_tc_2.1 Total Total episode-diarrhoea_age0-59m 11.4
## 2 mics_nkorea_2017 prk_tc_2.1 Sex Male episode-diarrhoea_age0-59m 12.3
## 3 mics_nkorea_2017 prk_tc_2.1 Sex Female episode-diarrhoea_age0-59m 10.5
## 4 mics_nkorea_2017 prk_tc_2.1 Area Urban episode-diarrhoea_age0-59m 10.1
## 5 mics_nkorea_2017 prk_tc_2.1 Area Rural episode-diarrhoea_age0-59m 13.4
## 6 mics_nkorea_2017 prk_tc_2.1 Province Ryanggang episode-diarrhoea_age0-59m 10.5
## 7 mics_nkorea_2017 prk_tc_2.1 Province North Hamgyong episode-diarrhoea_age0-59m 12.7
## 8 mics_nkorea_2017 prk_tc_2.1 Province South Hamgyong episode-diarrhoea_age0-59m 10.6
## 9 mics_nkorea_2017 prk_tc_2.1 Province Kangwon episode-diarrhoea_age0-59m 9.9
## 10 mics_nkorea_2017 prk_tc_2.1 Province Jagang episode-diarrhoea_age0-59m 12.8
## # ... with 94 more rows
# row-bind all reshaped data frames from 'output' list-column
full_rshp <- tbl_scrapes[["output"]] %>%
map_dfr(2) %>%
separate(source, c("svy_type", "country", "year"), sep = "_", remove = FALSE) %>%
separate(indicator, c("indicator","denom_grp"), sep = "_") %>%
mutate(value = as.numeric(value))
full_rshp
## # A tibble: 1,079 x 10
## source svy_type country year tbl_id row_grp row_lbl indicator denom_grp value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 dhs_haiti… dhs haiti 2017 hti_tbl_… Milieu de ré… Urbain hiv-posit… female 2.2
## 2 dhs_haiti… dhs haiti 2017 hti_tbl_… Milieu de ré… Rural hiv-posit… female 2.4
## 3 dhs_haiti… dhs haiti 2017 hti_tbl_… Département Aire Métrop… hiv-posit… female 2.1
## 4 dhs_haiti… dhs haiti 2017 hti_tbl_… Département Reste Ouest hiv-posit… female 1.6
## 5 dhs_haiti… dhs haiti 2017 hti_tbl_… Département Sud-Est hiv-posit… female 3.2
## 6 dhs_haiti… dhs haiti 2017 hti_tbl_… Département Nord hiv-posit… female 2.7
## 7 dhs_haiti… dhs haiti 2017 hti_tbl_… Département Nord-Est hiv-posit… female 2.1
## 8 dhs_haiti… dhs haiti 2017 hti_tbl_… Département Artibonite hiv-posit… female 3.3
## 9 dhs_haiti… dhs haiti 2017 hti_tbl_… Département Centre hiv-posit… female 1.5
## 10 dhs_haiti… dhs haiti 2017 hti_tbl_… Département Sud hiv-posit… female 2
## # ... with 1,069 more rows
# implement a self-join to match denominator value for each observation
master_df <- full_rshp %>%
left_join(x = filter(., !str_detect(indicator, "denom")),
y = filter(., str_detect(indicator, "denom")),
by = c("source", "tbl_id", "row_grp", "row_lbl", "denom_grp")) %>%
select(source, svy_type = svy_type.x, country = country.x, year = year.x,
tbl_id, row_grp, row_lbl, indicator = indicator.x, value = value.x,
denom_grp, denom_value = value.y)
master_df
## # A tibble: 802 x 11
## source svy_type country year tbl_id row_grp row_lbl indicator value denom_grp denom_value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 dhs_hai… dhs haiti 2017 hti_tb… Milieu d… Urbain hiv-posi… 2.2 female 4479
## 2 dhs_hai… dhs haiti 2017 hti_tb… Milieu d… Rural hiv-posi… 2.4 female 5118
## 3 dhs_hai… dhs haiti 2017 hti_tb… Départem… Aire Mét… hiv-posi… 2.1 female 2438
## 4 dhs_hai… dhs haiti 2017 hti_tb… Départem… Reste Ou… hiv-posi… 1.6 female 1559
## 5 dhs_hai… dhs haiti 2017 hti_tb… Départem… Sud-Est hiv-posi… 3.2 female 502
## 6 dhs_hai… dhs haiti 2017 hti_tb… Départem… Nord hiv-posi… 2.7 female 1044
## 7 dhs_hai… dhs haiti 2017 hti_tb… Départem… Nord-Est hiv-posi… 2.1 female 355
## 8 dhs_hai… dhs haiti 2017 hti_tb… Départem… Artiboni… hiv-posi… 3.3 female 1373
## 9 dhs_hai… dhs haiti 2017 hti_tb… Départem… Centre hiv-posi… 1.5 female 612
## 10 dhs_hai… dhs haiti 2017 hti_tb… Départem… Sud hiv-posi… 2 female 626
## # ... with 792 more rows
We just built a function that scraped 8 tables of data from 4 different PDFs and created a tidy data frame 👏 👏 👏