This week we're exploring the CIA World Factbook! The dataset this week comes from the CIA Factbook, Country Comparisons, 2014, via the {openintro} R package, via the {usdatasets} R package, via this post on LinkedIn.
The World Factbook provides basic intelligence on the history, people, government, economy, energy, geography, environment, communications, transportation, military, terrorism, and transnational issues for 265 world entities.
Which countries have the highest number of internet users per square kilometer? Which countries have the highest percentage of internet users?
You might want to join this dataset with past TidyTueday datasets that featured country information!
# pak::pak("r4ds/ttmeta")
library(tidyverse)
library(ttmeta)
country_datasets <- ttmeta::tt_datasets_metadata |>
dplyr::mutate(
has_country = purrr::map_lgl(
.data$variable_details,
\(var_dets) {
!is.null(var_dets) &&
any(stringr::str_detect(tolower(var_dets$variable), "country"))
}
)
) |>
dplyr::filter(has_country)
# Option 1: tidytuesdayR package
## install.packages("tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2024-10-22')
## OR
tuesdata <- tidytuesdayR::tt_load(2024, week = 43)
cia_factbook <- tuesdata$cia_factbook
# Option 2: Read directly from GitHub
cia_factbook <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-10-22/cia_factbook.csv')
- Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
- Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
- Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.
- Submit your own dataset!
variable | class | description |
---|---|---|
country | integer | Name of the country (factor with 259 levels). |
area | integer | Total area of the country (in square kilometers). |
birth_rate | double | Birth rate (number of live births per 1,000 people). |
death_rate | double | Death rate (number of deaths per 1,000 people). |
infant_mortality_rate | double | Infant mortality rate (number of deaths of infants under one year old per 1,000 live births). |
internet_users | integer | Number of internet users. |
life_exp_at_birth | double | Life expectancy at birth (in years). |
maternal_mortality_rate | integer | Maternal mortality rate (number of maternal deaths per 100,000 live births). |
net_migration_rate | double | Net migration rate (number of migrants per 1,000 people). |
population | integer | Total population of the country. |
population_growth_rate | double | Population growth rate (multiplier). |
# Mostly clean data provided by the {usdatasets} R package
# (https://cran.r-project.org/package=usdatasets). No cleaning was necessary.
# pak::pak("usdatasets")
library(dplyr)
library(usdatasets)
cia_factbook <- usdatasets::cia_factbook_tbl_df |>
dplyr::mutate(
dplyr::across(
c("area", "internet_users"),
as.integer
)
)