Skip to content

Latest commit

 

History

History
117 lines (80 loc) · 2.99 KB

readme.md

File metadata and controls

117 lines (80 loc) · 2.99 KB

Women of 2020

The data this week comes from the BBC by way of Joshua Feldman.

The BBC has revealed its list of 100 inspiring and influential women from around the world for 2020.

This year 100 Women is highlighting those who are leading change and making a difference during these turbulent times.

The list includes Sanna Marin, who leads Finland's all-female coalition government, Michelle Yeoh, star of the new Avatar and Marvel films and Sarah Gilbert, who heads the Oxford University research into a coronavirus vaccine, as well as Jane Fonda, a climate activist and actress.

And in an extraordinary year - when countless women around the world have made sacrifices to help others - one name on the 100 Women list has been left blank as a tribute.

Get the data here

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2020-12-08')
tuesdata <- tidytuesdayR::tt_load(2020, week = 50)

women <- tuesdata$women

# Or read in the data manually

women <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-12-08/women.csv')

Data Dictionary

women.csv

variable class description
name character Name of woman
img character Link to headshot
category character Category for award
country character Country of residence
role character Role/Career
description character Description of the woman and their achievements

Cleaning Script

# Load packages

library(rvest)
library(tidyverse)

# Load web page

bbc_women <- html("https://www.bbc.co.uk/news/world-55042935")

# Save even and odd indices for data extraction later

odd_index <- seq(1,200,2)
even_index <- seq(2,200,2)

# Extract name

name <- bbc_women %>% 
  html_nodes("article h4") %>% 
  html_text()

# Extract image

img <- bbc_women %>% 
  html_nodes(".card__header") %>% 
  html_nodes("img") %>% 
  html_attr("src")

img <- img[odd_index]

# Extract category

category <- bbc_women %>% 
  html_nodes("article .card") %>% 
  str_extract("card category--[A-Z][a-z]+") %>% 
  str_remove_all("card category--")

# Extract country & role

country_role <- bbc_women %>% 
  html_nodes(".card__header__strapline__location") %>% 
  html_text()

country <- country_role[odd_index]
role <- country_role[even_index]

# Extract description

description <- bbc_women %>% 
  html_nodes(".first_paragraph") %>% 
  html_text()

# Finalise data frame

df <- data.frame(
  name,
  img,
  category,
  country,
  role,
  description
)

# Export

write.csv(df, "data.csv", row.names=FALSE)