Skip to content

Latest commit

 

History

History
154 lines (118 loc) · 5.88 KB

readme.md

File metadata and controls

154 lines (118 loc) · 5.88 KB

Louisiana Tech was a women’s basketball powerhouse in the 1980s and ’90s. The Techsters lost the 1998 title game to Tennessee. GARY CASKEY / REUTERS

NCAA Women's Basketball Tournament

The data this week comes from FiveThirtyEight. The original raw data is on their [GitHub]. More details about the NCAA Women's Basketball Tournament which expended to 64 teams in 1994. There are some additional data points at that Wikipedia link if you're curious!

Note that for their dataviz, they converted seed to a 100 point scale based off the average wins/seed. Note that FiveThirtyEight used Simple Rating System scores from Sports-Reference, but I've simplified it into simply the average per seed.

A quick table of this as seen below:

To measure this, we awarded “seed points” in proportion to a given seed number’s expected wins in the tournament, calibrated to a 100-point scale where the No. 1 seed gets 100 points, No. 2 gets 70 points, and so forth.

This aligns to (based off the averages):

Seed Points
1st 100
2nd 72.7
3rd 54.5
4th 48.5
5th 33.3
6th 33.3
7th 27.3
8th 21.2
9th 18.2
10th 18.2
11th 18.2
12th 15.2
13th 9.09
14th 6.06
15th 3.03
16th 0

Their modeled fit:

You could see the quick plot of these points, but again note that this will vary a bit from the FiveThirtyEight table as they included SRS score in their equation.

tibble(
  seed = c(1:16),
  exp_wins = c(3.3, 2.4, 1.8, 1.6, 1.1, 1.1, 0.9, 0.7, 0.6, 0.6, 0.6, 0.5, 0.3, 0.2, 0.1, 0)
  
) %>% 
  mutate(
    points = exp_wins/3.3 * 100
  ) %>% 
  ggplot(aes(x = seed, y = points)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 0, color = "black") +
  scale_y_continuous(breaks = seq(0, 100, by = 20)) +
  coord_cartesian(ylim = c(0, 100)) +
  scale_x_continuous(breaks = seq(1, 16, by = 3)) +
  theme_minimal() +
  labs(
    x = "Tournament Seed", y = "Seed Points",
    title = "How much is that seed worth?"
  )

Thus, to get the points for each team/tournament season you can multiply the teams initial seed (1 - 16) by the assigned points as defined in the above table of 100 - 0.

Get the data here

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2020-10-06')
tuesdata <- tidytuesdayR::tt_load(2020, week = 41)

tournament <- tuesdata$tournament

# Or read in the data manually

tournament <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-06/tournament.csv')

Data Dictionary

Seed Point Value as defined in a separate FiveThirtyEight Article

Seed Points
1st 100
2nd 72.7
3rd 54.5
4th 48.5
5th 33.3
6th 33.3
7th 27.3
8th 21.2
9th 18.2
10th 18.2
11th 18.2
12th 15.2
13th 9.09
14th 6.06
15th 3.03
16th 0

tournament.csv

To get the points for each team/tournament season you can multiply the teams initial seed (1 - 16) by the assigned points as defined in the above table of 100 - 0.

variable class description
year double Tournament year
school character School name
seed double Seed rank
conference character Conference name
conf_w double Conference wins
conf_l double Conference losses
conf_percent double Conference win/loss percent
conf_place character Conference placement (ie, 1st, 2nd, etc)
reg_w double Regular season wins
reg_l double Regular season losses
reg_percent double Regular season win/loss percent
how_qual character How qualified - Whether the school qualified with an automatic bid (by winning its conference or conference tournament) or an at-large bid
x1st_game_at_home character Whether the school played its first-round tournament games on its home court.
tourney_w double Tournament wins
tourney_l double Tournament games losses
tourney_finish character Tournament finish - The round of the final game for each team. OR=opening-round loss (1983 only); 1st=first-round loss; 2nd=second-round loss; RSF=loss in the Sweet 16; RF=loss in the Elite Eight; NSF=loss in the national semifinals; N2nd=national runner-up; Champ=national champions
full_w double Total sum of wins
full_l double Total sum of losses
full_percent double Total sum win/loss percent

Cleaning Script

library(tidyverse)

raw_df <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/ncaa-womens-basketball-tournament/ncaa-womens-basketball-tournament-history.csv")

clean_tourn <- raw_df %>% 
  janitor::clean_names() %>% 
  mutate(across(c(seed, conf_w:conf_percent, full_percent), parse_number))

clean_tourn %>% 
  write_csv("2020/2020-10-06/tournament.csv")