-
Notifications
You must be signed in to change notification settings - Fork 0
/
wd_imdb_scrape.Rmd
661 lines (469 loc) · 22.7 KB
/
wd_imdb_scrape.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
---
title: "The Walking Dead IMDB web-scrape."
output: html_document
---
# Background
The Walking Dead is my favorite ever T.V show and I'm that sad I rate all of the episodes on IMDB. I have recently had to brush the dust off some old notes on web-scraping in R using rvest for a work related project. It then crossed my mind why not have some fun with it, web-scrape the IMDB scores and visualise them to see how the show has progressed in the eyes of many. I find it entertaining and I practice some R so it was a win-win. It is a bonus that I can brush up my rvest knowledge and skills and create a project from it!
While seeking an answer to a problem I ran into when first scraping IMDB I came across this great blog which I took inspiration from and learnt a few things form, credit where credit is due <https://eliocamp.github.io/codigo-r/en/2021/03/star-trek-rakings/>.
### Rvest
Rvest is an incredibly easy R package to use to web-scrape structured and unstructured data. Fortunately, IMDB is relatively structured so it is just a matter of identifying the correct divs and tags where all of the data is stored. There is some slight data wrangling, but that is it. The data is then good to go and play with.
## Initial set up.
Before embarking on a web-scrape you need to investigate the website using dev tools. This is to understand how the website is structured, how many weblinks you may need to scrape and most importantly where the dvis and tags. Understanding what data is avialable and what exactly you want to scrape is also a necessity.
In this case, I want to grab every episode's user rating. This informaiton can be found on each indvidual season weblink. <https://www.imdb.com/title/tt1520211/episodes?season=1>, seaosn 2, 3 etc. This is simple enough.
Diving under the hood by inspecting the dev tools (right click over your object, click inspect and you will see the HTML code).
```{r, include=FALSE}
# load libraries
# Set up -------------------------
rm(list=ls()) # remove everything form global environment.
library(tidyverse)
library(openxlsx)
library(readxl)
library(janitor)
library(stringr)
library(rvest)# most common web-scraping R package. (and very easy to pick up!)
library(tictoc) # simple function to monitor code chunk run time
library(data.table)
library(plotly)
path<-setwd(stringr::str_extract(rstudioapi::getActiveDocumentContext()$path,".+[/]"))# set working directory to where your file is saved.
`%notin%` <- Negate(`%in%`) # Custom negate function
```
```{r}
knitr::include_graphics("img/wd_img_html_ep1.PNG")
```
There is multiple pieces of data we can extract.
Firstly, having highlighted the first episode and inspected, reading through the HTML code the information is stored within the div with class "info". Within the div there are further divs breaking down this section. Following through the divs the information we require are storied within the div with class "ipl-rating-widget". This can be further broken down by the <span> div with class "ipl-rating-star__star" with each piece of data given a specific class.
Alot of divs but reading through line by line the data we require can easily be identified as the website is strucuted and labelled ncie and easy for us to scrape.
The first items I have highlighted in blue: the episode title and a hyperlink which directs you to an additional page with more episode statistics (more on this in the section demographics). These are stored within the <strong> div. The hyperlink is identified using the href tag.
The second bits of data I have highlighted in red, the episode rating and number of votes are stored within the <span> div.
# Code testing
```{r}
url <- "https://www.imdb.com/title/tt1520211/episodes?season=1"
scraped_html <- rvest::read_html(url) # scrape HTML code
```
```{r}
# scrape episode title.
# Stored within the strong and a tags within the list_item div.
titles <-
scraped_html %>%
html_nodes(".list_item strong a") %>%
html_text()
print(titles)
```
Great, we know this has worked as there is 6 episode and 6 items scraped. Simple.
Before scraping the ratings we need to store the web-links of each episode which links to demographic data.
```{r}
# scrape each episodes hyper link id.
# This id we can create the unique neccessary to scrape from.
scrape_ids <-
scraped_html %>%
html_nodes(".list_item strong a") %>%
html_attr("href")
# clean up strings so we extract only what we need to generate a new webpage.
scrape_ids <- strsplit(scrape_ids, "/") %>%# split string
vapply(function(x) x[3], as.character(1)) # strip 3 item and convert to character
print(scrape_ids)
```
Next scrape the rating and number of votes and the final bit of informaiton; the episode airdate. We are then complete and ready to combine this info a useable format.
```{r}
# scrape each episodes rating using the specific class identifying the correct span div.
# the rating is stored within the div with class "info"
# the data is stored within the div with class .ipl-rating-start small.
# the last tag is the class given the span where the rating is stored within.
ratings <-
scraped_html %>%
html_nodes(".info .ipl-rating-star.small .ipl-rating-star__rating") %>%
html_text() %>%
as.numeric() # convert scrpaed text to numeric.
# scrape episode votes
votes <-
scraped_html %>%
html_nodes(".info .ipl-rating-star.small .ipl-rating-star__total-votes") %>%
html_text() %>%
stringr::str_remove_all(.,"\\(|\\)|,") %>% # remove brackets and comma from scraped string
as.numeric() # convert text to numeric
# scrape episode air date
# this is a nice easy one as the div class is unique for this data item.
dates <-
scraped_html %>%
html_nodes(".airdate") %>%
html_text() %>%
lubridate::dmy() # convert to date type.
line1 = paste0(titles[1],'\n',dates[1],'\n',ratings[1],'\n',votes[1])
cat(line1) # print each item as a seperate line
```
We now have all the episode information we need. We can easy combine this to create a data table or tibble. We can then bind all seasons episode ratings together. Simple.
```{r}
df <-
tibble(
season = 1, # season 1 as the example
ep_no = "", # create episode number column
ep_title = titles,
ep_date = dates,
rating = ratings,
votes = votes
) %>%
mutate(ep_no = row_number()) # create episode number.
```
### Extract all seasons
And that is season one episode ratings done. Testing is done. Each season webpage follows the same structure and will produce the same data format which is perfect. We can create a function and run this for each season.
```{r}
get_wd_season_data <- function(seasonNo){
print(paste0("Web-scraping ratings data from season ",seasonNo))
#' function input is the season number.
# define web-link with season number input.
url <- paste0("https://www.imdb.com/title/tt1520211/episodes?season=",seasonNo)
scraped_html <- rvest::read_html(url) # scrape HTML
titles <- # scrape episode titles
scraped_html %>%
html_nodes(".list_item strong a") %>%
html_text()
ratings <-
scraped_html %>%
html_nodes(".info .ipl-rating-star.small .ipl-rating-star__rating") %>%
html_text() %>%
as.numeric() # convert scraped text to numeric.
# scrape episode votes
votes <-
scraped_html %>%
html_nodes(".info .ipl-rating-star.small .ipl-rating-star__total-votes") %>%
html_text() %>%
stringr::str_remove_all(.,"\\(|\\)|,") %>% # remove brackets and comma from scraped string
as.numeric() # convert text to numeric
# scrape episode air dates
dates <-
scraped_html %>%
html_nodes(".airdate") %>%
html_text() %>%
lubridate::dmy()
df <-
tibble(
season = seasonNo,
ep_no = "",
ep_title = titles,
ep_date = dates,
rating = ratings,
votes = votes
) %>%
mutate(ep_no = row_number())
}
# create season input.
seasons <- seq(1,10,1)
tic()
wd_data <- lapply(seasons,get_wd_season_data) %>% bind_rows()
toc()
```
All seasons ratings data has now been scraped, cleaned and combined ready for use. Web-scraping in R really is simple and easy to do using rvest.
```{r}
print(tail(wd_data,5))
```
# Demographic data
We identified that for each episode there is an individual hyperlink which takes users to a page with the demographic breakdown of the ratings for that episode. We previously scrapped the unique id for each episode we need to scrape the demography data. First we will examine and test the initial scrape of an episode before automating this process through the use of a function.
## Code testing
Like before and any web-page you scrape you need to investigate the underlying HTML/CSS code to identify what divs,tags,classes your data are stored within.
```{r}
knitr::include_graphics("img/wd_img_html_ep1demo.png")
```
There are three distinct tables of data on each web-link. The first are the ratings broken down by rating number, this will be interesting to plot across seasons. The second is a demographic table of ratings and the third off the screenshot captures US and non-US veiwers. Conveniently all of this data are stored in nice neat table divs which making scrape this page more simple.
```{r}
# scrape web-page html
url2 <- paste0("https://www.imdb.com/title/",scrape_ids[[1]],"/ratings/")
scraped_html2 <- read_html(url2)
#' the ratings website stores all data
#' conveniently within table divs.
#' all tables stored in list.
scraped_tables <-
scraped_html2 %>%
html_nodes("table") %>%
html_table()
# scrapes 3 unique tables
# ratings table
rating <- scraped_tables[[1]] %>% # extract ratings table from list
clean_names() %>%
mutate(
season = 1, # create season number column
ep_number = 1 # create episode number column
) %>%
relocate(c("season","ep_number"), .before = "rating") %>%
rename(rating_prop = x)
print(head(rating,5))
```
``` {r}
# compile demography data
# demog data requires splitting into two columns using separate
demo <- scraped_tables[[2]] %>% # extract demog data from list
clean_names() %>%
mutate(across(everything(), ~ stringr::str_squish(.x))) %>% # clean strings
pivot_longer(cols=c(-x), names_to = "demog", values_to = "val") %>%
separate(col = val, into = c("demo_rating","demo_votes"),sep = " ") %>%
mutate(
season = 1,
ep_number = 1
) %>%
relocate(c("season","ep_number"), .before = "x") %>%
rename(gender = x)
print(head(demo,5))
```
```{r}
# scrape us rating table
us_vote <- scraped_tables[[3]] %>% # extract us-rating data from list
clean_names() %>%
mutate(
season = 1,
ep_number = 1 # i is the iteration
) %>%
select(-1) %>%
mutate(across(everything(), ~ stringr::str_squish(.x))) %>%
relocate(c("season","ep_number"), .before = "us_users") %>%
pivot_longer(
cols = us_users:non_us_users,
names_to = "user_type", values_to = "val"
) %>%
separate(col = val, into = c("rating","votes"), sep = " ")
print(head(us_vote,2))
```
So far we have scraped all of season 1 data. We have demonstrated scraping one episodes demography data. Now we can create a function to extract the full seasons demography data across each episode.
```{r}
get_demog_data <- function(seasonNo,ids){
# empty data tables to insert scraped data.
demo_df <- data.table()
rating_df <- data.table()
us_df <- data.table()
for(idNo in 1:length(ids)){
#' In the initial imdb scrape each episode had a unique
#' hyperlink linking to the vote data broken down by demographics.
#' The web-link only changes its unique ID number.
#' each link is scraped through using the unique ID number from each href.
#' each episode is orders through 1:N. the Id Number
#' acts as the episode identifier.
print(paste0("Web-scraping season ",seasonNo," episode ",idNo," demographic breakdown"))
url2 <- paste0("https://www.imdb.com/title/",ids[[idNo]],"/ratings/")
scraped_html2 <- read_html(url2)
#' the ratings website stores all data
#' conveniently within table divs.
#' all tables stored in list.
scraped_tables <-
scraped_html2 %>%
html_nodes("table") %>%
html_table()
# demography table.
# data for rating and number of votes is within one cell.
# convert table to long format then split out into two columns.
demo <- scraped_tables[[2]] %>%
clean_names() %>%
mutate(across(everything(), ~ stringr::str_squish(.x))) %>%
pivot_longer(cols=c(-x), names_to = "demog", values_to = "val") %>%
separate(col = val, into = c("demo_rating","demo_votes"),sep = " ") %>%
mutate(
season = seasonNo,
ep_number = idNo
) %>%
relocate(c("season","ep_number"), .before = "x") %>%
rename(gender = x)
rating <- scraped_tables[[1]] %>%
clean_names() %>%
mutate(
season = seasonNo,
ep_number = idNo # i is the iteration
) %>%
relocate(c("season","ep_number"), .before = "rating") %>%
rename(rating_prop = x)
us_vote <- scraped_tables[[3]] %>%
clean_names() %>%
mutate(
season = seasonNo,
ep_number = idNo # i is the iteration
) %>%
select(-1) %>%
mutate(across(everything(), ~ stringr::str_squish(.x))) %>%
relocate(c("season","ep_number"), .before = "us_users") %>%
pivot_longer(
cols = us_users:non_us_users,
names_to = "user_type",
values_to = "val"
) %>%
separate(col = val, into = c("rating","votes"), sep = " ")
# bind data together for each episode breakdown
demo_df <- bind_rows(demo_df,demo)
rating_df <- bind_rows(rating_df,rating)
us_df <- bind_rows(us_df, us_vote)
}
# store all data within list.
demo_list <- list(demo = demo_df,
rating = rating_df,
us_df = us_df)
return(demo_list) # function returns list as object
} # end function
```
## Extract full datasets
The above function using the scraped seasons data and ids to compile each episodes demographic breakdown place each dataset within a list to be called. The inputs require the season number and the unique web-link ids from the aforementioned season. Therefore this function can be combined with the same code we tested into one full function. This can be then run to scrape all seasons ratings and demographic breakdown.
```{r}
get_full_series_data <- function(seasonNo){
options(warn=-0) # turn off warning messages for function
print(paste0("Web-scraping ratings data from season ",seasonNo))
url <-
paste0("https://www.imdb.com/title/tt1520211/episodes?season=",seasonNo)
scraped_html <- read_html(url)
# scrape date of each episode
scraped_date <-
scraped_html %>%
html_nodes(".airdate") %>%
html_text() %>%
lubridate::dmy()
# scrape episode title
scraped_title <-
scraped_html %>%
html_nodes(".clear .list_item .info strong a") %>%
html_attr("title")
# scrape rating and votes
# Both are within the same div so can scrape together.
scraped_rating <-
scraped_html %>%
html_nodes(".clear .list_item .info .ipl-rating-star.small") %>%
html_text() %>%
stringr::str_squish()
# scrape html
scraped_id <-
scraped_html %>%
html_nodes(".clear .list_item .info strong a") %>%
html_attr("href")
# ids for web-links of rating/demographic breakdowns.
ids <- strsplit(scraped_id, "/") %>%# split string
vapply(function(x) x[3], as.character(1)) # strip 3 item and convert to character
df <-
tibble(
season=seasonNo,
ep ="",
ep_title = scraped_title,
ep_date = scraped_date,
rating = scraped_rating
) %>%
mutate(
ep = row_number(), # row number for each episode
rating = str_remove_all(rating,"\\(|\\)") # remove brackets
) %>%
separate(
col =
rating,
into = c("ep_rating","ep_votes"),
" ") # split out rating col into two.
demog_data <- get_demog_data(seasonNo,ids)
data_list <- list(data = df, demog = demog_data)
return(data_list)
}
seasons = c(seq(1,10,1))
tic()
wd_full_data <- lapply(seasons,get_full_series_data)
toc()
```
Now all the data is scraped. The data is stored within a list, split into two, the first item is the full episode ratings for all seasons within "data". The second contains the three demography data sets. These are split across each season so need binding together to make complete data sets. This can be done simply using lapply and data.table package rbindlist.
```{r}
# bind rating data
wd_data <-
rbindlist(lapply(wd_full_data,function(x)x$data),use.names=T,fill=T) %>%
mutate(ep_votes = str_remove(ep_votes,",")) %>%
mutate(across(c(ep_votes:ep_rating), ~ as.numeric(.x))) %>%
mutate(ep_no = row_number()) # to enable time-series plot.
# bind demography data
wd_data_demog <-
rbindlist(lapply(wd_full_data,function(x)x$demog$demo),use.names=T,fill=T) %>%
mutate(demo_votes = str_remove(demo_votes,",")) %>%
mutate(across(c(demo_rating:demo_votes), ~ as.numeric(.x)))
# bind rating breakdown data
wd_data_ratings <-
rbindlist(lapply(wd_full_data,function(x)x$demog$rating),use.names=T,fill=T)%>%
mutate(votes = str_remove(votes,",")) %>%
mutate(rating_prop = str_remove(rating_prop,"%")) %>%
mutate(across(c(rating_prop:votes), ~ as.numeric(.x)))
# bind us-rating data
wd_data_ratings_us <-
rbindlist(lapply(wd_full_data,function(x)x$demog$us_df),use.names=T,fill=T) %>%
mutate(votes = str_remove(votes,",")) %>%
mutate(across(c(rating:votes), ~ as.numeric(.x)))
# save data output
save(wd_data,wd_data_demog,wd_data_ratings,wd_data_ratings_us, file = "data_output/wd_data_full.RData")
```
## Walking dead viewership
This I have always found interesting. The Walking Dead at it's peak was the most popular and most viewed show out there. It was larger than Game of Thrones. Deservedly so. How these ratings changed over time and then collapse is quite stark. I want to see how the the number of ratings, actual ratings and viewership potentially correlate. We can easily scrape this from wikipedia to create a new dataset.
```{r}
wiki_url <- "https://en.wikipedia.org/wiki/List_of_The_Walking_Dead_episodes"
wiki_html <- read_html(wiki_url)
# the tables we need are easily identifiable via the table div's class:
wiki_scrape <-
wiki_html %>%
html_nodes(".wikitable.plainrowheaders.wikiepisodetable") %>%
html_table()
# inspecting the scraped tables, we only need the first 11 items:
wiki_scrape <- wiki_scrape[1:11]
# inspecting the data further we can simply bind each table together as they are all in the same format.
# First we need to create the season number and clean the viewership data text.
# can simply create a function and apply this across each df in the list.
seasons = seq(1,11,1)
func=function(x,y){x<-x%>%clean_names() %>%
mutate(season=y, # create season no.
u_s_viewers_millions = # remove data within brackets in text.
as.numeric(gsub("\\[[^\\]]*\\]", "", u_s_viewers_millions, perl=TRUE)
))}
views = mapply(FUN=func,x=wiki_scrape,y=seasons,SIMPLIFY=FALSE) %>%
rbindlist(fill=TRUE) %>%
select(-written_by_160,-original_air_date_161) # remove redundnet columns
```
The Walking Dead IMDB ratings have now been scraped.
Next how to visualise this information.
# Visualisating The Walking Dead episodic rasting history
There is a great amount of data compiled. I will explore different aspects separately but ultimately I want to be able to see clearly how ratings of IMDB have changed over time and how they compare to my own ratings. Then how do these ratings vary by age-group and gender and then what interesting trends which can be identified.
## Episode rating time-series
First; upload my Walking dead ratings as an input
```{r}
my_scores = read_excel("wd_my_scores.xlsx")
wd_data %>%
ggplot(aes(x=ep_no,y=ep_rating))+
geom_line(
aes(x=ep_no,y=ep_rating,
color = as.character(season),
group=1),
size=1.2)+ #group = 1 ensures lines are connected across groups.
geom_line(
data = function(x) copy(x)[, ep_rating := mean(ep_rating), by = as.character(season)], # add hor. line for avg. fo season.
aes(color = as.character(season)),
size = 1) +
geom_line(
data=my_scores,
aes(x=ep_no,y=my_rating,group=season),
color="grey50",size=0.6,alpha=0.8)+
theme_minimal()+
scale_y_continuous("",limits = c(1, 10), n.breaks = 7)+ # fix y-axis scale
xlab("")+ggtitle("The Walking Dead Episodic Rating History")+
#scale_color_manual(values=c(colr))
scale_color_brewer(palette = "Paired",guide="none")+
theme(
plot.title = element_text(size = 12, hjust = 0.5),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
# note in order to group the data for colour the x-axis comprises off a simple 1:N sequence in steps of 1.
```
Clearly my scores peak much more than the average rater. I was mesmerized by seasons 3 to 6, specifically the first half of season 3, the end of season 4 and the majority of season 5 mvoing into season 6; this being peak walking dead for me. Then strangely, my views tended to move towards the average following the decline from season 7. Season 9 was a genuinely good season, it deserved more views. The damage was done after the slow laborious season 7 and 8 (at times overall it was still good).
**Back to plots..**
Alternatively we can use ggplots facet grid to plot each individual season ratings and place them side-by-side. I like both, both have their benefits.
```{r}
wd_data%>%
ggplot(aes(x=ep_no,y=ep_rating))+
geom_line(aes(color=as.character(season),group=1),size=1)+
facet_grid(~season, scales = "free_x",switch = 'x')+
theme_light()+
scale_y_continuous("",limits = c(3, 10), n.breaks = 7)+ # fix y-axis scale
xlab("")+ggtitle("The Walking Dead Episodic Rating History")+
scale_color_brewer(palette = "Paired",guide="none")+
theme(
panel.spacing = unit(-0.45, "lines"), # edit facet panel spaces so closer together.
axis.text.x = element_blank(),
axis.ticks.x=element_blank(), # remove x axis ticks
strip.placement = "bottom", # move facet labels to bottom of plot
plot.title = element_text(size = 11, hjust = 0.5),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(), # remove grid lines
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank()
)
```