Looking for a way to reinforce machine learning concepts, I happened upon the Chris Albon’s Machine Learning Flashcards on Twitter. And, after reading the accompanying website, I noticed that Chris links to a Python repo that scrapes the images from Chris’s Twitter feed. I thought I would try to do the same using R so here we go:
First, we have a go at pulling tweets from Chris’s feed via
rtweet
:
albon_tweets <- get_timeline(user = "chrisalbon", n = 3200) # max return value
head(albon_tweets)
#> # A tibble: 6 x 88
#> user_id status_id created_at screen_name text source display_text_wi… reply_to_status…
#> <chr> <chr> <dttm> <chr> <chr> <chr> <dbl> <chr>
#> 1 115185… 11012538… 2019-02-28 22:52:29 chrisalbon "@jG… Twitt… 69 110125104169218…
#> 2 115185… 11012529… 2019-02-28 22:48:55 chrisalbon @dpa… Twitt… 54 <NA>
#> 3 115185… 11012409… 2019-02-28 22:01:14 chrisalbon As a… Twitt… 74 <NA>
#> 4 115185… 11011904… 2019-02-28 18:40:30 chrisalbon Vari… Machi… 32 <NA>
#> 5 115185… 11008252… 2019-02-27 18:29:18 chrisalbon Deri… Machi… 34 <NA>
#> 6 115185… 11004606… 2019-02-26 18:20:41 chrisalbon Why … Machi… 56 <NA>
#> # … with 80 more variables: reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
#> # is_retweet <lgl>, favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#> # urls_url <list>, urls_t.co <list>, urls_expanded_url <list>, media_url <list>, media_t.co <list>,
#> # media_expanded_url <list>, media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#> # ext_media_expanded_url <list>, ext_media_type <chr>, mentions_user_id <list>,
#> # mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#> # quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>, quoted_retweet_count <int>,
#> # quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>, quoted_followers_count <int>,
#> # quoted_friends_count <int>, quoted_statuses_count <int>, quoted_location <chr>, quoted_description <chr>,
#> # quoted_verified <lgl>, retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#> # retweet_source <chr>, retweet_favorite_count <int>, retweet_retweet_count <int>, retweet_user_id <chr>,
#> # retweet_screen_name <chr>, retweet_name <chr>, retweet_followers_count <int>,
#> # retweet_friends_count <int>, retweet_statuses_count <int>, retweet_location <chr>,
#> # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>, place_name <chr>,
#> # place_full_name <chr>, place_type <chr>, country <chr>, country_code <chr>, geo_coords <list>,
#> # coords_coords <list>, bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#> # description <chr>, url <chr>, protected <lgl>, followers_count <int>, friends_count <int>,
#> # listed_count <int>, statuses_count <int>, favourites_count <int>, account_created_at <dttm>,
#> # verified <lgl>, profile_url <chr>, profile_expanded_url <chr>, account_lang <chr>,
#> # profile_banner_url <chr>, profile_background_url <chr>, profile_image_url <chr>
Okay, after reading Twitter’s standard search API documentation we see that the standard search API will only return a sampling of the user’s Tweets published in the past 7 days. So, we will be retrieving just a sample of Chris’s handy flashcards.
Also, there were nearly 90 variables returned so let’s take a glimpse
and zero in on what is essential for this specific task:
glimpse(albon_tweets)
#> Observations: 436
#> Variables: 88
#> $ user_id <chr> "11518572", "11518572", "11518572", "11518572", "11518572", "11518572", "11…
#> $ status_id <chr> "1101253837317382144", "1101252937769238528", "1101240938813370368", "11011…
#> $ created_at <dttm> 2019-02-28 22:52:29, 2019-02-28 22:48:55, 2019-02-28 22:01:14, 2019-02-28 …
#> $ screen_name <chr> "chrisalbon", "chrisalbon", "chrisalbon", "chrisalbon", "chrisalbon", "chri…
#> $ text <chr> "@jGage718 @cercerilla @statwonk @digitalocean That doesnt even sound like …
#> $ source <chr> "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Machine …
#> $ display_text_width <dbl> 69, 54, 74, 32, 34, 56, 121, 74, 126, 75, 67, 4, 65, 3, 25, 48, 81, 194, 88…
#> $ reply_to_status_id <chr> "1101251041692184576", NA, NA, NA, NA, NA, NA, "1100119723310804994", NA, "…
#> $ reply_to_user_id <chr> "637197054", "14839109", NA, NA, NA, NA, NA, "110474012", NA, "22803302", "…
#> $ reply_to_screen_name <chr> "jGage718", "dpatil", NA, NA, NA, NA, NA, "MattGallagher0", NA, "ArmsContro…
#> $ is_quote <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ is_retweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
#> $ favorite_count <int> 5, 4, 92, 15, 23, 20, 0, 0, 125, 0, 4, 1, 0, 1, 1, 2, 2, 55, 10, 0, 0, 11, …
#> $ retweet_count <int> 0, 0, 8, 0, 5, 1, 120, 0, 11, 0, 0, 0, 76, 0, 0, 0, 0, 4, 1, 168, 3, 3, 10,…
#> $ hashtags <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ symbols <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ urls_url <list> ["jobs.lever.co/devoted/", "twitter.com/jgage718/statu…", "twitter.com/jet…
#> $ urls_t.co <list> ["https://t.co/cKjUKd8pvo", "https://t.co/tKyVxVG1sf", "https://t.co/Plg5V…
#> $ urls_expanded_url <list> ["https://jobs.lever.co/devoted/", "https://twitter.com/jgage718/status/11…
#> $ media_url <list> [NA, NA, NA, "http://pbs.twimg.com/media/D0g22vCVYAASrTY.png", "http://pbs…
#> $ media_t.co <list> [NA, NA, NA, "https://t.co/vUoLldeqiA", "https://t.co/kweIzXBfjR", "https:…
#> $ media_expanded_url <list> [NA, NA, NA, "https://twitter.com/chrisalbon/status/1101190424276819968/ph…
#> $ media_type <list> [NA, NA, NA, "photo", "photo", "photo", NA, NA, NA, NA, NA, NA, "photo", N…
#> $ ext_media_url <list> [NA, NA, NA, "http://pbs.twimg.com/media/D0g22vCVYAASrTY.png", "http://pbs…
#> $ ext_media_t.co <list> [NA, NA, NA, "https://t.co/vUoLldeqiA", "https://t.co/kweIzXBfjR", "https:…
#> $ ext_media_expanded_url <list> [NA, NA, NA, "https://twitter.com/chrisalbon/status/1101190424276819968/ph…
#> $ ext_media_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ mentions_user_id <list> [<"637197054", "2777112211", "944231", "457033547">, "14839109", NA, NA, N…
#> $ mentions_screen_name <list> [<"jGage718", "cercerilla", "statwonk", "digitalocean">, "dpatil", NA, NA,…
#> $ lang <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "fr", "en…
#> $ quoted_status_id <chr> NA, "1101251379660836870", "1101200154961469440", NA, NA, NA, NA, NA, NA, N…
#> $ quoted_text <chr> NA, "What they didn’t tell you about the data team is that you get all the …
#> $ quoted_created_at <dttm> NA, 2019-02-28 22:42:43, 2019-02-28 19:19:10, NA, NA, NA, NA, NA, NA, NA, …
#> $ quoted_source <chr> NA, "Twitter for iPhone", "TweetDeck", NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ quoted_favorite_count <int> NA, 15, 1138, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2571,…
#> $ quoted_retweet_count <int> NA, 1, 571, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 770, NA…
#> $ quoted_user_id <chr> NA, "637197054", "18955413", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ quoted_screen_name <chr> NA, "jGage718", "jetjocko", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ quoted_name <chr> NA, "Justin Gage", "Adam Rogers", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ quoted_followers_count <int> NA, 2231, 14911, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 33…
#> $ quoted_friends_count <int> NA, 910, 1020, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 494,…
#> $ quoted_statuses_count <int> NA, 17459, 26848, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1…
#> $ quoted_location <chr> NA, "New York, NY", "California", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ quoted_description <chr> NA, "Product Data @digitalocean for that bread, freelance data content for …
#> $ quoted_verified <lgl> NA, FALSE, TRUE, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, TR…
#> $ retweet_status_id <chr> NA, NA, NA, NA, NA, NA, "1100432336905068544", NA, NA, NA, NA, NA, "1099900…
#> $ retweet_text <chr> NA, NA, NA, NA, NA, NA, "applying for a data science job in 2010 vs. applyi…
#> $ retweet_created_at <dttm> NA, NA, NA, NA, NA, NA, 2019-02-26 16:28:08, NA, NA, NA, NA, NA, 2019-02-2…
#> $ retweet_source <chr> NA, NA, NA, NA, NA, NA, "Twitter Web App", NA, NA, NA, NA, NA, "Twitter for…
#> $ retweet_favorite_count <int> NA, NA, NA, NA, NA, NA, 549, NA, NA, NA, NA, NA, 390, NA, NA, NA, NA, NA, N…
#> $ retweet_retweet_count <int> NA, NA, NA, NA, NA, NA, 120, NA, NA, NA, NA, NA, 76, NA, NA, NA, NA, NA, NA…
#> $ retweet_user_id <chr> NA, NA, NA, NA, NA, NA, "47436444", NA, NA, NA, NA, NA, "1558406653", NA, N…
#> $ retweet_screen_name <chr> NA, NA, NA, NA, NA, NA, "jrmontag", NA, NA, NA, NA, NA, "EpiEllie", NA, NA,…
#> $ retweet_name <chr> NA, NA, NA, NA, NA, NA, "Josh Montague \U0001f4ca\U0001f389", NA, NA, NA, N…
#> $ retweet_followers_count <int> NA, NA, NA, NA, NA, NA, 2946, NA, NA, NA, NA, NA, 4717, NA, NA, NA, NA, NA,…
#> $ retweet_friends_count <int> NA, NA, NA, NA, NA, NA, 2619, NA, NA, NA, NA, NA, 1172, NA, NA, NA, NA, NA,…
#> $ retweet_statuses_count <int> NA, NA, NA, NA, NA, NA, 40331, NA, NA, NA, NA, NA, 12631, NA, NA, NA, NA, N…
#> $ retweet_location <chr> NA, NA, NA, NA, NA, NA, "Golden + Boulder (CO)", NA, NA, NA, NA, NA, "", NA…
#> $ retweet_description <chr> NA, NA, NA, NA, NA, NA, "counts \U0001f425\U0001f4ac, drinks \u2615️, takes…
#> $ retweet_verified <lgl> NA, NA, NA, NA, NA, NA, FALSE, NA, NA, NA, NA, NA, FALSE, NA, NA, NA, NA, N…
#> $ place_url <chr> "https://api.twitter.com/1.1/geo/id/67b98f17fdcf20be.json", "https://api.tw…
#> $ place_name <chr> "Boston", "Boston", "Boston", NA, NA, NA, NA, "California", "California", "…
#> $ place_full_name <chr> "Boston, MA", "Boston, MA", "Boston, MA", NA, NA, NA, NA, "California, USA"…
#> $ place_type <chr> "city", "city", "city", NA, NA, NA, NA, "admin", "admin", "admin", "admin",…
#> $ country <chr> "United States", "United States", "United States", NA, NA, NA, NA, "United …
#> $ country_code <chr> "US", "US", "US", NA, NA, NA, NA, "US", "US", "US", "US", "US", NA, "US", "…
#> $ geo_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA…
#> $ coords_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA…
#> $ bbox_coords <list> [<-71.19142, -70.98600, -70.98600, -71.19142, 42.22780, 42.22780, 42.39954…
#> $ status_url <chr> "https://twitter.com/chrisalbon/status/1101253837317382144", "https://twitt…
#> $ name <chr> "Chris Albon", "Chris Albon", "Chris Albon", "Chris Albon", "Chris Albon", …
#> $ location <chr> "San Francisco", "San Francisco", "San Francisco", "San Francisco", "San Fr…
#> $ description <chr> "Using data to fight for something that matters. Data science @DevotedHealt…
#> $ url <chr> "https://t.co/CQhzAA24cn", "https://t.co/CQhzAA24cn", "https://t.co/CQhzAA2…
#> $ protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ followers_count <int> 31818, 31818, 31818, 31818, 31818, 31818, 31818, 31818, 31818, 31818, 31818…
#> $ friends_count <int> 705, 705, 705, 705, 705, 705, 705, 705, 705, 705, 705, 705, 705, 705, 705, …
#> $ listed_count <int> 1345, 1345, 1345, 1345, 1345, 1345, 1345, 1345, 1345, 1345, 1345, 1345, 134…
#> $ statuses_count <int> 485, 485, 485, 485, 485, 485, 485, 485, 485, 485, 485, 485, 485, 485, 485, …
#> $ favourites_count <int> 10316, 10316, 10316, 10316, 10316, 10316, 10316, 10316, 10316, 10316, 10316…
#> $ account_created_at <dttm> 2007-12-26 01:49:09, 2007-12-26 01:49:09, 2007-12-26 01:49:09, 2007-12-26 …
#> $ verified <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
#> $ profile_url <chr> "https://t.co/CQhzAA24cn", "https://t.co/CQhzAA24cn", "https://t.co/CQhzAA2…
#> $ profile_expanded_url <chr> "http://ChrisAlbon.com", "http://ChrisAlbon.com", "http://ChrisAlbon.com", …
#> $ account_lang <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en…
#> $ profile_banner_url <chr> "https://pbs.twimg.com/profile_banners/11518572/1503261622", "https://pbs.t…
#> $ profile_background_url <chr> "http://abs.twimg.com/images/themes/theme1/bg.png", "http://abs.twimg.com/i…
#> $ profile_image_url <chr> "http://pbs.twimg.com/profile_images/736992518110224384/fmqQxFEr_normal.jpg…
Right off the bat I see machinelearningflashcards.com
under the
urls_url
column which will helpful in filtering the data.
Additionally, media_url
conveniently contains the flashcard image url.
I will be sure to grab those variables in addition to the Twitter handle
and accompany text for reference. Using a series of stringr
functions,
I strip out the name of the flashcard from text
to be used later on.
Lastly, having a glance at the flashcard_name
column shows that some
flashcards are repeated. So, let’s use distinct
to keep only the
unique records, noting the convenient .keep_all = TRUE
argument to
retain all columns of the dataframe.
flash_df <- albon_tweets %>%
select(screen_name, text, urls_url, media_url) %>%
unnest(urls_url, .preserve = media_url) %>%
filter(str_detect(urls_url, "machinelearning"), # keep tweets containing the flashcard url
!is.na(media_url)) %>% # drop any tweets lacking an image
unnest(media_url) %>%
mutate(
flashcard_name = text %>%
str_extract(".+?(?=\\shttps)") %>%
str_to_lower() %>%
str_replace_all("\\s", "-")
) %>%
distinct(flashcard_name, .keep_all = TRUE)
flash_df
#> # A tibble: 22 x 5
#> screen_name text urls_url media_url flashcard_name
#> <chr> <chr> <chr> <chr> <chr>
#> 1 chrisalbon Variance https://t.co/eZ2bbpDzwV… machinelearning… http://pbs.twimg.com/m… variance
#> 2 chrisalbon Derivative https://t.co/eZ2bbpDz… machinelearning… http://pbs.twimg.com/m… derivative
#> 3 chrisalbon Why Is It Called A Cost Function… machinelearning… http://pbs.twimg.com/m… why-is-it-called-a-…
#> 4 chrisalbon Meanshift Clustering By Analogy … machinelearning… http://pbs.twimg.com/m… meanshift-clusterin…
#> 5 chrisalbon Interquartile Range https://t.co… machinelearning… http://pbs.twimg.com/m… interquartile-range
#> 6 chrisalbon Training And Test Error https://… machinelearning… http://pbs.twimg.com/m… training-and-test-e…
#> 7 chrisalbon Saddle Point https://t.co/eZ2bbp… machinelearning… http://pbs.twimg.com/m… saddle-point
#> 8 chrisalbon Frobenius Norm https://t.co/eZ2b… machinelearning… http://pbs.twimg.com/m… frobenius-norm
#> 9 chrisalbon Hidden Layer https://t.co/eZ2bbp… machinelearning… http://pbs.twimg.com/m… hidden-layer
#> 10 chrisalbon Matthews Correlation Coefficient… machinelearning… http://pbs.twimg.com/m… matthews-correlatio…
#> # … with 12 more rows
Now that we have the flash_df
dataframe containing the image URLs and
names, let’s make a quick function to read and write the flashcard
images using magick
and then feed those parameters into pwalk
to
iterate through the flashcards that we identified above:
grab_flash <- function(flash_url, flash_name, folder) {
flash_url %>%
image_read() %>%
image_write(here(folder, str_c(flash_name,".png")))
}
params <- list(pull(flash_df, media_url), pull(flash_df, flashcard_name))
pwalk(params, ~grab_flash(.x, .y, "ml-flashcard-images"))
Trolling through #rstats Twitter, I came across this tweet soliciting responses on how people draw an X:
Interestingly, @SMASEY observed that: “General consensus is that Americans do 7 & 8 while UK does 5 & 6. Probably how we were taught.” Is that the case? Let’s see what the data say.
The standard search API only returns data from the previous 7 days which presents a problem as this tweet is from January 20, 2019. There is, however, a way forward using the 30-Day API which requires one to register a developer account, a registered app, and a developer environment setup. See here for a walkthrough of that process. Note that while we can still search for free, we are now constrained to 100 tweets per request with a cap of 250 requests per month.
Digging into Twitter’s historical data means that we will have to leave
the simplicity of rtweet
behind and roll our own function to pull data
from the 30-Day
API.
pull_tweets <- function(url, # 30-day search stem
dev_env, # <YOUR_DEV_ENV_NAME>
tkn, # <YOUR_BEARER_TOKEN> (see rtweet::bearer_token)
search = NULL, # search terms
start_date = NULL, # <YYYYMMDDHHmm>
stop_date = NULL, # <YYYYMMDDHHmm>
max_req = NULL # integer to limit requests (250/month cap)
) {
# construct url
thirty_url <- str_c(url, dev_env, ".json")
# inital call
res <- GET(thirty_url,
query = list(query = search, fromDate = start_date, toDate = stop_date),
add_headers(Authorization = tkn))
out <- fromJSON(read_lines(res[["content"]]), flatten = TRUE) %>% .[['results']]
nxt_tkn <- fromJSON(read_lines(res[["content"]])) %>% .[['next']]
output_init <- list(list(df = out, nxt = nxt_tkn))
print("call_1")
# loop until max_req limit or `next` token unavailable
i <- 1
output_loop <- list()
while (!is.null(nxt_tkn) && i <= max_req - 1) {
res <- GET(thirty_url,
query = list(query = search, fromDate = start_date, toDate = stop_date, `next` = nxt_tkn),
add_headers(Authorization = tkn))
out <- fromJSON(read_lines(res[["content"]]), flatten = TRUE) %>% .[['results']]
nxt_tkn <- fromJSON(read_lines(res[["content"]])) %>% .[['next']]
output_loop[[i]] <- list(df = out, nxt = nxt_tkn)
i <- i + 1
print(str_c("call_", i))
}
append(output_init, output_loop)
}
Let’s use pull_tweets
to grab all responding tweets from the US and
the UK. Boilerplate code is provided below should you wish to try using
your own credentials.
pull_tweets(url = "https://api.twitter.com/1.1/tweets/search/30day/",
dev_env = <YOUR_DEV_ENV_NAME>,
tkn = <YOUR_BEARER_TOKEN>,
search = "to:SMASEY place_country:US", # `place_country:GB` for UK tweets
max_req = 10) # set a limit on number of requests
Here we read in the data collected using pull_tweets
; wrangle the
output into a tidy dataframe; and, use some coarse regex
to extract
the answer which should be an integer between 1 and 8.
tweets_df <- fs::dir_ls(here("how-to-draw-x_data"), glob = "*.rds") %>%
map(read_rds) %>% # read in output files
map_depth(2, "df") %>% # grab df inside each nested list
flatten_dfr() %>% # flatten list and row bind
filter(str_detect(text, "@SMASEY\\shttps+", negate = TRUE)) %>%
mutate(ans = text %>%
str_extract("\\s\\d{1}(?!\\d)") %>%
str_squish()) %>%
na.omit()
In total, it looks like we grabbed 434 tweets: 260 from the US and 174 from the UK. Let’s take a quick look at the distribution of answer choices by country.
# histograms
ggplot(tweets_df, aes(x = as.numeric(ans), fill = country_code)) +
geom_histogram(binwidth = 1, color = "white") +
geom_text(stat = 'count', aes(label = ..count.., vjust = -0.2)) +
scale_x_continuous(breaks = seq(1, 8, 1)) +
coord_cartesian(clip = "off") +
facet_wrap(~country_code, nrow = 2) +
theme_minimal(base_size = 10) +
labs(x = "answer choice", y = "count") +
guides(fill = FALSE, color = FALSE)
Okay, it is clear that 7 & 8 are the most popular answers in both countries followed by 5 & 6. Answer choices 1-4 are infrequent (fewer than ~4% of responses in both cases). However, it is not clear what, if any, association exists between country and answer choice.
Let’s use the infer
package to run a simulation-based test to
investigate whether there is an association between location and answer
choice. The infer
package’s intuitive design makes it straightforward
for us to: (1) calculate our chi-squared statistic; (2) simulate a null
distribution through permutation; and, (3) calculate the proportion of
replicates that had a chi-squared as or more extreme than the observed
statistic to determine significance.
# for reproducibility
set.seed(2)
# set up df
chisq_df <- tweets_df %>%
mutate(ans_fct = fct_lump(ans, prop = 0.05), # use fct_lump() for infrequent factor levels
country_code = factor(country_code))
# calculate test statistic
obs_chisq <- chisq_stat(chisq_df, ans_fct ~ country_code)
obs_chisq
#> # A tibble: 1 x 1
#> stat
#> <dbl>
#> 1 24.4
# generate null distribution
null_dist_chisq <- chisq_df %>%
specify(ans_fct ~ country_code) %>%
hypothesize(null = "independence") %>%
generate(reps = 3000, type = "permute") %>%
calculate(stat = "Chisq")
# inspect test statistic within null distribution
visualize(null_dist_chisq) +
shade_p_value(obs_stat = obs_chisq, direction = "greater")
# grab p-value
pval_chisq <- get_p_value(null_dist_chisq, obs_stat = obs_chisq, direction = "greater")
pval_chisq
#> # A tibble: 1 x 1
#> p_value
#> <dbl>
#> 1 0.000333
After generating a simulation-based null distribution of chi-squared
statistics, we see that there is a 0.03% chance of observing a
chi-squared value at least as large as 24.4 in a world where there’s no
difference between country
and ans
. So, we observe strong evidence
in support of a significant association between location and how one
draws an X.
Well, we’ve see that there is some association with how an X is drawn and location, so let’s be a bit more specific and tease apart this observation that 7 & 8 might be particularly American. To do this we will set up a dataframe with a collapsed factor variable for answers 7 & 8, take a peek at the proportions of answers 7 & 8 by location, and then test any observed difference in proportions across countries.
# set up df
props_df <- tweets_df %>%
mutate(ans_clps = fct_collapse(ans,
one_six = c("1","2","3","4","5","6"),
svn_egt = c("7","8")),
country_code = factor(country_code))
# calculate counts and proportions
props_df %>%
tabyl(country_code, ans_clps) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()
#> country_code one_six svn_egt
#> GB 33.9% (59) 66.1% (115)
#> US 17.3% (45) 82.7% (215)
We observe a difference of ~17 percentage points for answers # 7 & 8
between US and UK responses. Let’s employ the same infer
workflow from
before to implement a simulation-based test on this observed difference
in proportions.
# for reproducibility
set.seed(20)
# calculate test statistic
d_hat <- props_df %>%
specify(ans_clps ~ country_code, success = "svn_egt") %>%
calculate(stat = "diff in props", order = c("US", "GB"))
# generate null distribution
null_dist_props <- props_df %>%
specify(ans_clps ~ country_code, success = "svn_egt") %>%
hypothesize(null = "independence") %>%
generate(reps = 3000, type = "permute") %>%
calculate(stat = "diff in props", order = c("US", "GB"))
# inspect test statistic within null distribution
visualize(null_dist_props) +
shade_p_value(obs_stat = d_hat, direction = "two_sided")
# grab p-value
pval_props <- get_p_value(null_dist_props, obs_stat = d_hat, direction = "two_sided")
pval_props
#> # A tibble: 1 x 1
#> p_value
#> <dbl>
#> 1 0.000667
We see that there is a 0.07% chance of a test statistic at least as extreme as ±0.166 in a world where there is no difference in proportions by location. So, we observe strong evidence in support of a significant difference in responses of 7 & 8 by location, with a higher proportion of 7 & 8’s from the US.