Skip to content

Commit

Permalink
Merge pull request #24 from posit-conf-2023/target
Browse files Browse the repository at this point in the history
Changes slides to open in new tabs + exercise cleanup
  • Loading branch information
stephhazlitt authored Sep 12, 2023
2 parents e82a891 + 0744b31 commit 9fadf3e
Show file tree
Hide file tree
Showing 7 changed files with 19 additions and 11 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "53f610ff8cc8524ff8fdda04614a7b6f",
"hash": "46f80887ae90f6136973650347ae57b7",
"result": {
"markdown": "---\ntitle: \"Hello Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 122 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1155795912\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year %in% 2014:2017) |> \n group_by(year) |>\n summarize(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n year all_trips shared_trips pct_shared\n <int> <int> <int> <dbl>\n1 2014 165114361 48816505 29.6\n2 2015 146112989 43081091 29.5\n3 2016 131165043 38163870 29.1\n4 2017 113495512 32296166 28.5\n```\n:::\n:::\n\n\n::: {#exercise-hello-nyc-taxi .callout-tip}\n## Exercises: First {dplyr} pipeline with Arrow\n\n::: panel-tabset\n## Problems\n\n1. Calculate the total number of rides for every month in 2019\n2. About how long did this query of 1.15 billion rows take?\n\n## Solution 1\n\nTotal number of rides for every month in 2019:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n filter(year == 2019) |>\n count(month) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 2\n month n\n <int> <int>\n 1 1 7667255\n 2 11 6877463\n 3 12 6895933\n 4 10 7213588\n 5 2 7018750\n 6 3 7832035\n 7 4 7432826\n 8 5 7564884\n 9 6 6940489\n10 7 6310134\n11 8 6072851\n12 9 6567396\n```\n:::\n:::\n\n\n## Solution 2\n\nCompute time for querying the 1.15 billion rows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 2.844 0.175 0.331 \n```\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 2\n month longest_trip\n <int> <dbl>\n 1 1 832.\n 2 2 702.\n 3 3 237.\n 4 4 831.\n 5 5 401.\n 6 6 45977.\n 7 7 312.\n 8 8 602.\n 9 9 604.\n10 10 308.\n11 11 701.\n12 12 19130.\n```\n:::\n\n```{.r .cell-code}\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.379 sec elapsed\n```\n:::\n:::\n\n:::\n:::\n",
"markdown": "---\ntitle: \"Hello Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\n```\n:::\n\n\n\n\n\n\n::: {#exercise-hello-nyc-taxi .callout-tip}\n## Exercises: First {dplyr} pipeline with Arrow\n\n::: panel-tabset\n## Problems\n\n1. Calculate the total number of rides for every month in 2019\n2. About how long did this query of 1.15 billion rows take?\n\n## Solution 1\n\nTotal number of rides for every month in 2019:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n filter(year == 2019) |>\n count(month) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 2\n month n\n <int> <int>\n 1 1 7667255\n 2 11 6877463\n 3 10 7213588\n 4 12 6895933\n 5 3 7832035\n 6 5 7564884\n 7 2 7018750\n 8 4 7432826\n 9 6 6940489\n10 7 6310134\n11 9 6567396\n12 8 6072851\n```\n:::\n:::\n\n\n## Solution 2\n\nCompute time for querying the 1.15 billion rows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 2.552 0.307 0.705 \n```\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 2\n month longest_trip\n <int> <dbl>\n 1 1 832.\n 2 2 702.\n 3 3 237.\n 4 4 831.\n 5 5 401.\n 6 6 45977.\n 7 7 312.\n 8 8 602.\n 9 9 604.\n10 10 308.\n11 11 701.\n12 12 19130.\n```\n:::\n\n```{.r .cell-code}\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.492 sec elapsed\n```\n:::\n:::\n\n:::\n:::\n",
"supporting": [
"1_hello_arrow-exercises_files"
],
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "00263065051d39e9d1c386cd2685b77b",
"hash": "847d8f98a2ba0c8cb036b54821c7fd5c",
"result": {
"markdown": "---\ntitle: \"Data Manipulation Part 1 - Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(stringr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 120 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n\n::: {#exercise-compute-collect .callout-tip}\n# Using `compute()` and `collect()`\n\n::: panel-tabset\n## Problem\n\n1. How many taxi fares in the dataset had a total amount greater than \\$100?\n\n2. How many distinct pickup locations (distinct combinations of the `pickup_latitude` and `pickup_longitude` columns) are in the dataset since 2016? Use `nrow()` to work this out.\n\n## Solution 1\n\n\n::: {.cell hash='2_data_manipulation_1-exercises_cache/html/compute-collect-1_22f5a7e3ca42f31be95226ca75ff8140'}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(total_amount > 100) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1518869\n```\n:::\n:::\n\n\n## Solution 2\n\n\n::: {.cell hash='2_data_manipulation_1-exercises_cache/html/compute-collect-2_d78c6443e0ec2328bb8e454462d97d56'}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year >= 2016) |>\n distinct(pickup_longitude, pickup_latitude) |>\n compute() |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 29105801\n```\n:::\n:::\n\n:::\n:::\n\n::: {#exercise-dplyr-api .callout-tip}\n# Using the dplyr API in arrow\n\n::: panel-tabset\n## Problem\n\n1. Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter \"S\".\n\n2. Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. What happens, and why?\n\n3. Bonus question: see if you can find a different way of completing the task in question 2.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(str_ends(vendor_name, \"S\"), year == 2020, month == 9) |>\n collect()\n```\n:::\n\n\n## Solution 2 and 3\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(vendor_name = stringr::str_replace_na(vendor_name, \"No vendor\")) |>\n head() |>\n collect()\n```\n:::\n\n\nThis won't work as `stringr::str_replace_na()` hasn't been implemented in Arrow. You could try using `mutate()` and `ifelse()` here instead.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(vendor_name = ifelse(is.na(vendor_name), \"No vendor\", vendor_name)) |>\n head() |>\n collect()\n```\n:::\n\n\nOr, if you only needed a subset of the data, you could apply the function after collecting it into R memory.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year == 2019, month == 10) |> # smaller subset of the data\n collect() |>\n mutate(vendor_name = stringr::str_replace_na(vendor_name, \"No vendor\"))\n```\n:::\n\n\n\n:::\n:::\n",
"markdown": "---\ntitle: \"Data Manipulation Part 1 - Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(stringr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\n```\n:::\n\n\n::: {#exercise-compute-collect .callout-tip}\n# Using `compute()` and `collect()`\n\n::: panel-tabset\n## Problem\n\n1. How many taxi fares in the dataset had a total amount greater than \\$100?\n\n2. How many distinct pickup locations (distinct combinations of the `pickup_latitude` and `pickup_longitude` columns) are in the dataset since 2016? Use `nrow()` to work this out.\n\n## Solution 1\n\n\n::: {.cell hash='2_data_manipulation_1-exercises_cache/html/compute-collect-1_22f5a7e3ca42f31be95226ca75ff8140'}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(total_amount > 100) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1518869\n```\n:::\n:::\n\n\n## Solution 2\n\n\n::: {.cell hash='2_data_manipulation_1-exercises_cache/html/compute-collect-2_d78c6443e0ec2328bb8e454462d97d56'}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year >= 2016) |>\n distinct(pickup_longitude, pickup_latitude) |>\n compute() |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 29105801\n```\n:::\n:::\n\n:::\n:::\n\n::: {#exercise-dplyr-api .callout-tip}\n# Using the dplyr API in arrow\n\n::: panel-tabset\n## Problem\n\n1. Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter \"S\".\n\n2. Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. What happens, and why?\n\n3. Bonus question: see if you can find a different way of completing the task in question 2.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(str_ends(vendor_name, \"S\"), year == 2020, month == 9) |>\n collect()\n```\n:::\n\n\n## Solution 2 and 3\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(vendor_name = stringr::str_replace_na(vendor_name, \"No vendor\")) |>\n head() |>\n collect()\n```\n:::\n\n\nThis won't work as `stringr::str_replace_na()` hasn't been implemented in Arrow. You could try using `mutate()` and `ifelse()` here instead.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(vendor_name = ifelse(is.na(vendor_name), \"No vendor\", vendor_name)) |>\n head() |>\n collect()\n```\n:::\n\n\nOr, if you only needed a subset of the data, you could apply the function after collecting it into R memory.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year == 2019, month == 10) |> # smaller subset of the data\n collect() |>\n mutate(vendor_name = stringr::str_replace_na(vendor_name, \"No vendor\"))\n```\n:::\n\n:::\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down
Loading

0 comments on commit 9fadf3e

Please sign in to comment.