From ac5b819ce50e15e9207e8cb261f527fa99debffc Mon Sep 17 00:00:00 2001
From: Stephanie Hazlitt <stephhazlitt@gmail.com>
Date: Tue, 19 Sep 2023 13:38:53 -0500
Subject: [PATCH 1/4] actual order on the day

---
 _quarto.yaml | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/_quarto.yaml b/_quarto.yaml
index bcd0f2c..0b4c785 100644
--- a/_quarto.yaml
+++ b/_quarto.yaml
@@ -37,12 +37,12 @@ website:
         - text: "Data Engineering with Arrow"
           href: materials/3_data_engineering.qmd
           target: "_blank"
-        - text: "Manipulating Data with Arrow (Part II)"
-          href: materials/4_data_manipulation_2.qmd
-          target: "_blank"
         - text: "Arrow In-Memory Workflows"
           href: materials/5_arrow_single_file.qmd
           target: "_blank"
+        - text: "Manipulating Data with Arrow (Part II)"
+          href: materials/4_data_manipulation_2.qmd
+          target: "_blank"
         - text: "Wrapping Up: Arrow & R Together"
           href: materials/6_wrapping_up.qmd
           target: "_blank"
@@ -60,10 +60,11 @@ website:
           href: materials/2_data_manipulation_1-exercises.qmd
         - text: "Data Engineering Exercises"
           href: materials/3_data_engineering-exercises.qmd
-        - text: "Data Manipulation Part II Exercises"
-          href: materials/4_data_manipulation_2-exercises.qmd
         - text: "Arrow In-Memory Workflow Exercises"
           href: materials/5_arrow_single_file-exercises.qmd
+        - text: "Data Manipulation Part II Exercises"
+          href: materials/4_data_manipulation_2-exercises.qmd
+
 
 format:
   html:

From 7fe781b7049afd2d00cb862ad9d64d5fd54b00dd Mon Sep 17 00:00:00 2001
From: Stephanie Hazlitt <stephhazlitt@gmail.com>
Date: Tue, 19 Sep 2023 13:39:08 -0500
Subject: [PATCH 2/4] rm redundant group_by

---
 materials/3_data_engineering-exercises.qmd | 2 --
 1 file changed, 2 deletions(-)

diff --git a/materials/3_data_engineering-exercises.qmd b/materials/3_data_engineering-exercises.qmd
index 8229309..129b1b5 100644
--- a/materials/3_data_engineering-exercises.qmd
+++ b/materials/3_data_engineering-exercises.qmd
@@ -208,7 +208,6 @@ Total number of Checkouts in September of 2019 using partitioned Parquet data by
 #| label: seattle-partitioned-other-dplyr
 open_dataset(here::here("data/seattle-library-checkouts-type")) |> 
   filter(CheckoutYear == 2019, CheckoutMonth == 9) |> 
-  group_by(CheckoutYear) |> 
   summarise(TotalCheckouts = sum(Checkouts)) |>
   collect() |> 
   system.time()
@@ -220,7 +219,6 @@ Total number of Checkouts in September of 2019 using partitioned Parquet data by
 #| label: seattle-partitioned-partitioned-filter-dplyr
 open_dataset(here::here("data/seattle-library-checkouts")) |> 
   filter(CheckoutYear == 2019, CheckoutMonth == 9) |> 
-  group_by(CheckoutYear) |> 
   summarise(TotalCheckouts = sum(Checkouts)) |>
   collect() |> 
   system.time()

From 97b57bd94689b28dea4d4bb9aca57cccad17e184 Mon Sep 17 00:00:00 2001
From: Stephanie Hazlitt <stephhazlitt@gmail.com>
Date: Tue, 19 Sep 2023 13:39:24 -0500
Subject: [PATCH 3/4] use correct object labels

---
 materials/4_data_manipulation_2.qmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/materials/4_data_manipulation_2.qmd b/materials/4_data_manipulation_2.qmd
index 13c526d..5242aaf 100644
--- a/materials/4_data_manipulation_2.qmd
+++ b/materials/4_data_manipulation_2.qmd
@@ -273,14 +273,14 @@ nyc_taxi |>
   collect()
 ```
 
-## Schema for the `nyc_taxi` table
+## Schema for the `nyc_taxi` Dataset
 
 ```{r}
 #| label: get-schema
 schema(nyc_taxi)
 ```
 
-## Schema for the `nyc_taxi_zones` table
+## Schema for the `nyc_taxi_zones` Table
 
 ```{r}
 #| label: schema-2

From c22f17d10ee1b070cd30cc890cae64cda02a28bf Mon Sep 17 00:00:00 2001
From: Stephanie Hazlitt <stephhazlitt@gmail.com>
Date: Tue, 19 Sep 2023 13:39:34 -0500
Subject: [PATCH 4/4] rebuild site

---
 .../3_data_engineering-exercises/execute-results/html.json    | 4 ++--
 .../materials/4_data_manipulation_2/execute-results/html.json | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/_freeze/materials/3_data_engineering-exercises/execute-results/html.json b/_freeze/materials/3_data_engineering-exercises/execute-results/html.json
index f2f3f91..641546b 100644
--- a/_freeze/materials/3_data_engineering-exercises/execute-results/html.json
+++ b/_freeze/materials/3_data_engineering-exercises/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "b5e1cbc1cd97d5404a8e61f27b29b1c9",
+  "hash": "bc9577166bdb00f8d925743bf438c721",
   "result": {
-    "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\neditor: source  \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\"\n)\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1.  The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` instead of the `<null>` interpreted by Arrow.\n\n2.  Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-schema-1_f724b866de89b0d5657421eb6e893446'}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\",\n  skip = 1,\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = string(),\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  )\n)\n```\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\",\n  skip = 1,\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = utf8(),\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  )\n)\n```\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr_11e076db4356ccb8c1472bac17b0ebbe'}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n   CheckoutYear `sum(Checkouts)`\n          <int>            <int>\n 1         2005          3798685\n 2         2006          6599318\n 3         2007          7126627\n 4         2008          8438486\n 5         2009          9135167\n 6         2010          8608966\n 7         2011          8321732\n 8         2012          8163046\n 9         2013          9057096\n10         2014          9136081\n11         2015          9084179\n12         2016          9021051\n13         2017          9231648\n14         2018          9149176\n15         2019          9199083\n16         2020          6053717\n17         2021          7361031\n18         2022          7001989\n```\n:::\n:::\n\n\nor\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-2_10dd993dcc5049ff1548b3e2b15e01f3'}\n\n```{.r .cell-code}\nseattle_csv |> \n  count(CheckoutYear, wt = Checkouts) |> \n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n   CheckoutYear       n\n          <int>   <int>\n 1         2005 3798685\n 2         2006 6599318\n 3         2007 7126627\n 4         2008 8438486\n 5         2009 9135167\n 6         2010 8608966\n 7         2011 8321732\n 8         2012 8163046\n 9         2013 9057096\n10         2014 9136081\n11         2015 9084179\n12         2016 9021051\n13         2017 9231648\n14         2018 9149176\n15         2019 9199083\n16         2020 6053717\n17         2021 7361031\n18         2022 7001989\n```\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-timed_1f17f2738b0ea5175f9f30b8824ab034'}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n 10.853   1.198  10.561 \n```\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nseattle_csv |>\n  write_dataset(path = seattle_parquet,\n                format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1.  Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nopen_dataset(seattle_parquet, \n             format = \"parquet\") |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  2.235   0.475   0.625 \n```\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- here::here(\"data/seattle-library-checkouts\")\n\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  write_dataset(path = seattle_parquet_part,\n                format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1.  Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2.  Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- here::here(\"data/seattle-library-checkouts-type\")\n\nseattle_csv |>\n  group_by(CheckoutType) |>\n  write_dataset(path = seattle_checkouttype,\n                format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts-type\")) |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  group_by(CheckoutYear) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  0.890   0.088   0.326 \n```\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts\")) |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  group_by(CheckoutYear) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  0.046   0.006   0.036 \n```\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
+    "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\neditor: source  \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\"\n)\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1.  The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` instead of the `<null>` interpreted by Arrow.\n\n2.  Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-schema-1_f724b866de89b0d5657421eb6e893446'}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\",\n  skip = 1,\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = string(),\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  )\n)\n```\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\",\n  skip = 1,\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = utf8(),\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  )\n)\n```\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr_11e076db4356ccb8c1472bac17b0ebbe'}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n   CheckoutYear `sum(Checkouts)`\n          <int>            <int>\n 1         2005          3798685\n 2         2006          6599318\n 3         2007          7126627\n 4         2008          8438486\n 5         2009          9135167\n 6         2010          8608966\n 7         2011          8321732\n 8         2012          8163046\n 9         2013          9057096\n10         2014          9136081\n11         2015          9084179\n12         2016          9021051\n13         2017          9231648\n14         2018          9149176\n15         2019          9199083\n16         2020          6053717\n17         2021          7361031\n18         2022          7001989\n```\n:::\n:::\n\n\nor\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-2_10dd993dcc5049ff1548b3e2b15e01f3'}\n\n```{.r .cell-code}\nseattle_csv |> \n  count(CheckoutYear, wt = Checkouts) |> \n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n   CheckoutYear       n\n          <int>   <int>\n 1         2005 3798685\n 2         2006 6599318\n 3         2007 7126627\n 4         2008 8438486\n 5         2009 9135167\n 6         2010 8608966\n 7         2011 8321732\n 8         2012 8163046\n 9         2013 9057096\n10         2014 9136081\n11         2015 9084179\n12         2016 9021051\n13         2017 9231648\n14         2018 9149176\n15         2019 9199083\n16         2020 6053717\n17         2021 7361031\n18         2022 7001989\n```\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-timed_1f17f2738b0ea5175f9f30b8824ab034'}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n 10.853   1.198  10.561 \n```\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nseattle_csv |>\n  write_dataset(path = seattle_parquet,\n                format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1.  Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nopen_dataset(seattle_parquet, \n             format = \"parquet\") |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  2.238   0.452   0.696 \n```\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- here::here(\"data/seattle-library-checkouts\")\n\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  write_dataset(path = seattle_parquet_part,\n                format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1.  Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2.  Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- here::here(\"data/seattle-library-checkouts-type\")\n\nseattle_csv |>\n  group_by(CheckoutType) |>\n  write_dataset(path = seattle_checkouttype,\n                format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts-type\")) |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  0.907   0.087   0.333 \n```\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts\")) |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  0.039   0.006   0.032 \n```\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/materials/4_data_manipulation_2/execute-results/html.json b/_freeze/materials/4_data_manipulation_2/execute-results/html.json
index 8a70971..6e42b7d 100644
--- a/_freeze/materials/4_data_manipulation_2/execute-results/html.json
+++ b/_freeze/materials/4_data_manipulation_2/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "3e4b35b3bcfa580bd5a8a3c218666008",
+  "hash": "9100f089427c5167c7ea54fda1af02e5",
   "result": {
-    "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n  echo: true\nformat:\n  revealjs: \n    theme: default\nengine: knitr\neditor: source\n---\n\n\n# Data Manipulation---Part 2 {#data-manip-2}\n\n\n::: {.cell}\n\n:::\n\n\n## What if a function binding doesn't exist - revisited!\n\n-   Option 1 - find a workaround\n-   Option 2 - user-defined functions (UDFs)\n\n## Why use a UDF?\n\nImplement your own custom functions!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntime_diff_minutes <- function(pickup, dropoff){\n  difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n}\n\nnyc_taxi |>\n  mutate(\n    duration_minutes = time_diff_minutes(pickup_datetime, dropoff_datetime)\n  ) |> \n  select(pickup_datetime, dropoff_datetime, duration_minutes) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression time_diff_minutes(pickup_datetime, dropoff_datetime) not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\nWe get an error as we can't automatically convert the function to arrow.\n\n# User-defined functions (aka UDFs)\n\n-   Define your own functions\n-   Scalar functions - 1 row input and 1 row output\n\n\n## User-defined functions - definition\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\nThis looks complicated, so let's look at it 1 part at a time!\n\n## User-defined functions - definition\n\nStep 1. Give the function a name\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"2\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 2. Define the body of the function - first argument *must* be `context`\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"3,4,5,6,7\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 3. Set the schema of the input arguments\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"8,9,10,11\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 4. Set the data type of the output\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"12\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 5. Set `auto_convert = TRUE` if using in a dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"13\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - usage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  mutate(\n    duration_minutes = time_diff_minutes(pickup_datetime, dropoff_datetime)\n  ) |>\n  select(pickup_datetime, dropoff_datetime, duration_minutes) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n  pickup_datetime     dropoff_datetime    duration_minutes\n  <dttm>              <dttm>                         <int>\n1 2012-10-07 18:19:00 2012-10-07 18:29:00               10\n2 2012-10-07 18:19:00 2012-10-07 18:33:00               14\n3 2012-10-07 18:19:00 2012-10-07 18:35:00               16\n4 2012-10-07 18:19:00 2012-10-07 18:35:00               16\n5 2012-10-07 18:19:00 2012-10-07 18:42:00               23\n6 2012-10-07 18:19:00 2012-10-07 18:43:00               24\n```\n:::\n:::\n\n\n## Your Turn\n\n1.  Write a user-defined function which wraps the `stringr` function `str_replace_na()`, and use it to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. (Test it on the data from 2019 so you're not pulling everything into memory)\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n-   You can use UDFs to create your own bindings when they don't exist\n-   UDFs must be scalar (1 row in -\\> 1 row out) and stateless (no knowledge of other rows of data)\n-   Calculations done by R not Arrow, so slower than in-built bindings but still pretty fast\n\n# Joins\n\n## Joins\n\n![](images/joins.png)\n\n## Joining a reference table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvendors <- tibble::tibble(\n  code = c(\"VTS\", \"CMT\", \"DDS\"),\n  full_name = c(\n    \"Verifone Transportation Systems\",\n    \"Creative Mobile Technologies\",\n    \"Digital Dispatch Systems\"\n  )\n)\n\nnyc_taxi |>\n  left_join(vendors, by = c(\"vendor_name\" = \"code\")) |>\n  select(vendor_name, full_name, pickup_datetime) |>\n  head(3) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  vendor_name full_name                    pickup_datetime    \n  <chr>       <chr>                        <dttm>             \n1 CMT         Creative Mobile Technologies 2012-01-10 23:55:50\n2 CMT         Creative Mobile Technologies 2012-01-11 19:18:25\n3 CMT         Creative Mobile Technologies 2012-01-11 19:19:19\n```\n:::\n:::\n\n\n## Traps for the unwary\n\nQuestion: which are the most common borough-to-borough journeys in the dataset?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones <- \n  read_csv_arrow(here::here(\"data/taxi_zone_lookup.csv\")) |>\n  select(location_id = LocationID,\n         borough = Borough)\n\nnyc_taxi_zones\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 265 × 2\n   location_id borough      \n         <int> <chr>        \n 1           1 EWR          \n 2           2 Queens       \n 3           3 Bronx        \n 4           4 Manhattan    \n 5           5 Staten Island\n 6           6 Staten Island\n 7           7 Queens       \n 8           8 Queens       \n 9           9 Queens       \n10          10 Queens       \n# ℹ 255 more rows\n```\n:::\n:::\n\n\n## Why didn't this work?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  left_join(nyc_taxi_zones, by = c(\"pickup_location_id\" = \"location_id\")) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `compute.arrow_dplyr_query()`:\n! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(location_id) of type int32\n```\n:::\n:::\n\n\n## Schema for the `nyc_taxi` table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(nyc_taxi)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n\n## Schema for the `nyc_taxi_zones` table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(nyc_taxi_zones)\nschema(nyc_taxi_zones_arrow)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nlocation_id: int32\nborough: string\n```\n:::\n:::\n\n\n-   `pickup_location_id` is int64 in the `nyc_taxi` table\n-   `location_id` is int32 in the `nyc_taxi_zones` table\n\n## Take control of the schema\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(\n  nyc_taxi_zones, \n  schema = schema(location_id = int64(), borough = utf8())\n)\n```\n:::\n\n\n-   `schema()` takes variable name / types as input\n-   arrow has various \"type\" functions: `int64()`, `utf8()`, `boolean()`, `date32()` etc\n\n## Take control of the schema\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(\n  nyc_taxi_zones, \n  schema = schema(location_id = int64(), borough = utf8())\n)\nschema(nyc_taxi_zones_arrow)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nlocation_id: int64\nborough: string\n```\n:::\n:::\n\n\n## Prepare the auxiliary tables\n\n\n::: {.cell}\n\n```{.r .cell-code}\npickup <- nyc_taxi_zones_arrow |>\n  select(pickup_location_id = location_id,\n         pickup_borough = borough)\n\ndropoff <- nyc_taxi_zones_arrow |>\n  select(dropoff_location_id = location_id,\n         dropoff_borough = borough)\n```\n:::\n\n\n-   Join separately for the pickup and dropoff zones\n\n\n## Join and cross-tabulate\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nborough_counts <- nyc_taxi |> \n  left_join(pickup) |>\n  left_join(dropoff) |>\n  count(pickup_borough, dropoff_borough) |>\n  arrange(desc(n)) |>\n  collect()\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n151.228 sec elapsed\n```\n:::\n:::\n\n\n<br>\n\n2-3 minutes to join twice and cross-tabulate on non-partition variables, with 1.15 billion rows of data 🙂\n\n## The results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nborough_counts\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 50 × 3\n   pickup_borough dropoff_borough         n\n   <chr>          <chr>               <int>\n 1 <NA>           <NA>            732357953\n 2 Manhattan      Manhattan       351198872\n 3 Queens         Manhattan        14440705\n 4 Manhattan      Queens           13052517\n 5 Manhattan      Brooklyn         11180867\n 6 Queens         Queens            7440356\n 7 Unknown        Unknown           4491811\n 8 Queens         Brooklyn          3662324\n 9 Brooklyn       Brooklyn          3550480\n10 Manhattan      Bronx             2071830\n# ℹ 40 more rows\n```\n:::\n:::\n\n\n## Your Turn\n\n1.  How many taxi pickups were recorded in 2019 from the three major airports covered by the NYC Taxis data set (JFK, LaGuardia, Newark)? (Hint: you can use `stringr::str_detect()` to help you find pickup zones with the word \"Airport\" in them)\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n-   You can join Arrow Tables and Datasets to R data frames and Arrow Tables\n-   The Arrow data type of join keys must always match\n\n# Window functions\n\n## What are window functions?\n\n-   calculations across a \"window\" of multiple rows which relate to the current row\n-   e.g. `row_number()`, `ntile()`, or calling `mutate()` after `group_by()`\n\n## Grouped summaries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year <- nyc_taxi |>\n  filter(year > 2019) |>\n  select(year, fare_amount)\n\nfare_by_year |>\n  group_by(year) |>\n  summarise(mean_fare = mean(fare_amount)) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n   year mean_fare\n  <int>     <dbl>\n1  2020      12.7\n2  2021      13.5\n```\n:::\n:::\n\n\n## Window functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n  group_by(year) |>\n  mutate(mean_fare = mean(fare_amount)) |> \n  head() |> \n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: window functions not currently supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\n## Window functions - via joins\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n  left_join(\n    fare_by_year |>\n      group_by(year) |>\n      summarise(mean_fare = mean(fare_amount))\n  ) |> \n  arrange(desc(fare_amount)) |>\n  head() |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n   year fare_amount mean_fare\n  <int>       <dbl>     <dbl>\n1  2020     998310.      12.7\n2  2021     818283.      13.5\n3  2020     671100.      12.7\n4  2020     429497.      12.7\n5  2021     398466.      13.5\n6  2020     398465.      12.7\n```\n:::\n:::\n\n\n## Window functions - via duckdb\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n  group_by(year) |>\n  to_duckdb() |>\n  mutate(mean_fare = mean(fare_amount)) |> \n  to_arrow() |>\n  arrange(desc(fare_amount)) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n   year fare_amount mean_fare\n  <int>       <dbl>     <dbl>\n1  2020     998310.      12.7\n2  2021     818283.      13.5\n3  2020     671100.      12.7\n4  2020     429497.      12.7\n5  2021     398466.      13.5\n6  2020     398465.      12.7\n```\n:::\n:::\n\n\n## Your Turn\n\n1.  How many trips in September 2019 had a longer than average distance for that month?\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n-   Window functions in arrow can be achieved via joins or passing data to and from duckdb\n",
+    "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n  echo: true\nformat:\n  revealjs: \n    theme: default\nengine: knitr\neditor: source\n---\n\n\n# Data Manipulation---Part 2 {#data-manip-2}\n\n\n::: {.cell}\n\n:::\n\n\n## What if a function binding doesn't exist - revisited!\n\n-   Option 1 - find a workaround\n-   Option 2 - user-defined functions (UDFs)\n\n## Why use a UDF?\n\nImplement your own custom functions!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntime_diff_minutes <- function(pickup, dropoff){\n  difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n}\n\nnyc_taxi |>\n  mutate(\n    duration_minutes = time_diff_minutes(pickup_datetime, dropoff_datetime)\n  ) |> \n  select(pickup_datetime, dropoff_datetime, duration_minutes) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression time_diff_minutes(pickup_datetime, dropoff_datetime) not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\nWe get an error as we can't automatically convert the function to arrow.\n\n# User-defined functions (aka UDFs)\n\n-   Define your own functions\n-   Scalar functions - 1 row input and 1 row output\n\n\n## User-defined functions - definition\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\nThis looks complicated, so let's look at it 1 part at a time!\n\n## User-defined functions - definition\n\nStep 1. Give the function a name\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"2\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 2. Define the body of the function - first argument *must* be `context`\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"3,4,5,6,7\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 3. Set the schema of the input arguments\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"8,9,10,11\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 4. Set the data type of the output\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"12\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 5. Set `auto_convert = TRUE` if using in a dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"13\"}\nregister_scalar_function(\n  name = \"time_diff_minutes\",\n  function(context, pickup, dropoff) {\n    difftime(dropoff, pickup, units = \"mins\") |>\n      round() |>\n      as.integer()\n  },\n  in_type = schema(\n    pickup = timestamp(unit = \"ms\"),\n    dropoff = timestamp(unit = \"ms\")\n  ),\n  out_type = int32(),\n  auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - usage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  mutate(\n    duration_minutes = time_diff_minutes(pickup_datetime, dropoff_datetime)\n  ) |>\n  select(pickup_datetime, dropoff_datetime, duration_minutes) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n  pickup_datetime     dropoff_datetime    duration_minutes\n  <dttm>              <dttm>                         <int>\n1 2012-11-02 19:47:00 2012-11-02 20:16:00               29\n2 2012-11-02 19:47:07 2012-11-02 19:53:32                6\n3 2012-11-02 19:47:13 2012-11-02 19:53:31                6\n4 2012-11-02 19:47:35 2012-11-02 19:52:40                5\n5 2012-11-02 19:47:51 2012-11-02 20:00:19               12\n6 2012-11-02 19:48:00 2012-11-02 19:51:00                3\n```\n:::\n:::\n\n\n## Your Turn\n\n1.  Write a user-defined function which wraps the `stringr` function `str_replace_na()`, and use it to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. (Test it on the data from 2019 so you're not pulling everything into memory)\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n-   You can use UDFs to create your own bindings when they don't exist\n-   UDFs must be scalar (1 row in -\\> 1 row out) and stateless (no knowledge of other rows of data)\n-   Calculations done by R not Arrow, so slower than in-built bindings but still pretty fast\n\n# Joins\n\n## Joins\n\n![](images/joins.png)\n\n## Joining a reference table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvendors <- tibble::tibble(\n  code = c(\"VTS\", \"CMT\", \"DDS\"),\n  full_name = c(\n    \"Verifone Transportation Systems\",\n    \"Creative Mobile Technologies\",\n    \"Digital Dispatch Systems\"\n  )\n)\n\nnyc_taxi |>\n  left_join(vendors, by = c(\"vendor_name\" = \"code\")) |>\n  select(vendor_name, full_name, pickup_datetime) |>\n  head(3) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  vendor_name full_name                    pickup_datetime    \n  <chr>       <chr>                        <dttm>             \n1 CMT         Creative Mobile Technologies 2012-01-20 04:32:03\n2 CMT         Creative Mobile Technologies 2012-01-20 04:33:16\n3 CMT         Creative Mobile Technologies 2012-01-20 04:32:38\n```\n:::\n:::\n\n\n## Traps for the unwary\n\nQuestion: which are the most common borough-to-borough journeys in the dataset?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones <- \n  read_csv_arrow(here::here(\"data/taxi_zone_lookup.csv\")) |>\n  select(location_id = LocationID,\n         borough = Borough)\n\nnyc_taxi_zones\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 265 × 2\n   location_id borough      \n         <int> <chr>        \n 1           1 EWR          \n 2           2 Queens       \n 3           3 Bronx        \n 4           4 Manhattan    \n 5           5 Staten Island\n 6           6 Staten Island\n 7           7 Queens       \n 8           8 Queens       \n 9           9 Queens       \n10          10 Queens       \n# ℹ 255 more rows\n```\n:::\n:::\n\n\n## Why didn't this work?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  left_join(nyc_taxi_zones, by = c(\"pickup_location_id\" = \"location_id\")) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `compute.arrow_dplyr_query()`:\n! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(location_id) of type int32\n```\n:::\n:::\n\n\n## Schema for the `nyc_taxi` Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(nyc_taxi)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n\n## Schema for the `nyc_taxi_zones` Table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(nyc_taxi_zones)\nschema(nyc_taxi_zones_arrow)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nlocation_id: int32\nborough: string\n```\n:::\n:::\n\n\n-   `pickup_location_id` is int64 in the `nyc_taxi` table\n-   `location_id` is int32 in the `nyc_taxi_zones` table\n\n## Take control of the schema\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(\n  nyc_taxi_zones, \n  schema = schema(location_id = int64(), borough = utf8())\n)\n```\n:::\n\n\n-   `schema()` takes variable name / types as input\n-   arrow has various \"type\" functions: `int64()`, `utf8()`, `boolean()`, `date32()` etc\n\n## Take control of the schema\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(\n  nyc_taxi_zones, \n  schema = schema(location_id = int64(), borough = utf8())\n)\nschema(nyc_taxi_zones_arrow)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nlocation_id: int64\nborough: string\n```\n:::\n:::\n\n\n## Prepare the auxiliary tables\n\n\n::: {.cell}\n\n```{.r .cell-code}\npickup <- nyc_taxi_zones_arrow |>\n  select(pickup_location_id = location_id,\n         pickup_borough = borough)\n\ndropoff <- nyc_taxi_zones_arrow |>\n  select(dropoff_location_id = location_id,\n         dropoff_borough = borough)\n```\n:::\n\n\n-   Join separately for the pickup and dropoff zones\n\n\n## Join and cross-tabulate\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nborough_counts <- nyc_taxi |> \n  left_join(pickup) |>\n  left_join(dropoff) |>\n  count(pickup_borough, dropoff_borough) |>\n  arrange(desc(n)) |>\n  collect()\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n108.192 sec elapsed\n```\n:::\n:::\n\n\n<br>\n\n2-3 minutes to join twice and cross-tabulate on non-partition variables, with 1.15 billion rows of data 🙂\n\n## The results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nborough_counts\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 50 × 3\n   pickup_borough dropoff_borough         n\n   <chr>          <chr>               <int>\n 1 <NA>           <NA>            732357953\n 2 Manhattan      Manhattan       351198872\n 3 Queens         Manhattan        14440705\n 4 Manhattan      Queens           13052517\n 5 Manhattan      Brooklyn         11180867\n 6 Queens         Queens            7440356\n 7 Unknown        Unknown           4491811\n 8 Queens         Brooklyn          3662324\n 9 Brooklyn       Brooklyn          3550480\n10 Manhattan      Bronx             2071830\n# ℹ 40 more rows\n```\n:::\n:::\n\n\n## Your Turn\n\n1.  How many taxi pickups were recorded in 2019 from the three major airports covered by the NYC Taxis data set (JFK, LaGuardia, Newark)? (Hint: you can use `stringr::str_detect()` to help you find pickup zones with the word \"Airport\" in them)\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n-   You can join Arrow Tables and Datasets to R data frames and Arrow Tables\n-   The Arrow data type of join keys must always match\n\n# Window functions\n\n## What are window functions?\n\n-   calculations across a \"window\" of multiple rows which relate to the current row\n-   e.g. `row_number()`, `ntile()`, or calling `mutate()` after `group_by()`\n\n## Grouped summaries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year <- nyc_taxi |>\n  filter(year > 2019) |>\n  select(year, fare_amount)\n\nfare_by_year |>\n  group_by(year) |>\n  summarise(mean_fare = mean(fare_amount)) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n   year mean_fare\n  <int>     <dbl>\n1  2020      12.7\n2  2021      13.5\n```\n:::\n:::\n\n\n## Window functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n  group_by(year) |>\n  mutate(mean_fare = mean(fare_amount)) |> \n  head() |> \n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: window functions not currently supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\n## Window functions - via joins\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n  left_join(\n    fare_by_year |>\n      group_by(year) |>\n      summarise(mean_fare = mean(fare_amount))\n  ) |> \n  arrange(desc(fare_amount)) |>\n  head() |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n   year fare_amount mean_fare\n  <int>       <dbl>     <dbl>\n1  2020     998310.      12.7\n2  2021     818283.      13.5\n3  2020     671100.      12.7\n4  2020     429497.      12.7\n5  2021     398466.      13.5\n6  2020     398465.      12.7\n```\n:::\n:::\n\n\n## Window functions - via duckdb\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n  group_by(year) |>\n  to_duckdb() |>\n  mutate(mean_fare = mean(fare_amount)) |> \n  to_arrow() |>\n  arrange(desc(fare_amount)) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n   year fare_amount mean_fare\n  <int>       <dbl>     <dbl>\n1  2020     998310.      12.7\n2  2021     818283.      13.5\n3  2020     671100.      12.7\n4  2020     429497.      12.7\n5  2021     398466.      13.5\n6  2020     398465.      12.7\n```\n:::\n:::\n\n\n## Your Turn\n\n1.  How many trips in September 2019 had a longer than average distance for that month?\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n-   Window functions in arrow can be achieved via joins or passing data to and from duckdb\n",
     "supporting": [
       "4_data_manipulation_2_files"
     ],