From ac5b819ce50e15e9207e8cb261f527fa99debffc Mon Sep 17 00:00:00 2001 From: Stephanie Hazlitt Date: Tue, 19 Sep 2023 13:38:53 -0500 Subject: [PATCH 1/4] actual order on the day --- _quarto.yaml | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/_quarto.yaml b/_quarto.yaml index bcd0f2c..0b4c785 100644 --- a/_quarto.yaml +++ b/_quarto.yaml @@ -37,12 +37,12 @@ website: - text: "Data Engineering with Arrow" href: materials/3_data_engineering.qmd target: "_blank" - - text: "Manipulating Data with Arrow (Part II)" - href: materials/4_data_manipulation_2.qmd - target: "_blank" - text: "Arrow In-Memory Workflows" href: materials/5_arrow_single_file.qmd target: "_blank" + - text: "Manipulating Data with Arrow (Part II)" + href: materials/4_data_manipulation_2.qmd + target: "_blank" - text: "Wrapping Up: Arrow & R Together" href: materials/6_wrapping_up.qmd target: "_blank" @@ -60,10 +60,11 @@ website: href: materials/2_data_manipulation_1-exercises.qmd - text: "Data Engineering Exercises" href: materials/3_data_engineering-exercises.qmd - - text: "Data Manipulation Part II Exercises" - href: materials/4_data_manipulation_2-exercises.qmd - text: "Arrow In-Memory Workflow Exercises" href: materials/5_arrow_single_file-exercises.qmd + - text: "Data Manipulation Part II Exercises" + href: materials/4_data_manipulation_2-exercises.qmd + format: html: From 7fe781b7049afd2d00cb862ad9d64d5fd54b00dd Mon Sep 17 00:00:00 2001 From: Stephanie Hazlitt Date: Tue, 19 Sep 2023 13:39:08 -0500 Subject: [PATCH 2/4] rm redundant group_by --- materials/3_data_engineering-exercises.qmd | 2 -- 1 file changed, 2 deletions(-) diff --git a/materials/3_data_engineering-exercises.qmd b/materials/3_data_engineering-exercises.qmd index 8229309..129b1b5 100644 --- a/materials/3_data_engineering-exercises.qmd +++ b/materials/3_data_engineering-exercises.qmd @@ -208,7 +208,6 @@ Total number of Checkouts in September of 2019 using partitioned Parquet data by #| label: seattle-partitioned-other-dplyr open_dataset(here::here("data/seattle-library-checkouts-type")) |> filter(CheckoutYear == 2019, CheckoutMonth == 9) |> - group_by(CheckoutYear) |> summarise(TotalCheckouts = sum(Checkouts)) |> collect() |> system.time() @@ -220,7 +219,6 @@ Total number of Checkouts in September of 2019 using partitioned Parquet data by #| label: seattle-partitioned-partitioned-filter-dplyr open_dataset(here::here("data/seattle-library-checkouts")) |> filter(CheckoutYear == 2019, CheckoutMonth == 9) |> - group_by(CheckoutYear) |> summarise(TotalCheckouts = sum(Checkouts)) |> collect() |> system.time() From 97b57bd94689b28dea4d4bb9aca57cccad17e184 Mon Sep 17 00:00:00 2001 From: Stephanie Hazlitt Date: Tue, 19 Sep 2023 13:39:24 -0500 Subject: [PATCH 3/4] use correct object labels --- materials/4_data_manipulation_2.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/materials/4_data_manipulation_2.qmd b/materials/4_data_manipulation_2.qmd index 13c526d..5242aaf 100644 --- a/materials/4_data_manipulation_2.qmd +++ b/materials/4_data_manipulation_2.qmd @@ -273,14 +273,14 @@ nyc_taxi |> collect() ``` -## Schema for the `nyc_taxi` table +## Schema for the `nyc_taxi` Dataset ```{r} #| label: get-schema schema(nyc_taxi) ``` -## Schema for the `nyc_taxi_zones` table +## Schema for the `nyc_taxi_zones` Table ```{r} #| label: schema-2 From c22f17d10ee1b070cd30cc890cae64cda02a28bf Mon Sep 17 00:00:00 2001 From: Stephanie Hazlitt Date: Tue, 19 Sep 2023 13:39:34 -0500 Subject: [PATCH 4/4] rebuild site --- .../3_data_engineering-exercises/execute-results/html.json | 4 ++-- .../materials/4_data_manipulation_2/execute-results/html.json | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/_freeze/materials/3_data_engineering-exercises/execute-results/html.json b/_freeze/materials/3_data_engineering-exercises/execute-results/html.json index f2f3f91..641546b 100644 --- a/_freeze/materials/3_data_engineering-exercises/execute-results/html.json +++ b/_freeze/materials/3_data_engineering-exercises/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "b5e1cbc1cd97d5404a8e61f27b29b1c9", + "hash": "bc9577166bdb00f8d925743bf438c721", "result": { - "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n format = \"csv\"\n)\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `` instead of the `` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-schema-1_f724b866de89b0d5657421eb6e893446'}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n format = \"csv\",\n skip = 1,\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(),\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n )\n)\n```\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n format = \"csv\",\n skip = 1,\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = utf8(),\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n )\n)\n```\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr_11e076db4356ccb8c1472bac17b0ebbe'}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n:::\n:::\n\n\nor\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-2_10dd993dcc5049ff1548b3e2b15e01f3'}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n CheckoutYear n\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-timed_1f17f2738b0ea5175f9f30b8824ab034'}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 10.853 1.198 10.561 \n```\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nopen_dataset(seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 2.235 0.475 0.625 \n```\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- here::here(\"data/seattle-library-checkouts\")\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- here::here(\"data/seattle-library-checkouts-type\")\n\nseattle_csv |>\n group_by(CheckoutType) |>\n write_dataset(path = seattle_checkouttype,\n format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts-type\")) |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n group_by(CheckoutYear) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 0.890 0.088 0.326 \n```\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts\")) |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n group_by(CheckoutYear) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 0.046 0.006 0.036 \n```\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n", + "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n format = \"csv\"\n)\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `` instead of the `` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-schema-1_f724b866de89b0d5657421eb6e893446'}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n format = \"csv\",\n skip = 1,\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(),\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n )\n)\n```\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n format = \"csv\",\n skip = 1,\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = utf8(),\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n )\n)\n```\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr_11e076db4356ccb8c1472bac17b0ebbe'}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n:::\n:::\n\n\nor\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-2_10dd993dcc5049ff1548b3e2b15e01f3'}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n CheckoutYear n\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-timed_1f17f2738b0ea5175f9f30b8824ab034'}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 10.853 1.198 10.561 \n```\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nopen_dataset(seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 2.238 0.452 0.696 \n```\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- here::here(\"data/seattle-library-checkouts\")\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- here::here(\"data/seattle-library-checkouts-type\")\n\nseattle_csv |>\n group_by(CheckoutType) |>\n write_dataset(path = seattle_checkouttype,\n format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts-type\")) |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 0.907 0.087 0.333 \n```\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts\")) |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n user system elapsed \n 0.039 0.006 0.032 \n```\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/materials/4_data_manipulation_2/execute-results/html.json b/_freeze/materials/4_data_manipulation_2/execute-results/html.json index 8a70971..6e42b7d 100644 --- a/_freeze/materials/4_data_manipulation_2/execute-results/html.json +++ b/_freeze/materials/4_data_manipulation_2/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "3e4b35b3bcfa580bd5a8a3c218666008", + "hash": "9100f089427c5167c7ea54fda1af02e5", "result": { - "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Data Manipulation---Part 2 {#data-manip-2}\n\n\n::: {.cell}\n\n:::\n\n\n## What if a function binding doesn't exist - revisited!\n\n- Option 1 - find a workaround\n- Option 2 - user-defined functions (UDFs)\n\n## Why use a UDF?\n\nImplement your own custom functions!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntime_diff_minutes <- function(pickup, dropoff){\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n}\n\nnyc_taxi |>\n mutate(\n duration_minutes = time_diff_minutes(pickup_datetime, dropoff_datetime)\n ) |> \n select(pickup_datetime, dropoff_datetime, duration_minutes) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression time_diff_minutes(pickup_datetime, dropoff_datetime) not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\nWe get an error as we can't automatically convert the function to arrow.\n\n# User-defined functions (aka UDFs)\n\n- Define your own functions\n- Scalar functions - 1 row input and 1 row output\n\n\n## User-defined functions - definition\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\nThis looks complicated, so let's look at it 1 part at a time!\n\n## User-defined functions - definition\n\nStep 1. Give the function a name\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 2. Define the body of the function - first argument *must* be `context`\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3,4,5,6,7\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 3. Set the schema of the input arguments\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"8,9,10,11\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 4. Set the data type of the output\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"12\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 5. Set `auto_convert = TRUE` if using in a dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"13\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - usage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(\n duration_minutes = time_diff_minutes(pickup_datetime, dropoff_datetime)\n ) |>\n select(pickup_datetime, dropoff_datetime, duration_minutes) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n pickup_datetime dropoff_datetime duration_minutes\n \n1 2012-10-07 18:19:00 2012-10-07 18:29:00 10\n2 2012-10-07 18:19:00 2012-10-07 18:33:00 14\n3 2012-10-07 18:19:00 2012-10-07 18:35:00 16\n4 2012-10-07 18:19:00 2012-10-07 18:35:00 16\n5 2012-10-07 18:19:00 2012-10-07 18:42:00 23\n6 2012-10-07 18:19:00 2012-10-07 18:43:00 24\n```\n:::\n:::\n\n\n## Your Turn\n\n1. Write a user-defined function which wraps the `stringr` function `str_replace_na()`, and use it to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. (Test it on the data from 2019 so you're not pulling everything into memory)\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n- You can use UDFs to create your own bindings when they don't exist\n- UDFs must be scalar (1 row in -\\> 1 row out) and stateless (no knowledge of other rows of data)\n- Calculations done by R not Arrow, so slower than in-built bindings but still pretty fast\n\n# Joins\n\n## Joins\n\n![](images/joins.png)\n\n## Joining a reference table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvendors <- tibble::tibble(\n code = c(\"VTS\", \"CMT\", \"DDS\"),\n full_name = c(\n \"Verifone Transportation Systems\",\n \"Creative Mobile Technologies\",\n \"Digital Dispatch Systems\"\n )\n)\n\nnyc_taxi |>\n left_join(vendors, by = c(\"vendor_name\" = \"code\")) |>\n select(vendor_name, full_name, pickup_datetime) |>\n head(3) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n vendor_name full_name pickup_datetime \n \n1 CMT Creative Mobile Technologies 2012-01-10 23:55:50\n2 CMT Creative Mobile Technologies 2012-01-11 19:18:25\n3 CMT Creative Mobile Technologies 2012-01-11 19:19:19\n```\n:::\n:::\n\n\n## Traps for the unwary\n\nQuestion: which are the most common borough-to-borough journeys in the dataset?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones <- \n read_csv_arrow(here::here(\"data/taxi_zone_lookup.csv\")) |>\n select(location_id = LocationID,\n borough = Borough)\n\nnyc_taxi_zones\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 265 × 2\n location_id borough \n \n 1 1 EWR \n 2 2 Queens \n 3 3 Bronx \n 4 4 Manhattan \n 5 5 Staten Island\n 6 6 Staten Island\n 7 7 Queens \n 8 8 Queens \n 9 9 Queens \n10 10 Queens \n# ℹ 255 more rows\n```\n:::\n:::\n\n\n## Why didn't this work?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n left_join(nyc_taxi_zones, by = c(\"pickup_location_id\" = \"location_id\")) |>\n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `compute.arrow_dplyr_query()`:\n! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(location_id) of type int32\n```\n:::\n:::\n\n\n## Schema for the `nyc_taxi` table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(nyc_taxi)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n\n## Schema for the `nyc_taxi_zones` table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(nyc_taxi_zones)\nschema(nyc_taxi_zones_arrow)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nlocation_id: int32\nborough: string\n```\n:::\n:::\n\n\n- `pickup_location_id` is int64 in the `nyc_taxi` table\n- `location_id` is int32 in the `nyc_taxi_zones` table\n\n## Take control of the schema\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(\n nyc_taxi_zones, \n schema = schema(location_id = int64(), borough = utf8())\n)\n```\n:::\n\n\n- `schema()` takes variable name / types as input\n- arrow has various \"type\" functions: `int64()`, `utf8()`, `boolean()`, `date32()` etc\n\n## Take control of the schema\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(\n nyc_taxi_zones, \n schema = schema(location_id = int64(), borough = utf8())\n)\nschema(nyc_taxi_zones_arrow)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nlocation_id: int64\nborough: string\n```\n:::\n:::\n\n\n## Prepare the auxiliary tables\n\n\n::: {.cell}\n\n```{.r .cell-code}\npickup <- nyc_taxi_zones_arrow |>\n select(pickup_location_id = location_id,\n pickup_borough = borough)\n\ndropoff <- nyc_taxi_zones_arrow |>\n select(dropoff_location_id = location_id,\n dropoff_borough = borough)\n```\n:::\n\n\n- Join separately for the pickup and dropoff zones\n\n\n## Join and cross-tabulate\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nborough_counts <- nyc_taxi |> \n left_join(pickup) |>\n left_join(dropoff) |>\n count(pickup_borough, dropoff_borough) |>\n arrange(desc(n)) |>\n collect()\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n151.228 sec elapsed\n```\n:::\n:::\n\n\n
\n\n2-3 minutes to join twice and cross-tabulate on non-partition variables, with 1.15 billion rows of data 🙂\n\n## The results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nborough_counts\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 50 × 3\n pickup_borough dropoff_borough n\n \n 1 732357953\n 2 Manhattan Manhattan 351198872\n 3 Queens Manhattan 14440705\n 4 Manhattan Queens 13052517\n 5 Manhattan Brooklyn 11180867\n 6 Queens Queens 7440356\n 7 Unknown Unknown 4491811\n 8 Queens Brooklyn 3662324\n 9 Brooklyn Brooklyn 3550480\n10 Manhattan Bronx 2071830\n# ℹ 40 more rows\n```\n:::\n:::\n\n\n## Your Turn\n\n1. How many taxi pickups were recorded in 2019 from the three major airports covered by the NYC Taxis data set (JFK, LaGuardia, Newark)? (Hint: you can use `stringr::str_detect()` to help you find pickup zones with the word \"Airport\" in them)\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n- You can join Arrow Tables and Datasets to R data frames and Arrow Tables\n- The Arrow data type of join keys must always match\n\n# Window functions\n\n## What are window functions?\n\n- calculations across a \"window\" of multiple rows which relate to the current row\n- e.g. `row_number()`, `ntile()`, or calling `mutate()` after `group_by()`\n\n## Grouped summaries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year <- nyc_taxi |>\n filter(year > 2019) |>\n select(year, fare_amount)\n\nfare_by_year |>\n group_by(year) |>\n summarise(mean_fare = mean(fare_amount)) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n year mean_fare\n \n1 2020 12.7\n2 2021 13.5\n```\n:::\n:::\n\n\n## Window functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n group_by(year) |>\n mutate(mean_fare = mean(fare_amount)) |> \n head() |> \n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: window functions not currently supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\n## Window functions - via joins\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n left_join(\n fare_by_year |>\n group_by(year) |>\n summarise(mean_fare = mean(fare_amount))\n ) |> \n arrange(desc(fare_amount)) |>\n head() |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n year fare_amount mean_fare\n \n1 2020 998310. 12.7\n2 2021 818283. 13.5\n3 2020 671100. 12.7\n4 2020 429497. 12.7\n5 2021 398466. 13.5\n6 2020 398465. 12.7\n```\n:::\n:::\n\n\n## Window functions - via duckdb\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n group_by(year) |>\n to_duckdb() |>\n mutate(mean_fare = mean(fare_amount)) |> \n to_arrow() |>\n arrange(desc(fare_amount)) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n year fare_amount mean_fare\n \n1 2020 998310. 12.7\n2 2021 818283. 13.5\n3 2020 671100. 12.7\n4 2020 429497. 12.7\n5 2021 398466. 13.5\n6 2020 398465. 12.7\n```\n:::\n:::\n\n\n## Your Turn\n\n1. How many trips in September 2019 had a longer than average distance for that month?\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n- Window functions in arrow can be achieved via joins or passing data to and from duckdb\n", + "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Data Manipulation---Part 2 {#data-manip-2}\n\n\n::: {.cell}\n\n:::\n\n\n## What if a function binding doesn't exist - revisited!\n\n- Option 1 - find a workaround\n- Option 2 - user-defined functions (UDFs)\n\n## Why use a UDF?\n\nImplement your own custom functions!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntime_diff_minutes <- function(pickup, dropoff){\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n}\n\nnyc_taxi |>\n mutate(\n duration_minutes = time_diff_minutes(pickup_datetime, dropoff_datetime)\n ) |> \n select(pickup_datetime, dropoff_datetime, duration_minutes) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression time_diff_minutes(pickup_datetime, dropoff_datetime) not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\nWe get an error as we can't automatically convert the function to arrow.\n\n# User-defined functions (aka UDFs)\n\n- Define your own functions\n- Scalar functions - 1 row input and 1 row output\n\n\n## User-defined functions - definition\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\nThis looks complicated, so let's look at it 1 part at a time!\n\n## User-defined functions - definition\n\nStep 1. Give the function a name\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 2. Define the body of the function - first argument *must* be `context`\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3,4,5,6,7\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 3. Set the schema of the input arguments\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"8,9,10,11\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 4. Set the data type of the output\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"12\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - definition\n\nStep 5. Set `auto_convert = TRUE` if using in a dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"13\"}\nregister_scalar_function(\n name = \"time_diff_minutes\",\n function(context, pickup, dropoff) {\n difftime(dropoff, pickup, units = \"mins\") |>\n round() |>\n as.integer()\n },\n in_type = schema(\n pickup = timestamp(unit = \"ms\"),\n dropoff = timestamp(unit = \"ms\")\n ),\n out_type = int32(),\n auto_convert = TRUE\n)\n```\n:::\n\n\n## User-defined functions - usage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(\n duration_minutes = time_diff_minutes(pickup_datetime, dropoff_datetime)\n ) |>\n select(pickup_datetime, dropoff_datetime, duration_minutes) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n pickup_datetime dropoff_datetime duration_minutes\n \n1 2012-11-02 19:47:00 2012-11-02 20:16:00 29\n2 2012-11-02 19:47:07 2012-11-02 19:53:32 6\n3 2012-11-02 19:47:13 2012-11-02 19:53:31 6\n4 2012-11-02 19:47:35 2012-11-02 19:52:40 5\n5 2012-11-02 19:47:51 2012-11-02 20:00:19 12\n6 2012-11-02 19:48:00 2012-11-02 19:51:00 3\n```\n:::\n:::\n\n\n## Your Turn\n\n1. Write a user-defined function which wraps the `stringr` function `str_replace_na()`, and use it to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. (Test it on the data from 2019 so you're not pulling everything into memory)\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n- You can use UDFs to create your own bindings when they don't exist\n- UDFs must be scalar (1 row in -\\> 1 row out) and stateless (no knowledge of other rows of data)\n- Calculations done by R not Arrow, so slower than in-built bindings but still pretty fast\n\n# Joins\n\n## Joins\n\n![](images/joins.png)\n\n## Joining a reference table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvendors <- tibble::tibble(\n code = c(\"VTS\", \"CMT\", \"DDS\"),\n full_name = c(\n \"Verifone Transportation Systems\",\n \"Creative Mobile Technologies\",\n \"Digital Dispatch Systems\"\n )\n)\n\nnyc_taxi |>\n left_join(vendors, by = c(\"vendor_name\" = \"code\")) |>\n select(vendor_name, full_name, pickup_datetime) |>\n head(3) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n vendor_name full_name pickup_datetime \n \n1 CMT Creative Mobile Technologies 2012-01-20 04:32:03\n2 CMT Creative Mobile Technologies 2012-01-20 04:33:16\n3 CMT Creative Mobile Technologies 2012-01-20 04:32:38\n```\n:::\n:::\n\n\n## Traps for the unwary\n\nQuestion: which are the most common borough-to-borough journeys in the dataset?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones <- \n read_csv_arrow(here::here(\"data/taxi_zone_lookup.csv\")) |>\n select(location_id = LocationID,\n borough = Borough)\n\nnyc_taxi_zones\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 265 × 2\n location_id borough \n \n 1 1 EWR \n 2 2 Queens \n 3 3 Bronx \n 4 4 Manhattan \n 5 5 Staten Island\n 6 6 Staten Island\n 7 7 Queens \n 8 8 Queens \n 9 9 Queens \n10 10 Queens \n# ℹ 255 more rows\n```\n:::\n:::\n\n\n## Why didn't this work?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n left_join(nyc_taxi_zones, by = c(\"pickup_location_id\" = \"location_id\")) |>\n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `compute.arrow_dplyr_query()`:\n! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(location_id) of type int32\n```\n:::\n:::\n\n\n## Schema for the `nyc_taxi` Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(nyc_taxi)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n\n## Schema for the `nyc_taxi_zones` Table\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(nyc_taxi_zones)\nschema(nyc_taxi_zones_arrow)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nlocation_id: int32\nborough: string\n```\n:::\n:::\n\n\n- `pickup_location_id` is int64 in the `nyc_taxi` table\n- `location_id` is int32 in the `nyc_taxi_zones` table\n\n## Take control of the schema\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(\n nyc_taxi_zones, \n schema = schema(location_id = int64(), borough = utf8())\n)\n```\n:::\n\n\n- `schema()` takes variable name / types as input\n- arrow has various \"type\" functions: `int64()`, `utf8()`, `boolean()`, `date32()` etc\n\n## Take control of the schema\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi_zones_arrow <- arrow_table(\n nyc_taxi_zones, \n schema = schema(location_id = int64(), borough = utf8())\n)\nschema(nyc_taxi_zones_arrow)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSchema\nlocation_id: int64\nborough: string\n```\n:::\n:::\n\n\n## Prepare the auxiliary tables\n\n\n::: {.cell}\n\n```{.r .cell-code}\npickup <- nyc_taxi_zones_arrow |>\n select(pickup_location_id = location_id,\n pickup_borough = borough)\n\ndropoff <- nyc_taxi_zones_arrow |>\n select(dropoff_location_id = location_id,\n dropoff_borough = borough)\n```\n:::\n\n\n- Join separately for the pickup and dropoff zones\n\n\n## Join and cross-tabulate\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nborough_counts <- nyc_taxi |> \n left_join(pickup) |>\n left_join(dropoff) |>\n count(pickup_borough, dropoff_borough) |>\n arrange(desc(n)) |>\n collect()\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n108.192 sec elapsed\n```\n:::\n:::\n\n\n
\n\n2-3 minutes to join twice and cross-tabulate on non-partition variables, with 1.15 billion rows of data 🙂\n\n## The results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nborough_counts\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 50 × 3\n pickup_borough dropoff_borough n\n \n 1 732357953\n 2 Manhattan Manhattan 351198872\n 3 Queens Manhattan 14440705\n 4 Manhattan Queens 13052517\n 5 Manhattan Brooklyn 11180867\n 6 Queens Queens 7440356\n 7 Unknown Unknown 4491811\n 8 Queens Brooklyn 3662324\n 9 Brooklyn Brooklyn 3550480\n10 Manhattan Bronx 2071830\n# ℹ 40 more rows\n```\n:::\n:::\n\n\n## Your Turn\n\n1. How many taxi pickups were recorded in 2019 from the three major airports covered by the NYC Taxis data set (JFK, LaGuardia, Newark)? (Hint: you can use `stringr::str_detect()` to help you find pickup zones with the word \"Airport\" in them)\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n- You can join Arrow Tables and Datasets to R data frames and Arrow Tables\n- The Arrow data type of join keys must always match\n\n# Window functions\n\n## What are window functions?\n\n- calculations across a \"window\" of multiple rows which relate to the current row\n- e.g. `row_number()`, `ntile()`, or calling `mutate()` after `group_by()`\n\n## Grouped summaries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year <- nyc_taxi |>\n filter(year > 2019) |>\n select(year, fare_amount)\n\nfare_by_year |>\n group_by(year) |>\n summarise(mean_fare = mean(fare_amount)) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n year mean_fare\n \n1 2020 12.7\n2 2021 13.5\n```\n:::\n:::\n\n\n## Window functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n group_by(year) |>\n mutate(mean_fare = mean(fare_amount)) |> \n head() |> \n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: window functions not currently supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\n## Window functions - via joins\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n left_join(\n fare_by_year |>\n group_by(year) |>\n summarise(mean_fare = mean(fare_amount))\n ) |> \n arrange(desc(fare_amount)) |>\n head() |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n year fare_amount mean_fare\n \n1 2020 998310. 12.7\n2 2021 818283. 13.5\n3 2020 671100. 12.7\n4 2020 429497. 12.7\n5 2021 398466. 13.5\n6 2020 398465. 12.7\n```\n:::\n:::\n\n\n## Window functions - via duckdb\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfare_by_year |>\n group_by(year) |>\n to_duckdb() |>\n mutate(mean_fare = mean(fare_amount)) |> \n to_arrow() |>\n arrange(desc(fare_amount)) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n year fare_amount mean_fare\n \n1 2020 998310. 12.7\n2 2021 818283. 13.5\n3 2020 671100. 12.7\n4 2020 429497. 12.7\n5 2021 398466. 13.5\n6 2020 398465. 12.7\n```\n:::\n:::\n\n\n## Your Turn\n\n1. How many trips in September 2019 had a longer than average distance for that month?\n\n➡️ [Data Manipulation Part II Exercises Page](4_data_manipulation_2-exercises.html)\n\n## Summary\n\n- Window functions in arrow can be achieved via joins or passing data to and from duckdb\n", "supporting": [ "4_data_manipulation_2_files" ],