Merge pull request #33 from posit-conf-2023/irl-updates

A few fixes from the In Real Life workshop day
posit-conf-2023 · Sep 20, 2023 · 5599fee · 5599fee
2 parents d411912 + c22f17d
commit 5599fee
Show file tree

Hide file tree

Showing 5 changed files with 12 additions and 13 deletions.
diff --git a/_freeze/materials/3_data_engineering-exercises/execute-results/html.json b/_freeze/materials/3_data_engineering-exercises/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "b5e1cbc1cd97d5404a8e61f27b29b1c9",
+  "hash": "bc9577166bdb00f8d925743bf438c721",
   "result": {
-    "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\neditor: source  \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\"\n)\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1.  The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` instead of the `<null>` interpreted by Arrow.\n\n2.  Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-schema-1_f724b866de89b0d5657421eb6e893446'}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\",\n  skip = 1,\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = string(),\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  )\n)\n```\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\",\n  skip = 1,\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = utf8(),\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  )\n)\n```\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr_11e076db4356ccb8c1472bac17b0ebbe'}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n   CheckoutYear `sum(Checkouts)`\n          <int>            <int>\n 1         2005          3798685\n 2         2006          6599318\n 3         2007          7126627\n 4         2008          8438486\n 5         2009          9135167\n 6         2010          8608966\n 7         2011          8321732\n 8         2012          8163046\n 9         2013          9057096\n10         2014          9136081\n11         2015          9084179\n12         2016          9021051\n13         2017          9231648\n14         2018          9149176\n15         2019          9199083\n16         2020          6053717\n17         2021          7361031\n18         2022          7001989\n```\n:::\n:::\n\n\nor\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-2_10dd993dcc5049ff1548b3e2b15e01f3'}\n\n```{.r .cell-code}\nseattle_csv |> \n  count(CheckoutYear, wt = Checkouts) |> \n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n   CheckoutYear       n\n          <int>   <int>\n 1         2005 3798685\n 2         2006 6599318\n 3         2007 7126627\n 4         2008 8438486\n 5         2009 9135167\n 6         2010 8608966\n 7         2011 8321732\n 8         2012 8163046\n 9         2013 9057096\n10         2014 9136081\n11         2015 9084179\n12         2016 9021051\n13         2017 9231648\n14         2018 9149176\n15         2019 9199083\n16         2020 6053717\n17         2021 7361031\n18         2022 7001989\n```\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-timed_1f17f2738b0ea5175f9f30b8824ab034'}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n 10.853   1.198  10.561 \n```\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nseattle_csv |>\n  write_dataset(path = seattle_parquet,\n                format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1.  Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nopen_dataset(seattle_parquet, \n             format = \"parquet\") |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  2.235   0.475   0.625 \n```\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- here::here(\"data/seattle-library-checkouts\")\n\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  write_dataset(path = seattle_parquet_part,\n                format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1.  Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2.  Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- here::here(\"data/seattle-library-checkouts-type\")\n\nseattle_csv |>\n  group_by(CheckoutType) |>\n  write_dataset(path = seattle_checkouttype,\n                format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts-type\")) |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  group_by(CheckoutYear) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  0.890   0.088   0.326 \n```\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts\")) |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  group_by(CheckoutYear) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  0.046   0.006   0.036 \n```\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
+    "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\neditor: source  \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\"\n)\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1.  The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` instead of the `<null>` interpreted by Arrow.\n\n2.  Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-schema-1_f724b866de89b0d5657421eb6e893446'}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\",\n  skip = 1,\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = string(),\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  )\n)\n```\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(here::here(\"data/seattle-library-checkouts.csv\"),\n  format = \"csv\",\n  skip = 1,\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = utf8(),\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  )\n)\n```\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr_11e076db4356ccb8c1472bac17b0ebbe'}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n   CheckoutYear `sum(Checkouts)`\n          <int>            <int>\n 1         2005          3798685\n 2         2006          6599318\n 3         2007          7126627\n 4         2008          8438486\n 5         2009          9135167\n 6         2010          8608966\n 7         2011          8321732\n 8         2012          8163046\n 9         2013          9057096\n10         2014          9136081\n11         2015          9084179\n12         2016          9021051\n13         2017          9231648\n14         2018          9149176\n15         2019          9199083\n16         2020          6053717\n17         2021          7361031\n18         2022          7001989\n```\n:::\n:::\n\n\nor\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-2_10dd993dcc5049ff1548b3e2b15e01f3'}\n\n```{.r .cell-code}\nseattle_csv |> \n  count(CheckoutYear, wt = Checkouts) |> \n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 2\n   CheckoutYear       n\n          <int>   <int>\n 1         2005 3798685\n 2         2006 6599318\n 3         2007 7126627\n 4         2008 8438486\n 5         2009 9135167\n 6         2010 8608966\n 7         2011 8321732\n 8         2012 8163046\n 9         2013 9057096\n10         2014 9136081\n11         2015 9084179\n12         2016 9021051\n13         2017 9231648\n14         2018 9149176\n15         2019 9199083\n16         2020 6053717\n17         2021 7361031\n18         2022 7001989\n```\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell hash='3_data_engineering-exercises_cache/html/seattle-csv-dplyr-timed_1f17f2738b0ea5175f9f30b8824ab034'}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n 10.853   1.198  10.561 \n```\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nseattle_csv |>\n  write_dataset(path = seattle_parquet,\n                format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1.  Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- here::here(\"data/seattle-library-checkouts-parquet\")\n\nopen_dataset(seattle_parquet, \n             format = \"parquet\") |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  2.238   0.452   0.696 \n```\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- here::here(\"data/seattle-library-checkouts\")\n\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  write_dataset(path = seattle_parquet_part,\n                format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1.  Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2.  Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- here::here(\"data/seattle-library-checkouts-type\")\n\nseattle_csv |>\n  group_by(CheckoutType) |>\n  write_dataset(path = seattle_checkouttype,\n                format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts-type\")) |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  0.907   0.087   0.333 \n```\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(here::here(\"data/seattle-library-checkouts\")) |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  0.039   0.006   0.032 \n```\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"