Merge pull request #28 from posit-conf-2023/cloud

Add posit cloud link in site navbar
posit-conf-2023 · Sep 16, 2023 · edf1fd0 · edf1fd0
2 parents e7516a3 + e9d8748
commit edf1fd0
Show file tree

Hide file tree

Showing 6 changed files with 27 additions and 17 deletions.
diff --git a/_freeze/materials/1_hello_arrow/execute-results/html.json b/_freeze/materials/1_hello_arrow/execute-results/html.json
@@ -1,10 +1,8 @@
 {
-  "hash": "dc12398a95be5b40682dd9d32ed7efa3",
+  "hash": "1b414bf37e73746adfb50ccd20403b13",
   "result": {
-    "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n  echo: true\nformat:\n  revealjs: \n    theme: default\nengine: knitr\neditor: source\n---\n\n\n# Hello Arrow {#hello-arrow}\n\n## Poll: Arrow\n\n<br>\n\nHave you used or experimented with Arrow before today?\n\n- 1️⃣ No yet\n- 2️⃣ Not yet, but I have read about it!\n- 3️⃣ A little\n- 4️⃣ A lot\n\n\n## Hello Arrow<br>Demo\n\n<br>\n\n![](images/logo.png){.absolute top=\"0\" left=\"250\" width=\"600\" height=\"800\"}\n\n## Some \"Big\" Data\n\n![](images/nyc-taxi-homepage.png){.absolute left=\"200\" width=\"600\"}\n\n::: {style=\"font-size: 60%; margin-top: 550px; margin-left: 200px;\"}\n<https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>\n:::\n\n## NYC Taxi Data\n\n-   *big* NYC Taxi data set (\\~40GBs on disk)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"s3://voltrondata-labs-datasets/nyc-taxi\") |>\n  filter(year %in% 2012:2021) |>\n  write_dataset(here::here(\"data/nyc-taxi\"), partitioning = c(\"year\", \"month\"))\n```\n:::\n\n\n-   *tiny* NYC Taxi data set (\\<1GB on disk)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload.file(url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1/nyc-taxi-tiny.zip\",\n              destfile = here::here(\"data/nyc-taxi-tiny.zip\"))\n\nunzip(\n  zipfile = here::here(\"data/nyc-taxi-tiny.zip\"),\n  exdir = here::here(\"data/\")\n)\n```\n:::\n\n\n## Posit Cloud ☁️\n\n - Join the cloud workspace via URL in the workshop Discord channel \n - You will need to create a (free) Posit Cloud account\n\n![](images/posit-cloud.png){.absolute left=\"200\" width=\"600\"}\n\n## Larger-Than-Memory Data\n\n<br>\n\n`arrow::open_dataset()`\n\n<br>\n\n::: notes\nArrow Datasets allow you to query against data that has been split across multiple files. This division of data into multiple files may indicate partitioning, which can accelerate queries that only touch some partitions (files). Call open_dataset() to point to a directory of data files and return a Dataset, then use dplyr methods to query it.\n:::\n\n## NYC Taxi Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\n```\n:::\n\n\n## NYC Taxi Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1150352666\n```\n:::\n:::\n\n\n<br>\n\n1.15 billion rows 🤯\n\n## NYC Taxi Dataset: A question\n\n<br>\n\nWhat percentage of taxi rides each year had more than 1 passenger?\n\n## NYC Taxi Dataset: A dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\nnyc_taxi |>\n  group_by(year) |>\n  summarise(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 4\n    year all_trips shared_trips pct_shared\n   <int>     <int>        <int>      <dbl>\n 1  2012 178544324     53313752       29.9\n 2  2013 173179759     51215013       29.6\n 3  2014 165114361     48816505       29.6\n 4  2015 146112989     43081091       29.5\n 5  2016 131165043     38163870       29.1\n 6  2017 113495512     32296166       28.5\n 7  2018 102797401     28796633       28.0\n 8  2019  84393604     23515989       27.9\n 9  2020  24647055      5837960       23.7\n10  2021  30902618      7221844       23.4\n```\n:::\n:::\n\n\n## NYC Taxi Dataset: A dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |>\n  group_by(year) |>\n  summarise(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  collect()\ntoc()\n```\n:::\n\n\n> 6.077 sec elapsed\n\n## Your Turn\n\n1.  Calculate the longest trip distance for every month in 2019\n\n2.  How long did this query take to run?\n\n➡️ [Hello Arrow Exercises Page](1_hello_arrow-exercises.html)\n\n## What is Apache Arrow?\n\n::: columns\n::: {.column width=\"50%\"}\n> A multi-language toolbox for accelerated data interchange and in-memory processing\n:::\n\n::: {.column width=\"50%\"}\n> Arrow is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another\n:::\n:::\n\n::: {style=\"font-size: 70%;\"}\n<https://arrow.apache.org/overview/>\n:::\n\n## Apache Arrow Specification\n\nIn-memory columnar format: a standardized, language-agnostic specification for representing structured, table-like data sets in-memory.\n\n<br>\n\n![](images/arrow-rectangle.png){.absolute left=\"200\"}\n\n## A Multi-Language Toolbox\n\n![](images/arrow-libraries-structure.png)\n\n## Accelerated Data Interchange\n\n![](images/data-interchange-with-arrow.png)\n\n## Accelerated In-Memory Processing\n\nArrow's Columnar Format is Fast\n\n![](images/columnar-fast.png){.absolute top=\"120\" left=\"200\" height=\"600\"}\n\n::: notes\nThe contiguous columnar layout enables vectorization using the latest SIMD (Single Instruction, Multiple Data) operations included in modern processors.\n:::\n\n## arrow 📦\n\n<br>\n\n![](images/arrow-r-pkg.png){.absolute top=\"0\" left=\"300\" width=\"700\" height=\"900\"}\n\n## arrow 📦\n\n![](images/arrow-read-write-updated.png)\n\n## Today\n\n-   Module 1: Larger-than-memory data manipulation with Arrow---Part I\n-   Module 2: Data engineering with Arrow\n-   Module 3: Larger-than-memory data manipulation with Arrow---Part II\n-   Module 4: In-memory workflows in R with Arrow\n",
-    "supporting": [
-      "1_hello_arrow_files"
-    ],
+    "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n  echo: true\nformat:\n  revealjs: \n    theme: default\nengine: knitr\neditor: source\n---\n\n\n# Hello Arrow {#hello-arrow}\n\n## Poll: Arrow\n\n<br>\n\nHave you used or experimented with Arrow before today?\n\n- 1️⃣ No yet\n- 2️⃣ Not yet, but I have read about it!\n- 3️⃣ A little\n- 4️⃣ A lot\n\n\n## Hello Arrow<br>Demo\n\n<br>\n\n![](images/logo.png){.absolute top=\"0\" left=\"250\" width=\"600\" height=\"800\"}\n\n## Some \"Big\" Data\n\n![](images/nyc-taxi-homepage.png){.absolute left=\"200\" width=\"600\"}\n\n::: {style=\"font-size: 60%; margin-top: 550px; margin-left: 200px;\"}\n<https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>\n:::\n\n## NYC Taxi Data\n\n-   *big* NYC Taxi data set (\\~40GBs on disk)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"s3://voltrondata-labs-datasets/nyc-taxi\") |>\n  filter(year %in% 2012:2021) |>\n  write_dataset(here::here(\"data/nyc-taxi\"), partitioning = c(\"year\", \"month\"))\n```\n:::\n\n\n-   *tiny* NYC Taxi data set (\\<1GB on disk)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload.file(url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1/nyc-taxi-tiny.zip\",\n              destfile = here::here(\"data/nyc-taxi-tiny.zip\"))\n\nunzip(\n  zipfile = here::here(\"data/nyc-taxi-tiny.zip\"),\n  exdir = here::here(\"data/\")\n)\n```\n:::\n\n\n## Posit Cloud ☁️\n\n - Join the cloud workspace via URL in the workshop Discord channel \n - You will need to create a (free) Posit Cloud account\n\n![](images/posit-cloud.png){.absolute left=\"200\" width=\"600\"}\n\n## Posit Cloud ☁️\n\n - Once you have joined you can come and go\n \n![](images/pc-navbar.png){.absolute width=\"800\"}\n\n## Larger-Than-Memory Data\n\n<br>\n\n`arrow::open_dataset()`\n\n<br>\n\n::: notes\nArrow Datasets allow you to query against data that has been split across multiple files. This division of data into multiple files may indicate partitioning, which can accelerate queries that only touch some partitions (files). Call open_dataset() to point to a directory of data files and return a Dataset, then use dplyr methods to query it.\n:::\n\n## NYC Taxi Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\n```\n:::\n\n\n## NYC Taxi Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1150352666\n```\n:::\n:::\n\n\n<br>\n\n1.15 billion rows 🤯\n\n## NYC Taxi Dataset: A question\n\n<br>\n\nWhat percentage of taxi rides each year had more than 1 passenger?\n\n## NYC Taxi Dataset: A dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\nnyc_taxi |>\n  group_by(year) |>\n  summarise(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 4\n    year all_trips shared_trips pct_shared\n   <int>     <int>        <int>      <dbl>\n 1  2012 178544324     53313752       29.9\n 2  2013 173179759     51215013       29.6\n 3  2014 165114361     48816505       29.6\n 4  2015 146112989     43081091       29.5\n 5  2016 131165043     38163870       29.1\n 6  2017 113495512     32296166       28.5\n 7  2018 102797401     28796633       28.0\n 8  2019  84393604     23515989       27.9\n 9  2020  24647055      5837960       23.7\n10  2021  30902618      7221844       23.4\n```\n:::\n:::\n\n\n## NYC Taxi Dataset: A dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |>\n  group_by(year) |>\n  summarise(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  collect()\ntoc()\n```\n:::\n\n\n> 6.077 sec elapsed\n\n## Your Turn\n\n1.  Calculate the longest trip distance for every month in 2019\n\n2.  How long did this query take to run?\n\n➡️ [Hello Arrow Exercises Page](1_hello_arrow-exercises.html)\n\n## What is Apache Arrow?\n\n::: columns\n::: {.column width=\"50%\"}\n> A multi-language toolbox for accelerated data interchange and in-memory processing\n:::\n\n::: {.column width=\"50%\"}\n> Arrow is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another\n:::\n:::\n\n::: {style=\"font-size: 70%;\"}\n<https://arrow.apache.org/overview/>\n:::\n\n## Apache Arrow Specification\n\nIn-memory columnar format: a standardized, language-agnostic specification for representing structured, table-like data sets in-memory.\n\n<br>\n\n![](images/arrow-rectangle.png){.absolute left=\"200\"}\n\n## A Multi-Language Toolbox\n\n![](images/arrow-libraries-structure.png)\n\n## Accelerated Data Interchange\n\n![](images/data-interchange-with-arrow.png)\n\n## Accelerated In-Memory Processing\n\nArrow's Columnar Format is Fast\n\n![](images/columnar-fast.png){.absolute top=\"120\" left=\"200\" height=\"600\"}\n\n::: notes\nThe contiguous columnar layout enables vectorization using the latest SIMD (Single Instruction, Multiple Data) operations included in modern processors.\n:::\n\n## arrow 📦\n\n<br>\n\n![](images/arrow-r-pkg.png){.absolute top=\"0\" left=\"300\" width=\"700\" height=\"900\"}\n\n## arrow 📦\n\n![](images/arrow-read-write-updated.png)\n\n## Today\n\n-   Module 1: Larger-than-memory data manipulation with Arrow---Part I\n-   Module 2: Data engineering with Arrow\n-   Module 3: Larger-than-memory data manipulation with Arrow---Part II\n-   Module 4: In-memory workflows in R with Arrow\n",
+    "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],