diff --git a/.nojekyll b/.nojekyll index 7ddc7b0..5596367 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -0da64790 \ No newline at end of file +4fa3a878 \ No newline at end of file diff --git a/search.json b/search.json index 669457b..2af84ca 100644 --- a/search.json +++ b/search.json @@ -11,21 +11,21 @@ "href": "setup.html#larger-than-memory-data-option", "title": "Packages & Data", "section": "Larger-Than-Memory Data Option", - "text": "Larger-Than-Memory Data Option\n\n1. NYC Taxi Data\nThis is the main data set we will need for the day. It’s pretty hefty—40 GB in total—and there are a couple of options for how to acquire it, depending on your internet connection speed.\n\nOption 1—the simplest option—for those with a good internet connection and happy to let things run\nIf you have a solid internet connection, and especially if you’re in the US/Canada, this option is the simplest. You can use arrow itself to download the data. Note that there are no progress bars displayed during download, and so your session will appear to hang, but you can check progress by inspecting the contents of the download directory. When we tested this with Steph’s laptop and a fast internet connection, it took 67 minutes, though results will likely vary widely.\nAfter installing arrow, run the following code:\n\nlibrary(arrow)\nlibrary(dplyr)\n\ndata_path <- here::here(\"data/nyc-taxi\") # Or set your own preferred path\n\nopen_dataset(\"s3://voltrondata-labs-datasets/nyc-taxi\") |>\n filter(year %in% 2012:2021) |> \n write_dataset(data_path, partitioning = c(\"year\", \"month\"))\n\nOnce this has completed, you can check everything has downloaded correctly by calling:\n\nopen_dataset(data_path) |>\n nrow()\n\nIt might take a moment to run (the data has over a billion rows!), but you should expect to see:\n[1] 1150352666\nIf you get an error message, your download may have been interrupted at some point. The error message will name the file which could not be read. Manually delete this file and run the nrow() code snippet again until you successfully load the remaining data. You can then download any missing files individually using option 2.\n\n\nOption 2—one file at a time via https\nIf you have a slower internet connection or are further away from the data S3 bucket location, it’s probably going to be simpler to download the data file-by-file. Or, if you had any interruptions to your download process in the previous step, you can either try instead with this method, or delete the files which weren’t downloaded properly, and use this method to just download the files you need.\nWe’ve created a script for you which downloads the data one file at a time via https. The script also checks for previously downloaded data, so if you encounter problems downloading any files, just delete the partially downloaded file and run again—the script will only download files which are missing.\n\ndownload_via_https <- function(data_dir, years = 2012:2021){\n\n # Set this option as we'll be downloading large files and R has a default\n # timeout of 60 seconds, so we've updated this to 30 mins\n options(timeout = 1800)\n \n # The S3 bucket where the data is stored\n bucket <- \"https://voltrondata-labs-datasets.s3.us-east-2.amazonaws.com\"\n \n # Collect any errors raised during the download process\n problems <- c()\n \n # Download the data from S3 - loops through the data files, downloading 1 file at a time\n for (year in years) {\n \n # We only have 2 months for 2022 data\n if(year ==2022){\n months = 1:2\n } else {\n months = 1:12\n }\n \n for (month in months) {\n \n # Work out where we're going to be saving the data\n partition_dir <- paste0(\"year=\", year, \"/month=\", month)\n dest_dir <- file.path(data_dir, partition_dir)\n dest_file_path <- file.path(dest_dir, \"part-0.parquet\")\n \n # If the file doesn't exist\n if (!file.exists(dest_file_path)) {\n \n # Create the partition to store the data in\n if(!dir.exists(dest_dir)){\n dir.create(dest_dir, recursive = TRUE)\n }\n \n # Work out where we are going to be retrieving the data from\n source_path <- file.path(bucket, \"nyc-taxi\", partition_dir, \"part-0.parquet\")\n \n # Download the data - save any error messages that occur\n tryCatch(\n download.file(source_path, dest_file_path, mode = \"wb\"),\n error = function(e){\n problems <- c(problems, e$message)\n }\n )\n }\n }\n }\n \n print(\"Downloads complete\")\n \n if(length(problems) > 0){\n warning(call. = FALSE, \"The following errors occurred during download:\\n\", paste(problems, collapse = \"\\n\"))\n }\n}\n\n\ndata_path <- here::here(\"data/nyc-taxi\") # Or set your own preferred path\n\ndownload_via_https(data_path)\n\nOnce this has completed, you can check everything has downloaded correctly by calling:\n\nopen_dataset(data_path) |>\n nrow()\n\nIt might take a moment to run (the data has over a billion rows), but you should expect to see:\n[1] 1150352666\nIf you get an error message, your download may have been interrupted at some point. The error message will name the file which could not be read. Manually delete this file and run the nrow() code snippet again until you successfully load the data. You can then download any missing files by re-running download_via_https(data_path).\n\n\n\n2. Seattle Checkouts by Title Data\nThis is the data we will use to explore some data storage and engineering options. It’s a good sized, single CSV file—9GB on-disk in total, which can be downloaded from the an AWS S3 bucket via https:\n\ndownload.file(\n url = \"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv\",\n destfile = here::here(\"data/seattle-library-checkouts.csv\")\n)" + "text": "Larger-Than-Memory Data Option\n\n1. NYC Taxi Data\nThis is the main data set we will need for the day. It’s pretty hefty—40 GB in total—and there are a couple of options for how to acquire it, depending on your internet connection speed.\n\nOption 1—the simplest option—for those with a good internet connection and happy to let things run\nIf you have a solid internet connection, and especially if you’re in the US/Canada, this option is the simplest. You can use arrow itself to download the data. Note that there are no progress bars displayed during download, and so your session will appear to hang, but you can check progress by inspecting the contents of the download directory. When we tested this with Steph’s laptop and a fast internet connection, it took 67 minutes, though results will likely vary widely.\nAfter installing arrow, run the following code:\n\nlibrary(arrow)\nlibrary(dplyr)\n\ndata_path <- here::here(\"data/nyc-taxi\") # Or set your own preferred path\n\nopen_dataset(\"s3://voltrondata-labs-datasets/nyc-taxi\") |>\n filter(year %in% 2012:2021) |> \n write_dataset(data_path, partitioning = c(\"year\", \"month\"))\n\nOnce this has completed, you can check everything has downloaded correctly by calling:\n\nopen_dataset(data_path) |>\n nrow()\n\nIt might take a moment to run (the data has over a billion rows!), but you should expect to see:\n[1] 1150352666\nIf you get an error message, your download may have been interrupted at some point. The error message will name the file which could not be read. Manually delete this file and run the nrow() code snippet again until you successfully load the remaining data. You can then download any missing files individually using option 2.\n\n\nOption 2—one file at a time via https\nIf you have a slower internet connection or are further away from the data S3 bucket location, it’s probably going to be simpler to download the data file-by-file. Or, if you had any interruptions to your download process in the previous step, you can either try instead with this method, or delete the files which weren’t downloaded properly, and use this method to just download the files you need.\nWe’ve created a script for you which downloads the data one file at a time via https. The script also checks for previously downloaded data, so if you encounter problems downloading any files, just delete the partially downloaded file and run again—the script will only download files which are missing.\n\ndownload_via_https <- function(data_dir, years = 2012:2021){\n\n # Set this option as we'll be downloading large files and R has a default\n # timeout of 60 seconds, so we've updated this to 30 mins\n options(timeout = 1800)\n \n # The S3 bucket where the data is stored\n bucket <- \"https://voltrondata-labs-datasets.s3.us-east-2.amazonaws.com\"\n \n # Collect any errors raised during the download process\n problems <- c()\n \n # Download the data from S3 - loops through the data files, downloading 1 file at a time\n for (year in years) {\n \n # We only have 2 months for 2022 data\n if(year ==2022){\n months = 1:2\n } else {\n months = 1:12\n }\n \n for (month in months) {\n \n # Work out where we're going to be saving the data\n partition_dir <- paste0(\"year=\", year, \"/month=\", month)\n dest_dir <- file.path(data_dir, partition_dir)\n dest_file_path <- file.path(dest_dir, \"part-0.parquet\")\n \n # If the file doesn't exist\n if (!file.exists(dest_file_path)) {\n \n # Create the partition to store the data in\n if(!dir.exists(dest_dir)){\n dir.create(dest_dir, recursive = TRUE)\n }\n \n # Work out where we are going to be retrieving the data from\n source_path <- file.path(bucket, \"nyc-taxi\", partition_dir, \"part-0.parquet\")\n \n # Download the data - save any error messages that occur\n tryCatch(\n download.file(source_path, dest_file_path, mode = \"wb\"),\n error = function(e){\n problems <- c(problems, e$message)\n }\n )\n }\n }\n }\n \n print(\"Downloads complete\")\n \n if(length(problems) > 0){\n warning(call. = FALSE, \"The following errors occurred during download:\\n\", paste(problems, collapse = \"\\n\"))\n }\n}\n\n\ndata_path <- here::here(\"data/nyc-taxi\") # Or set your own preferred path\n\ndownload_via_https(data_path)\n\nOnce this has completed, you can check everything has downloaded correctly by calling:\n\nopen_dataset(data_path) |>\n nrow()\n\nIt might take a moment to run (the data has over a billion rows), but you should expect to see:\n[1] 1150352666\nIf you get an error message, your download may have been interrupted at some point. The error message will name the file which could not be read. Manually delete this file and run the nrow() code snippet again until you successfully load the data. You can then download any missing files by re-running download_via_https(data_path).\n\n\n\n2. Seattle Checkouts by Title Data\nThis is the data we will use to explore some data storage and engineering options. It’s a good sized, single CSV file—9GB on-disk in total, which can be downloaded from the an AWS S3 bucket via https:\n\noptions(timeout = 1800)\ndownload.file(\n url = \"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv\",\n destfile = here::here(\"data/seattle-library-checkouts.csv\")\n)" }, { "objectID": "setup.html#tiny-data-option", "href": "setup.html#tiny-data-option", "title": "Packages & Data", "section": "Tiny Data Option", - "text": "Tiny Data Option\nIf you don’t have time or disk space to download the larger-than-memory data sets (and still have disk space do the exercises), you can run the code and exercises in the course with “tiny” versions of these data. Although the focus in this course is working with larger-than-memory data, you can still learn about the concepts and workflows with smaller data—although note you may not see the same performance improvements that you would get when working with larger data.\n\n1. Tiny NYC Taxi Data\nWe’ve created a “tiny” NYC Taxi data set which contains only 1 in 1000 rows from the original data set. So instead of working with 1.15 billion rows of data and about 40GB of files, the tiny taxi data set is 1.15 million rows and about 50MB of files. You can download the tiny NYC Taxi data directly from this repo via https:\n\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/nyc-taxi-tiny.zip\",\n destfile = here::here(\"data/nyc-taxi-tiny.zip\")\n)\n\n# Extract the partitioned parquet files from the zip folder:\nunzip(\n zipfile = here::here(\"data/nyc-taxi-tiny.zip\"), \n exdir = here::here(\"data/\")\n)\n\n\n\n2. Tiny Seattle Checkouts by Title Data\nWe’ve created a “tiny” Seattle Checkouts by Title data set which contains only 1 in 100 rows from the original data set. So instead of working with ~41 million rows of data in a 9GB file, the tiny Seattle checkouts data set is ~410 thousand rows and in an 90MB file. You can download the tiny Seattle Checkouts by Title data directly from this repo via https:\n\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv\",\n destfile = here::here(\"data/seattle-library-checkouts-tiny.csv\")\n)" + "text": "Tiny Data Option\nIf you don’t have time or disk space to download the larger-than-memory data sets (and still have disk space do the exercises), you can run the code and exercises in the course with “tiny” versions of these data. Although the focus in this course is working with larger-than-memory data, you can still learn about the concepts and workflows with smaller data—although note you may not see the same performance improvements that you would get when working with larger data.\n\n1. Tiny NYC Taxi Data\nWe’ve created a “tiny” NYC Taxi data set which contains only 1 in 1000 rows from the original data set. So instead of working with 1.15 billion rows of data and about 40GB of files, the tiny taxi data set is 1.15 million rows and about 50MB of files. You can download the tiny NYC Taxi data directly from this repo via https:\n\noptions(timeout = 1800)\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/nyc-taxi-tiny.zip\",\n destfile = here::here(\"data/nyc-taxi-tiny.zip\")\n)\n\n# Extract the partitioned parquet files from the zip folder:\nunzip(\n zipfile = here::here(\"data/nyc-taxi-tiny.zip\"), \n exdir = here::here(\"data/\")\n)\n\n\n\n2. Tiny Seattle Checkouts by Title Data\nWe’ve created a “tiny” Seattle Checkouts by Title data set which contains only 1 in 100 rows from the original data set. So instead of working with ~41 million rows of data in a 9GB file, the tiny Seattle checkouts data set is ~410 thousand rows and in an 90MB file. You can download the tiny Seattle Checkouts by Title data directly from this repo via https:\n\noptions(timeout = 1800)\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv\",\n destfile = here::here(\"data/seattle-library-checkouts-tiny.csv\")\n)" }, { "objectID": "setup.html#both-data-options-everyone", "href": "setup.html#both-data-options-everyone", "title": "Packages & Data", "section": "Both Data Options / Everyone", - "text": "Both Data Options / Everyone\n\n3. Taxi Zone Lookup CSV Table & Taxi Zone Shapefile\nYou can download the two NYC Taxi trip ancillary data files directly from this repo via https:\n\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zone_lookup.csv\",\n destfile = here::here(\"data/taxi_zone_lookup.csv\")\n)\n\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zones.zip\",\n destfile = here::here(\"data/taxi_zones.zip\")\n)\n\n# Extract the spatial files from the zip folder:\nunzip(\n zipfile = here::here(\"data/taxi_zones.zip\"), \n exdir = here::here(\"data/taxi_zones\")\n)" + "text": "Both Data Options / Everyone\n\n3. Taxi Zone Lookup CSV Table & Taxi Zone Shapefile\nYou can download the two NYC Taxi trip ancillary data files directly from this repo via https:\n\noptions(timeout = 1800)\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zone_lookup.csv\",\n destfile = here::here(\"data/taxi_zone_lookup.csv\")\n)\n\ndownload.file(\n url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zones.zip\",\n destfile = here::here(\"data/taxi_zones.zip\")\n)\n\n# Extract the spatial files from the zip folder:\nunzip(\n zipfile = here::here(\"data/taxi_zones.zip\"), \n exdir = here::here(\"data/taxi_zones\")\n)" }, { "objectID": "setup.html#data-on-the-day-of", diff --git a/setup.html b/setup.html index c8ea3aa..589f253 100644 --- a/setup.html +++ b/setup.html @@ -411,10 +411,11 @@

Optio

2. Seattle Checkouts by Title Data

This is the data we will use to explore some data storage and engineering options. It’s a good sized, single CSV file—9GB on-disk in total, which can be downloaded from the an AWS S3 bucket via https:

-
download.file(
-  url = "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
-  destfile = here::here("data/seattle-library-checkouts.csv")
-)
+
options(timeout = 1800)
+download.file(
+  url = "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
+  destfile = here::here("data/seattle-library-checkouts.csv")
+)
@@ -425,26 +426,28 @@

Tiny Data Option

1. Tiny NYC Taxi Data

We’ve created a “tiny” NYC Taxi data set which contains only 1 in 1000 rows from the original data set. So instead of working with 1.15 billion rows of data and about 40GB of files, the tiny taxi data set is 1.15 million rows and about 50MB of files. You can download the tiny NYC Taxi data directly from this repo via https:

-
download.file(
-  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/nyc-taxi-tiny.zip",
-  destfile = here::here("data/nyc-taxi-tiny.zip")
-)
-
-# Extract the partitioned parquet files from the zip folder:
-unzip(
-  zipfile = here::here("data/nyc-taxi-tiny.zip"), 
-  exdir = here::here("data/")
-)
+
options(timeout = 1800)
+download.file(
+  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/nyc-taxi-tiny.zip",
+  destfile = here::here("data/nyc-taxi-tiny.zip")
+)
+
+# Extract the partitioned parquet files from the zip folder:
+unzip(
+  zipfile = here::here("data/nyc-taxi-tiny.zip"), 
+  exdir = here::here("data/")
+)

2. Tiny Seattle Checkouts by Title Data

We’ve created a “tiny” Seattle Checkouts by Title data set which contains only 1 in 100 rows from the original data set. So instead of working with ~41 million rows of data in a 9GB file, the tiny Seattle checkouts data set is ~410 thousand rows and in an 90MB file. You can download the tiny Seattle Checkouts by Title data directly from this repo via https:

-
download.file(
-  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv",
-  destfile = here::here("data/seattle-library-checkouts-tiny.csv")
-)
+
options(timeout = 1800)
+download.file(
+  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv",
+  destfile = here::here("data/seattle-library-checkouts-tiny.csv")
+)
@@ -454,21 +457,22 @@

Both Data Optio

3. Taxi Zone Lookup CSV Table & Taxi Zone Shapefile

You can download the two NYC Taxi trip ancillary data files directly from this repo via https:

-
download.file(
-  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zone_lookup.csv",
-  destfile = here::here("data/taxi_zone_lookup.csv")
-)
-
-download.file(
-  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zones.zip",
-  destfile = here::here("data/taxi_zones.zip")
-)
-
-# Extract the spatial files from the zip folder:
-unzip(
-  zipfile = here::here("data/taxi_zones.zip"), 
-  exdir = here::here("data/taxi_zones")
-)
+
options(timeout = 1800)
+download.file(
+  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zone_lookup.csv",
+  destfile = here::here("data/taxi_zone_lookup.csv")
+)
+
+download.file(
+  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zones.zip",
+  destfile = here::here("data/taxi_zones.zip")
+)
+
+# Extract the spatial files from the zip folder:
+unzip(
+  zipfile = here::here("data/taxi_zones.zip"), 
+  exdir = here::here("data/taxi_zones")
+)
diff --git a/sitemap.xml b/sitemap.xml index b73d997..fa25764 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,66 +2,66 @@ https://github.com/posit-conf-2023/arrow/setup.html - 2023-09-02T09:59:19.327Z + 2023-09-05T17:13:28.973Z https://github.com/posit-conf-2023/arrow/materials/7_continue_learning.html - 2023-09-02T09:59:18.267Z + 2023-09-05T17:13:27.789Z https://github.com/posit-conf-2023/arrow/materials/5_arrow_single_file.html - 2023-09-02T09:59:17.567Z + 2023-09-05T17:13:26.937Z https://github.com/posit-conf-2023/arrow/materials/4_data_manipulation_2.html - 2023-09-02T09:59:13.667Z + 2023-09-05T17:13:22.689Z https://github.com/posit-conf-2023/arrow/materials/3_data_engineering.html - 2023-09-02T09:59:12.491Z + 2023-09-05T17:13:21.401Z https://github.com/posit-conf-2023/arrow/materials/2_data_manipulation_1.html - 2023-09-02T09:59:10.983Z + 2023-09-05T17:13:19.729Z https://github.com/posit-conf-2023/arrow/materials/1_hello_arrow.html - 2023-09-02T09:59:09.687Z + 2023-09-05T17:13:18.289Z https://github.com/posit-conf-2023/arrow/materials/0_housekeeping.html - 2023-09-02T09:59:07.139Z + 2023-09-05T17:13:16.593Z https://github.com/posit-conf-2023/arrow/index.html - 2023-09-02T09:59:05.703Z + 2023-09-05T17:13:15.101Z https://github.com/posit-conf-2023/arrow/materials/1_hello_arrow-exercises.html - 2023-09-02T09:59:08.083Z + 2023-09-05T17:13:17.605Z https://github.com/posit-conf-2023/arrow/materials/2_data_manipulation_1-exercises.html - 2023-09-02T09:59:10.315Z + 2023-09-05T17:13:18.981Z https://github.com/posit-conf-2023/arrow/materials/3_data_engineering-exercises.html - 2023-09-02T09:59:11.691Z + 2023-09-05T17:13:20.525Z https://github.com/posit-conf-2023/arrow/materials/4_data_manipulation_2-exercises.html - 2023-09-02T09:59:13.079Z + 2023-09-05T17:13:22.065Z https://github.com/posit-conf-2023/arrow/materials/5_arrow_single_file-exercises.html - 2023-09-02T09:59:16.999Z + 2023-09-05T17:13:26.281Z https://github.com/posit-conf-2023/arrow/materials/6_wrapping_up.html - 2023-09-02T09:59:18.015Z + 2023-09-05T17:13:27.493Z https://github.com/posit-conf-2023/arrow/materials/8_closing.html - 2023-09-02T09:59:18.543Z + 2023-09-05T17:13:28.109Z