Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Sep 14, 2023
1 parent 094731a commit 80e830c
Show file tree
Hide file tree
Showing 10 changed files with 354 additions and 313 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
c93ab8b4
409ec87a
92 changes: 47 additions & 45 deletions materials/1_hello_arrow-exercises.html
Original file line number Diff line number Diff line change
Expand Up @@ -264,62 +264,50 @@ <h1 class="title">Hello Arrow Exercises</h1>
<div class="tab-content">
<div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab">
<ol type="1">
<li><p>Calculate the total number of rides for every month in 2019</p></li>
<li><p>About how long did this query of 1.15 billion rows take?</p></li>
<li><p>Calculate the longest trip distance for every month in 2019</p></li>
<li><p>How long did this query take to run?</p></li>
</ol>
</div>
<div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab">
<p>Total number of rides for every month in 2019:</p>
<p>Longest trip distance for every month in 2019:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>nyc_taxi <span class="sc">|&gt;</span> </span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(year <span class="sc">==</span> <span class="dv">2019</span>) <span class="sc">|&gt;</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">count</span>(month) <span class="sc">|&gt;</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a> <span class="fu">collect</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(month) <span class="sc">|&gt;</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarize</span>(<span class="at">longest_trip =</span> <span class="fu">max</span>(trip_distance, <span class="at">na.rm =</span> <span class="cn">TRUE</span>)) <span class="sc">|&gt;</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">arrange</span>(month) <span class="sc">|&gt;</span> </span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a> <span class="fu">collect</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 12 × 2
month n
&lt;int&gt; &lt;int&gt;
1 11 6877463
2 10 7213588
3 12 6895933
4 1 7667255
5 2 7018750
6 3 7832035
7 4 7432826
8 5 7564884
9 6 6940489
10 8 6072851
11 7 6310134
12 9 6567396</code></pre>
month longest_trip
&lt;int&gt; &lt;dbl&gt;
1 1 832.
2 2 702.
3 3 237.
4 4 831.
5 5 401.
6 6 45977.
7 7 312.
8 8 602.
9 9 604.
10 10 308.
11 11 701.
12 12 19130.</code></pre>
</div>
</div>
</div>
<div id="tabset-1-3" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-3-tab">
<p>Compute time for querying the 1.15 billion rows:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>nyc_taxi <span class="sc">|&gt;</span> </span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(year <span class="sc">==</span> <span class="dv">2019</span>) <span class="sc">|&gt;</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(month) <span class="sc">|&gt;</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarize</span>(<span class="at">longest_trip =</span> <span class="fu">max</span>(trip_distance, <span class="at">na.rm =</span> <span class="cn">TRUE</span>)) <span class="sc">|&gt;</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">arrange</span>(month) <span class="sc">|&gt;</span> </span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a> <span class="fu">collect</span>() <span class="sc">|&gt;</span> </span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a> <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> user system elapsed
2.837 0.322 0.925 </code></pre>
</div>
</div>
<p>or</p>
<p>Compute time:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(tictoc)</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="fu">tic</span>()</span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>nyc_taxi <span class="sc">|&gt;</span> </span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(year <span class="sc">==</span> <span class="dv">2019</span>) <span class="sc">|&gt;</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(month) <span class="sc">|&gt;</span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarize</span>(<span class="at">longest_trip =</span> <span class="fu">max</span>(trip_distance, <span class="at">na.rm =</span> <span class="cn">TRUE</span>)) <span class="sc">|&gt;</span></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a> <span class="fu">arrange</span>(month) <span class="sc">|&gt;</span> </span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a> <span class="fu">collect</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(tictoc)</span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="fu">tic</span>()</span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>nyc_taxi <span class="sc">|&gt;</span> </span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(year <span class="sc">==</span> <span class="dv">2019</span>) <span class="sc">|&gt;</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(month) <span class="sc">|&gt;</span></span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarize</span>(<span class="at">longest_trip =</span> <span class="fu">max</span>(trip_distance, <span class="at">na.rm =</span> <span class="cn">TRUE</span>)) <span class="sc">|&gt;</span></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a> <span class="fu">arrange</span>(month) <span class="sc">|&gt;</span> </span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a> <span class="fu">collect</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 12 × 2
month longest_trip
Expand All @@ -337,9 +325,23 @@ <h1 class="title">Hello Arrow Exercises</h1>
11 11 701.
12 12 19130.</code></pre>
</div>
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="fu">toc</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="fu">toc</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>0.441 sec elapsed</code></pre>
<pre><code>0.461 sec elapsed</code></pre>
</div>
</div>
<p>or</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>nyc_taxi <span class="sc">|&gt;</span> </span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(year <span class="sc">==</span> <span class="dv">2019</span>) <span class="sc">|&gt;</span></span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(month) <span class="sc">|&gt;</span></span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarize</span>(<span class="at">longest_trip =</span> <span class="fu">max</span>(trip_distance, <span class="at">na.rm =</span> <span class="cn">TRUE</span>)) <span class="sc">|&gt;</span></span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">arrange</span>(month) <span class="sc">|&gt;</span> </span>
<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a> <span class="fu">collect</span>() <span class="sc">|&gt;</span> </span>
<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a> <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> user system elapsed
3.887 0.225 0.435 </code></pre>
</div>
</div>
</div>
Expand Down
62 changes: 32 additions & 30 deletions materials/1_hello_arrow.html
Original file line number Diff line number Diff line change
Expand Up @@ -453,9 +453,8 @@ <h2>Larger-Than-Memory Data</h2>
<p><br></p>
<p><code>arrow::open_dataset()</code></p>
<p><br></p>
<p><code>sources</code>: point to a string path or directory of data files (on disk or in a GCS/S3 bucket) and return an <code>Arrow Dataset</code>, then use <code>dplyr</code> methods to query it.</p>
<aside class="notes">
<p>Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, which can accelerate queries that only touch some partitions (files). Call open_dataset() to point to a directory of data files and return a Dataset, then use dplyr methods to query it.</p>
<p>Arrow Datasets allow you to query against data that has been split across multiple files. This division of data into multiple files may indicate partitioning, which can accelerate queries that only touch some partitions (files). Call open_dataset() to point to a directory of data files and return a Dataset, then use dplyr methods to query it.</p>
<style type="text/css">
span.MJX_Assistive_MathML {
position:absolute!important;
Expand Down Expand Up @@ -490,6 +489,7 @@ <h2>NYC Taxi Dataset</h2>
</section>
<section id="nyc-taxi-dataset-a-question" class="slide level2">
<h2>NYC Taxi Dataset: A question</h2>
<p><br></p>
<p>What percentage of taxi rides each year had more than 1 passenger?</p>
</section>
<section id="nyc-taxi-dataset-a-dplyr-pipeline" class="slide level2">
Expand All @@ -498,51 +498,55 @@ <h2>NYC Taxi Dataset: A dplyr pipeline</h2>
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1"></a><span class="fu">library</span>(dplyr)</span>
<span id="cb6-2"><a href="#cb6-2"></a></span>
<span id="cb6-3"><a href="#cb6-3"></a>nyc_taxi <span class="sc">|&gt;</span></span>
<span id="cb6-4"><a href="#cb6-4"></a> <span class="fu">filter</span>(year <span class="sc">%in%</span> <span class="dv">2014</span><span class="sc">:</span><span class="dv">2017</span>) <span class="sc">|&gt;</span></span>
<span id="cb6-5"><a href="#cb6-5"></a> <span class="fu">group_by</span>(year) <span class="sc">|&gt;</span></span>
<span id="cb6-6"><a href="#cb6-6"></a> <span class="fu">summarise</span>(</span>
<span id="cb6-7"><a href="#cb6-7"></a> <span class="at">all_trips =</span> <span class="fu">n</span>(),</span>
<span id="cb6-8"><a href="#cb6-8"></a> <span class="at">shared_trips =</span> <span class="fu">sum</span>(passenger_count <span class="sc">&gt;</span> <span class="dv">1</span>, <span class="at">na.rm =</span> <span class="cn">TRUE</span>)</span>
<span id="cb6-9"><a href="#cb6-9"></a> ) <span class="sc">|&gt;</span></span>
<span id="cb6-10"><a href="#cb6-10"></a> <span class="fu">mutate</span>(<span class="at">pct_shared =</span> shared_trips <span class="sc">/</span> all_trips <span class="sc">*</span> <span class="dv">100</span>) <span class="sc">|&gt;</span></span>
<span id="cb6-11"><a href="#cb6-11"></a> <span class="fu">collect</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb6-4"><a href="#cb6-4"></a> <span class="fu">group_by</span>(year) <span class="sc">|&gt;</span></span>
<span id="cb6-5"><a href="#cb6-5"></a> <span class="fu">summarise</span>(</span>
<span id="cb6-6"><a href="#cb6-6"></a> <span class="at">all_trips =</span> <span class="fu">n</span>(),</span>
<span id="cb6-7"><a href="#cb6-7"></a> <span class="at">shared_trips =</span> <span class="fu">sum</span>(passenger_count <span class="sc">&gt;</span> <span class="dv">1</span>, <span class="at">na.rm =</span> <span class="cn">TRUE</span>)</span>
<span id="cb6-8"><a href="#cb6-8"></a> ) <span class="sc">|&gt;</span></span>
<span id="cb6-9"><a href="#cb6-9"></a> <span class="fu">mutate</span>(<span class="at">pct_shared =</span> shared_trips <span class="sc">/</span> all_trips <span class="sc">*</span> <span class="dv">100</span>) <span class="sc">|&gt;</span></span>
<span id="cb6-10"><a href="#cb6-10"></a> <span class="fu">collect</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 4 × 4
year all_trips shared_trips pct_shared
&lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
1 2014 165114361 48816505 29.6
2 2015 146112989 43081091 29.5
3 2016 131165043 38163870 29.1
4 2017 113495512 32296166 28.5</code></pre>
<pre><code># A tibble: 10 × 4
year all_trips shared_trips pct_shared
&lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
1 2012 178544324 53313752 29.9
2 2013 173179759 51215013 29.6
3 2014 165114361 48816505 29.6
4 2015 146112989 43081091 29.5
5 2016 131165043 38163870 29.1
6 2017 113495512 32296166 28.5
7 2018 102797401 28796633 28.0
8 2019 84393604 23515989 27.9
9 2020 24647055 5837960 23.7
10 2021 30902618 7221844 23.4</code></pre>
</div>
</div>
</section>
<section id="nyc-taxi-dataset-a-dplyr-pipeline-1" class="slide level2">
<h2>NYC Taxi Dataset: A dplyr pipeline</h2>
<div class="cell">
<div class="sourceCode cell-code" id="cb8" data-code-line-numbers="11,12"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1"></a><span class="fu">library</span>(dplyr)</span>
<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1"></a><span class="fu">library</span>(tictoc)</span>
<span id="cb8-2"><a href="#cb8-2"></a></span>
<span id="cb8-3"><a href="#cb8-3"></a>nyc_taxi <span class="sc">|&gt;</span></span>
<span id="cb8-4"><a href="#cb8-4"></a> <span class="fu">filter</span>(year <span class="sc">%in%</span> <span class="dv">2014</span><span class="sc">:</span><span class="dv">2017</span>) <span class="sc">|&gt;</span></span>
<span id="cb8-3"><a href="#cb8-3"></a><span class="fu">tic</span>()</span>
<span id="cb8-4"><a href="#cb8-4"></a>nyc_taxi <span class="sc">|&gt;</span></span>
<span id="cb8-5"><a href="#cb8-5"></a> <span class="fu">group_by</span>(year) <span class="sc">|&gt;</span></span>
<span id="cb8-6"><a href="#cb8-6"></a> <span class="fu">summarise</span>(</span>
<span id="cb8-7"><a href="#cb8-7"></a> <span class="at">all_trips =</span> <span class="fu">n</span>(),</span>
<span id="cb8-8"><a href="#cb8-8"></a> <span class="at">shared_trips =</span> <span class="fu">sum</span>(passenger_count <span class="sc">&gt;</span> <span class="dv">1</span>, <span class="at">na.rm =</span> <span class="cn">TRUE</span>)</span>
<span id="cb8-9"><a href="#cb8-9"></a> ) <span class="sc">|&gt;</span></span>
<span id="cb8-10"><a href="#cb8-10"></a> <span class="fu">mutate</span>(<span class="at">pct_shared =</span> shared_trips <span class="sc">/</span> all_trips <span class="sc">*</span> <span class="dv">100</span>) <span class="sc">|&gt;</span></span>
<span id="cb8-11"><a href="#cb8-11"></a> <span class="fu">collect</span>() <span class="sc">|&gt;</span> </span>
<span id="cb8-12"><a href="#cb8-12"></a> <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> user system elapsed
17.661 1.404 2.625 </code></pre>
</div>
<span id="cb8-11"><a href="#cb8-11"></a> <span class="fu">collect</span>()</span>
<span id="cb8-12"><a href="#cb8-12"></a><span class="fu">toc</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<blockquote>
<p>6.077 sec elapsed</p>
</blockquote>
</section>
<section id="your-turn" class="slide level2">
<h2>Your Turn</h2>
<ol type="1">
<li><p>Calculate total number of rides for each month in 2019</p></li>
<li><p>About how long did this query of 1.15 billion rows take?</p></li>
<li><p>Calculate the longest trip distance for every month in 2019</p></li>
<li><p>How long did this query take to run?</p></li>
</ol>
<p>➡️ <a href="1_hello_arrow-exercises.html">Hello Arrow Exercises Page</a></p>
</section>
Expand Down Expand Up @@ -612,8 +616,6 @@ <h2>Today</h2>
<li>Module 3: Larger-than-memory data manipulation with Arrow—Part II</li>
<li>Module 4: In-memory workflows in R with Arrow</li>
</ul>
<p><br></p>
<p>We will also talk about Arrow data types, file formats, controlling schemas &amp; more fun stuff along the way!</p>


<img src="images/logo.png" class="slide-logo r-stretch"><div class="footer footer-default">
Expand Down
Loading

0 comments on commit 80e830c

Please sign in to comment.