Skip to content

Latest commit

 

History

History
480 lines (354 loc) · 11.5 KB

Week 2.md

File metadata and controls

480 lines (354 loc) · 11.5 KB

Week 2 - Filtering, Sorting, and Calculating Data with SQL

Table of Contents

Module Introduction

video

This module introduces a number of operators and clauses to query data:

  • WHERE
  • BETWEEN
  • IN
  • OR
  • NOT
  • LIKE
  • ORDER BY
  • GROUP BY

It also introduces wildcards, and math operators which can be used to aggregate data:

  • AVERAGE
  • COUNT
  • MAX
  • MIN

Basics of Filtering with SQL

video

Why filter

  • allows one to be specific about what they want
  • reduces the number of records you receive
  • increases query performance
  • reduces strain on the client application

WHERE clause

The WHERE clause filters queries by criteria you define.

SELECT [fields]
FROM [table]
WHERE [criteria]

These criteria are determined through using a number of operators:

  • = - equal
  • <> - not equal
  • > - greater than
  • < - less than
  • >= - greater than or equal
  • <= - less than or equal
  • BETWEEN - between an inclusive range
  • IS NULL - is a null value

To use BETWEEN we need to use it in conjunction with AND:

SELECT *
FROM my_table
WHERE size BETWEEN 5 AND 10;

Advanced Filtering: IN, OR, and NOT

video

IN operator

The IN operator specifies a range of conditions, delimited by commas, and within parentheses.

-- select only suppliers with the provided ids
SELECT *
FROM suppliers
WHERE supplierId IN (9, 10, 15);

To assert against string values, surround the strings in single quotes.

OR operator

As with the disjunction operator in many programming languages, SQL will ignore the second condition in an OR statement if the first condition is true.

SELECT *
FROM pets
WHERE name = 'sammy' OR 'hammy';

IN vs OR

IN works similarly to OR, but there are benefits to using IN:

  • one can provide a long list of options
  • IN executes faster than OR
  • the condition used in IN are not subject to a specific order
  • IN clauses can container another SELECT statement to make subqueries

NOT operator

NOT is used to exclude rows based on conditions:

SELECT *
FROM employees
WHERE NOT city = 'London' AND NOT city = 'Seattle';

Using Wildcards in SQL

video

A wildcard is a special character used to match parts of a value, such as when you know only a portion of a value.

Search patterns are made from literal text, a wildcard character, or a combination of both.

Wildcard queries use the LIKE operator. LIKE is actually a predicate, but is often referred to as an operator.

Wildcards can only be used for string data.

Using wildcards

  • %text - matches anything ending with text
  • text% - matches anything beginning with text
  • %text% - matches anything containing text
  • a%b - matches anything that is preceded by a and ends with b

Wildcards will not match NULL values.

SELECT DISTINCT(area)
FROM cities
WHERE name LIKE '%ville';

Underscore wildcard

Many RDMSs support _ as a wildcard. It functions in the same way that % has been outlined here.

Implications of using wildcards

  • wildcard queries take longer to run
  • better to use other operators if possible
  • statements with wildcards at the end of patterns take longer to run than patterns with wildcards at the beginning
  • placement of wildcads is imporant

Sorting with ORDER BY

video

Using SELECT will return data in the same order that it was captured in the database.

Furthermore, the order of data can change when it is updated or deleted.

Without specifying the order of data, its order cannot be assumed.

ORDER BY:

  • takes the name of one or more columns
  • columns are delimited by commas
  • data can be sorted by columns that are not retrieved in the SELECT statement
  • must always be the last statement in a SELECT statement
SELECT *
FROM my_table
WHERE [criteria]
ORDER BY col1, col2;

Sort direction

One can specify the direction to sort data:

  • DESC - descending
  • ASC - ascending

The direction keywords apply only to the column name they precede.

...
ORDER BY DESC name;

Sorting by columns position

Though not readable, and prone to error, one can sort by column number:

...
ORDER BY 2, 3;

Columns numbers are 1-indexed.

Math operations

video

Available operators:

  • +
  • -
  • *
  • /
SELECT
  id
  ,unit_price
  ,unit_sold
  ,unit_price * units_sold AS total_sales
FROM my_table;
SELECT
  id
  ,unit_price
  ,units_sold
  ,unit_weight
  ,(unit_price * units_sold)/unit_weight AS weight_ratio
FROM my_table;

Aggregate Functions

video

Aggregate functios allow one to summarise data.

The aggregate functions available are:

  • AVG
  • COUNT
  • MIN
  • MAX
  • SUM

DISTINCT may be used with some of the aggregate functions to include only values that are distinct.

Aggregate functions are useful for:

  • summarising data
  • finding the highest and lowest values
  • finding the total number of rows
  • finding averages

Aggregate functions are syntactic sugar for the existing math operators.

AVG function

Rows containing NULL values are ignore by the AVG function.

SELECT
  AVG(price) AS avg_price
FROM products;

COUNT function

COUNT can be used to count rows, including those containing NULL values, or individual columns, excluding those containing NULL values.

# count all rows, including those with NULL values
SELECT COUNT(*) AS customers
FROM customers;
# count customers that have emails
SELECT COUNT(email) AS customers_with_email
FROM customers;

MAX and MIN functions

NULL values are ignored by both MAX and MIN.

SELECT MAX(price) AS max_price
FROM products;
SELECT MIN(size) AS smallest_size
FROM shoes;
# get a range of shoe sizes
SELECT
  MAX(size) AS shoe_size_upper
  , MIN(size) as show_size_lower
FROM shoes;

SUM function

# simple query
SELECT SUM(price) as total_price
FROM products;
# less simple query
SELECT SUM(price * units_sold) total_sales
FROM products;

DISTINCT

If DISTINCT is not defined, ALL is assumed.

DISTINCT can't be used in conjunction with COUNT(*).

DISTINCt doesn't make sense to use with MIN or MAX since they only return a single value.

SELECT COUNT(DISTINCT price) AS distinct_prices
FROM products;

Grouping Data with SQL

video

Grouping data allows one to summarise subsets of data.

Data is grouped using two statements:

  • GROUP BY
  • HAVING

Using these statements we can aggregate data by a particular value, e.g. grouping shoes by shoe size.

When simply aggregating total number of customers in the aggregating module, we didn't need to do anything specific to get a result.

If, however, another column is specified in the SELECT statement, we'd need to indicate how we'd like the number of customers to be counted. This is where GROUP BY is useful:

# this will error - we need to specify _how_ to count customer ids when queried
with region
SELECT
  region
  ,COUNT(customer_id) AS total_customers
FROM customers;

# so we indicate that we want to count customers grouped by region
SELECT
  region
  ,COUNT(customer_id) AS total_customers
FROM customers
GROUP BY region;

Using GROUP BY

  • can contain multiple columns, delimited by commas
  • every column in the SELECT statement must be present in a GROUP BY clause, with the exception of the aggregated values
  • NULLs can be grouped together if your GROUP BY column contains NULLs

WHERE, HAVING, and GROUP BY

  • WHERE filters on rows, not groups
  • WHERE then needs to be defined before GROUP BY
  • rows eliminated by a WHERE clause will not be included in a group
  • HAVING is the equivalent to WHERE, but for groups
  • HAVING must come after ORDER BY
# count the number of orders, by customer, where there are 2 or more orders for
# each customer
SELECT
  customer_id
  .COUNT(*) AS total_orders
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 2;

GROUP BY and ORDER BY

As with other queries, GROUP BY will not sort results. It's good practice to use ORDER BY with GROUP BY to organise the results of queries.

SELECT
  supplier_id
  ,COUNT(*) AS num_products
FROM products
WHERE price >= 4
GROUP BY supplier_id
HAVING COUNT(*) >= 2
ORDER BY name ASC;

Putting it all together

[video](Putting it All Together)

Filtering is useful because:

  • you're narrowing down your results
  • it increases query and application performance
  • makes understanding data easier by:
    • finding specific values
    • finding a range of values
    • finding blank values

Order of key SQL clauses:

# always required
SELECT [columns]
# required if selecting table from a table
FROM [table]
WHERE [criteria]
# required if calculating aggregates by a group
GROUP BY [columns]
HAVING [criteria]
ORDER BY [columns [ASC|DESC]];