Codecademy want to create a world where anyone can build something meaningful with technology, and everyone has the learning tools, resources, and opportunities to do so. Code contains a world of possibilities — all that’s required is the curiosity and drive to learn. At Codecademy, we are committed to empowering all people, regardless of where they are in their coding journeys, to continue to learn, grow, and make an impact on the world around them.
Details
The Data Scientist Career Path is designed for you to gain the technical skills, knowledge, resources, and confidence you need to get your start as a data scientist.
After this Path, you will be able to:
- Create programs using Python 3
- Move off of the Codecademy platform with the Command Line, Jupyter Notebook, Git, and GitHub Desktop
- Acquire and query data with SQL and BeautifulSoup
- Manipulate data using NumPy and pandas
- Wrangle and tidy data with pandas
- Summarize and analyze data with scikit-learn
- Implement many different hypothesis tests using Python
- Visualize data using Matplotlib and seaborn
- Communicate data science findings effectively
- Work with text data using natural language processing
- Build, train, and test a variety of supervised and unsupervised machine learning models
- Understand the fundamentals of deep learning
- Work with aggregate functions and joins in SQL
Popular Data Science Resources
As you go through the path, we’ll suggest certain resources for you to use, like articles, videos, tutorials, and documentation. Here are some additional resources that are considered groundbreaking, significant, or classics in the industry and will help you throughout your Path:
-
Documentation:
-
Cheatsheet:
-
Books:
- A Whirlwind Tour of Python, Jake VanderPlas
- Python for Data Analysis: Data Wrangling with Pandas, Numpy, and Ipython, Wes McKinney
- Python Data Science Handbook, Jake VanderPlas, also available for free here
- Algorithms of Oppression: How Search Engines Reinforce Racism, Safiya Umoja Noble
- Introduction to Machine Learning with Python: A Guide for Data Scientists, Andreas C. Müller, Sarah Guido
- Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Cathy O’Neil
- Python Feature Engineering Cookbook, Soledad Galli
- Deep Learning with Python, François Chollet
Details
Data science enables us to take data and transform it into meaningful information that can help us make decisions. Data science is interdisciplinary and combines other well-known fields such as probability, statistics, analytics, and computer science. The work that you might do could range from writing up reports to building machine learning models. No matter what your interests are, data science is applicable - because these days, we have data on everything!
A fundamental part of data science is statistics. Statistics is the practice of applying mathematical calculations to sets of data to derive meaning. Statistics can give us a quick summary of a dataset, such as the average amount or how consistent a dataset is.
There are two types of statistics:
-
Descriptive Statistics
Descriptive statistics describe a dataset using mathematically calculated values, such as the mean and standard deviation. For instance, the graph below from FiveThirtyEight charts the wage gap between American men and women in 2014. An example of a descriptive statistic would be that at the 90th income percentile, women make 80.6% of what men make on average.
-
Inferential Statistics
On the other hand, inferential statistics are statistical calculations that enable us to draw conclusions about the larger population. For instance, from looking at the graph we can infer that at the 99th income percentile, women make less than 78% of what men make on average. We can also infer that the reason why the wage gap is smallest at the 10th income percentile is because the minimum wage for men and women is the same.
Another field that informs the field of data science is probability. Probability is the mathematical study of what could potentially happen. Fun fact: the study of probability came about as a method to decode secret messages.
In data science, probability calculations are used to build models. Models are able to help us understand data that has yet to exist - either data we hadn’t previously collected or data that has yet to be created. Data scientists create models to help calculate the probability of a certain action and then they use that probability to make informed decisions.
Example of Probability Application
Given that we’re a “learn to code” website, this one may seem fairly obvious. But in all seriousness, programming is an essential part of data science. It’s also what sets data science apart from similar fields, like data analytics.
Programming is the practice of writing commands for a computer to execute. Computer Science is the discipline of interacting with computation systems.
Example of Programming Application
One of the reasons that data science is one of the most interesting fields to go into is that it doesn’t limit you to a specific industry. Interested in airplanes? Work for a company on ticket pricing! Like fashion? Create a model that predicts the latest trends.
Domain Expertise refers to the particular set of knowledge that someone cultivates in order to understand their data. You may be able to crunch the numbers, but in order to understand their meaning, you’re going to need a lot of context. Sometimes this context comes from your research, your teammates, or the knowledge that you gain after working in a particular industry.
Review! You now know the basic parts that make up the field of data science.
- Data Science—the field of taking data and transforming it into meaningful information that can help us make decisions
- Descriptive Statistics—statistics that describe the data in objective terms
- Inferential Statistics—inferences for the overall population based on data
- Probability—the likelihood that an event will happen
- Programming—the act of giving the computer instructions to perform a task
- Domain Expertise—the particular set of knowledge that someone cultivates and brings with them in order to understand their data
If you are interested in learning more about these topics, here are some additional resources:
- Blog: What does a data scientist do?
- Article: What Data Scientists Really Do, According to 35 Data Scientists
- Article: Achieving business impact with data
As you can see, there are many different applications and kinds of projects that you can do once you know a bit of data science!
- Reports - a way of presenting your process, insights, and recommendations
- Recommender Systems - a process that uses data about users and items to predict interest
- Dynamic Pricing - a strategy that takes into account factors such as demand to increase and decrease prices to drive profit
- Natural Language Processing - ways of analyzing text to gain insights as well as support applications, such as chatbots
Of course, there are many applications beyond just the ones that we’ve covered here. So no matter your interest or professional industry, data science thinking can create impact.
Details
Details
Python is a programming language. Like other languages, it gives us a way to communicate ideas. In the case of a programming language, these ideas are “commands” that people use to communicate with a computer!
- Welcome
- Comment
- String
- Variables
- Errors
- Numbers
- Calculations
- Changing Numbers
- Exponents
- Modulo
- Concatenation
- Plus Equals
- Multi-line Strings
- Review
Two common errors that we encounter while writing Python are SyntaxError
and NameError
.
-
SyntaxError
means there is something wrong with the way your program is written — punctuation that does not belong, a command where it is not expected, or a missing parenthesis can all trigger aSyntaxError
. -
A
NameError
occurs when the Python interpreter sees a word it does not recognize. Code that contains something that looks like a variable but was never defined will throw aNameError
.
Python Documentation - Built-in Types
In this documentation, you will learn about the built-in data types of Python. This is helpful if you would like to learn about data types that you can use when programming in Python.
Details
Let’s imagine we were building an application to help people plan trips! When using a trip planning application we can say a simple procedure could look like this:
1. Establish your origin and destination
2. Calculate the distance/route
3. Return the best route to the user
We will perform these three steps every time users have to travel between two points using our trip application. In our programs, we could rewrite the same procedures over and over (and over) for each time we want to travel, but there’s a better way! Python gives us a useful concept called functions.
- Why Functions?
- Defining a Function
- Calling Function
- Whitespace & Execution Flow
- Parameters & Arguments
- Multiple Parameters
- Types of Arguments
- Built-in vs User Defined Functions
- Variable Access
- Returns
- Multiple Returns
- Review
Here’s an example of a function definition:
def function_name():
# functions tasks go here
There are some key components we want to note here:
-
The
def
keyword indicates the beginning of a function (also known as a function header). The function header is followed by a name in snake_case format that describes the task the function performs. It’s best practice to give your functions a descriptive yet concise name. -
Following the function name is a pair of parenthesis
( )
that can hold input values known as parameters (more on parameters later in the lesson!). In this example function, we have no parameters. -
A colon
:
to mark the end of the function header. -
Lastly, we have one or more valid python statements that make up the function body (where we have our python comment).
Python Code Challenges: Functions
Python Code Challenges: Functions (Advanced)
Details
- Boolean Expressions
- Relational Operators: Equals and Not Equals
- Boolean Variables
- If Statement
- Relational Operators II
- Boolean Operators: and
- Boolean Operators: or
- Boolean Operators: not
- Else Statements
- Else If Statements
- Review
Let’s review what we’ve learned this lesson:
- Boolean expressions are statements that can be either
True
orFalse
- A boolean variable is a variable that is set to either
True
orFalse
. - We can create boolean expressions using relational operators:
==
: Equals!=
: Not equals>
: Greater than>=
: Greater than or equal to<
: Less than<=
: Less than or equal to
if
statements can be used to create control flow in your code.else
statements can be used to execute code when the conditions of anif
statement are not met.elif
statements can be used to build additional checks into yourif
statements
Python Tutorial - More Control Flow Tools
In this documentation, you will learn about the different flow control statements of Python. This is helpful if you would like to conditionally execute blocks of code in a Python program.
Python Code Challenges: Control Flow
Python Code Challenges: Control Flow (Advanced)
Details
Windows has a different CLI, called Command Prompt. While this has many of the same features as Bash, Bash is much more popular. Because of the strength of the open source community and the tools they provide, mastering Bash is a better investment than mastering Command Prompt.
To use Bash on a Windows computer, we will download and install a program called Git Bash. Git Bash allows us to easily access Bash as well as another tool we’ll be using later called Git, inside the Windows environment.
For more detailed tutorial, please watch this video.
Jupyter Notebook (sometimes called IPython Notebook) is a popular way to write and run Python code, especially for data analysis, data science and machine learning. Jupyter Notebooks are easy-to-use because they let you execute code and review the output quickly. This iterative process is central to data analytics and makes it easy to test hypotheses and record the results (just like a notebook).
Above is the Jupyter Notebook interface. A Jupyter Notebook has two parts:
- the front-end interface (what you see in the gif)
- the back-end kernel
The front-end interface
loads in a web browser and consists of “cells” where you enter your code. The browser is just for display, so you do not need to be connected to the internet.
Jupyter Notebook uses a back-end kernel
called IPython
. The ‘I’ stands for ‘Interactive’, which means that a program or script can be broken up into smaller pieces, and those pieces can be run independently from the rest of the program.
You do not need to worry about the difference between Python
and IPython
. The important thing to know is that you can run small pieces of code, which can be helpful when working with data.
Windows Anaconda
Go to the Anaconda Downloads page and download the appropriate version of Anaconda (32- or 64-Bit Graphical Installer)
- Double click on the .exe file and click
Install
. - Read and agree to the licensing terms.
- Select if you want to install for 'Just Me' or 'All Users'. If you are installing for 'All Users', you must have Administrator privileges.
- You will be prompted to select the installation location. By default, Anaconda should try to install in your home directory. We recommend accepting this default. Click
Install
. - You will be asked if you want to add Anaconda to your PATH environment variable. Do not add Anaconda to the PATH because it can interfere with other software.
- You will be asked if you want Anaconda to be your default version of Python. We recommend 'Yes'. There are some rare instances where you might not make Anaconda the default version, but they are beyond the scope of this article.
- Click the Install button.
- Go ahead and finish the installation.
Now you can verify your installation. Go to the Windows Start menu and search for ‘Anaconda Navigator’ (not the Anaconda Prompt).
If it appears, it is installed. Congratulations!
Double click on it, and the Anaconda Navigator window should appear.
Your applications screen may look slightly different from this one, but that is ok. Click the Launch button under Jupyter Notebook. A Jupyter Notebook interface will appear in your default browser.
An Anaconda Prompt might also open and display a url. If it does, do not close it until you are done working with Jupyter Notebook. If it does not appear, don’t worry - this has to do with your operating system and will not affect Jupyter Notebook’s performance.
Congratulations!! You are ready to move on to the next article and get started using Jupyter Notebook!
For many more helpful tools and features, check out the documentation, and if you want to be inspired to do more with Jupyter notebook, check out A gallery of interesting Jupyter Notebooks.
Have fun coding with Jupyter Notebook!
There are numerous ways to enhance your Jupyter Notebook and make it more readable and standardized. We have covered some of the methods, but there are many others that can be found in the Jupyter Notebook User’s Manual.
Documentation
Jupyter Notebook Documentation
In this documentation, you will learn about how to install, configure, and use Jupyter Notebook for Python programming. This is helpful if you would like to quickly validate Python code or prototype visualizations.
Details
In programming, it is common to want to work with collections of data. In Python, a list is one of the many built-in data structures that allows us to work with a collection of data in sequential order.
Suppose we want to make a list of the heights of students in a class:
- Noelle is 61 inches tall
- Ava is 70 inches tall
- Sam is 67 inches tall
- Mia is 64 inches tall
In Python, we can create a variable called heights to store these integers into a list:
heights = [61, 70, 67, 64]
Notice that:
- A list begins and ends with square brackets
([
and])
. - Each item (i.e.,
67
or70
) is separated by a comma (,) - It’s considered good practice to insert a space
()
after each comma, but your code will run just fine if you forget the space.
Intro to List in Python:
- What is a List?
- What can a List contain?
- Empty Lists
- List Methods
- Growing a List: Append
- Growing a List: Plus (+)
- Accessing List Elements
- Accessing List Elements: Negative Index
- Modifying List Elements
- Shrinking a List: Remove
- Two-Dimensional (2D) Lists
- Accessing 2D Lists
- Modifying 2D Lists
- Review
Combining Lists: The Zip Function
Use zip()
to create a new variable called names_and_dogs_names
that combines owners and dogs_names lists into a zip object.
Then, create a new variable named list_of_names_and_dogs_names
by calling the list()
function on names_and_dogs_names
.
Print list_of_names_and_dogs_names
.
See the code here.
Documentation
Python Tutorial - Sequence Types
In this documentation, you will learn about the built-in sequence types available in Python. This is helpful when you are working with a sequence of data in Python.
Python List Methods:
.count()
- A list method to count the number of occurrences of an element in a list..insert()
- A list method to insert an element into a specific index of a list..pop()
- A list method to remove an element from a specific index or from the end of a list.range()
- A built-in Python function to create a sequence of integers.len()
- A built-in Python function to get the length of a list..sort() / sorted()
- A method and a built-in function to sort a list.
Working with the List in Python:
- Adding by Index: Insert
- Removing by Index: Pop
- Consecutive Lists: Range
- The Power of Range!
- Length
- Slicing Lists I
- Slicing Lists II
- Counting in a List
- Sorting Lists I
- Sorting Lists II
- Review
Python Code Challenges: Lists
Python Code Challenges: Lists (Advanced)
Details
In programming, this process of using an initialization, repetitions, and an ending condition is called a loop
. In a loop, we perform a process of iteration (repeating tasks).
Programming languages like Python implement two types of iteration:
-
Indefinite iteration
, where the number of times the loop is executed depends on how many times a condition is met. -
Definite iteration
, where the number of times the loop will be executed is defined in advance (usually based on the collection size).
Intro to Loop:
- Why Loops?
- For Loops: Introduction
- For Loops: Using Range
- While Loops: Introduction
- While Loops: Lists
- Infinite Loops
- Loop Control: Break
- Loop Control: Continue
- Nested Loops
- List Comprehensions: Introduction
- List Comprehensions: Conditionals
- Review
Documentation
Python Tutorial - Data Structures : Looping Techniques
In this documentation, you will learn about the different Python looping techniques. This is helpful if you would like to repeatedly execute a code block in a program for a sequence of values.
List Comprehension - Code Challenge:
Details
In Python, the way we store something like a word, a sentence, or even a whole paragraph is as a string. A string is a sequence of characters contained within a pair of 'single quotes'
or "double quotes"
. A string can be any length and can contain any letters, numbers, symbols, and spaces.
Intro to String:
- String
- They're all Lists!
- Cut Me a Slice of String
- Concatenating Strings
- More and More String Slicing
- Negative Indices
- Strings are Immutable
- Escape Characters
- Iterating through Strings
- Strings and Conditionals (Part One)
- Strings and Conditionals (Part Two)
- Review
String Methods:
- Formatting Methods
- Splitting Strings
- Splitting Strings II
- Splitting Strings III
- Joining Strings
- Joining Strings II
- .strip()
- Replace
- .find()
- .format()
- .format() II
- Review
Documentation
Python Documentation - Common String Operations
In this documentation, you will learn about common string operations in Python. This is helpful if you would like to manipulate strings using Python code.
Code Challenge - String Methods:
Details
A dictionary is an unordered set of key: value
pairs.
It provides us with a way to map pieces of data to each other so that we can quickly find values that are associated with one another.
Suppose we want to store the prices of various items sold at a cafe:
- Avocado Toast is 6 dollars
- Carrot Juice is 5 dollars
- Blueberry Muffin is 2 dollars
In Python, we can create a dictionary called menu to store this data:
menu = {"avocado toast": 6, "carrot juice": 5, "blueberry muffin": 2}
Notice that:
- A dictionary begins and ends with curly braces
{
and}
. - Each item consists of a key (
"avocado toast"
) and a value (6
). - Each
key: value
pair is separated by a comma.
It’s considered good practice to insert a space (
) after each comma, but our code will still run without the space.
Intro to Dictionaries:
- Make a Dictionary
- Invalid Keys
- Empty Dictionary
- Add A Key
- Add Multiple Keys
- Overwrite Values
- List Comprehensions to Dictionaries
- Review
Documentation
Python Tutorial - Data Structures - Dictionaries
In this documentation, you will learn about Python’s built-in dictionary datatype. This is helpful if you would like to store the data as a set of key:value pairs, and later extract a value given its key.
Using Dictionaries:
Details
Intro to Classes:
Details
A module is a collection of Python declarations intended broadly to be used as a tool. Modules are also often referred to as “libraries” or “packages” — a package is really a directory that holds a collection of modules.
Usually, to use a module in a file, the basic syntax you need at the top of that file is:
from module_name import object_name
Often, a library will include a lot of code that you don’t need that may slow down your program or conflict with existing code. Because of this, it makes sense to only import what you need.
Intro to Modules:
- Modules Python Introduction
- Modules Python Random
- Modules Python Namespaces
- Modules Python Decimals
- Modules Python Files and Scope
Documentation
In this documentation you will learn about Python modules. This is helpful if you would like to split long program code into logically grouped files and import the files to reference the classes and functions within.
Details
Learn Python Files:
- Reading a File
- Iterating Through Lines
- Reading a Line
- Writing a File
- Appending to a File
- What's With "with"?
- What Is a CSV File?
- Reading a CSV File
- Reading Different Types of CSV Files
- Writing a CSV File
- Reading a JSON File
- Writing a JSON File
Documentation
Python Tutorial - Input and Output - Reading and Writing Files
In this documentation you will learn about reading and writing the data in files using Python. This is helpful if you would like to read the file’s contents into the program or write data you’ve generated into a file.
Details
Details
Data acquisition, or data mining, is the step of the Data Science Life Cycle where we identify and obtain the raw data that will later be cleaned for exploration and modeling.
Data Science is viewed as such a powerful tool because data is often the most valuable resource we have when attempting to gain insight from our environment. In the frame of business, once a question has been defined and objectives set through the business understanding phase of the data science life cycle, the right data must be acquired to answer the business needs. In the diagram above, the arrows between the business understanding phase and data acquisition phase indicate that there is often an iterative relationship between the two. You need to understand how the questions you’re asking relate to your goals before collecting data. However after acquiring your data, that data may shift your understanding of your business, where you may revisit the business understanding phase. Some aspects to consider when acquiring data are:
- What data is needed to achieve the business goal?
- How much data is needed to produce valuable insight and modeling?
- Where and how can this data be found?
- What legal and privacy parameters should be considered?
There are several methods you can utilize to acquire data. Methods that we will cover in this article are:
Public Data
There are several open source datasets that are hosted online that allow you to freely download data collected by others, offering solutions to a wide range of data science and machine learning applications. These public sources of data are often suitable for small to medium sized machine learning projects, concept validation, and research purposes. Some of the most commonly visited are:
- GitHub
- Kaggle
- KDnuggets
- UCI Machine Learning Repository
- US Government’s Open Data
- Five Thirty Eight
- Amazon Web Services
Private Data
There are also a number of private datasets that businesses curate themselves and are under ownership of the company. For instance, Netflix’s database of user preferences powers their immense recommendation systems. There are also services that allow you to purchase datasets such as data markets like Data & Sons or crowd-sourcing marketplaces such as Amazon’s Mechanical Turks where one can outsource their data acquisition needs like data validation and research to survey participation. Often we will find usage of private data within a large production scale setting.
Pros:
Time
: Readily available datasets that can quickly move a project to the next phase of the Data Science Life Cycle.Cost
: Public datasets can cut costs of collecting data down to zero.
Cons:
Messy
: Data can often come in forms that require intensive cleaning and modification.Cost
: Private services can lead to high costs in acquiring data.
Web scraping can be one of the most potent methods of data acquisition when used effectively. Web scraping at its core is the act of extracting or copying information directly from a website. This data can then be stored within a dataframe or spreadsheet and used in the same manner as any other dataset. There are two general methods for web scraping, one where you will manually scrape the data from a web page and the other where you employ a web crawler or bot that automates the process of extracting data from a website.
Python has useful libraries that allow you to implement both methods of web scraping, some of the most commonly used are BeautifulSoup, Selenium, and Scrapy. Web scraping is best used when the data you need is not available as a public or private dataset, but present on a web page. We typically employ web scraping techniques to acquire data for small to medium sized projects, but rarely in production as this can raise ownership and copyright issues.
Pros:
Versatile
: Highly adaptable method for acquiring data from the internet.Scalable
: Distributed bots can be coordinated to retrieve large quantities of data.
Cons:
Language Barrier
: Multiple languages are involved when scraping and require a knowledge of languages not typically used for data science.Legality
: Excessive or improper web-scraping can be illegal, disrupt a website’s functionality, and lead to your IP address being black listed from the site.
Application Programming Interfaces, most often referred to as API’s are another method that can be used for data acquisition. We can imagine APIs as a more polite and ‘by the books’ way of web scraping, where in the process of acquiring data from a site we request permission to access some data, and wait for a response from the site to fulfill the request. Unlike web scraping, APIs are a means of communication between 2 different software systems.
Typically this communication is achieved in the form of an HTTP Request/Response Cycle where a client(you) sends a request to a website’s server for data through an API call. The server then searches within its databases for the particular data requested and responds back to the client either with the data, or an error stating that request can not be fulfilled.
Pros:
User & Site Friendly
: APIs allow security and management of resources for sites that data is being requested from.Scalable
: API’s can allow for various amounts of data to be requested, up to production scale volumes.
Cons:
Limited
: Some functions or data may not be accessed via an API.Cost
: Some API calls can be quite expensive, leading to limitations of certain functions and projects.
What can we do when the data we need to analyze is not present on the internet or able to be outsourced? In these situations, it can be useful to know how to harvest data yourself, and there are many tools available to do so. Perhaps you would like to conduct hypothesis testing to determine public sentiment about a particular good or service, or maybe you would like to analyze data about animal populations of a local habitat. Being able to acquire the data to conduct these analyses yourself is an invaluable skill to have.
Google Forms is a simple and free way to create surveys that can be shared with others to acquire data related to a particular population. With Google forms, you can create surveys that include video and image files, multiple choice, and short answer questions that can be posted on social media, sent out in email lists, and even printed to manually harvest data. The data acquired can then be downloaded in a CSV file and used within Python. Google also offers Google Surveys, a paid service that allows you to have a wider range of respondents and gives you more control in determining your target audience.
Pros
Bespoke
: Data acquired manually can be made to address all of the business objective’s needs, and can need little to no cleaning to be effective.Community
: Manually acquired data can be useful in advancing the fields of data science and machine learning, and be further explored by others.
Cons
Time
: Manually acquired data can take more time than other methods, especially when dealing with large datasets.Cost
: Some services and devices used to manually acquire data can be expensive and can be limiting for certain projects.
→ How to build your own dataset for data science projects
In this article, you will learn how to create your own dataset for a data science project. This is helpful if the dataset you want is not available and you need to build your own.
Details
In this lesson, we will focus on structured data, as it is the data found within relational databases.
Structured Data
Structured data is the most organized type of data. It follows a data model, which is a kind of blueprint that defines and describes the structure of the data. Structured data is typically in the form of tables that are well defined by their rows and columns, a great example being a DataFrame from the popular Python library pandas.
Some advantages of structured data are that its explicit structure aids in organized storage and accessibility. This also allows it to be easily indexed for efficient reference. With structured data, we are able to set permissions for security purposes, so that only those that meet predetermined guidelines and clearances can access the data.
Examples of structured data are:
- Online forms
- Excel Spreadsheets
- SQL Databases
Relational Database Management Systems
Another advantage of working with relational databases is Relational Database Management systems, or RDBMS. Relational Database Management Systems (RDBMS) are important for data science and analytics because they provide the functionality needed for creating, reading, updating, and deleting data, often referred to as CRUD within our database. The language that data teams utilize most often in RDBMS to execute commands is Structured Query Language (SQL), pronounced as “S-Q-L” or “sequel”.
Structured Query Language (SQL) is one of the most common and powerful languages for querying databases. It is fast, secure, and able to return millions of data points in just a few lines. While there are hundreds of RDBMS, some of the most common RDBMS that use SQL are:
-
MySQL is a popular free and open-source SQL database. It is widely used for web applications because the MySQL Server is renowned for its speed, reliability, and ease of use on multiple platforms.
-
Much like MySQL, PostgreSQL is a popular open-source SQL database. PostgreSQl is one of the oldest RDBMS with over 30 years into its development, so it has an extensive community supporting it and is known for its reliability and array of features.
-
Oracle is considered to be among the most popular of all RDBMS. Owned by Oracle Corporation and closed sourced, Oracle DB is the goto RDBMS for corporations as it is able to scale for and support their massive workloads effectively. Licensing for Oracle DB however, is known to be expensive and can be infeasible to use for certain applications.
-
SQL Server is another privately-owned RDBMS that is popular, especially among corporations. While Microsoft offers SQL Server free through its SQL Server 2019 Express edition, the enterprise editions that are designed for large scale applications with more functionality become more expensive as your application scales.
-
SQLite is another popular open-source SQL database. SQLite is designed to be compact, efficient, and self-contained. SQLite is able to store a complete database in a single cross-platform disk file so that it is not necessary to connect databases to a server. These characteristics and capabilities are what make SQLite considered to be the most used RDBMS, as it is used in most cell phones, computers, and several other daily used devices. Throughout our course, we will use SQLite for working with relational databases.
SQL
Structured Query Language, is a programming language designed to manage data stored in relational databases. SQL operates through simple, declarative statements. This keeps data accurate and secure, and helps maintain the integrity of databases, regardless of size.
The SQL language is widely used today across web frameworks and database applications. Knowing SQL gives you the freedom to explore your data, and the power to make better decisions. By learning SQL, you will also learn concepts that apply to nearly every data storage system.
Relational Databases
SELECT * FROM celebs;
A relational database is a database that organizes information into one or more tables. Here, the relational database contains one table.
A table is a collection of data organized into rows and columns. Tables are sometimes referred to as relations. Here the table is celebs
.
A column is a set of data values of a particular type. Here, id
, name
, and age
are the columns.
A row is a single record in a table. The first row in the celebs
table has:
- An
id
of1
- A
name
ofJustin Bieber
- An
age
of22
All data stored in a relational database is of a certain data type. Some of the most common data types are:
INTEGER
, a positive or negative whole numberTEXT
, a text stringDATE
, the date formatted as YYYY-MM-DDREAL
, a decimal value
Statements
The code below is a SQL statement. A statement is text that the database recognizes as a valid command. Statements always end in a semicolon ;.
CREATE TABLE table_name (
column_1 data_type,
column_2 data_type,
column_3 data_type
);
Let’s break down the components of a statement:
CREATE TABLE
is a clause. Clauses perform specific tasks in SQL. By convention, clauses are written in capital letters. Clauses can also be referred to as commands.table_name
refers to the name of the table that the command is applied to.(column_1 data_type, column_2 data_type, column_3 data_type)
is a parameter. A parameter is a list of columns, data types, or values that are passed to a clause as an argument. Here, the parameter is a list of column names and the associated data type.
Create
CREATE
statements allow us to create a new table in the database. You can use the CREATE
statement anytime you want to create a new table from scratch. The statement below creates a new table named celebs
.
CREATE TABLE celebs (
id INTEGER,
name TEXT,
age INTEGER
);
-
CREATE TABLE
is a clause that tells SQL you want to create a new table. -
celebs
is the name of the table. -
(id INTEGER, name TEXT, age INTEGER)
is a list of parameters defining each column, or attribute in the table and its data type:id
is the first column in the table. It stores values of data typeINTEGER
name
is the second column in the table. It stores values of data typeTEXT
age
is the third column in the table. It stores values of data typeINTEGER
Insert
The INSERT
statement inserts a new row into a table.
We can use the INSERT
statement when you want to add new records. The statement below enters a record for Justin Bieber into the celebs
table.
INSERT INTO celebs (id, name, age)
VALUES (1, 'Justin Bieber', 22);
INSERT INTO
is a clause that adds the specified row or rows.celebs
is the table the row is added to.(id, name, age)
is a parameter identifying the columns that data will be inserted into.VALUES
is a clause that indicates the data being inserted.(1, 'Justin Bieber', 22)
is a parameter identifying the values being inserted.1
: an integer that will be added toid
column'Justin Bieber'
: text that will be added toname
column22
: an integer that will be added toage
column
Select
SELECT
statements are used to fetch data from a database. In the statement below, SELECT
returns all data in the name
column of the celebs table.
SELECT name FROM celebs;
SELECT
is a clause that indicates that the statement is a query. You will useSELECT
every time you query data from a database.name
specifies the column to query data from.FROM
celebs specifies the name of the table to query data from. In this statement, data is queried from thecelebs
table.
Alter
The ALTER TABLE
statement adds a new column to a table. You can use this command when you want to add columns to a table. The statement below adds a new column twitter_handle
to the celebs
table.
ALTER TABLE celebs
ADD COLUMN twitter_handle TEXT;
-
ALTER TABLE
is a clause that lets you make the specified changes. -
celebs
is the name of the table that is being changed. -
ADD COLUMN
is a clause that lets you add a new column to a table:twitter_handle
is the name of the new column being addedTEXT
is the data type for the new column
-
NULL
is a special value in SQL that represents missing or unknown data. Here, the rows that existed before the column was added haveNULL
(∅) values fortwitter_handle
.
Update
The UPDATE
statement edits a row in a table. You can use the UPDATE
statement when you want to change existing records. The statement below updates the record with an id
value of 4
to have the twitter_handle
@taylorswift13
.
UPDATE celebs
SET twitter_handle = '@taylorswift13'
WHERE id = 4;
-
UPDATE
is a clause that edits a row in the table. -
celebs
is the name of the table. -
SET
is a clause that indicates the column to edit.twitter_handle
is the name of the column that is going to be updated@taylorswift13
is the new value that is going to be inserted into thetwitter_handle
column.
-
WHERE
is a clause that indicates which row(s) to update with the new column value. Here the row with a4
in theid
column is the row that will have thetwitter_handle
updated to@taylorswift13
.
Delete
The DELETE FROM
statement deletes one or more rows from a table. You can use the statement when you want to delete existing records. The statement below deletes all records in the celeb
table with no twitter_handle
:
DELETE FROM celebs
WHERE twitter_handle IS NULL;
DELETE FROM
is a clause that lets you delete rows from a table.celebs
is the name of the table we want to delete rows from.WHERE
is a clause that lets you select which rows you want to delete. Here we want to delete all of the rows where thetwitter_handle
columnIS NULL
.IS NULL
is a condition in SQL that returns true when the value isNULL
and false otherwise.
Constraints
Constraints that add information about how a column can be used are invoked after specifying the data type for a column. They can be used to tell the database to reject inserted data that does not adhere to a certain restriction. The statement below sets constraints on the celebs
table.
CREATE TABLE celebs (
id INTEGER PRIMARY KEY,
name TEXT UNIQUE,
date_of_birth TEXT NOT NULL,
date_of_death TEXT DEFAULT 'Not Applicable'
);
-
PRIMARY KEY
columns can be used to uniquely identify the row. Attempts to insert a row with an identical value to a row already in the table will result in a constraint violation which will not allow you to insert the new row. -
UNIQUE
columns have a different value for every row. This is similar toPRIMARY KEY
except a table can have many differentUNIQUE
columns. -
NOT NULL
columns must have a value. Attempts to insert a row without a value for aNOT NULL
column will result in a constraint violation and the new row will not be inserted. -
DEFAULT
columns take an additional argument that will be the assumed value for an inserted row if the new row does not specify a value for that column.
Intro to SQL:
SQL Commands:
- ALTER TABLE
- AND
- AS
- AVG()
- BETWEEN
- CASE
- COUNT()
- CREATE TABLE
- DELETE
- GROUP BY
- HAVING
- INNER JOIN
- INSERT
- IS NULL / IS NOT NULL
- LIKE
- LIMIT
- MAX()
- MIN()
- OR
- ORDER BY
- OUTER JOIN
- ROUND()
- SELECT
- SELECT DISTINCT
- SUM
- UPDATE
- WHERE
- WITH
Thinking in SQL vs Thinking in Python
→ Thinking in SQL vs Thinking in Python
In this article, you will learn tips on how to use Python scripts with SQL queries after only using SQL queries on their own. This is helpful if you want to integrate the flexibility and expanded capabilities of Python methods into the simplicity and directness of SQL queries.
Tutorial
In this tutorial, you will learn how to access a SQLite database in a Python environment. This is helpful for accessing and manipulating datasets that are not located on local devices. Some benefits to this include:
- Saving memory space on your local drive.
- Reducing the amount of processing power needed to manipulate large datasets.
- Providing more data security.
- Allowing for seamless transitions as the given dataset changes.
Details
- Introduction
- Rules of Scraping
- Requests
- The BeautifulSoup Object
- Object Types
- Navigating by Tags
- Find All
- Select for CSS Selectors
- Reading Text
- Review
Details
Details
A function is an object that is able to accept some sort of input, possibly modify it, and return some sort of output. In Python, a lambda function is a one-line shorthand for function. A simple lambda function might look like this:
add_two = lambda my_input: my_input + 2
So this code:
print(add_two(3))
print(add_two(100))
print(add_two(-2))
would print:
>>> 5
>>> 102
>>> 0
Let’s break this syntax down:
- The function is stored in a variable called
add_two
lambda
declares that this is a lambda function (if you are familiar with normal Python functions, this is similar to how we usedef
to declare a function)my_input
is what we call the input we are passing intoadd_two
- We are returning
my_input
plus 2 (with normal Python functions, we use the keywordreturn
)
Lambda Function Code Challenge:
Details
Creating, Loading, Selecting Data with Pandas:
- Create a DataFrame I
- Create a DataFrame II
- Loading and Saving CSVs
- Inspect a DataFrame
- Select Columns
- Selecting Multiple Columns
- Select Rows
- Selecting Multiple Rows
- Select Rows with Logic I
- Select Rows with Logic II
- Select Rows with Logic III
- Setting indices
- Review
Modifying Dataframe:
- Adding a Column I
- Adding a Column II
- Adding a Column III
- Performing Column Operations
- Reviewing Lambda Function
- Reviewing Lambda Function: If Statements
- Applying a Lambda to a Column
- Applying a Lambda to a Row
- Renaming Columns
- Renaming Columns II
- Review
Documentation - Pandas
10 minutes to Pandas User Guide
In this documentation you will learn about Pandas, a python library for data analysis. This library is helpful if you would like to perform data manipulation (slicing, dicing, joining, merging, grouping) and analysis.
Documentation - NumPy
Absolute Beginner's Guide to Numpy
In this documentation you will learn about the python library “NumPy”. This is helpful if you would like perform mathematical operations on arrays and matrices.
Details
Introduction
This lesson you will learn about aggregates in Pandas. An aggregate statistic is a way of creating a single number that describes a group of numbers. Common aggregate statistics include mean, median, or standard deviation.
You will also learn how to rearrange a DataFrame into a pivot table, which is a great way to compare data across two dimensions.
Aggregates:
- Calculating Column Statistics
- Calculating Aggregate Functions I
- Calculating Aggregate Functions II
- Calculating Aggregate Functions III
- Calculating Aggregate Functions IV
- Pivot Tables
- Review
Documentation
Pandas Tutorial - Pandas.Dataframe.Groupby
In this documentation, you will learn about how to group data using pandas.dataframe.groupby. This is helpful if you would like to group large amounts of data and compute operations on those groups.
Documentation
Pandas Documentation - Reshaping and Pivot Tables
In this documentation, you will learn about reshaping dataframe objects to visualize data in different formats. This is helpful if you would like to explore the relationships among variables by representing the data in different ways.
Details
Instead, we can split our data into three tables:
- orders would contain the information necessary to describe an order:
order_id
,customer_id
,product_id
,quantity
, andtimestamp
- products would contain the information to describe each product:
product_id
,product_description
andproduct_price
- customers would contain the information for each customer:
customer_id
,customer_name
,customer_address
, andcustomer_phone_number
In this lesson, we will learn the Pandas commands that help us work with data stored in multiple tables.
Multiple Tables:
- Intro
- Inner Merge I
- Inner Merge II
- Inner Merge III
- Merge on Specific Columns
- Merge on Specific Columns II
- Mismatched Merges
- Outer Merge
- Left and Right Merge
- Concatenate DataFrames
- Review
Documentation
Pandas Documentation - Pandas.Merge
In this documentation, you will learn about merging dataframes in Pandas. This is helpful if you would like to combine data from two dataframes into a single dataframe.
Details
Details
A regular expression is a special sequence of characters that describe a pattern of text that should be found, or matched, in a string or document. By matching text, we can identify how often and where certain pieces of text occur, as well as have the opportunity to replace or update these pieces of text if needed.
Regular Expressions have a variety of use cases including:
- validating user input in HTML forms
- verifying and parsing text in files, code and applications
- examining test results
- finding keywords in emails and web pages
Concepts:
- Regular expressions are special sequences of characters that describe a pattern of text that is to be matched
- We can use literals to match the exact characters that we desire
3, Alternation, using the pipe symbol
|
, allows us to match the text preceding or following the|
- Character sets, denoted by a pair of brackets
[]
, let us match one character from a series of characters - Wildcards, represented by the period or dot
.
, will match any single character (letter, number, symbol or whitespace) - Ranges allow us to specify a range of characters in which we can make a match
- Shorthand character classes like
/w
,/d
and/s
represent the ranges representing word characters, digit characters, and whitespace characters, respectively - Groupings, denoted with parentheses
()
, group parts of a regular expression together, and allows us to limit alternation to part of a regex - Fixed quantifiers, represented with curly braces
{}
, let us indicate the exact quantity or a range of quantity of a character we wish to match - Optional quantifiers, indicated by the question mark
?
, allow us to indicate a character in a regex is optional, or can appear either 0 times or 1 time - The Kleene star, denoted with the asterisk
*
, is a quantifier that matches the preceding character0
or more times - The Kleene plus, denoted by the plus
+
, matches the preceding character1
or more times - The anchor symbols hat
^
and dollar sign$
are used to match text at the start and end of a string, respectively
Resource
RegExr Regular Expression Builder
In this resource, you will learn how to build regular expressions. This is helpful if you wish to experiment with regular expressions before using them professionally.
Details
When we receive raw data, we have to do a number of things before we’re ready to analyze it, possibly including:
- diagnosing the “tidiness” of the data — how much - data cleaning we will have to do
- reshaping the data — getting right rows and columns for effective analysis
- combining multiple files
- changing the types of values — how we fix a column where numerical values are stored as strings, for example
- dropping or filling missing values - how we deal with data that is incomplete or missing
- manipulating strings to represent the data better
We will go through the techniques data scientists use to accomplish these goals by looking at some “unclean” datasets and trying to get them into a good, clean state.
Cleaning Data:
- Intro
- Diagnose the Data
- Dealing with Multiple Files
- Reshaping your Data
- Dealing with Duplicates
- Splitting by Index
- Splitting by Character
- Looking at Types
- String Parsing
- More String Parsing
- Missing Values
Documentation
In this resource you will learn how to work with null or missing values in pandas. This is helpful if you wish to learn the library’s vast array of such tools beyond NaN.
Details
Details
Generally, variables will come in two varieties; categorical and quantitative. Categorical variables group observations into separate categories that can be ordered or unordered. Quantitative variables on the other hand are variables expressed numerically, whether as a count or measurement.
Let’s dive a bit deeper into the different variable types to understand how to identify them in a dataset.
-
Quantitative Variables
We can think of quantitative variables as any information about an observation that can only be described with numbers. Quantitative variables are generally counts or measurements of something (eg., number of points earned in a game or height). They are well suited for mathematical operations and quantitative analysis, and are helpful for answering questions like “How many/much?”, “What is the average?”, or “How often?”. There are two types of quantitative variables; discrete and continuous, and they both help to serve different functions in a dataset.
a. Discrete Variables
Discrete quantitative variables are numeric values that represent counts and can only take on integer values. They represent whole units that can not be broken down into smaller pieces, and as such cannot be meaningfully expressed with decimals or fractions. Examples of discrete variables are the number of children in a person’s family or the number of coin flips a person makes. Unless we are working with quantum mechanics, we can not meaningfully have flipped a coin 3.5 times, or have 4.75 sisters.
b. Continuous Variables
Continuous quantitative variables are numeric measurements that can be expressed with decimal precision. Theoretically, continuous variables can take on infinitely many values within a given range. Examples of continuous variables are length, weight, and age which can all be described with decimal values.
-
Categorical Variables
Categorical variables differ from quantitative variables in that they focus on the different ways data can be grouped rather than counted or measured. With categorical variables, we want to understand how the observations in our dataset can be grouped and separated from one another based on their attributes. When the groupings have a specific order or ranking, the variable is an ordinal categorical variable. If there is no apparent order or ranking to the categories, we refer to the variable as a nominal categorical variable.
a. Ordinal Variables
Do you remember working with a column in a dataset where the values of the column were groups that were greater or lesser than each other in some intrinsic way? Suppose there was a variable containing responses to the question “Rate your agreement with the statement: The minimum age to drive should be lowered.” The response options are “strongly disagree”, “disagree”, “neutral”, “agree”, and “strongly agree”. Because we can see an order where
“strongly disagree”
<“disagree“
<“neutral”
<“agree”
<“strongly agree”
in relation to agreement, we consider the variable to be ordinal.b. Nominal Variables
Nominal categorical variables are those variables with two or more categories that do not have any relational order. Examples of nominal categories could be states in the U.S., brands of computers, or ethnicities. Notice how for each of these variables, there is no intrinsic ordering that distinguishes a category as greater than or less than another category.
c. Binary Variables
Binary or dichotomous variables are a special kind of nominal variable that have only two categories. Because there are only two possible values for binary variables, they are mutually exclusive to one another. We can imagine a variable that describes if a picture contains a cat or a dog as a binary variable. In this case, if the picture is not a dog, it must be a cat, and vice versa. Binary variables can also be described with numbers similar to bits with
0
or1
values. Likewise you may find binary variables containing boolean values ofTrue
orFalse
.
Variable Types:
- Assessing Variable Types
- Categorical Variables
- Quantitative Variables
- Matching Quantitative Variables
- Matching Categorical Variables
- Altering the Data Types of Pandas Dataframes
- The Pandas Category Data Type
- Label Encoding with .cat.codes
- One-Hot Encoding
- Review
Review
In this lesson, you have:
- Discovered the different types of variables you will encounter when working with data and their corresponding data types in Python.
- Explored datasets with
.head()
. - Assessed categories within variables with the
.unique()
method. - Practiced ways to check the data type of variables like the
.dtypes
attribute and.info()
method. - Altered data with the
.replace()
method. - Learned how to change the data types of variables using the
.astype()
method. - Investigated the pandas
category
data type. - Developed your Label Encoding and One-Hot Encoding skills with the
.cat.codes
accessor andpd.get_dummies()
method.
Details
1) Mean
The mean, often referred to as the average, is a way to measure the center of a dataset.
The average of a set is calculated using a two-step process:
- Add all of the observations in your dataset.
- Divide the total sum from step one by the number of points in your dataset.
Mean:
2) Median
The formal definition for the median of a dataset is:
The value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, you either report both of the middle two values or their average.
There are always two steps to finding the median of a dataset:
- Order the values in the dataset from smallest to largest
- Identify the number(s) that fall(s) in the middle
Median:
3) Mode
The formal definition for the mode of a dataset is:
The most frequently occurring observation in the dataset. A dataset can have multiple modes if there is more than one value with the same maximum frequency.
While you may be able to find the mode of a small dataset by simply looking through it, if you have trouble, we recommend you follow these two steps:
- Find the frequency of every unique number in the dataset
- Determine which number has the highest frequency
Mode:
Documentation
In this documentation, you will learn about the different functions for statistical analysis within the Python library NumPy. This is helpful if you would like to understand ways to analyze data using NumPy.
Details
1) Variance
Variance is a descriptive statistic that describes how spread out the points in a data set are.
Variance:
2) Standard Deviation
Standard deviation is computed by taking the square root of the variance.
Standard Deviation:
Details
1) Histogram
There’s no better tool to visualize the uncertainty and chaos in data than a histogram. A histogram displays the distribution of your underlying data.
Histograms reveal, through numbers, interpretable trends in your data. They don’t provide a yes or no answer, but are often used as a starting point for discussion and informing an answer to your data.
Histogram:
Details
1) Quartiles
A common way to communicate a high-level overview of a dataset is to find the values that split the data into four groups of equal size. By doing this, we can then say whether a new datapoint falls in the first, second, third, or fourth quarter of the data.
The values that split the data into fourths are the quartiles. Those values are called the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3)
In the image above, Q1 is 10
, Q2 is 13
, and Q3 is 22
. Those three values split the data into four groups that each contain five datapoints.
Quartiles:
2) Quantiles
Quantiles are points that split a dataset into groups of equal size. For example, let’s say you just took a test and wanted to know whether you’re in the top 10% of the class. One way to determine this would be to split the data into ten groups with an equal number of datapoints in each group and see which group you fall into.
There are nine values that split the dataset into ten groups of equal size — each group has 3 different test scores in it.
Those nine values that split the data are quantiles! Specifically, they are the 10-quantiles, or deciles.
Quantiles:
3) Interquartile Range
The interquartile range (IQR) is a descriptive statistic that tries to solve this problem. The IQR ignores the tails of the dataset, so you know the range around-which your data is centered.
In this image, most of the data is between 0
and 15
. However, there is one large negative outlier (-20
) and one large positive outlier (40
). This makes the range of the dataset 60
(The difference between 40
and -20
). That’s not very representative of the spread of the majority of the data!
Interquartile Range:
4) Boxplot
Boxplots are one of the most common ways to visualize a dataset. Like histograms, boxplots give you a sense of the central tendency and spread of the data.
Take a look at the boxplot on this page. This boxplot is visualizing a dataset containing the lengths of 9,975 songs. We’ll highlight some of the features of the boxplot and will dig into them in more detail in this lesson:
- The line in the center of the box is the median.
- The edges of the box are the first and third quartiles. This makes the length of the box the interquartile range — the middle 50% of your data.
- The whiskers of the boxplot extend to include most of the data. There are many different ways to calculate the length of the whiskers.
- Outliers are points that fall beyond the whiskers. Those points are represented with dots. In the boxplot we’re showing, there are many outliers.
Review
Here are some of the major takeaways from boxplots:
- The box of a boxplot visualizes the median, first quartile, and third quartile of a dataset.
- The length of the box in a boxplot visualizes the interquartile range.
- The whiskers extend from the box 1.5 times the size of the interquartile range.
- Outliers are points that fall outside of the whiskers. They’re represented by dots.
- Boxplots are especially useful for comparing the spread of multiple datasets.
Details
When exploring data, we’re often interested in summarizing a large amount of information with a single number or visualization.
Depending on what we’re trying to understand from our data, we may need to rely on different statistics. For quantitative data, we can summarize central tendency using mean, median or mode and we can summarize spread using standard deviation, variance, or percentiles. However, when working with categorical data, we may not be able to use all the same summary statistics.
Details
1) Bar Chart
Once you have your distinct categories, a bar chart is best used to display the different value counts of each category. We can also compare means, but we would recommend using a side-by-side box plot for that instead because they give a full picture of the five-number summary.
Bar Charts vs. Histograms
Finally, we have one last thing to go over before we jump into coding our own charts. If you have ever run into histograms, you may notice that bar charts and histograms look almost identical. However, these are the key differences between them:
- Bar charts are used for categorical variables, while histograms are used for quantitative data.
- Histograms must always be arranged in a specific order because they represent a range of numerical values. For a bar chart, each bar represents frequencies of category variables within a category. Unless the variable is ordinal, the bars could be arranged in any order.
Bar Chart:
2) Pie Chart
Pie charts are made up of slices that are combined to make a full circle. Each slice represents a proportion, and the sum of each proportion (slice) adds up to 1 (or 100%).
These slices are meant to give viewers a relative size of the data, similar to how bars in a bar chart act. The arc length (size of the slice’s angle) of each slice is proportional to the quantity it represents. This also means that the area of each slice is proportional to this quantity as well.
Pie Chart:
Details
1) Quantitative and Categorical Variables
Examining the relationship between variables can give us key insight into our data. In this lesson, we will cover ways of assessing the association between a quantitative variable and a categorical variable.
- Intro
- Mean and Median Differences
- Side-by-Side Box Plots
- Inspecting Overlapping Histograms
- Exploring Non-Binary Categorical Variables
- Review
2) Two Quantitative Variables
When associations exist between variables, it means that information about the value of one variable gives us information about the value of the other variable. In this lesson, we will cover ways of examining an association between two quantitative variables.
3) Two Categorical Variables
Details
Details
1) Central Limit Theorem (CLT)
The Central Limit Theorem (CLT) is a powerful statistical tool that is useful in quantifying uncertainty around sample mean estimates. It is also the basis for common hypothesis tests, such as Z- and t-tests. A formal proof of the CLT requires some complex math, but this article will demonstrate it using a simulation!
It’s now time to formally define the CLT, which tells us that the sampling distribution of the mean:
- is normally distributed (for large enough sample size)
- is centered at the population mean
- has standard deviation equal to the population standard deviation divided by the square root of the sample size. This is called Standard Error.
In real life, the data scientist is still stuck with their one sample mean as a best guess for the population mean. However, they can leverage the CLT to estimate the standard error — the amount of variation in imagined, repeated samples!
2) Introduction to Hypothesis Testing
It is important to remember the following:
- A p-value is a probability, usually reported as a decimal between zero and one.
- A small p-value means that an observed sample statistic (or something more extreme) would be unlikely to occur if the null hypothesis is true.
- A significance threshold can be used to translate a p-value into a “significant” or “non-significant” result.
- In practice, the alternative hypothesis and significance threshold should be chosen prior to data collection.
3) One Sample t Test in SciPy
One-sample t-tests are used for comparing a sample average to a hypothetical population average. For example, a one-sample t-test might be used to address questions such as:
- Is the average amount of time that visitors spend on a website different from 5 minutes?
- Is the average amount of money that customers spend on a purchase more than 10 USD?
t Test:
4) Simulating a Binomial Test
Binomial tests are similar to one-sample t-tests in that they test a sample statistic against some population-level expectation. The difference is that:
- binomial tests are used for binary categorical data to compare a sample frequency to an expected population-level probability
- one-sample t-tests are used for quantitative data to compare a sample mean to an expected population mean.
Binomial Test:
- Intro
- Summarizing the Sample
- Simulating Randomness
- Simulating the Null Distribution I
- Simulating the Null Distribution II
- Inspecting the Null Distribution
- Confidence Intervals
- Calculating a One-Sided P-Value
- Calculating a Two-Sided P-Value
- Writing a Binomial Test Function
- Binomial Testing with SciPy
- Review
5) Significance Tresholds
- Intro
- Interpreting a P-Value based on a Significance Threshold
- Error Types
- Setting the Type I Error Rate
- Problems with Multiple Hypothesis Tests
Documentation
Scipy Documentation - Statistical Functions
In this documentation, you will find information about SciPy’s functions for statistical analysis. This is helpful if you would like to perform statistical analyses and/or mathematical computations on data in Python.
Details
In this lesson, we’ll use hypothesis tests to make inference about population-level associations between two variables.
We will cover four different hypothesis tests:
- Two Sample T-Tests (for an association between a quantitative variable and a binary categorical variable)
- ANOVA and Tukey Tests (for an association between a quantitative variable and a non-binary categorical variable)
- Chi-Square Tests (for an association between two categorical variables)
Association:
Assumption:
Before we use a two sample t-test, ANOVA, or Tukey’s range test, we need to be sure that the following things are true:
- The observations should be independently randomly sampled from the population
- The standard deviations of the groups should be equal
- The data should be normally distributed…ish
- The groups created by the categorical variable must be independent
Before we use a Chi-Square test, we need to be sure that the following things are true:
- The observations should be independently randomly sampled from the population
- The categories of both variables must be mutually exclusive
- The groups should be independent
Details
There are many different types of hypothesis tests that should be used in different situations. In this article, we are going to briefly review some of them and when they are appropriate.
Here is a summary table of some of the hypothesis tests we’ll cover:
1) A/B Testing - Sample Size Calculators
An A/B Test is a scientific method of choosing between two options (Option A and Option B). Some examples of A/B tests include:
- What number of sale items on a website makes customers most likely to purchase something: 25 or 50?
- What color button are customers more likely to click on: blue or green?
- Do people spend more time on a website if the background is green or orange?
For A/B tests where the outcome of interest (eg., whether or not a customer makes a purchase) is categorical, an A/B test is conducted using a Chi-Square hypothesis test. In order to determine the sample size necessary for this kind of test, a sample size calculator requires three numbers:
- Baseline conversion rate
- Minimum detectable effect (also called the minimum desired lift)
- Statistical significance threshold
A/B Testing:
2) Sample Size Determination with Simulation
Details
Details
1) Introduction
Data visualization is an important component of Exploratory Data Analysis (EDA) because it allows a data analyst to “look at” their data and get to know the variables and relationships between them. In order to choose and design a data visualization, it is important to consider two things:
- The question you want to answer (and how many variables that question involves)
- The data that is available (is it quantitative or categorical?)
Univariate Analysis
Univariate analysis focuses on a single variable at a time. Univariate data visualizations can help us answer questions like:
- What is the typical price of a rental in New York City?
- What proportion of NYC rentals have a gym?
-
Quantitative Variables
Box plots (or violin plots) and histograms are common choices for visually summarizing a quantitative variable. These plots are useful because they simultaneously communicate information about minimum and maximum values, central location, and spread. Histograms can additionally illuminate patterns that can impact an analysis (eg., skew or multimodality).
-
Categorical Variables
For categorical variables, we can use a bar plot (instead of a histogram) to quickly visualize the frequency (or proportion) of values in each category.
Bivariate Analysis
In many cases, a data analyst is interested in the relationship between two variables in a dataset. For example:
- Do apartments in different boroughs tend to cost different amounts?
- What is the relationship between the area of an apartment and how much it costs?
-
One Quantitative Variable and One Categorical Variable
Two good options for investigating the relationship between a quantitative variable and a categorical variable are side-by-side box plots and overlapping histograms.
-
Two Quantitative Variables
A scatter plot is a great option for investigating the relationship between two quantitative variables.
-
Two Categorical Variables
Side by side (or stacked) bar plots are useful for visualizing the relationship between two categorical variables.
Multivariate Analysis
Sometimes, a data analyst is interested in simultaneously exploring the relationship between three or more variables in a single visualization. Many of the visualization methods presented up to this point can include additional variables by using visual cues such as colors, shapes, and patterns.
2) A Gentle Introduction to Exploratory Data Analysis
In this article, you will learn the basic steps of exploratory data analysis and walk through an example. This is helpful if you want to ensure that all aspects of the data are being correctly interpreted in an analysis.
3) A Simple Tutorial on Exploratory Data Analysis
→ Kaggle
In this article, you will learn the different methods of exploratory data analysis (EDA), key concepts for an EDA, as well as a thorough example through an EDA. This is helpful if you want many practical examples of visual and numeric methods to explore data.
Details
1) Line Graphs in Matplotlib
Matplotlib is a Python library used to create charts and graphs In this first lesson, you will get an overview of the basic commands necessary to build and label a line graph.
- Intro
- Basic Line Plot
- Basic Line Plot II
- Linestyles
- Axis and Labels
- Labeling the Axes
- Subplots
- Subplots Part II
- Legends
- Modify Ticks
- Figures
- Review
Documentation
In this documentation, you will learn about plotting data using Python’s Matplotlib library. This is helpful if you would like to graphically visualize data in python.
2) Different Plot Types
- Simple Bar Chart
- Simple Bar Chart II
- Side-By-Side Bars
- Stacked Bars
- Error Bars
- Fill Between
- Pie Chart
- Pie Chart Labeling
- Histogram
- Multiple Histograms
3) How to Select a Meaningful Visualization
The three steps in the data visualization process are preparing, visualizing, and styling data. When faced with a blank canvas, the second step of the process, visualizing the data, can be overwhelming. To help, we’ve created a diagram to guide the selection of a chart based on what you want to explore in your data.
When planning out a visualization, you’ll usually have an idea of what questions you’ll want to explore. However, you may initially wonder exactly which chart to use. This moment is one of the most exciting parts of the process!
During your brainstorming phase, you should consider two things:
- The focusing question you want to answer with your chart
- The type of data that you want to visualize
Depending on the focusing questions you’re trying to answer, the type of chart you select should be different and intentional in its difference. In the diagram below, we have assigned Matplotlib visualizations to different categories. These categories explore common focusing questions and types of data you may want to display in a visualization.
Chart categories
-
Composition charts
- Focusing Question:
- What are the parts of some whole?
- What is the data made of?
- Datasets that work well: / Data pertaining to probabilities, proportions, and percentages can be visualized as with the graphs in this composition category. Charts in this category illustrate the different data components and their percentages as part of a whole.
- Focusing Question:
-
Distribution Charts
- Datasets that work well: / Data in large quantities and/or with an array of attributes works well for these types of charts. Visualizations in this category will allow you to see patterns, re-occurrences, and a clustering of data points.
- Note: / In statistics, a commonly seen distribution is a bell curve, also known as a normal distribution. A bell curve is a bell-shaped distribution where most of the values in the dataset crowd around the average (also known as the mean), therefore causing the curve to form. If you want to see how values in the data are “distributed” across variables, the best way to do that would be with the visualizations in this category.
-
Relationship Charts
- Focusing Question:
- How do variables relate to each other?
- Datasets that work well: / Data with two or more variables can be displayed in these charts. These charts typically illustrate a correlation between two or more variables. You can communicate this relationship by mapping multiple variables in the same chart. Correlation measures the strength of a relationship between variables.
- Focusing Question:
-
Comparison Charts
- Focusing Question:
- How do variables compare to each other?
- Datasets that work well: / Data must have multiple variables, and the visualizations in this category allow readers to compare those items against the others. For example, a line graph that has multiple lines, each belonging to a different variable. Multi-colored bar charts are also a great way to compare items in data.
- Focusing Question:
4) Recreate Graphs using Matplotlib
- Bar Chart with Error
- Side By Side Bars
- Stacked Bars
- Two Histograms on a Plot
- Labeled Pie Chart
- Line with Shaded Error
Resource
The Data Visualization Catalogue
In this resource, you will learn about different data visualization methods including how different methods are related, how to interpret them, and the limitations of each method. This is helpful if you want to expertly visualize data and can be a powerful reference to use when trying to decide which visualization method matches what you want to show.
Details
1) Introduction to Seaborn
Seaborn is a Python data visualization library that provides simple code to create elegant visualizations for statistical exploration and insight. Seaborn is based on Matplotlib, but improves on Matplotlib in several ways:
- Seaborn provides a more visually appealing plotting style and concise syntax.
- Seaborn natively understands Pandas DataFrames, making it easier to plot data directly from CSVs.
- Seaborn can easily summarize Pandas DataFrames with many rows of data into aggregated charts.
If you’re unfamiliar with Pandas, just know that Pandas is a data analysis library for Python that provides easy-to-use data structures and allows you to organize and manipulate datasets so they can be visualized. To fully leverage the power of Seaborn, it is best to prepare your data using Pandas.
Seaborn:
- Intro
- Using Pandas For Seaborn
- Plotting Bars with Seaborn
- Understanding Aggregates
- Plotting Aggregates
- Modifying Error Bars
- Calculating Different Aggregates
- Aggregating by Multiple Columns
- Review
2) Distribution using Seaborn
In this lesson, we will explore how to use Seaborn to graph multiple statistical distributions, including box plots and violin plots.
Seaborn is optimized to work with large datasets — from its ability to natively interact with Pandas DataFrames, to automatically calculating and plotting aggregates. One of the most powerful aspects of Seaborn is its ability to visualize and compare distributions. Distributions provide us with more information about our data — how spread out it is, its range, etc.
Calculating and graphing distributions is integral to analyzing massive amounts of data. We’ll look at how Seaborn allows us to move beyond the traditional distribution graphs to plots that enable us to communicate important statistical information.
Distribution:
Documentation
In this documentation, you will learn the many elements of the Seaborn API. This is helpful if you wish to use Seaborn to visualize data in plots and grids.
Details
Details
Look at the technologies around us:
- Spellcheck and autocorrect
- Auto-generated video captions
- Virtual assistants like Amazon’s Alexa
- Autocomplete
- Your news site’s suggested articles
What do they have in common?
All of these handy technologies exist because of natural language processing! Also known as NLP, the field is at the intersection of linguistics, artificial intelligence, and computer science. The goal? Enabling computers to interpret, analyze, and approximate the generation of human languages (like English or Spanish).
- Text Preprocessing
- Parsing Text
- Language Models: Bag-of-Words
- Language Models: N-Gram and NLM
- Topic Models
- Text Similarity
- Language Prediction & Text Generation
- Advanced NLP Topics
- Challenges and Considerations
- Review
Concepts!
Natural language processing
combines computer science, linguistics, and artificial intelligence to enable computers to process human languages.NLTK
is a Python library used for NLP.Text preprocessing
is a stage of NLP focused on cleaning and preparing text for other NLP tasks.Parsing
is an NLP technique concerned with breaking up text based on syntax.Language models
are probabilistic machine models of language use for NLP comprehension tasks. Common models include bag-of-words, n-gram models, and neural language modeling.Topic modeling
is the NLP process by which hidden topics are identified given a body of text.Text similarity
is a facet of NLP concerned with semblance between instances of language.Language prediction
is an application of NLP concerned with predicting language given preceding language.
Details
Text preprocessing is an approach for cleaning and preparing text data for use in a specific context. Developers use it in almost all natural language processing (NLP) pipelines, including voice recognition software, search engine lookup, and machine learning model training. It is an essential step because text data can vary.
From its format (website, text message, voice recognition) to the people who create the text (language, dialect), there are plenty of things that can introduce noise into your data.
- Noise Removal
- Tokenization
- Normalization
- Stopword Removal
- Stemming
- Lemmatization
- Part-of-Speech Tagging
- Review
Concepts!
Text preprocessing
is all about cleaning and prepping text data so that it’s ready for other NLP tasks.Noise removal
is a text preprocessing step concerned with removing unnecessary formatting from our text.Tokenization
is a text preprocessing step devoted to breaking up text into smaller units (usually words or discrete terms).Normalization
is the name we give most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.Stemming
is the normalization preprocessing task focused on removing word affixes.Lemmatization
is the normalization preprocessing task that more carefully brings words down to their root forms.
Details
The bag-of-words language model is a simple-yet-powerful tool to have up your sleeve when working on natural language processing (NLP). The model has many, many use cases including:
- determining topics in a song
- filtering spam from your inbox
- finding out if a tweet has positive or negative sentiment
- creating word clouds
Bag-of-words (BoW) is a statistical language model based on word count. Say what?
Let’s start with that first part: a statistical language model is a way for computers to make sense of language based on probability. For example, let’s say we have the text:
"Five fantastic fish flew off to find faraway functions. Maybe find another five fantastic fish?"
A statistical language model focused on the starting letter for words might take this text and predict that words are most likely to start with the letter “f” because 11 out of 15 words begin that way. A different statistical model that pays attention to word order might tell us that the word “fish” tends to follow the word “fantastic.”
- BoW Dictionaries
- Building a Features Dictionary
- Building a BoW Vector
- It's All in the Bag
- Spam A Lot No More
- BoW Wow
- BoW Ow
- Review
Documentation
Working with Text Data | scikit-learn
In this documentation, you will learn about the scikit-learn tools for generating language models like bag-of-words using text data. This is helpful if you would like to analyze a collection of text documents.
Details
You might not even realize, but you are building a language model in your head similar to term frequency-inverse document frequency, commonly known as tf-idf. Tf-idf is another powerful tool in your NLP toolkit that has a variety of use cases included:
- ranking results in a search engine
- text summarization
- building smarter chatbots
Term frequency-inverse document frequency is a numerical statistic used to indicate how important a word is to each document in a collection of documents, or a corpus.
Tf-idf relies on two different metrics in order to come up with an overall score:
term frequency
, or how often a word appears in a document. This is the same as bag-of-words’ word count.inverse document frequency
, which is a measure of how often a word appears in the overall corpus. By penalizing the score of words that appear throughout a corpus, tf-idf can give better insight into how important a word is to a particular document of a corpus.
- What is Tf-idf?
- Breaking It Down Part I: Term Frequency
- Breaking It Down Part II: Inverse Document Frequency
- Putting It All Together: Tf-idf
- Converting Bag-of-Words to Tf-idf
- Review
Documentation
Working with Text Data | scikit-learn | From Occurrences to Frequencies
In this documentation, you will learn how to use scikit-learn to conduct tf-idf. This is helpful if you are trying to determine topics or themes within text data.
Details
A word embedding is a representation of a word as a numeric vector, enabling us to compare and contrast how words are used and identify words that occur in similar contexts.
The applications of word embeddings include:
- entity recognition in chatbots
- sentiment analysis
- syntax parsing
Word Embedings:
- Intro
- Vectors
- What is a Word Embedding?
- Distance
- Word Embeddings Are All About Distance
- Word2vec
- Gensim
Concept!
- Vectors are containers of information, and they can have anywhere from 1-dimension to hundreds or thousands of dimensions
- Word embeddings are vector representations of a word, where words with similar contexts are represented with vectors that are closer together
- spaCy is a package that enables us to view and use pre-trained word embedding models
- The distance between vectors can be calculated in many ways, and the best way for measuring the distance between higher dimensional vectors is cosine distance
- Word2Vec is a shallow neural network model that can build word embeddings using either continuous bag-of-words or continuous skip-grams
- Gensim is a package that allows us to create and train word embedding models using any corpus of text
Documentation
spaCy Documentation - Token.similarity
In this documentation, you will learn about the spaCy library for Natural Language Processing in Python. This is helpful if you would like to process and derive insights from unstructured textual data.
Details
Details
While at IBM, Arthur Samuel developed a program that learned how to play checkers (1959). He called it:
"The field of study that gives computers the ability to learn without being explicitly programmed."
What does this mean?
As programmers, we often approach problems in a methodical, logic-based way. We try to determine what our desired outputs should be, and then create the proper rules that will transform our inputs into those outputs.
Machine learning flips the script. We want the program itself to learn the rules that describe our data the best, by finding patterns in what we know and applying those patterns to what we don’t know.
These algorithms are able to learn. Their performance gets better and better with each iteration, as it uncovers more hidden trends in the data.
Machine learning can be branched out into the following categories:
-
Supervised Learning
Supervised Learning is where the data is labeled and the program learns to predict the output from the input data. For instance, a supervised learning algorithm for credit card fraud detection would take as input a set of recorded transactions. For each transaction, the program would predict if it is fraudulent or not.
Supervised learning problems can be further grouped into regression and classification problems.
Regression:
In regression problems, we are trying to predict a continuous-valued output. Examples are:
- What is the housing price in Neo York?
- What is the value of cryptocurrencies?
Classification:
In classification problems, we are trying to predict a discrete number of values. Examples are:
- Is this a picture of a human or a picture of an AI?
- Is this email spam?
-
Unsupervised Learning
Unsupervised Learning is a type of machine learning where the program learns the inherent structure of the data based on unlabeled examples.
Clustering is a common unsupervised machine learning approach that finds patterns and structures in unlabeled data by grouping them into clusters.
Some examples:
- Social networks clustering topics in their news feed
- Consumer sites clustering users for recommendations
- Search engines to group similar objects in one cluster
Supervised Learning
Data is labeled and the program learns to predict the output from the input data
Unsupervised Learning
Data is unlabeled and the program learns to recognize the inherent structure in the input data
Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms. It’s built upon some of the technology you might already be familiar with, like NumPy, pandas, and Matplotlib!
As you build robust Machine Learning programs, it’s helpful to have all the sklearn commands all in one place in case you forget.
-
Import and create the model:
from sklearn.linear_model import LinearRegression your_model = LinearRegression()
Fit:
your_model.fit(x_training_data, y_training_data)
.coef_
: contains the coefficients.intercept_
: contains the intercept
Predict:
predictions = your_model.predict(your_x_data)
.score()
: returns the coefficient of determination R²
-
Import and create the model:
from sklearn.naive_bayes import MultinomialNB your_model = MultinomialNB()
Fit:
your_model.fit(x_training_data, y_training_data)
Predict:
# Returns a list of predicted classes - one prediction for every data point predictions = your_model.predict(your_x_data) # For every data point, returns a list of probabilities of each class probabilities = your_model.predict_proba(your_x_data)
-
Import and create the model:
from sklearn.neigbors import KNeighborsClassifier your_model = KNeighborsClassifier()
Fit:
your_model.fit(x_training_data, y_training_data)
Predict:
# Returns a list of predicted classes - one prediction for every data point predictions = your_model.predict(your_x_data) # For every data point, returns a list of probabilities of each class probabilities = your_model.predict_proba(your_x_data)
-
Import and create the model:
from sklearn.cluster import KMeans your_model = KMeans(n_clusters=4, init='random')
n_clusters
: number of clusters to form and number of centroids to generateinit
: method for initializationk-means++
: K-Means++ [default]random
: K-Means
random_state
: the seed used by the random number generator [optional]
Fit:
your_model.fit(x_training_data)
Predict:
predictions = your_model.predict(your_x_data)
-
Import and print accuracy, recall, precision, and F1 score:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score print(accuracy_score(true_labels, guesses)) print(recall_score(true_labels, guesses)) print(precision_score(true_labels, guesses)) print(f1_score(true_labels, guesses))
Import and print the confusion matrix:
from sklearn.metrics import confusion_matrix print(confusion_matrix(true_labels, guesses))
-
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)
train_size
: the proportion of the dataset to include in the train splittest_size
: the proportion of the dataset to include in the test splitrandom_state
: the seed used by the random number generator [optional]
Details
Supervised learning algorithms use labeled data as input while unsupervised learning algorithms use unlabeled data. However, we can further distinguish machine learning algorithms by the output they produce. In terms of output, two main types of machine learning models exist: those for regression and those for classification.
-
Regression
Regression is used to predict outputs that are continuous. The outputs are quantities that can be flexibly determined based on the inputs of the model rather than being confined to a set of possible labels.
For example:
- Predict the height of a potted plant from the amount of rainfall
- Predict salary based on someone’s age and availability of high-speed internet
- Predict a car’s MPG (miles per gallon) based on size and model year
-
Classification
Classification is used to predict a discrete label. The outputs fall under a finite set of possible outcomes. Many situations have only two possible outcomes. This is called binary classification (True/False, 0 or 1, Hotdog / not Hotdog).
For example:
- Predict whether an email is spam or not
- Predict whether it will rain or not
- Predict whether a user is a power user or a casual user
There are also two other common types of classification: multi-class classification and multi-label classification.
Multi-class classification has the same idea behind binary classification, except instead of two possible outcomes, there are three or more.
For example:
- Predict whether a photo contains a pear, apple, or peach
- Predict what letter of the alphabet a handwritten character is
- Predict whether a piece of fruit is small, medium, or large
In this lesson, you will learn three different ways to define the distance between two points:
-
Euclidean Distance
Euclidean Distance is the most commonly used distance formula. To find the Euclidean distance between two points, we first calculate the squared distance between each dimension. If we add up all of these squared differences and take the square root, we’ve computed the Euclidean distance.
-
Manhattan Distance
Manhattan Distance is extremely similar to Euclidean distance. Rather than summing the squared difference between each dimension, we instead sum the absolute value of the difference between each dimension. It’s called Manhattan distance because it’s similar to how you might navigate when walking city blocks. If you’ve ever wondered “how many blocks will it take me to get from point A to point B”, you’ve computed the Manhattan distance.
-
Hamming Distance
Hamming Distance is another slightly different variation on the distance formula. Instead of finding the difference of each dimension, Hamming distance only cares about whether the dimensions are exactly equal. When finding the Hamming distance between two points, add one for every dimension that has different values.
-
SciPy Distances
Now that you’ve written these three distance formulas yourself, let’s look at how to use them using Python’s SciPy library:
- Euclidean Distance
.euclidean()
- Manhattan Distance
.cityblock()
- Hamming Distance
.hamming()
- Euclidean Distance
Distance:
The purpose of machine learning is often to create a model that explains some real-world data, so that we can predict what may happen next, with different inputs.
The simplest model that we can fit to data is a line. When we are trying to find a line that fits a set of data best, we are performing Linear Regression.
We often want to find lines to fit data, so that we can predict unknowns. For example:
- The market price of a house vs. the square footage of a house. Can we predict how much a house will sell for, given its size?
- The tax rate of a country vs. its GDP. Can we predict taxation based on a country’s GDP?
- The amount of chips left in the bag vs. number of chips taken. Can we predict how much longer this bag of chips will last, given how much people at this party have been eating?
Linear Regression:
- Introduction to Linear Regression
- Points and Lines
- Loss
- Gradient Descent for Intercept
- Gradient Descent for Slope
- Put it Together
- Convergence
- Learning Rate
- Put it Together II
- Use Your Functions on Real Data
- Scikit-Learn
- Review
Documentation
Scikit-learn's Linear Regression Documentation
In this documentation, you will learn about Sklearn’s linear regression function. This is helpful if you would like to perform data prediction based on the relationship between two numeric variables.
Details
Linear regression is useful when we want to predict the values of a variable from its relationship with other variables. There are two different types of linear regression models (simple linear regression and multiple linear regression).
In predicting the price of a home, one factor to consider is the size of the home. The relationship between those two variables, price and size, is important, but there are other variables that factor in to pricing a home: location, air quality, demographics, parking, and more. When making predictions for price, our dependent variable, we’ll want to use multiple independent variables. To do this, we’ll use Multiple Linear Regression.
Multiple Linear Regression uses two or more independent variables to predict the values of the dependent variable.
- Intro
- StreetEasy Dataset
- Training Set vs. Test Set
- Multiple Linear Regression: Scikit-Learn
- Visualizing Results with Matplotlib
- Multiple Linear Regression Equation
- Correlations
- Evaluating the Model's Accuracy
- Rebuild the Model
Documentation
Sklearn's Train_test_split documentation
In this documentation, you will learn about the Sklearn’s train_test_split() function. This is helpful if you would like to split a dataset into training and testing datasets.
Details
K-Nearest Neighbors (KNN) is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories.
- Introduction
- Distance Between Points - 2D
- Distance Between Points - 3D
- Data with Different Scales: Normalization
- Finding the Nearest Neighbors
- Count Neighbors
- Classify Your Favorite Movie
- Training and Validation Sets
- Choosing K
- Using sklearn
The K-Nearest Neighbors algorithm is a powerful supervised machine learning algorithm typically used for classification. However, it can also perform regression.
Documentation
Sklearn KNeighborsClassifier documentation
In this documentation, you will learn about the KNeighborsClassifier in Scikit-Learn. This is helpful if you would like to classify data into discreet classes using a k-nearest neighbors algorithm.
Details
Normalizing your data is an essential part of machine learning. You might have an amazing dataset with many great features, but if you forget to normalize, one of those features might completely dominate the others. It’s like you’re throwing away almost all of your information! Normalizing solves this problem. In this article, you learned the following techniques to normalize:
Min-max normalization
: Guarantees all features will have the exact same scale but does not handle outliers well.Z-score normalization
: Handles outliers, but does not produce normalized data with the exact same scale.
Sometimes your dataset is so small, that splitting it 80/20 will still result in a large amount of variance. One solution to this is to perform N-Fold Cross-Validation. The central idea here is that we’re going to do this entire process N times and average the accuracy.
For example, in 10-fold cross-validation, we’ll make the validation set the first 10% of the data and calculate accuracy, precision, recall and F1 score. We’ll then make the validation set the second 10% of the data and calculate these statistics again. We can do this process 10 times, and every time the validation set will be a different chunk of the data. If we then average all of the accuracies, we will have a better sense of how our model does on average.
Insights:
- Classifying a single point can result in a true positive (
truth = 1
,guess = 1
), a true negative (truth = 0
,guess = 0
), a false positive (truth = 0
,guess = 1
), or a false negative (truth = 1
,guess = 0
). - Accuracy measures how many classifications your algorithm got correct out of every classification it made.
- Recall measures the percentage of the relevant items your classifier was able to successfully find.
- Precision measures the percentage of items your classifier found that were actually relevant. Precision and recall are tied to each other. As one goes up, the other will go down.
- F1 score is a combination of precision and recall.
- F1 score will be low if either precision or recall is low.
The decision to use precision, recall, or F1 score ultimately comes down to the context of your classification. Maybe you don’t care if your classifier has a lot of false positives. If that’s the case, precision doesn’t matter as much.
Documentation
In this documentation, you will learn about the regularization technique in Machine Learning. This is helpful to improve accuracy in machine learning models by reducing overfitting.
Details
Logistic Regression is a supervised machine learning algorithm that uses regression to predict the continuous probability, ranging from 0 to 1, of a data sample belonging to a specific category, or class. Then, based on that probability, the sample is classified as belonging to the more probable class, ultimately making Logistic Regression a classification algorithm.
- Introduction
- Linear Regression Approach
- Logistic Regression
- Log-Odds
- Sigmoid Function
- Log Loss
- Classification Thresholding
- Scikit-Learn
- Feature Importance
Review
- Logistic Regression is used to perform binary classification, predicting whether a data sample belongs to a positive (present) class, labeled 1 and the negative (absent) class, labeled 0.
- The Sigmoid Function bounds the product of feature values and their coefficients, known as the log-odds, to the range [0,1], providing the probability of a sample belonging to the positive class.
- A loss function measures how well a machine learning model makes predictions. The loss function of Logistic Regression is log-loss.
- A Classification Threshold is used to determine the probabilistic cutoff for where a data sample is classified as belonging to a positive or negative class. The standard cutoff for Logistic Regression is 0.5, but the threshold can be higher or lower depending on the nature of the data and the situation.
- Scikit-learn has a Logistic Regression implementation that allows you to fit a model to your data, find the feature coefficients, and make predictions on new data samples.
- The coefficients determined by a Logistic Regression model can be used to interpret the relative importance of each feature in predicting the class of a data sample.
Documentation
Sklearn documentation | Linear_model.LogisticRegression
In this documentation, you will learn about the sklearn class “Linear_model.LogisticRegression”. This is helpful if you would like to build a machine learning model to predict the probability of an event occurring based on data.
Bayes’ Theorem is the basis of a branch of statistics called Bayesian Statistics, where we take prior knowledge into account before calculating new probabilities.
A Naive Bayes classifier is a supervised machine learning algorithm that leverages Bayes’ Theorem to make predictions and classifications.
- Investigate the Data
- Bayes Theorem I
- Bayes Theorem II
- Smoothing
- Classify
- Formatting the Data for scikit-learn
- Using scikit-learn
- Review
Documentation
Sklearn Documentation | feature_extraction.text.CountVectorizer
In this documentation, you will learn about the CountVectorizer class of the Sklearn.feature_extraction module. This is helpful if you would like to analyze text documents or strings using machine learning algorithms.
Documentation
Scikit-learn documentation for Multinomial Naive Bayes
In this documentation, you will learn Scikit-Learn’s methods for building a multinomial naive Bayes model. This is helpful if you wish to build a naive Bayes’ classifier for discrete data.
Details
A Support Vector Machine (SVM) is a powerful supervised machine learning model used for classification. An SVM makes classifications by defining a decision boundary and then seeing what side of the boundary an unclassified point falls on. In the next few exercises, we’ll learn how these decision boundaries get defined, but for now, know that they’re defined by using a training set of classified points. That’s why SVMs are supervised machine learning models.
- Support Vector Machines
- Optimal Decision Boundaries
- Support Vectors and Margins
- scikit-learn
- Outliers
- Kernels
- Polynomial Kernel
- Radial Bias Function Kernel
Concepts:
- SVMs are supervised machine learning models used for classification.
- An SVM uses support vectors to define a decision boundary. Classifications are made by comparing unlabeled points to that decision boundary.
- Support vectors are the points of each class closest to the decision boundary. The distance between the support vectors and the decision boundary is called the margin.
- SVMs attempt to create the largest margin possible while staying within an acceptable amount of error.
- The
C
parameter controls how much error is allowed. A largeC
allows for little error and creates a hard margin. A smallC
allows for more error and creates a soft margin. - SVMs use kernels to classify points that aren’t linearly separable.
- Kernels transform points into higher dimensional space. A polynomial kernel transforms points into three dimensions while an rbf kernel transforms points into infinite dimensions.
- An rbf kernel has a
gamma
parameter. Ifgamma
is large, the training data is more relevant, and as a result overfitting can occur.
Documentation
Scikit-Learn Documentation on SVC
In this documentation, you will learn how to create a Support Vector Machine classifier using the Scikit learn library. This is helpful if you would like to experiment with different mathematical models for classifying data.
Decision trees are machine learning models that try to find patterns in the features of data points. Take a look at the tree on this page. This tree tries to predict whether a student will get an A on their next test.
- Cars
- Gini Impurity
- Information Gain
- Weighted Information Gain
- Recursive Tree Building
- Classifying New Data
- Decision Trees in scikit-learn
- Decision Tree Limitations
Review:
- Good decision trees have pure leaves. A leaf is pure if all of the data points in that class have the same label.
- Decision trees are created using a greedy algorithm that prioritizes finding the feature that results in the largest information gain when splitting the data using that feature.
- Creating an optimal decision tree is difficult. The greedy algorithm doesn’t always find the globally optimal tree.
- Decision trees often suffer from overfitting. Making the tree small by pruning helps to generalize the tree so it is more accurate on data in the real world.
Documentation
In this documentation, you will learn how to use a decision tree to classify data with Scikit-Learn. This is helpful if you would like to build a decision tree classifier in Python.
A random forest is an ensemble machine learning technique — a random forest contains many decision trees that all work together to classify new points. When a random forest is asked to classify a new point, the random forest gives that point to each of the decision trees. Each of those trees reports their classification and the random forest returns the most popular classification. It’s like every tree gets a vote, and the most popular classification wins.
Review:
- A random forest is an ensemble machine learning model. It makes a classification by aggregating the classifications of many decision trees.
- Random forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in a random forest is less impactful.
- Every decision tree in a random forest is created by using a different subset of data points from the training set. Those data points are chosen at random with replacement, which means a single data point can be chosen more than once. This process is known as bagging.
- When creating a tree in a random forest, a randomly selected subset of features are considered as candidates for the best splitting feature. If your dataset has
n
features, it is common practice to randomly select the square root ofn
features.
Documentation
In this documentation, you will learn about the parameters, attributes, and methods of the Random Forest Classifier within Scikit-Learn. This is helpful if you would like to use multiple decision trees to classify and analyze data.
Details
Clustering
Clustering is the most well-known unsupervised learning technique. It finds structure in unlabeled data by identifying similar groups, or clusters. Examples of clustering applications are:
Recommendation engines
: group products to personalize the user experienceSearch engines
: group news topics and search resultsMarket segmentation
: group customers based on geography, demography, and behaviorsImage segmentation
: medical imaging or road scene segmentation on self-driving cars
K-Means Clustering
K-Means is the most popular and well-known clustering algorithm, and it tries to address these two questions.
- The “K” refers to the number of clusters (groups) we expect to find in a dataset.
- The “Means” refers to the average distance of data to each cluster center, also known as the centroid, which we are trying to minimize.
It is an iterative approach:
- Place
k
random centroids for the initial clusters. - Assign data samples to the nearest centroid.
- Update centroids based on the above-assigned data samples.
Repeat Steps 2 and 3 until convergence (when points don’t move between clusters and centroids stabilize). Once we are happy with our clusters, we can take a new unlabeled datapoint and quickly assign it to the appropriate cluster.
- Iris Dataset
- Visualize Before K-Means
- Implementing K-Means: Step 1
- Implementing K-Means: Step 2
- Implementing K-Means: Step 3
- Implementing K-Means: Step 4
- Implementing K-Means: Scikit-Learn
- New Data?
- Visualize After K-Means
- Evaluation
- The Number of Clusters
K-Means++ Clustering
To recap, the Step 1 of the K-Means algorithm is “Place k
random centroids for the initial clusters”.
The K-Means++ algorithm replaces Step 1 of the K-Means algorithm and adds the following:
- The first cluster centroid is randomly picked from the data points.
- For each remaining data point, the distance from the point to its nearest cluster centroid is calculated.
- The next cluster centroid is picked according to a probability proportional to the distance of each point to its nearest cluster centroid. This makes it likely for the next cluster centroid to be far away from the already initialized centroids.
Repeat 2 - 3 until k
centroids are chosen.
K-Means++ improves K-Means by placing initial centroids more strategically. As a result, it can result in more optimal clusterings than K-Means.
It can also outperform K-Means in speed. If you get very unlucky initial centroids using K-Means, the algorithm can take a long time to converge. K-Means++ will often converge quicker!
Documentation
→ Scikit-Learn Documentation on Kmeans Clustering
In this documentation, you will learn scikit-Learn’s methods for k-means clustering. This is helpful if you wish to partition observations into clusters using the nearest mean.
What is Hierarchical Clustering
→ What is Hierarchical Clustering
In this article, you will learn about hierarchical cluster analysis. This is helpful if you want to divide the data into clusters, or groups of similar items.
Details
What Is Deep Learning?
First, let’s focus on learning. If you have come across machine learning before, you might be familiar with the concept of a learning model. Learning describes the process by which models analyze data and finds patterns. Our machine learning algorithm learns from these patterns to find the best representation of this data, which it then uses to make predictions about new data that it has never seen before.
Deep learning is a subfield of machine learning, and the concept of learning is pretty much the same.
- We create our model carefully
- Throw relevant data at it
- Train it on this data
- Have it make predictions for data it has never seen
Deep learning models are used with many different types of data, such as text, images, audio, and more, making them applicable to many different domains.
The deep part of deep learning refers to the numerous “layers” that transform data. This architecture mimics the structure of the brain, where each successive layer attempts to learn progressively complex patterns from the data fed into the model. This may seem a bit abstract, so let’s look at a concrete example, such as facial recognition. With facial recognition, a deep learning model takes in a photo as an input, and numerous layers perform specific steps to identify whose face is in the picture. The steps taken by each layer might be the following:
- Find the face within the image using edge detection.
- Analyze the facial features (eyes, nose, mouth, etc.).
- Compare against faces within a repository.
- Output a prediction!
Notice that without large amounts of data, deep learning models are no more powerful (and maybe even less accurate) than less complex learning models. However, with large amounts of data, deep learning models can improve performance to the point that they outperform humans in tasks such as classifying objects or faces in images or driving. Deep learning is fundamentally a future learning system (also known as representation learning). It learns from raw data without human intervention. Hence, given massive amounts of data, a deep learning system will perform better than traditional machine learning systems that rely on feature extractions from developers.
What are Neural Networks?
A neural network is a programming model inspired by the human brain. Let’s explore how it came into existence.
Computers have been designed to excel at number-crunching tasks, something that most humans find terrifying. On the other hand, humans are naturally wired to effortlessly recognize objects and patterns, something that computers find difficult.
This juxtaposition brought up two important questions in the 1950s:
- “How can computers be better at solving problems that humans find effortless?”
- “How can computers solve such problems in the way a human brain does?”
In 1957, Frank Rosenblatt explored the second question and invented the Perceptron algorithm that allowed an artificial neuron to simulate a biological neuron! The artificial neuron could take in an input, process it based on some rules, and fire a result. But computers had been doing this for years — what was so remarkable?
The Perceptron Algorithm used multiple artificial neurons, or perceptrons, for image recognition tasks and opened up a whole new way to solve computational problems. However, as it turns out, this wasn’t enough to solve a wide range of problems, and interest in the Perceptron Algorithm along with Neural Networks waned for many years.
But many years later, the neurons fired back.
It was found out that creating multiple layers of neurons — with one layer feeding its output to the next layer as input — could process a wide range of inputs, make complex decisions, and still produce meaningful results. With some tweaks, the algorithm became known as the Multilayer Perceptron, which led to the rise of Feedforward Neural Networks.
Deep Learning Math
-
Scalars, Vectors, and Matrices
To start, let us go over a couple of topics that will be integral to understanding the mathematical operations that are present in deep learning, including how data is represented:
-
Scalars:
A scalar is a single quantity that you can think of as a number. In machine learning models, we can use scalar quantities to manipulate data, and we often modify them to improve our model’s accuracy. We can also represent data as scalar values depending on what dataset we are working with. Code example:
x = 5
-
Vectors:
Vectors are arrays of numbers. In Python, we often denote vectors as NumPy arrays. Each value in the array can be identified by its index (location within the array). Code example:
x = np.array([1,2,3])
-
Matrices:
Matrices are grids of information with rows and columns. We can index a matrix just like an array; however, when indexing on a matrix, we need two arguments: one for the row and one for the column. Code example:
x = np.array([[1,2,3],[4,5,6],[7,8,9]])
-
-
Tensors
Scalars, vectors, and matrices are foundational objects in linear algebra. Understanding the different ways they interact with each other and can be manipulated through matrix algebra is integral before diving into deep learning. This is because the data structure we use in deep learning is called a tensor, which is a generalized form of a vector and matrix: a multidimensional array.
A tensor allows for more flexibility with the type of data you are using and how you can manipulate that data.
-
Matrix Algebra
The following gifs walkthrough matrix multiplication, addition, and transpose. You can perform element-wise operations on tensors using matrix algebra as well, which you can read more about here.
-
Matrix Addition:
-
Scalar Multiplication:
-
Matrix Multiplication:
This is the most complicated, so spend time understanding how it is done in the applet in the Learning Environment. Feel free to pause the animation if you would like to spend more time making sense of each slide!
-
Transpose:
-
-
Neural Networks Concept Overview
Let’s take a look at the journey our inputs take inside of a neural network! By an input, we mean a data point from our dataset. Our input can have many different features, so in our input layer, each node represents a different input feature. For example, if we were working with a dataset of different types of food, some of our features might be size, shape, nutrition, etc., where the value for each of these features would be held in an input node.
Besides an input layer, our neural network has two other different types of layers:
Hidden layers
are layers that come between the input layer and the output layer. They introduce complexity into our neural network and help with the learning process. You can have as many hidden layers as you want in a neural network (including zero of them).The output layer
is the final layer in our neural network. It produces the final result, so every neural network must have only one output layer.
Each layer in a neural network contains nodes. Nodes between each layer are connected by weights. These are the learning parameters of our neural network, determining the strength of the connection between each linked node.
-
Forwardpropagation
The process we have been going through is known as forward propagation. Inputs are moved forward from the input layer through the hidden layer(s) until they reach the output layer.
Check the Apps here.
-
Loss Functions
We have seen how we get to an output! Now, what do we do with it? When a value is outputted, we calculate its error using a loss function. Our predicted values are compared with the actual values within the training data. There are two commonly used loss calculation formulas:
-
Backpropagation
This all seems fine and dandy so far. However, what if our output values are inaccurate? Do we cry? Try harder next time? Well, we can do that, but the good news is that there is more to our deep learning models.
This is where backpropagation and gradient descent come into play. Forward propagation deals with feeding the input values through hidden layers to the final output layer. Backpropagation refers to the computation of gradients with an algorithm known as gradient descent. This algorithm continuously updates and refines the weights between neurons to minimize our loss function.
By gradient, we mean the rate of change with respect to the parameters of our loss function. From this, backpropagation determines how much each weight is contributing to the error in our loss function, and gradient descent will update our weight values accordingly to decrease this error.
Check the Apps here.
-
Gradient Descent
We have the overall process of backpropagation down! Now, let’s zoom in on what is happening during gradient descent.
If we think about the concept graphically, we want to look for the minimum point of our loss function because this will yield us the highest accuracy. If we start at a random point on our loss function, gradient descent will take “steps” in the “downhill direction” towards the negative gradient. The size of the “step” taken is dependent on our learning rate. Choosing the optimal learning rate is important because it affects both the efficiency and accuracy of our results.
-
Stochastic Gradient Descent
This leads us to the final point about gradient descent. In deep learning models, we are often dealing with extremely large datasets. Because of this, performing backpropagation and gradient descent calculations on all of our data may be inefficient and computationally exhaustive no matter what learning rate we choose.
To solve this problem, a variation of gradient descent known as Stochastic Gradient Descent (SGD) was developed. Let’s say we have 100,000 data points and 5 parameters. If we did 1000 iterations (also known as epochs in Deep Learning) we would end up with 100000⋅5⋅1000 = 500,000,000 computations. We do not want our computer to do that many computations on top of the rest of the learning model; it will take forever.
This is where SGD comes to play. Instead of performing gradient descent on our entire dataset, we pick out a random data point to use at each iteration. This cuts back on computation time immensely while still yielding accurate results.
-
More Variants of Gradient Descent
Just when you thought SDG solved all our problems, even more options come into the picture!
There are also other variants of gradient descent such as Adam optimization algorithm and mini-batch gradient descent. Adam is an adaptive learning algorithm that finds individual learning rates for each parameter. Mini-batch gradient descent is similar to SGD except instead of iterating on one data point at a time, we iterate on small batches of fixed size.
Adam optimizer’s ability to have an adaptive learning rate has made it an ideal variant of gradient descent and is commonly used in deep learning models. Mini-batch gradient descent was developed as an ideal trade-off between GD and SGD. Since mini-batch does not depend on just one training sample, it has a much smoother curve and is less affected by outliers and noisy data making it a more optimal algorithm for gradient descent than SGD.
Review
This overview completes the necessary mathematical intuition you need to move forward and begin coding your own learning models! To recap all the things we have learned (so many things!!!):
- Scalars, vectors, matrices, and tensors
- A scalar is a singular quantity like a number.
- A vector is an array of numbers (scalar values).
- A matrix is a grid of information with rows and columns.
- A tensor is a multidimensional array and is a generalized version of a vector and matrix.
- Matrix Algebra
- In scalar multiplication, every entry of the matrix is multiplied by a scalar value.
- In matrix addition, corresponding matrix entries are added together.
- In matrix multiplication, the dot product between the corresponding rows of the first matrix and columns of the second matrix is calculated.
- A matrix transpose turns the rows of a matrix into columns.
- In forward propagation, data is sent through a neural network to get initial outputs and error values.
- Weights are the learning parameters of a deep learning model that determine the strength of the connection between two nodes.
- A bias node shifts the activation function either left or right to create the best fit for the given data in a deep learning model.
- Activation Functions are used in each layer of a neural network and determine whether neurons should be “fired” or not based on output from a weighted sum.
- Loss functions are used to calculate the error between the predicted values and actual values of our training set in a learning model.
- In backpropagation, the gradient of the loss function is calculated with respect to the weight parameters within a neural network.
- Gradient descent updates our weight parameters by iteratively minimizing our loss function to increase our model’s accuracy.
- Stochastic gradient descent is a variant of gradient descent, where instead of using all data points to update parameters, a random data point is selected.
- Adam optimization is a variant of SGD that allows for adaptive learning rates.
- Mini-batch gradient descent is a variant of GD that uses random batches of data to update parameters instead of a random datapoint.
Details
Details
We’ve learned how to write queries to retrieve information from the database. Now, we are going to learn how to perform calculations using SQL.
Calculations performed on multiple rows of a table are called aggregates.
Here is a quick preview of some important aggregates that we will cover in the next exercises:
COUNT()
: count the number of rowsSUM()
: the sum of the values in a columnMAX()
: the largest valueMIN()
: the smallest valueAVG()
: the average of the values in a columnROUND()
: round the values in the column
Aggregate functions combine multiple rows together to form a single value of more meaningful information.
GROUP BY
is a clause used with aggregate functions to combine data from one or more columns.HAVING
limit the results of a query based on an aggregate property.
Aggregate
Challenge
Details
Concepts
-
JOIN
will combine rows from different tables if the join condition is true. -
LEFT JOIN
will return every row in the left table, and if the join condition is not met, NULL values are used to fill in the columns from the right table. -
Primary key
is a column that serves a unique identifier for the rows in the table. -
Foreign key
is a column that contains the primary key to another table. -
CROSS JOIN
lets us combine all rows of one table with all rows of another table. -
UNION
stacks one dataset on top of another. -
WITH
allows us to define one or more temporary tables that can be used in the final query.
Multiple Tables
- Introduction
- Combining Tables with SQL
- Inner Joins
- Left Joins
- Primary Key vs Foreign Key
- Cross Join
- Union
- With
Challenge
Details
So far we’ve learned how to retrieve information from tables and perform calculations on them. But we’ve seen that GROUP BY
and SUM
reduce the number of rows in your query results because they are combining or grouping rows.
Window functions, on the other hand, allow you to maintain the values of your original table while displaying grouped or summative information alongside in another column. This is why many Data Scientists and Data Engineers love to use window functions for complex data analysis.
- Introduction
- Window Function Syntax
- PARTITION BY
- FIRST_VALUE and LAST_VALUE
- LAG
- LEAD
- ROW_NUMBER
- RANK
- NTILE
Here’s a quick preview of the concepts that will be covered in these next six exercises:
Math operators:
+
addition-
subtraction/
division*
multiplication%
modulo (returns the remainder)
Math functions:
ABS()
: returns the absolute value of the input expressionCAST()
: converts an expression into another data type
Date and time functions:
DATETIME()
: returns the date and time of a time stringDATE()
: returns the date portion of a time stringTIME()
: returns the time portion of a time stringSTRFTIME()
: returns a formatted date
Date function modifiers:
-
Modifiers to shift the date backwards to a specified part:
'start of year'
'start of month'
'start of day'
-
Modifiers to add a specified amount to the date and time of a time string:
'+-N years'
'+-N months'
'+-N days'
'+-N hours'
'+-N minutes'
'+-N seconds'
-
Format string substitutions:
%Y
returns the year (YYYY)%m
returns the month (01-12)%d
returns the day of month (01-31)%H
returns the hour (00-23)%M
returns the minute (00-59)%S
returns the second (00-59)
Let’s get started!
Details
- Code Challenge 1
- Code Challenge 2
- Code Challenge 3
- Code Challenge 4
- Code Challenge 5
- Code Challenge 6
- Code Challenge 7
- Code Challenge 8
- Code Challenge 9
- Code Challenge 10
- Code Challenge 11
- Code Challenge 12
- Code Challenge 13
- Code Challenge 14
-
What is SQL?
SQL, which stands for Structured Query Language, is a programming language to communicate with data stored in a relational database management system.
Its syntax is similar to English, making it easy to write, read, and interpret. It allows you to write queries which define the subset of data you are seeking. You can save these queries, refine them, share them, and run them on different databases.
-
What is a database?
A database is a set of data stored in a computer, where the data is structured in a way that makes it easily accessible.
-
What is a relational database?
A relational database is a type of database. It uses a structure that allows us to identify and access data in relation to other pieces of data in the database.
Data in a relational database is often organized as tables.
-
What is a RDBMS?
A relational database management system (RDBMS) is a program that allows you to create, update, and administer a relational database.
Most relational database management systems use SQL language to access the databases.
The most popular RDBMS is MySQL. Others include PostgreSQL, Oracle DB, SQL Server, and SQLite.
-
What is a table?
A table is a collection of data organized into rows and columns. They are sometimes referred to as “relations”.
They can have hundreds, thousands, and sometimes even millions of rows of data.
-
What is a row and column in a table?
A row is a single record of data in a table.
A column is a set of data values of a particular type.
-
What is a data type?
A data type is an attribute that specifies the type of data a column holds.
Every column of a database has a data type.
Some common data types are
INTEGER
,TEXT
,DATE
,REAL
. -
What is a primary key and foreign key?
A primary key is a column that uniquely identifies each row of a table.
Primary keys must satisfy the following requirements: No value can be
NULL
, each value must be unique, and a table cannot have more than one primary key column.For example, in a
customers
table, the primary key would becustomer_id
.A foreign key is the primary key for one table that appears in a different table. For example if there was also an
orders
table, each order can store the customer, such that thecustomer_id
column would be the foreign key. -
What is the difference between ALTER and UPDATE?
The
ALTER
statement is used to add a new column to a table. It changes the table structure.The
UPDATE
statement is used to edit a row in the table. It changes existing records in the table. -
What is a query?
A query is a SQL statement used to retrieve information stored in a database.
They allow us to communicate with the database by asking questions and returning a result set with data relevant to the question.
-
What is a subquery?
A subquery is an internal query nested inside of an external query.
They can nested inside of
SELECT
,INSERT
,UPDATE
, orDELETE
statements.When a subquery is present, it will be executed before the external statement is run.
-
What are constraints?
Constraints are a set of rules used to tell the database to limit the type of data that can be stored for the columns. They tell the database to reject inserted data that does not adhere to the restriction.
They add information about how a column can be used, and are invoked after the data type for a column.
Some examples of constraints are:
PRIMARY KEY
, which uniquely identifies each row and requires each value to be unique.UNIQUE
, which requires every value in the column to be different.NOT NULL
, which requires columns to have a value. DEFAULT, which takes an additional argument which will be the assumed value for an inserted row if the new row does not specify a value for that column. -
What is a statement?
A statement is text that the database recognizes as a valid command.
They can be used to perform tasks like altering he structure of a table, updating data, or retrieving data from a database.
The structure of a statement can vary, but each must end with a semi-colon
;
. The number of lines used for a statement does not matter, and it can be written as a single line or split into multiple lines to make it easier to read. -
How do you check if a field does not have a value or has a value?
When a field has no value, it is indicated with a
NULL
value.To check if a field does not have a value, you can use
IS NULL
as the condition:WHERE col IS NULL
To check if a field has a value, you can use
IS NOT NULL
as the condition:WHERE col IS NOT NULL
-
What is the difference between DISTINCT and UNIQUE?
DISTINCT
is a keyword used when we want to return unique values in the output. It filters out all duplicate values in the specified column.UNIQUE
is a constraint used to ensure that all values of a column are different. It is similar toPRIMARY KEY
, except that a table can have many differentUNIQUE
columns. -
What are aggregate functions used for?
Aggregate functions are used to perform a calculation on one or more values, and returns a single value of more meaningful information.
Some aggregate functions are
COUNT()
,SUM()
,MAX()
,MIN()
,AVG()
, andROUND()
. -
What is a join?
A join is a way to combine rows from two or more tables, based on a related column between them.
-
What is the difference between an INNER JOIN and LEFT JOIN?
An
INNER JOIN
is used to include combined rows from two tables that match theON
condition. The final result does not include rows with no match for theON
condition.A
LEFT JOIN
is used to keep all rows from the first table, regardless of whether there is a matching row in the second table for theON
condition. -
What is the purpose of windows functions?
Windows functions are used when you want to maintain the values of your original table while displaying grouped or summative information alongside. It is similar to aggregate functions, but does not reduce the number of rows in the result by combining or grouping them into a few result.
-
What are indexes and why are they needed?
Indexes are a powerful tool used in the background of a database to speed up querying, by acting as a lookup table for data.
They are needed to efficiently store data for quicker retrieval, which can be paramount to the success of large tech companies which need to process on the scale of petabytes of data each day.