Data Scientist Learning Path

Codecademy want to create a world where anyone can build something meaningful with technology, and everyone has the learning tools, resources, and opportunities to do so. Code contains a world of possibilities — all that’s required is the curiosity and drive to learn. At Codecademy, we are committed to empowering all people, regardless of where they are in their coding journeys, to continue to learn, grow, and make an impact on the world around them.

Data Scientist Learning Path

1. Welcome to the Data Scientist Career Path

Details

The Data Scientist Career Path is designed for you to gain the technical skills, knowledge, resources, and confidence you need to get your start as a data scientist.

After this Path, you will be able to:

Create programs using Python 3
Move off of the Codecademy platform with the Command Line, Jupyter Notebook, Git, and GitHub Desktop
Acquire and query data with SQL and BeautifulSoup
Manipulate data using NumPy and pandas
Wrangle and tidy data with pandas
Summarize and analyze data with scikit-learn
Implement many different hypothesis tests using Python
Visualize data using Matplotlib and seaborn
Communicate data science findings effectively
Work with text data using natural language processing
Build, train, and test a variety of supervised and unsupervised machine learning models
Understand the fundamentals of deep learning
Work with aggregate functions and joins in SQL

Popular Data Science Resources

As you go through the path, we’ll suggest certain resources for you to use, like articles, videos, tutorials, and documentation. Here are some additional resources that are considered groundbreaking, significant, or classics in the industry and will help you throughout your Path:

2. Getting Started with Data Science

Details

a. Introduction to Data Science

Data science enables us to take data and transform it into meaningful information that can help us make decisions. Data science is interdisciplinary and combines other well-known fields such as probability, statistics, analytics, and computer science. The work that you might do could range from writing up reports to building machine learning models. No matter what your interests are, data science is applicable - because these days, we have data on everything!

1) Statistics

A fundamental part of data science is statistics. Statistics is the practice of applying mathematical calculations to sets of data to derive meaning. Statistics can give us a quick summary of a dataset, such as the average amount or how consistent a dataset is.

There are two types of statistics:

Descriptive Statistics

Descriptive statistics describe a dataset using mathematically calculated values, such as the mean and standard deviation. For instance, the graph below from FiveThirtyEight charts the wage gap between American men and women in 2014. An example of a descriptive statistic would be that at the 90th income percentile, women make 80.6% of what men make on average.
Inferential Statistics

On the other hand, inferential statistics are statistical calculations that enable us to draw conclusions about the larger population. For instance, from looking at the graph we can infer that at the 99th income percentile, women make less than 78% of what men make on average. We can also infer that the reason why the wage gap is smallest at the 10th income percentile is because the minimum wage for men and women is the same.

2) Probability

Another field that informs the field of data science is probability. Probability is the mathematical study of what could potentially happen. Fun fact: the study of probability came about as a method to decode secret messages.

In data science, probability calculations are used to build models. Models are able to help us understand data that has yet to exist - either data we hadn’t previously collected or data that has yet to be created. Data scientists create models to help calculate the probability of a certain action and then they use that probability to make informed decisions.

Example of Probability Application

3) Programming

Given that we’re a “learn to code” website, this one may seem fairly obvious. But in all seriousness, programming is an essential part of data science. It’s also what sets data science apart from similar fields, like data analytics.

Programming is the practice of writing commands for a computer to execute. Computer Science is the discipline of interacting with computation systems.

Example of Programming Application

4) Domain Expertise

One of the reasons that data science is one of the most interesting fields to go into is that it doesn’t limit you to a specific industry. Interested in airplanes? Work for a company on ticket pricing! Like fashion? Create a model that predicts the latest trends.

Domain Expertise refers to the particular set of knowledge that someone cultivates in order to understand their data. You may be able to crunch the numbers, but in order to understand their meaning, you’re going to need a lot of context. Sometimes this context comes from your research, your teammates, or the knowledge that you gain after working in a particular industry.

Review

Review! You now know the basic parts that make up the field of data science.

Data Science—the field of taking data and transforming it into meaningful information that can help us make decisions
Descriptive Statistics—statistics that describe the data in objective terms
Inferential Statistics—inferences for the overall population based on data
Probability—the likelihood that an event will happen
Programming—the act of giving the computer instructions to perform a task
Domain Expertise—the particular set of knowledge that someone cultivates and brings with them in order to understand their data

If you are interested in learning more about these topics, here are some additional resources:

Blog: What does a data scientist do?
Article: What Data Scientists Really Do, According to 35 Data Scientists
Article: Achieving business impact with data

b. Data Science Process

c. Data Science Applications

As you can see, there are many different applications and kinds of projects that you can do once you know a bit of data science!

Reports - a way of presenting your process, insights, and recommendations
Recommender Systems - a process that uses data about users and items to predict interest
Dynamic Pricing - a strategy that takes into account factors such as demand to increase and decrease prices to drive profit
Natural Language Processing - ways of analyzing text to gain insights as well as support applications, such as chatbots

Of course, there are many applications beyond just the ones that we’ve covered here. So no matter your interest or professional industry, data science thinking can create impact.

3. Python Fundamentals

Details

a. Python Syntax and Variable Types

Details

Python is a programming language. Like other languages, it gives us a way to communicate ideas. In the case of a programming language, these ideas are “commands” that people use to communicate with a computer!

Two common errors that we encounter while writing Python are SyntaxError and NameError.

SyntaxError means there is something wrong with the way your program is written — punctuation that does not belong, a command where it is not expected, or a missing parenthesis can all trigger a SyntaxError.
A NameError occurs when the Python interpreter sees a word it does not recognize. Code that contains something that looks like a variable but was never defined will throw a NameError.

Documentation

Python Documentation - Built-in Types

In this documentation, you will learn about the built-in data types of Python. This is helpful if you would like to learn about data types that you can use when programming in Python.

b. Python Functions

Details

Let’s imagine we were building an application to help people plan trips! When using a trip planning application we can say a simple procedure could look like this:

 1. Establish your origin and destination
 2. Calculate the distance/route
 3. Return the best route to the user

We will perform these three steps every time users have to travel between two points using our trip application. In our programs, we could rewrite the same procedures over and over (and over) for each time we want to travel, but there’s a better way! Python gives us a useful concept called functions.

Here’s an example of a function definition:

def function_name():
    # functions tasks go here

There are some key components we want to note here:

The def keyword indicates the beginning of a function (also known as a function header). The function header is followed by a name in snake_case format that describes the task the function performs. It’s best practice to give your functions a descriptive yet concise name.
Following the function name is a pair of parenthesis ( ) that can hold input values known as parameters (more on parameters later in the lesson!). In this example function, we have no parameters.
A colon : to mark the end of the function header.
Lastly, we have one or more valid python statements that make up the function body (where we have our python comment).

Python Code Challenges: Functions

Python Code Challenges: Functions (Advanced)

c. Python Control Flow

Details

Let’s review what we’ve learned this lesson:

Boolean expressions are statements that can be either True or False
A boolean variable is a variable that is set to either True or False.
We can create boolean expressions using relational operators:
- == : Equals
- != : Not equals
- > : Greater than
- >= : Greater than or equal to
- < : Less than
- <= : Less than or equal to
if statements can be used to create control flow in your code.
else statements can be used to execute code when the conditions of an if statement are not met.
elif statements can be used to build additional checks into your if statements

Documentation

Python Tutorial - More Control Flow Tools

In this documentation, you will learn about the different flow control statements of Python. This is helpful if you would like to conditionally execute blocks of code in a Python program.

Python Code Challenges: Control Flow

Python Code Challenges: Control Flow (Advanced)

d. Getting Started Off-Platform

Details

1) Command Line Interface Setup

Windows has a different CLI, called Command Prompt. While this has many of the same features as Bash, Bash is much more popular. Because of the strength of the open source community and the tools they provide, mastering Bash is a better investment than mastering Command Prompt.

To use Bash on a Windows computer, we will download and install a program called Git Bash. Git Bash allows us to easily access Bash as well as another tool we’ll be using later called Git, inside the Windows environment.

For more detailed tutorial, please watch this video.

2) Introducing Jupyter Notebook

Jupyter Notebook (sometimes called IPython Notebook) is a popular way to write and run Python code, especially for data analysis, data science and machine learning. Jupyter Notebooks are easy-to-use because they let you execute code and review the output quickly. This iterative process is central to data analytics and makes it easy to test hypotheses and record the results (just like a notebook).

Above is the Jupyter Notebook interface. A Jupyter Notebook has two parts:

the front-end interface (what you see in the gif)
the back-end kernel

The front-end interface loads in a web browser and consists of “cells” where you enter your code. The browser is just for display, so you do not need to be connected to the internet.

Jupyter Notebook uses a back-end kernel called IPython. The ‘I’ stands for ‘Interactive’, which means that a program or script can be broken up into smaller pieces, and those pieces can be run independently from the rest of the program.

You do not need to worry about the difference between Python and IPython. The important thing to know is that you can run small pieces of code, which can be helpful when working with data.

3) Setting up Jupyter Notebook

Windows Anaconda

Go to the Anaconda Downloads page and download the appropriate version of Anaconda (32- or 64-Bit Graphical Installer)

Double click on the .exe file and click Install.
Read and agree to the licensing terms.
Select if you want to install for 'Just Me' or 'All Users'. If you are installing for 'All Users', you must have Administrator privileges.
You will be prompted to select the installation location. By default, Anaconda should try to install in your home directory. We recommend accepting this default. Click Install.
You will be asked if you want to add Anaconda to your PATH environment variable. Do not add Anaconda to the PATH because it can interfere with other software.
You will be asked if you want Anaconda to be your default version of Python. We recommend 'Yes'. There are some rare instances where you might not make Anaconda the default version, but they are beyond the scope of this article.
Click the Install button.
Go ahead and finish the installation.

Now you can verify your installation. Go to the Windows Start menu and search for ‘Anaconda Navigator’ (not the Anaconda Prompt).

If it appears, it is installed. Congratulations!

Double click on it, and the Anaconda Navigator window should appear.

Your applications screen may look slightly different from this one, but that is ok. Click the Launch button under Jupyter Notebook. A Jupyter Notebook interface will appear in your default browser.

An Anaconda Prompt might also open and display a url. If it does, do not close it until you are done working with Jupyter Notebook. If it does not appear, don’t worry - this has to do with your operating system and will not affect Jupyter Notebook’s performance.

Congratulations!! You are ready to move on to the next article and get started using Jupyter Notebook!

4) Getting Started with Jupyter

For many more helpful tools and features, check out the documentation, and if you want to be inspired to do more with Jupyter notebook, check out A gallery of interesting Jupyter Notebooks.

Have fun coding with Jupyter Notebook!

5) Getting More out of Jupyter Notebook

There are numerous ways to enhance your Jupyter Notebook and make it more readable and standardized. We have covered some of the methods, but there are many others that can be found in the Jupyter Notebook User’s Manual.

Documentation

Jupyter Notebook Documentation

In this documentation, you will learn about how to install, configure, and use Jupyter Notebook for Python programming. This is helpful if you would like to quickly validate Python code or prototype visualizations.

e. Python List

Details

In programming, it is common to want to work with collections of data. In Python, a list is one of the many built-in data structures that allows us to work with a collection of data in sequential order.

Suppose we want to make a list of the heights of students in a class:

Noelle is 61 inches tall
Ava is 70 inches tall
Sam is 67 inches tall
Mia is 64 inches tall

In Python, we can create a variable called heights to store these integers into a list:

heights = [61, 70, 67, 64]

Notice that:

A list begins and ends with square brackets ([ and ]).
Each item (i.e., 67 or 70) is separated by a comma (,)
It’s considered good practice to insert a space () after each comma, but your code will run just fine if you forget the space.

Intro to List in Python:

Combining Lists: The Zip Function

Use zip() to create a new variable called names_and_dogs_names that combines owners and dogs_names lists into a zip object.

Then, create a new variable named list_of_names_and_dogs_names by calling the list() function on names_and_dogs_names.

Print list_of_names_and_dogs_names.

See the code here.

Documentation

Python Tutorial - Sequence Types

In this documentation, you will learn about the built-in sequence types available in Python. This is helpful when you are working with a sequence of data in Python.

Python List Methods:

.count() - A list method to count the number of occurrences of an element in a list.
.insert() - A list method to insert an element into a specific index of a list.
.pop() - A list method to remove an element from a specific index or from the end of a list.
range() - A built-in Python function to create a sequence of integers.
len() - A built-in Python function to get the length of a list.
.sort() / sorted() - A method and a built-in function to sort a list.

Working with the List in Python:

Python Code Challenges: Lists

Python Code Challenges: Lists (Advanced)

e. Python Loop

Details

In programming, this process of using an initialization, repetitions, and an ending condition is called a loop. In a loop, we perform a process of iteration (repeating tasks).

Programming languages like Python implement two types of iteration:

Indefinite iteration, where the number of times the loop is executed depends on how many times a condition is met.
Definite iteration, where the number of times the loop will be executed is defined in advance (usually based on the collection size).

Intro to Loop:

Documentation

Python Tutorial - Data Structures : Looping Techniques

In this documentation, you will learn about the different Python looping techniques. This is helpful if you would like to repeatedly execute a code block in a program for a sequence of values.

List Comprehension - Code Challenge:

f. Python Strings

Details

In Python, the way we store something like a word, a sentence, or even a whole paragraph is as a string. A string is a sequence of characters contained within a pair of 'single quotes' or "double quotes". A string can be any length and can contain any letters, numbers, symbols, and spaces.

Intro to String:

String Methods:

Documentation

Python Documentation - Common String Operations

In this documentation, you will learn about common string operations in Python. This is helpful if you would like to manipulate strings using Python code.

Code Challenge - String Methods:

g. Python Dictionaries

Details

A dictionary is an unordered set of key: value pairs.

It provides us with a way to map pieces of data to each other so that we can quickly find values that are associated with one another.

Suppose we want to store the prices of various items sold at a cafe:

Avocado Toast is 6 dollars
Carrot Juice is 5 dollars
Blueberry Muffin is 2 dollars

In Python, we can create a dictionary called menu to store this data:

menu = {"avocado toast": 6, "carrot juice": 5, "blueberry muffin": 2}

Notice that:

A dictionary begins and ends with curly braces { and }.
Each item consists of a key ("avocado toast") and a value (6).
Each key: value pair is separated by a comma.

It’s considered good practice to insert a space ( ) after each comma, but our code will still run without the space.

Intro to Dictionaries:

Documentation

Python Tutorial - Data Structures - Dictionaries

In this documentation, you will learn about Python’s built-in dictionary datatype. This is helpful if you would like to store the data as a set of key:value pairs, and later extract a value given its key.

Using Dictionaries:

h. Python Classes

Details

Intro to Classes:

i. Python Modules

Details

A module is a collection of Python declarations intended broadly to be used as a tool. Modules are also often referred to as “libraries” or “packages” — a package is really a directory that holds a collection of modules.

Usually, to use a module in a file, the basic syntax you need at the top of that file is:

from module_name import object_name

Often, a library will include a lot of code that you don’t need that may slow down your program or conflict with existing code. Because of this, it makes sense to only import what you need.

Intro to Modules:

Documentation

Python Tutorial - Modules

In this documentation you will learn about Python modules. This is helpful if you would like to split long program code into logically grouped files and import the files to reference the classes and functions within.

j. Python Files

Details

Learn Python Files:

Documentation

Python Tutorial - Input and Output - Reading and Writing Files

In this documentation you will learn about reading and writing the data in files using Python. This is helpful if you would like to read the file’s contents into the program or write data you’ve generated into a file.

4. Data Acquisition

Details

a. Intro to Data Acquisition

Details

Data acquisition, or data mining, is the step of the Data Science Life Cycle where we identify and obtain the raw data that will later be cleaned for exploration and modeling.

Data Science is viewed as such a powerful tool because data is often the most valuable resource we have when attempting to gain insight from our environment. In the frame of business, once a question has been defined and objectives set through the business understanding phase of the data science life cycle, the right data must be acquired to answer the business needs. In the diagram above, the arrows between the business understanding phase and data acquisition phase indicate that there is often an iterative relationship between the two. You need to understand how the questions you’re asking relate to your goals before collecting data. However after acquiring your data, that data may shift your understanding of your business, where you may revisit the business understanding phase. Some aspects to consider when acquiring data are:

What data is needed to achieve the business goal?
How much data is needed to produce valuable insight and modeling?
Where and how can this data be found?
What legal and privacy parameters should be considered?

There are several methods you can utilize to acquire data. Methods that we will cover in this article are:

1) Public & Private Data

Public Data

There are several open source datasets that are hosted online that allow you to freely download data collected by others, offering solutions to a wide range of data science and machine learning applications. These public sources of data are often suitable for small to medium sized machine learning projects, concept validation, and research purposes. Some of the most commonly visited are:

Private Data

There are also a number of private datasets that businesses curate themselves and are under ownership of the company. For instance, Netflix’s database of user preferences powers their immense recommendation systems. There are also services that allow you to purchase datasets such as data markets like Data & Sons or crowd-sourcing marketplaces such as Amazon’s Mechanical Turks where one can outsource their data acquisition needs like data validation and research to survey participation. Often we will find usage of private data within a large production scale setting.

Pros:

Time: Readily available datasets that can quickly move a project to the next phase of the Data Science Life Cycle.
Cost: Public datasets can cut costs of collecting data down to zero.

Cons:

Messy: Data can often come in forms that require intensive cleaning and modification.
Cost: Private services can lead to high costs in acquiring data.

2) Web Scraping

Web scraping can be one of the most potent methods of data acquisition when used effectively. Web scraping at its core is the act of extracting or copying information directly from a website. This data can then be stored within a dataframe or spreadsheet and used in the same manner as any other dataset. There are two general methods for web scraping, one where you will manually scrape the data from a web page and the other where you employ a web crawler or bot that automates the process of extracting data from a website.

Python has useful libraries that allow you to implement both methods of web scraping, some of the most commonly used are BeautifulSoup, Selenium, and Scrapy. Web scraping is best used when the data you need is not available as a public or private dataset, but present on a web page. We typically employ web scraping techniques to acquire data for small to medium sized projects, but rarely in production as this can raise ownership and copyright issues.

Pros:

Versatile: Highly adaptable method for acquiring data from the internet.
Scalable: Distributed bots can be coordinated to retrieve large quantities of data.

Cons:

Language Barrier: Multiple languages are involved when scraping and require a knowledge of languages not typically used for data science.
Legality: Excessive or improper web-scraping can be illegal, disrupt a website’s functionality, and lead to your IP address being black listed from the site.

3) APIs

Application Programming Interfaces, most often referred to as API’s are another method that can be used for data acquisition. We can imagine APIs as a more polite and ‘by the books’ way of web scraping, where in the process of acquiring data from a site we request permission to access some data, and wait for a response from the site to fulfill the request. Unlike web scraping, APIs are a means of communication between 2 different software systems.

Typically this communication is achieved in the form of an HTTP Request/Response Cycle where a client(you) sends a request to a website’s server for data through an API call. The server then searches within its databases for the particular data requested and responds back to the client either with the data, or an error stating that request can not be fulfilled.

Pros:

User & Site Friendly: APIs allow security and management of resources for sites that data is being requested from.
Scalable: API’s can allow for various amounts of data to be requested, up to production scale volumes.

Cons:

Limited: Some functions or data may not be accessed via an API.
Cost: Some API calls can be quite expensive, leading to limitations of certain functions and projects.

4) Manual Data Acquisition

What can we do when the data we need to analyze is not present on the internet or able to be outsourced? In these situations, it can be useful to know how to harvest data yourself, and there are many tools available to do so. Perhaps you would like to conduct hypothesis testing to determine public sentiment about a particular good or service, or maybe you would like to analyze data about animal populations of a local habitat. Being able to acquire the data to conduct these analyses yourself is an invaluable skill to have.

Google Forms is a simple and free way to create surveys that can be shared with others to acquire data related to a particular population. With Google forms, you can create surveys that include video and image files, multiple choice, and short answer questions that can be posted on social media, sent out in email lists, and even printed to manually harvest data. The data acquired can then be downloaded in a CSV file and used within Python. Google also offers Google Surveys, a paid service that allows you to have a wider range of respondents and gives you more control in determining your target audience.

Pros

Bespoke: Data acquired manually can be made to address all of the business objective’s needs, and can need little to no cleaning to be effective.
Community: Manually acquired data can be useful in advancing the fields of data science and machine learning, and be further explored by others.

Cons

Time: Manually acquired data can take more time than other methods, especially when dealing with large datasets.
Cost: Some services and devices used to manually acquire data can be expensive and can be limiting for certain projects.

How to build your own dataset for Data Science projects

→ How to build your own dataset for data science projects

In this article, you will learn how to create your own dataset for a data science project. This is helpful if the dataset you want is not available and you need to build your own.

b. SQL

Details

1) Relational Databases for Data Science/Analysis

In this lesson, we will focus on structured data, as it is the data found within relational databases.

Structured Data

Structured data is the most organized type of data. It follows a data model, which is a kind of blueprint that defines and describes the structure of the data. Structured data is typically in the form of tables that are well defined by their rows and columns, a great example being a DataFrame from the popular Python library pandas.

Some advantages of structured data are that its explicit structure aids in organized storage and accessibility. This also allows it to be easily indexed for efficient reference. With structured data, we are able to set permissions for security purposes, so that only those that meet predetermined guidelines and clearances can access the data.

Examples of structured data are:

Online forms
Excel Spreadsheets
SQL Databases

Relational Database Management Systems

Another advantage of working with relational databases is Relational Database Management systems, or RDBMS. Relational Database Management Systems (RDBMS) are important for data science and analytics because they provide the functionality needed for creating, reading, updating, and deleting data, often referred to as CRUD within our database. The language that data teams utilize most often in RDBMS to execute commands is Structured Query Language (SQL), pronounced as “S-Q-L” or “sequel”.

Structured Query Language (SQL) is one of the most common and powerful languages for querying databases. It is fast, secure, and able to return millions of data points in just a few lines. While there are hundreds of RDBMS, some of the most common RDBMS that use SQL are:

MySQL

MySQL is a popular free and open-source SQL database. It is widely used for web applications because the MySQL Server is renowned for its speed, reliability, and ease of use on multiple platforms.
PostgreSQL

Much like MySQL, PostgreSQL is a popular open-source SQL database. PostgreSQl is one of the oldest RDBMS with over 30 years into its development, so it has an extensive community supporting it and is known for its reliability and array of features.
Oracle DB

Oracle is considered to be among the most popular of all RDBMS. Owned by Oracle Corporation and closed sourced, Oracle DB is the goto RDBMS for corporations as it is able to scale for and support their massive workloads effectively. Licensing for Oracle DB however, is known to be expensive and can be infeasible to use for certain applications.
SQL Server

SQL Server is another privately-owned RDBMS that is popular, especially among corporations. While Microsoft offers SQL Server free through its SQL Server 2019 Express edition, the enterprise editions that are designed for large scale applications with more functionality become more expensive as your application scales.
SQLite

SQLite is another popular open-source SQL database. SQLite is designed to be compact, efficient, and self-contained. SQLite is able to store a complete database in a single cross-platform disk file so that it is not necessary to connect databases to a server. These characteristics and capabilities are what make SQLite considered to be the most used RDBMS, as it is used in most cell phones, computers, and several other daily used devices. Throughout our course, we will use SQLite for working with relational databases.

2) Introduction to SQL

SQL

Structured Query Language, is a programming language designed to manage data stored in relational databases. SQL operates through simple, declarative statements. This keeps data accurate and secure, and helps maintain the integrity of databases, regardless of size.

The SQL language is widely used today across web frameworks and database applications. Knowing SQL gives you the freedom to explore your data, and the power to make better decisions. By learning SQL, you will also learn concepts that apply to nearly every data storage system.

Relational Databases

SELECT * FROM celebs;

A relational database is a database that organizes information into one or more tables. Here, the relational database contains one table.

A table is a collection of data organized into rows and columns. Tables are sometimes referred to as relations. Here the table is celebs.

A column is a set of data values of a particular type. Here, id, name, and age are the columns.

A row is a single record in a table. The first row in the celebs table has:

An id of 1
A name of Justin Bieber
An age of 22

All data stored in a relational database is of a certain data type. Some of the most common data types are:

INTEGER, a positive or negative whole number
TEXT, a text string
DATE, the date formatted as YYYY-MM-DD
REAL, a decimal value

Statements

The code below is a SQL statement. A statement is text that the database recognizes as a valid command. Statements always end in a semicolon ;.

CREATE TABLE table_name (
   column_1 data_type, 
   column_2 data_type, 
   column_3 data_type
);

Let’s break down the components of a statement:

CREATE TABLE is a clause. Clauses perform specific tasks in SQL. By convention, clauses are written in capital letters. Clauses can also be referred to as commands.
table_name refers to the name of the table that the command is applied to.
(column_1 data_type, column_2 data_type, column_3 data_type) is a parameter. A parameter is a list of columns, data types, or values that are passed to a clause as an argument. Here, the parameter is a list of column names and the associated data type.

Create

CREATE statements allow us to create a new table in the database. You can use the CREATE statement anytime you want to create a new table from scratch. The statement below creates a new table named celebs.

CREATE TABLE celebs (
   id INTEGER, 
   name TEXT, 
   age INTEGER
);

CREATE TABLE is a clause that tells SQL you want to create a new table.
celebs is the name of the table.
(id INTEGER, name TEXT, age INTEGER) is a list of parameters defining each column, or attribute in the table and its data type:
- id is the first column in the table. It stores values of data type INTEGER
- name is the second column in the table. It stores values of data type TEXT
- age is the third column in the table. It stores values of data type INTEGER

Insert

The INSERT statement inserts a new row into a table.

We can use the INSERT statement when you want to add new records. The statement below enters a record for Justin Bieber into the celebs table.

INSERT INTO celebs (id, name, age) 
VALUES (1, 'Justin Bieber', 22);

INSERT INTO is a clause that adds the specified row or rows.
celebs is the table the row is added to.
(id, name, age) is a parameter identifying the columns that data will be inserted into.
VALUES is a clause that indicates the data being inserted.
(1, 'Justin Bieber', 22) is a parameter identifying the values being inserted.
- 1: an integer that will be added to id column
- 'Justin Bieber': text that will be added to name column
- 22: an integer that will be added to age column

Select

SELECT statements are used to fetch data from a database. In the statement below, SELECT returns all data in the name column of the celebs table.

SELECT name FROM celebs;

SELECT is a clause that indicates that the statement is a query. You will use SELECT every time you query data from a database.
name specifies the column to query data from.
FROM celebs specifies the name of the table to query data from. In this statement, data is queried from the celebs table.

Alter

The ALTER TABLE statement adds a new column to a table. You can use this command when you want to add columns to a table. The statement below adds a new column twitter_handle to the celebs table.

ALTER TABLE celebs 
ADD COLUMN twitter_handle TEXT;

ALTER TABLE is a clause that lets you make the specified changes.
celebs is the name of the table that is being changed.
ADD COLUMN is a clause that lets you add a new column to a table:
- twitter_handle is the name of the new column being added
- TEXT is the data type for the new column
NULL is a special value in SQL that represents missing or unknown data. Here, the rows that existed before the column was added have NULL (∅) values for twitter_handle.

Update

The UPDATE statement edits a row in a table. You can use the UPDATE statement when you want to change existing records. The statement below updates the record with an id value of 4 to have the twitter_handle @taylorswift13.

UPDATE celebs 
SET twitter_handle = '@taylorswift13' 
WHERE id = 4;

UPDATE is a clause that edits a row in the table.
celebs is the name of the table.
SET is a clause that indicates the column to edit.
- twitter_handle is the name of the column that is going to be updated
- @taylorswift13 is the new value that is going to be inserted into the twitter_handle column.
WHERE is a clause that indicates which row(s) to update with the new column value. Here the row with a 4 in the id column is the row that will have the twitter_handle updated to @taylorswift13.

Delete

The DELETE FROM statement deletes one or more rows from a table. You can use the statement when you want to delete existing records. The statement below deletes all records in the celeb table with no twitter_handle:

DELETE FROM celebs 
WHERE twitter_handle IS NULL;

DELETE FROM is a clause that lets you delete rows from a table.
celebs is the name of the table we want to delete rows from.
WHERE is a clause that lets you select which rows you want to delete. Here we want to delete all of the rows where the twitter_handle column IS NULL.
IS NULL is a condition in SQL that returns true when the value is NULL and false otherwise.

Constraints

Constraints that add information about how a column can be used are invoked after specifying the data type for a column. They can be used to tell the database to reject inserted data that does not adhere to a certain restriction. The statement below sets constraints on the celebs table.

CREATE TABLE celebs (
   id INTEGER PRIMARY KEY, 
   name TEXT UNIQUE,
   date_of_birth TEXT NOT NULL,
   date_of_death TEXT DEFAULT 'Not Applicable'
);

PRIMARY KEY columns can be used to uniquely identify the row. Attempts to insert a row with an identical value to a row already in the table will result in a constraint violation which will not allow you to insert the new row.
UNIQUE columns have a different value for every row. This is similar to PRIMARY KEY except a table can have many different UNIQUE columns.
NOT NULL columns must have a value. Attempts to insert a row without a value for a NOT NULL column will result in a constraint violation and the new row will not be inserted.
DEFAULT columns take an additional argument that will be the assumed value for an inserted row if the new row does not specify a value for that column.

3) SQL Queries

Intro to SQL:

SQL Commands:

Thinking in SQL vs Thinking in Python

→ Thinking in SQL vs Thinking in Python

In this article, you will learn tips on how to use Python scripts with SQL queries after only using SQL queries on their own. This is helpful if you want to integrate the flexibility and expanded capabilities of Python methods into the simplicity and directness of SQL queries.

Tutorial

Software Carpentry Foundation

In this tutorial, you will learn how to access a SQLite database in a Python environment. This is helpful for accessing and manipulating datasets that are not located on local devices. Some benefits to this include:

Saving memory space on your local drive.
Reducing the amount of processing power needed to manipulate large datasets.
Providing more data security.
Allowing for seamless transitions as the given dataset changes.

c. Web Scraping

Details

Introduction
Rules of Scraping
Requests
The BeautifulSoup Object
Object Types
Navigating by Tags
Find All
Select for CSS Selectors
Reading Text
Review

5. Data Manipulation with Pandas

Details

a. Python Lambda Function

Details

A function is an object that is able to accept some sort of input, possibly modify it, and return some sort of output. In Python, a lambda function is a one-line shorthand for function. A simple lambda function might look like this:

add_two = lambda my_input: my_input + 2

So this code:

print(add_two(3))
print(add_two(100))
print(add_two(-2))

would print:

>>> 5
>>> 102
>>> 0

Let’s break this syntax down:

The function is stored in a variable called add_two
lambda declares that this is a lambda function (if you are familiar with normal Python functions, this is similar to how we use def to declare a function)
my_input is what we call the input we are passing into add_two
We are returning my_input plus 2 (with normal Python functions, we use the keyword return)

Lambda Function Code Challenge:

b. Hands On with Pandas

Details

Creating, Loading, Selecting Data with Pandas:

Modifying Dataframe:

Documentation - Pandas

10 minutes to Pandas User Guide

In this documentation you will learn about Pandas, a python library for data analysis. This library is helpful if you would like to perform data manipulation (slicing, dicing, joining, merging, grouping) and analysis.

Documentation - NumPy

Absolute Beginner's Guide to Numpy

In this documentation you will learn about the python library “NumPy”. This is helpful if you would like perform mathematical operations on arrays and matrices.

c. Agregates in Pandas

Details

Introduction

This lesson you will learn about aggregates in Pandas. An aggregate statistic is a way of creating a single number that describes a group of numbers. Common aggregate statistics include mean, median, or standard deviation.

You will also learn how to rearrange a DataFrame into a pivot table, which is a great way to compare data across two dimensions.

Aggregates:

Documentation

Pandas Tutorial - Pandas.Dataframe.Groupby

In this documentation, you will learn about how to group data using pandas.dataframe.groupby. This is helpful if you would like to group large amounts of data and compute operations on those groups.

Documentation

Pandas Documentation - Reshaping and Pivot Tables

In this documentation, you will learn about reshaping dataframe objects to visualize data in different formats. This is helpful if you would like to explore the relationships among variables by representing the data in different ways.

d. Multiple Tables in Pandas

Details

Instead, we can split our data into three tables:

orders would contain the information necessary to describe an order: order_id, customer_id, product_id, quantity, and timestamp
products would contain the information to describe each product: product_id, product_description and product_price
customers would contain the information for each customer: customer_id, customer_name, customer_address, and customer_phone_number

In this lesson, we will learn the Pandas commands that help us work with data stored in multiple tables.

Multiple Tables:

Documentation

Pandas Documentation - Pandas.Merge

In this documentation, you will learn about merging dataframes in Pandas. This is helpful if you would like to combine data from two dataframes into a single dataframe.

6. Data Wrangling and Tidying

Details

a. Introduction to Regular Expression

Details

A regular expression is a special sequence of characters that describe a pattern of text that should be found, or matched, in a string or document. By matching text, we can identify how often and where certain pieces of text occur, as well as have the opportunity to replace or update these pieces of text if needed.

Regular Expressions have a variety of use cases including:

validating user input in HTML forms
verifying and parsing text in files, code and applications
examining test results
finding keywords in emails and web pages

Concepts:

Regular expressions are special sequences of characters that describe a pattern of text that is to be matched
We can use literals to match the exact characters that we desire 3, Alternation, using the pipe symbol |, allows us to match the text preceding or following the |
Character sets, denoted by a pair of brackets [], let us match one character from a series of characters
Wildcards, represented by the period or dot ., will match any single character (letter, number, symbol or whitespace)
Ranges allow us to specify a range of characters in which we can make a match
Shorthand character classes like /w, /d and /s represent the ranges representing word characters, digit characters, and whitespace characters, respectively
Groupings, denoted with parentheses (), group parts of a regular expression together, and allows us to limit alternation to part of a regex
Fixed quantifiers, represented with curly braces {}, let us indicate the exact quantity or a range of quantity of a character we wish to match
Optional quantifiers, indicated by the question mark ?, allow us to indicate a character in a regex is optional, or can appear either 0 times or 1 time
The Kleene star, denoted with the asterisk *, is a quantifier that matches the preceding character 0 or more times
The Kleene plus, denoted by the plus +, matches the preceding character 1 or more times
The anchor symbols hat ^ and dollar sign $ are used to match text at the start and end of a string, respectively

Resource

RegExr Regular Expression Builder

In this resource, you will learn how to build regular expressions. This is helpful if you wish to experiment with regular expressions before using them professionally.

b. Clean Data with Python

Details

When we receive raw data, we have to do a number of things before we’re ready to analyze it, possibly including:

diagnosing the “tidiness” of the data — how much - data cleaning we will have to do
reshaping the data — getting right rows and columns for effective analysis
combining multiple files
changing the types of values — how we fix a column where numerical values are stored as strings, for example
dropping or filling missing values - how we deal with data that is incomplete or missing
manipulating strings to represent the data better

We will go through the techniques data scientists use to accomplish these goals by looking at some “unclean” datasets and trying to get them into a good, clean state.

Cleaning Data:

Documentation

→ Working with Missing Data

In this resource you will learn how to work with null or missing values in pandas. This is helpful if you wish to learn the library’s vast array of such tools beyond NaN.

7. Summary Statistics

Details

a. Variable Types

Details

Generally, variables will come in two varieties; categorical and quantitative. Categorical variables group observations into separate categories that can be ordered or unordered. Quantitative variables on the other hand are variables expressed numerically, whether as a count or measurement.

Let’s dive a bit deeper into the different variable types to understand how to identify them in a dataset.

Quantitative Variables

We can think of quantitative variables as any information about an observation that can only be described with numbers. Quantitative variables are generally counts or measurements of something (eg., number of points earned in a game or height). They are well suited for mathematical operations and quantitative analysis, and are helpful for answering questions like “How many/much?”, “What is the average?”, or “How often?”. There are two types of quantitative variables; discrete and continuous, and they both help to serve different functions in a dataset.

a. Discrete Variables

Discrete quantitative variables are numeric values that represent counts and can only take on integer values. They represent whole units that can not be broken down into smaller pieces, and as such cannot be meaningfully expressed with decimals or fractions. Examples of discrete variables are the number of children in a person’s family or the number of coin flips a person makes. Unless we are working with quantum mechanics, we can not meaningfully have flipped a coin 3.5 times, or have 4.75 sisters.

b. Continuous Variables

Continuous quantitative variables are numeric measurements that can be expressed with decimal precision. Theoretically, continuous variables can take on infinitely many values within a given range. Examples of continuous variables are length, weight, and age which can all be described with decimal values.

Categorical Variables

Categorical variables differ from quantitative variables in that they focus on the different ways data can be grouped rather than counted or measured. With categorical variables, we want to understand how the observations in our dataset can be grouped and separated from one another based on their attributes. When the groupings have a specific order or ranking, the variable is an ordinal categorical variable. If there is no apparent order or ranking to the categories, we refer to the variable as a nominal categorical variable.

a. Ordinal Variables

Do you remember working with a column in a dataset where the values of the column were groups that were greater or lesser than each other in some intrinsic way? Suppose there was a variable containing responses to the question “Rate your agreement with the statement: The minimum age to drive should be lowered.” The response options are “strongly disagree”, “disagree”, “neutral”, “agree”, and “strongly agree”. Because we can see an order where “strongly disagree” < “disagree“ < “neutral” < “agree” < “strongly agree” in relation to agreement, we consider the variable to be ordinal.

b. Nominal Variables

Nominal categorical variables are those variables with two or more categories that do not have any relational order. Examples of nominal categories could be states in the U.S., brands of computers, or ethnicities. Notice how for each of these variables, there is no intrinsic ordering that distinguishes a category as greater than or less than another category.

c. Binary Variables

Binary or dichotomous variables are a special kind of nominal variable that have only two categories. Because there are only two possible values for binary variables, they are mutually exclusive to one another. We can imagine a variable that describes if a picture contains a cat or a dog as a binary variable. In this case, if the picture is not a dog, it must be a cat, and vice versa. Binary variables can also be described with numbers similar to bits with 0 or 1 values. Likewise you may find binary variables containing boolean values of True or False.

Variable Types:

Review

In this lesson, you have:

Discovered the different types of variables you will encounter when working with data and their corresponding data types in Python.
Explored datasets with .head().
Assessed categories within variables with the .unique() method.
Practiced ways to check the data type of variables like the .dtypes attribute and .info() method.
Altered data with the .replace() method.
Learned how to change the data types of variables using the .astype() method.
Investigated the pandas category data type.
Developed your Label Encoding and One-Hot Encoding skills with the .cat.codes accessor and pd.get_dummies() method.

b. Mean, Median, Mode

Details

1) Mean

The mean, often referred to as the average, is a way to measure the center of a dataset.

The average of a set is calculated using a two-step process:

Add all of the observations in your dataset.
Divide the total sum from step one by the number of points in your dataset.

Mean:

2) Median

The formal definition for the median of a dataset is:

The value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, you either report both of the middle two values or their average.

There are always two steps to finding the median of a dataset:

Order the values in the dataset from smallest to largest
Identify the number(s) that fall(s) in the middle

Median:

3) Mode

The formal definition for the mode of a dataset is:

The most frequently occurring observation in the dataset. A dataset can have multiple modes if there is more than one value with the same maximum frequency.

While you may be able to find the mode of a small dataset by simply looking through it, if you have trouble, we recommend you follow these two steps:

Find the frequency of every unique number in the dataset
Determine which number has the highest frequency

Mode:

Documentation

→ NumPy Statistics

In this documentation, you will learn about the different functions for statistical analysis within the Python library NumPy. This is helpful if you would like to understand ways to analyze data using NumPy.

c. Variance and Standard Deviation

Details

1) Variance

Variance is a descriptive statistic that describes how spread out the points in a data set are.

Variance:

2) Standard Deviation

Standard deviation is computed by taking the square root of the variance.

Standard Deviation:

d. Distribution

Details

1) Histogram

There’s no better tool to visualize the uncertainty and chaos in data than a histogram. A histogram displays the distribution of your underlying data.

Histograms reveal, through numbers, interpretable trends in your data. They don’t provide a yes or no answer, but are often used as a starting point for discussion and informing an answer to your data.

Histogram:

e. Quartiles, Quantiles, Interquartile Range

Details

1) Quartiles

A common way to communicate a high-level overview of a dataset is to find the values that split the data into four groups of equal size. By doing this, we can then say whether a new datapoint falls in the first, second, third, or fourth quarter of the data.

The values that split the data into fourths are the quartiles. Those values are called the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3)

In the image above, Q1 is 10, Q2 is 13, and Q3 is 22. Those three values split the data into four groups that each contain five datapoints.

Quartiles:

2) Quantiles

Quantiles are points that split a dataset into groups of equal size. For example, let’s say you just took a test and wanted to know whether you’re in the top 10% of the class. One way to determine this would be to split the data into ten groups with an equal number of datapoints in each group and see which group you fall into.

There are nine values that split the dataset into ten groups of equal size — each group has 3 different test scores in it.

Those nine values that split the data are quantiles! Specifically, they are the 10-quantiles, or deciles.

Quantiles:

3) Interquartile Range

The interquartile range (IQR) is a descriptive statistic that tries to solve this problem. The IQR ignores the tails of the dataset, so you know the range around-which your data is centered.

In this image, most of the data is between 0 and 15. However, there is one large negative outlier (-20) and one large positive outlier (40). This makes the range of the dataset 60 (The difference between 40 and -20). That’s not very representative of the spread of the majority of the data!

Interquartile Range:

4) Boxplot

Boxplots are one of the most common ways to visualize a dataset. Like histograms, boxplots give you a sense of the central tendency and spread of the data.

Take a look at the boxplot on this page. This boxplot is visualizing a dataset containing the lengths of 9,975 songs. We’ll highlight some of the features of the boxplot and will dig into them in more detail in this lesson:

The line in the center of the box is the median.
The edges of the box are the first and third quartiles. This makes the length of the box the interquartile range — the middle 50% of your data.
The whiskers of the boxplot extend to include most of the data. There are many different ways to calculate the length of the whiskers.
Outliers are points that fall beyond the whiskers. Those points are represented with dots. In the boxplot we’re showing, there are many outliers.

Review

Here are some of the major takeaways from boxplots:

The box of a boxplot visualizes the median, first quartile, and third quartile of a dataset.
The length of the box in a boxplot visualizes the interquartile range.
The whiskers extend from the box 1.5 times the size of the interquartile range.
Outliers are points that fall outside of the whiskers. They’re represented by dots.
Boxplots are especially useful for comparing the spread of multiple datasets.

f. Summary Statistics for Categorical Data

Details

When exploring data, we’re often interested in summarizing a large amount of information with a single number or visualization.

Depending on what we’re trying to understand from our data, we may need to rely on different statistics. For quantitative data, we can summarize central tendency using mean, median or mode and we can summarize spread using standard deviation, variance, or percentiles. However, when working with categorical data, we may not be able to use all the same summary statistics.

g. Visualizing Categorical Data

Details

1) Bar Chart

Once you have your distinct categories, a bar chart is best used to display the different value counts of each category. We can also compare means, but we would recommend using a side-by-side box plot for that instead because they give a full picture of the five-number summary.

Bar Charts vs. Histograms

Finally, we have one last thing to go over before we jump into coding our own charts. If you have ever run into histograms, you may notice that bar charts and histograms look almost identical. However, these are the key differences between them:

Bar charts are used for categorical variables, while histograms are used for quantitative data.
Histograms must always be arranged in a specific order because they represent a range of numerical values. For a bar chart, each bar represents frequencies of category variables within a category. Unless the variable is ordinal, the bars could be arranged in any order.

Bar Chart:

2) Pie Chart

Pie charts are made up of slices that are combined to make a full circle. Each slice represents a proportion, and the sum of each proportion (slice) adds up to 1 (or 100%).

These slices are meant to give viewers a relative size of the data, similar to how bars in a bar chart act. The arc length (size of the slice’s angle) of each slice is proportional to the quantity it represents. This also means that the area of each slice is proportional to this quantity as well.

Pie Chart:

h. Associations between Variables

Details

1) Quantitative and Categorical Variables

Examining the relationship between variables can give us key insight into our data. In this lesson, we will cover ways of assessing the association between a quantitative variable and a categorical variable.

2) Two Quantitative Variables

When associations exist between variables, it means that information about the value of one variable gives us information about the value of the other variable. In this lesson, we will cover ways of examining an association between two quantitative variables.

3) Two Categorical Variables

8. Hypothesis Testing

Details

a. Testing a Sample Statistics

Details

1) Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) is a powerful statistical tool that is useful in quantifying uncertainty around sample mean estimates. It is also the basis for common hypothesis tests, such as Z- and t-tests. A formal proof of the CLT requires some complex math, but this article will demonstrate it using a simulation!

It’s now time to formally define the CLT, which tells us that the sampling distribution of the mean:

is normally distributed (for large enough sample size)
is centered at the population mean
has standard deviation equal to the population standard deviation divided by the square root of the sample size. This is called Standard Error.

In real life, the data scientist is still stuck with their one sample mean as a best guess for the population mean. However, they can leverage the CLT to estimate the standard error — the amount of variation in imagined, repeated samples!

2) Introduction to Hypothesis Testing

It is important to remember the following:

A p-value is a probability, usually reported as a decimal between zero and one.
A small p-value means that an observed sample statistic (or something more extreme) would be unlikely to occur if the null hypothesis is true.
A significance threshold can be used to translate a p-value into a “significant” or “non-significant” result.
In practice, the alternative hypothesis and significance threshold should be chosen prior to data collection.

3) One Sample t Test in SciPy

One-sample t-tests are used for comparing a sample average to a hypothetical population average. For example, a one-sample t-test might be used to address questions such as:

Is the average amount of time that visitors spend on a website different from 5 minutes?
Is the average amount of money that customers spend on a purchase more than 10 USD?

t Test:

4) Simulating a Binomial Test

Binomial tests are similar to one-sample t-tests in that they test a sample statistic against some population-level expectation. The difference is that:

binomial tests are used for binary categorical data to compare a sample frequency to an expected population-level probability
one-sample t-tests are used for quantitative data to compare a sample mean to an expected population mean.

Binomial Test:

5) Significance Tresholds

Documentation

Scipy Documentation - Statistical Functions

In this documentation, you will find information about SciPy’s functions for statistical analysis. This is helpful if you would like to perform statistical analyses and/or mathematical computations on data in Python.

b. Testing an Association

Details

In this lesson, we’ll use hypothesis tests to make inference about population-level associations between two variables.

We will cover four different hypothesis tests:

Two Sample T-Tests (for an association between a quantitative variable and a binary categorical variable)
ANOVA and Tukey Tests (for an association between a quantitative variable and a non-binary categorical variable)
Chi-Square Tests (for an association between two categorical variables)

Association:

Assumption:

Before we use a two sample t-test, ANOVA, or Tukey’s range test, we need to be sure that the following things are true:

The observations should be independently randomly sampled from the population
The standard deviations of the groups should be equal
The data should be normally distributed…ish
The groups created by the categorical variable must be independent

Before we use a Chi-Square test, we need to be sure that the following things are true:

The observations should be independently randomly sampled from the population
The categories of both variables must be mutually exclusive
The groups should be independent

c. Experimental Design

Details

There are many different types of hypothesis tests that should be used in different situations. In this article, we are going to briefly review some of them and when they are appropriate.

Here is a summary table of some of the hypothesis tests we’ll cover:

1) A/B Testing - Sample Size Calculators

An A/B Test is a scientific method of choosing between two options (Option A and Option B). Some examples of A/B tests include:

What number of sale items on a website makes customers most likely to purchase something: 25 or 50?
What color button are customers more likely to click on: blue or green?
Do people spend more time on a website if the background is green or orange?

For A/B tests where the outcome of interest (eg., whether or not a customer makes a purchase) is categorical, an A/B test is conducted using a Chi-Square hypothesis test. In order to determine the sample size necessary for this kind of test, a sample size calculator requires three numbers:

Baseline conversion rate
Minimum detectable effect (also called the minimum desired lift)
Statistical significance threshold

A/B Testing:

2) Sample Size Determination with Simulation

9. Data Visualization

Details

a. Exploratory Data Analysis for Data Visualization

Details

1) Introduction

Data visualization is an important component of Exploratory Data Analysis (EDA) because it allows a data analyst to “look at” their data and get to know the variables and relationships between them. In order to choose and design a data visualization, it is important to consider two things:

The question you want to answer (and how many variables that question involves)
The data that is available (is it quantitative or categorical?)

Univariate Analysis

Univariate analysis focuses on a single variable at a time. Univariate data visualizations can help us answer questions like:

What is the typical price of a rental in New York City?
What proportion of NYC rentals have a gym?

Quantitative Variables

Box plots (or violin plots) and histograms are common choices for visually summarizing a quantitative variable. These plots are useful because they simultaneously communicate information about minimum and maximum values, central location, and spread. Histograms can additionally illuminate patterns that can impact an analysis (eg., skew or multimodality).
Categorical Variables

For categorical variables, we can use a bar plot (instead of a histogram) to quickly visualize the frequency (or proportion) of values in each category.

Bivariate Analysis

In many cases, a data analyst is interested in the relationship between two variables in a dataset. For example:

Do apartments in different boroughs tend to cost different amounts?
What is the relationship between the area of an apartment and how much it costs?

One Quantitative Variable and One Categorical Variable

Two good options for investigating the relationship between a quantitative variable and a categorical variable are side-by-side box plots and overlapping histograms.
Two Quantitative Variables

A scatter plot is a great option for investigating the relationship between two quantitative variables.
Two Categorical Variables

Side by side (or stacked) bar plots are useful for visualizing the relationship between two categorical variables.

Multivariate Analysis

Sometimes, a data analyst is interested in simultaneously exploring the relationship between three or more variables in a single visualization. Many of the visualization methods presented up to this point can include additional variables by using visual cues such as colors, shapes, and patterns.

2) A Gentle Introduction to Exploratory Data Analysis

→ Max Kelsen

In this article, you will learn the basic steps of exploratory data analysis and walk through an example. This is helpful if you want to ensure that all aspects of the data are being correctly interpreted in an analysis.

3) A Simple Tutorial on Exploratory Data Analysis

→ Kaggle

In this article, you will learn the different methods of exploratory data analysis (EDA), key concepts for an EDA, as well as a thorough example through an EDA. This is helpful if you want many practical examples of visual and numeric methods to explore data.

b. Data Visualization with Matplotlib

Details

1) Line Graphs in Matplotlib

Matplotlib is a Python library used to create charts and graphs In this first lesson, you will get an overview of the basic commands necessary to build and label a line graph.

Documentation

Matplotlib Documentation

In this documentation, you will learn about plotting data using Python’s Matplotlib library. This is helpful if you would like to graphically visualize data in python.

2) Different Plot Types

3) How to Select a Meaningful Visualization

The three steps in the data visualization process are preparing, visualizing, and styling data. When faced with a blank canvas, the second step of the process, visualizing the data, can be overwhelming. To help, we’ve created a diagram to guide the selection of a chart based on what you want to explore in your data.

When planning out a visualization, you’ll usually have an idea of what questions you’ll want to explore. However, you may initially wonder exactly which chart to use. This moment is one of the most exciting parts of the process!

During your brainstorming phase, you should consider two things:

The focusing question you want to answer with your chart
The type of data that you want to visualize

Depending on the focusing questions you’re trying to answer, the type of chart you select should be different and intentional in its difference. In the diagram below, we have assigned Matplotlib visualizations to different categories. These categories explore common focusing questions and types of data you may want to display in a visualization.

Chart categories

Composition charts
- Focusing Question:
  - What are the parts of some whole?
  - What is the data made of?
- Datasets that work well: / Data pertaining to probabilities, proportions, and percentages can be visualized as with the graphs in this composition category. Charts in this category illustrate the different data components and their percentages as part of a whole.
Distribution Charts
- Datasets that work well: / Data in large quantities and/or with an array of attributes works well for these types of charts. Visualizations in this category will allow you to see patterns, re-occurrences, and a clustering of data points.
- Note: / In statistics, a commonly seen distribution is a bell curve, also known as a normal distribution. A bell curve is a bell-shaped distribution where most of the values in the dataset crowd around the average (also known as the mean), therefore causing the curve to form. If you want to see how values in the data are “distributed” across variables, the best way to do that would be with the visualizations in this category.
Relationship Charts
- Focusing Question:
  - How do variables relate to each other?
- Datasets that work well: / Data with two or more variables can be displayed in these charts. These charts typically illustrate a correlation between two or more variables. You can communicate this relationship by mapping multiple variables in the same chart. Correlation measures the strength of a relationship between variables.
Comparison Charts
- Focusing Question:
  - How do variables compare to each other?
- Datasets that work well: / Data must have multiple variables, and the visualizations in this category allow readers to compare those items against the others. For example, a line graph that has multiple lines, each belonging to a different variable. Multi-colored bar charts are also a great way to compare items in data.

4) Recreate Graphs using Matplotlib

Resource

The Data Visualization Catalogue

In this resource, you will learn about different data visualization methods including how different methods are related, how to interpret them, and the limitations of each method. This is helpful if you want to expertly visualize data and can be a powerful reference to use when trying to decide which visualization method matches what you want to show.

c. Advanced Graphing with Seaborn

Details

1) Introduction to Seaborn

Seaborn is a Python data visualization library that provides simple code to create elegant visualizations for statistical exploration and insight. Seaborn is based on Matplotlib, but improves on Matplotlib in several ways:

Seaborn provides a more visually appealing plotting style and concise syntax.
Seaborn natively understands Pandas DataFrames, making it easier to plot data directly from CSVs.
Seaborn can easily summarize Pandas DataFrames with many rows of data into aggregated charts.

If you’re unfamiliar with Pandas, just know that Pandas is a data analysis library for Python that provides easy-to-use data structures and allows you to organize and manipulate datasets so they can be visualized. To fully leverage the power of Seaborn, it is best to prepare your data using Pandas.

Seaborn:

2) Distribution using Seaborn

In this lesson, we will explore how to use Seaborn to graph multiple statistical distributions, including box plots and violin plots.

Seaborn is optimized to work with large datasets — from its ability to natively interact with Pandas DataFrames, to automatically calculating and plotting aggregates. One of the most powerful aspects of Seaborn is its ability to visualize and compare distributions. Distributions provide us with more information about our data — how spread out it is, its range, etc.

Calculating and graphing distributions is integral to analyzing massive amounts of data. We’ll look at how Seaborn allows us to move beyond the traditional distribution graphs to plots that enable us to communicate important statistical information.

Distribution:

Documentation

Seaborn API Reference

In this documentation, you will learn the many elements of the Seaborn API. This is helpful if you wish to use Seaborn to visualize data in plots and grids.

10. Natural Language Processing

Details

a. Getting Started with NLP

Details

Look at the technologies around us:

Spellcheck and autocorrect
Auto-generated video captions
Virtual assistants like Amazon’s Alexa
Autocomplete
Your news site’s suggested articles

What do they have in common?

All of these handy technologies exist because of natural language processing! Also known as NLP, the field is at the intersection of linguistics, artificial intelligence, and computer science. The goal? Enabling computers to interpret, analyze, and approximate the generation of human languages (like English or Spanish).

Concepts!

Natural language processing combines computer science, linguistics, and artificial intelligence to enable computers to process human languages.
NLTK is a Python library used for NLP.
Text preprocessing is a stage of NLP focused on cleaning and preparing text for other NLP tasks.
Parsing is an NLP technique concerned with breaking up text based on syntax.
Language models are probabilistic machine models of language use for NLP comprehension tasks. Common models include bag-of-words, n-gram models, and neural language modeling.
Topic modeling is the NLP process by which hidden topics are identified given a body of text.
Text similarity is a facet of NLP concerned with semblance between instances of language.
Language prediction is an application of NLP concerned with predicting language given preceding language.

b. Text Processing

Details

Text preprocessing is an approach for cleaning and preparing text data for use in a specific context. Developers use it in almost all natural language processing (NLP) pipelines, including voice recognition software, search engine lookup, and machine learning model training. It is an essential step because text data can vary.

From its format (website, text message, voice recognition) to the people who create the text (language, dialect), there are plenty of things that can introduce noise into your data.

Concepts!

Text preprocessing is all about cleaning and prepping text data so that it’s ready for other NLP tasks.
Noise removal is a text preprocessing step concerned with removing unnecessary formatting from our text.
Tokenization is a text preprocessing step devoted to breaking up text into smaller units (usually words or discrete terms).
Normalization is the name we give most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.
Stemming is the normalization preprocessing task focused on removing word affixes.
Lemmatization is the normalization preprocessing task that more carefully brings words down to their root forms.

c. Bag of Word

Details

The bag-of-words language model is a simple-yet-powerful tool to have up your sleeve when working on natural language processing (NLP). The model has many, many use cases including:

determining topics in a song
filtering spam from your inbox
finding out if a tweet has positive or negative sentiment
creating word clouds

Bag-of-words (BoW) is a statistical language model based on word count. Say what?

Let’s start with that first part: a statistical language model is a way for computers to make sense of language based on probability. For example, let’s say we have the text:

"Five fantastic fish flew off to find faraway functions. Maybe find another five fantastic fish?"

A statistical language model focused on the starting letter for words might take this text and predict that words are most likely to start with the letter “f” because 11 out of 15 words begin that way. A different statistical model that pays attention to word order might tell us that the word “fish” tends to follow the word “fantastic.”

Documentation

Working with Text Data | scikit-learn

In this documentation, you will learn about the scikit-learn tools for generating language models like bag-of-words using text data. This is helpful if you would like to analyze a collection of text documents.

d. Term Frequency - Inverse Document Frequency

Details

You might not even realize, but you are building a language model in your head similar to term frequency-inverse document frequency, commonly known as tf-idf. Tf-idf is another powerful tool in your NLP toolkit that has a variety of use cases included:

ranking results in a search engine
text summarization
building smarter chatbots

Term frequency-inverse document frequency is a numerical statistic used to indicate how important a word is to each document in a collection of documents, or a corpus.

Tf-idf relies on two different metrics in order to come up with an overall score:

term frequency, or how often a word appears in a document. This is the same as bag-of-words’ word count.
inverse document frequency, which is a measure of how often a word appears in the overall corpus. By penalizing the score of words that appear throughout a corpus, tf-idf can give better insight into how important a word is to a particular document of a corpus.

Documentation

Working with Text Data | scikit-learn | From Occurrences to Frequencies

In this documentation, you will learn how to use scikit-learn to conduct tf-idf. This is helpful if you are trying to determine topics or themes within text data.

e. Word Embedings

Details

A word embedding is a representation of a word as a numeric vector, enabling us to compare and contrast how words are used and identify words that occur in similar contexts.

The applications of word embeddings include:

entity recognition in chatbots
sentiment analysis
syntax parsing

Word Embedings:

Concept!

Vectors are containers of information, and they can have anywhere from 1-dimension to hundreds or thousands of dimensions
Word embeddings are vector representations of a word, where words with similar contexts are represented with vectors that are closer together
spaCy is a package that enables us to view and use pre-trained word embedding models
The distance between vectors can be calculated in many ways, and the best way for measuring the distance between higher dimensional vectors is cosine distance
Word2Vec is a shallow neural network model that can build word embeddings using either continuous bag-of-words or continuous skip-grams
Gensim is a package that allows us to create and train word embedding models using any corpus of text

Documentation

spaCy Documentation - Token.similarity

In this documentation, you will learn about the spaCy library for Natural Language Processing in Python. This is helpful if you would like to process and derive insights from unstructured textual data.

11. Supervised Learning

Details

a. Introduction to Machine Learning

Details

1) What is Machine Learning?

While at IBM, Arthur Samuel developed a program that learned how to play checkers (1959). He called it:

"The field of study that gives computers the ability to learn without being explicitly programmed."

What does this mean?

As programmers, we often approach problems in a methodical, logic-based way. We try to determine what our desired outputs should be, and then create the proper rules that will transform our inputs into those outputs.

Machine learning flips the script. We want the program itself to learn the rules that describe our data the best, by finding patterns in what we know and applying those patterns to what we don’t know.

These algorithms are able to learn. Their performance gets better and better with each iteration, as it uncovers more hidden trends in the data.

Machine learning can be branched out into the following categories:

Supervised Learning

Supervised Learning is where the data is labeled and the program learns to predict the output from the input data. For instance, a supervised learning algorithm for credit card fraud detection would take as input a set of recorded transactions. For each transaction, the program would predict if it is fraudulent or not.

Supervised learning problems can be further grouped into regression and classification problems.

Regression:

In regression problems, we are trying to predict a continuous-valued output. Examples are:
- What is the housing price in Neo York?
- What is the value of cryptocurrencies?
Classification:

In classification problems, we are trying to predict a discrete number of values. Examples are:
- Is this a picture of a human or a picture of an AI?
- Is this email spam?
Unsupervised Learning

Unsupervised Learning is a type of machine learning where the program learns the inherent structure of the data based on unlabeled examples.

Clustering is a common unsupervised machine learning approach that finds patterns and structures in unlabeled data by grouping them into clusters.

Some examples:
- Social networks clustering topics in their news feed
- Consumer sites clustering users for recommendations
- Search engines to group similar objects in one cluster

2) Supervised vs. Unsupervised

Supervised Learning

Data is labeled and the program learns to predict the output from the input data

Unsupervised Learning

Data is unlabeled and the program learns to recognize the inherent structure in the input data

3) Scikit-Learn Cheatsheet

Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms. It’s built upon some of the technology you might already be familiar with, like NumPy, pandas, and Matplotlib!

As you build robust Machine Learning programs, it’s helpful to have all the sklearn commands all in one place in case you forget.

Linear Regression

Import and create the model:

from sklearn.linear_model import LinearRegression

your_model = LinearRegression()

Fit:

your_model.fit(x_training_data, y_training_data)

.coef_: contains the coefficients
.intercept_: contains the intercept

Predict:

predictions = your_model.predict(your_x_data)

.score(): returns the coefficient of determination R²

Naive Bayes)

Import and create the model:

from sklearn.naive_bayes import MultinomialNB

your_model = MultinomialNB()

Fit:

your_model.fit(x_training_data, y_training_data)

Predict:

# Returns a list of predicted classes - one prediction for every data point
predictions = your_model.predict(your_x_data)

# For every data point, returns a list of probabilities of each class
probabilities = your_model.predict_proba(your_x_data)

K-Nearest Neighbors

Import and create the model:

from sklearn.neigbors import KNeighborsClassifier

your_model = KNeighborsClassifier()

Fit:

your_model.fit(x_training_data, y_training_data)

Predict:

# Returns a list of predicted classes - one prediction for every data point
predictions = your_model.predict(your_x_data)

# For every data point, returns a list of probabilities of each class
probabilities = your_model.predict_proba(your_x_data)

K-Means

Import and create the model:
```
from sklearn.cluster import KMeans

your_model = KMeans(n_clusters=4, init='random')
```
- n_clusters: number of clusters to form and number of centroids to generate
- init: method for initialization
  - k-means++: K-Means++ [default]
  - random: K-Means
- random_state: the seed used by the random number generator [optional]
Fit:
```
your_model.fit(x_training_data)
```
Predict:
```
predictions = your_model.predict(your_x_data)
```

Validating the Model

Import and print accuracy, recall, precision, and F1 score:

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

print(accuracy_score(true_labels, guesses))
print(recall_score(true_labels, guesses))
print(precision_score(true_labels, guesses))
print(f1_score(true_labels, guesses))

Import and print the confusion matrix:

from sklearn.metrics import confusion_matrix

print(confusion_matrix(true_labels, guesses))

Training Sets and Test Sets
```
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)
```
- train_size: the proportion of the dataset to include in the train split
- test_size: the proportion of the dataset to include in the test split
- random_state: the seed used by the random number generator [optional]

b. Regression

Details

1) Regression vs. Classification

Supervised learning algorithms use labeled data as input while unsupervised learning algorithms use unlabeled data. However, we can further distinguish machine learning algorithms by the output they produce. In terms of output, two main types of machine learning models exist: those for regression and those for classification.

Regression

Regression is used to predict outputs that are continuous. The outputs are quantities that can be flexibly determined based on the inputs of the model rather than being confined to a set of possible labels.

For example:
- Predict the height of a potted plant from the amount of rainfall
- Predict salary based on someone’s age and availability of high-speed internet
- Predict a car’s MPG (miles per gallon) based on size and model year
Classification

Classification is used to predict a discrete label. The outputs fall under a finite set of possible outcomes. Many situations have only two possible outcomes. This is called binary classification (True/False, 0 or 1, Hotdog / not Hotdog).

For example:
- Predict whether an email is spam or not
- Predict whether it will rain or not
- Predict whether a user is a power user or a casual user
There are also two other common types of classification: multi-class classification and multi-label classification.

Multi-class classification has the same idea behind binary classification, except instead of two possible outcomes, there are three or more.

For example:
- Predict whether a photo contains a pear, apple, or peach
- Predict what letter of the alphabet a handwritten character is
- Predict whether a piece of fruit is small, medium, or large

2) Distance Formula

In this lesson, you will learn three different ways to define the distance between two points:

Euclidean Distance

Euclidean Distance is the most commonly used distance formula. To find the Euclidean distance between two points, we first calculate the squared distance between each dimension. If we add up all of these squared differences and take the square root, we’ve computed the Euclidean distance.
Manhattan Distance

Manhattan Distance is extremely similar to Euclidean distance. Rather than summing the squared difference between each dimension, we instead sum the absolute value of the difference between each dimension. It’s called Manhattan distance because it’s similar to how you might navigate when walking city blocks. If you’ve ever wondered “how many blocks will it take me to get from point A to point B”, you’ve computed the Manhattan distance.
Hamming Distance

Hamming Distance is another slightly different variation on the distance formula. Instead of finding the difference of each dimension, Hamming distance only cares about whether the dimensions are exactly equal. When finding the Hamming distance between two points, add one for every dimension that has different values.
SciPy Distances

Now that you’ve written these three distance formulas yourself, let’s look at how to use them using Python’s SciPy library:
- Euclidean Distance .euclidean()
- Manhattan Distance .cityblock()
- Hamming Distance .hamming()

Distance:

3) Linear Regression

The purpose of machine learning is often to create a model that explains some real-world data, so that we can predict what may happen next, with different inputs.

The simplest model that we can fit to data is a line. When we are trying to find a line that fits a set of data best, we are performing Linear Regression.

We often want to find lines to fit data, so that we can predict unknowns. For example:

The market price of a house vs. the square footage of a house. Can we predict how much a house will sell for, given its size?
The tax rate of a country vs. its GDP. Can we predict taxation based on a country’s GDP?
The amount of chips left in the bag vs. number of chips taken. Can we predict how much longer this bag of chips will last, given how much people at this party have been eating?

Linear Regression:

Documentation

Scikit-learn's Linear Regression Documentation

In this documentation, you will learn about Sklearn’s linear regression function. This is helpful if you would like to perform data prediction based on the relationship between two numeric variables.

c. Multiple Linear Regression

Details

Linear regression is useful when we want to predict the values of a variable from its relationship with other variables. There are two different types of linear regression models (simple linear regression and multiple linear regression).

In predicting the price of a home, one factor to consider is the size of the home. The relationship between those two variables, price and size, is important, but there are other variables that factor in to pricing a home: location, air quality, demographics, parking, and more. When making predictions for price, our dependent variable, we’ll want to use multiple independent variables. To do this, we’ll use Multiple Linear Regression.

Multiple Linear Regression uses two or more independent variables to predict the values of the dependent variable.

Documentation

Sklearn's Train_test_split documentation

In this documentation, you will learn about the Sklearn’s train_test_split() function. This is helpful if you would like to split a dataset into training and testing datasets.

d. Classification with K-Nearest Neighbors

Details

1) K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories.

2) K-Nearest Neighbors Regression

The K-Nearest Neighbors algorithm is a powerful supervised machine learning algorithm typically used for classification. However, it can also perform regression.

Documentation

Sklearn KNeighborsClassifier documentation

In this documentation, you will learn about the KNeighborsClassifier in Scikit-Learn. This is helpful if you would like to classify data into discreet classes using a k-nearest neighbors algorithm.

e. Evaluating and Improving Your Model

Details

1) Normalization

Normalizing your data is an essential part of machine learning. You might have an amazing dataset with many great features, but if you forget to normalize, one of those features might completely dominate the others. It’s like you’re throwing away almost all of your information! Normalizing solves this problem. In this article, you learned the following techniques to normalize:

Min-max normalization: Guarantees all features will have the exact same scale but does not handle outliers well.
Z-score normalization: Handles outliers, but does not produce normalized data with the exact same scale.

2) N-Fold Cross-Validation

Sometimes your dataset is so small, that splitting it 80/20 will still result in a large amount of variance. One solution to this is to perform N-Fold Cross-Validation. The central idea here is that we’re going to do this entire process N times and average the accuracy.

For example, in 10-fold cross-validation, we’ll make the validation set the first 10% of the data and calculate accuracy, precision, recall and F1 score. We’ll then make the validation set the second 10% of the data and calculate these statistics again. We can do this process 10 times, and every time the validation set will be a different chunk of the data. If we then average all of the accuracies, we will have a better sense of how our model does on average.

3) Metrics

Insights:

Classifying a single point can result in a true positive (truth = 1, guess = 1), a true negative (truth = 0, guess = 0), a false positive (truth = 0, guess = 1), or a false negative (truth = 1, guess = 0).
Accuracy measures how many classifications your algorithm got correct out of every classification it made.
Recall measures the percentage of the relevant items your classifier was able to successfully find.
Precision measures the percentage of items your classifier found that were actually relevant. Precision and recall are tied to each other. As one goes up, the other will go down.
F1 score is a combination of precision and recall.
F1 score will be low if either precision or recall is low.

The decision to use precision, recall, or F1 score ultimately comes down to the context of your classification. Maybe you don’t care if your classifier has a lot of false positives. If that’s the case, precision doesn’t matter as much.

Documentation

ML Glossary | Regularization

In this documentation, you will learn about the regularization technique in Machine Learning. This is helpful to improve accuracy in machine learning models by reducing overfitting.

f. More Classification Model

Details

1) Logistic Regression

Logistic Regression is a supervised machine learning algorithm that uses regression to predict the continuous probability, ranging from 0 to 1, of a data sample belonging to a specific category, or class. Then, based on that probability, the sample is classified as belonging to the more probable class, ultimately making Logistic Regression a classification algorithm.

Review

Logistic Regression is used to perform binary classification, predicting whether a data sample belongs to a positive (present) class, labeled 1 and the negative (absent) class, labeled 0.
The Sigmoid Function bounds the product of feature values and their coefficients, known as the log-odds, to the range [0,1], providing the probability of a sample belonging to the positive class.
A loss function measures how well a machine learning model makes predictions. The loss function of Logistic Regression is log-loss.
A Classification Threshold is used to determine the probabilistic cutoff for where a data sample is classified as belonging to a positive or negative class. The standard cutoff for Logistic Regression is 0.5, but the threshold can be higher or lower depending on the nature of the data and the situation.
Scikit-learn has a Logistic Regression implementation that allows you to fit a model to your data, find the feature coefficients, and make predictions on new data samples.
The coefficients determined by a Logistic Regression model can be used to interpret the relative importance of each feature in predicting the class of a data sample.

Documentation

Sklearn documentation | Linear_model.LogisticRegression

In this documentation, you will learn about the sklearn class “Linear_model.LogisticRegression”. This is helpful if you would like to build a machine learning model to predict the probability of an event occurring based on data.

2) Bayes’ Theorem

Bayes’ Theorem is the basis of a branch of statistics called Bayesian Statistics, where we take prior knowledge into account before calculating new probabilities.

3) Naive Bayes Classifier

A Naive Bayes classifier is a supervised machine learning algorithm that leverages Bayes’ Theorem to make predictions and classifications.

Documentation

Sklearn Documentation | feature_extraction.text.CountVectorizer

In this documentation, you will learn about the CountVectorizer class of the Sklearn.feature_extraction module. This is helpful if you would like to analyze text documents or strings using machine learning algorithms.

Documentation

Scikit-learn documentation for Multinomial Naive Bayes

In this documentation, you will learn Scikit-Learn’s methods for building a multinomial naive Bayes model. This is helpful if you wish to build a naive Bayes’ classifier for discrete data.

g. Advanced Calssification Model

Details

1) Support Vector Machines

A Support Vector Machine (SVM) is a powerful supervised machine learning model used for classification. An SVM makes classifications by defining a decision boundary and then seeing what side of the boundary an unclassified point falls on. In the next few exercises, we’ll learn how these decision boundaries get defined, but for now, know that they’re defined by using a training set of classified points. That’s why SVMs are supervised machine learning models.

Concepts:

SVMs are supervised machine learning models used for classification.
An SVM uses support vectors to define a decision boundary. Classifications are made by comparing unlabeled points to that decision boundary.
Support vectors are the points of each class closest to the decision boundary. The distance between the support vectors and the decision boundary is called the margin.
SVMs attempt to create the largest margin possible while staying within an acceptable amount of error.
The C parameter controls how much error is allowed. A large C allows for little error and creates a hard margin. A small C allows for more error and creates a soft margin.
SVMs use kernels to classify points that aren’t linearly separable.
Kernels transform points into higher dimensional space. A polynomial kernel transforms points into three dimensions while an rbf kernel transforms points into infinite dimensions.
An rbf kernel has a gamma parameter. If gamma is large, the training data is more relevant, and as a result overfitting can occur.

Documentation

Scikit-Learn Documentation on SVC

In this documentation, you will learn how to create a Support Vector Machine classifier using the Scikit learn library. This is helpful if you would like to experiment with different mathematical models for classifying data.

2) Decision Trees

Decision trees are machine learning models that try to find patterns in the features of data points. Take a look at the tree on this page. This tree tries to predict whether a student will get an A on their next test.

Review:

Good decision trees have pure leaves. A leaf is pure if all of the data points in that class have the same label.
Decision trees are created using a greedy algorithm that prioritizes finding the feature that results in the largest information gain when splitting the data using that feature.
Creating an optimal decision tree is difficult. The greedy algorithm doesn’t always find the globally optimal tree.
Decision trees often suffer from overfitting. Making the tree small by pruning helps to generalize the tree so it is more accurate on data in the real world.

Documentation

Scikit-Learn Documentation

In this documentation, you will learn how to use a decision tree to classify data with Scikit-Learn. This is helpful if you would like to build a decision tree classifier in Python.

3) Random Forest

A random forest is an ensemble machine learning technique — a random forest contains many decision trees that all work together to classify new points. When a random forest is asked to classify a new point, the random forest gives that point to each of the decision trees. Each of those trees reports their classification and the random forest returns the most popular classification. It’s like every tree gets a vote, and the most popular classification wins.

Review:

A random forest is an ensemble machine learning model. It makes a classification by aggregating the classifications of many decision trees.
Random forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in a random forest is less impactful.
Every decision tree in a random forest is created by using a different subset of data points from the training set. Those data points are chosen at random with replacement, which means a single data point can be chosen more than once. This process is known as bagging.
When creating a tree in a random forest, a randomly selected subset of features are considered as candidates for the best splitting feature. If your dataset has n features, it is common practice to randomly select the square root of n features.

Documentation

Scikit-Learn Documentation

In this documentation, you will learn about the parameters, attributes, and methods of the Random Forest Classifier within Scikit-Learn. This is helpful if you would like to use multiple decision trees to classify and analyze data.

12. Unsupervised Learning

Details

Clustering

Clustering is the most well-known unsupervised learning technique. It finds structure in unlabeled data by identifying similar groups, or clusters. Examples of clustering applications are:

Recommendation engines: group products to personalize the user experience
Search engines: group news topics and search results
Market segmentation: group customers based on geography, demography, and behaviors
Image segmentation: medical imaging or road scene segmentation on self-driving cars

K-Means Clustering

K-Means is the most popular and well-known clustering algorithm, and it tries to address these two questions.

The “K” refers to the number of clusters (groups) we expect to find in a dataset.
The “Means” refers to the average distance of data to each cluster center, also known as the centroid, which we are trying to minimize.

It is an iterative approach:

Place k random centroids for the initial clusters.
Assign data samples to the nearest centroid.
Update centroids based on the above-assigned data samples.

Repeat Steps 2 and 3 until convergence (when points don’t move between clusters and centroids stabilize). Once we are happy with our clusters, we can take a new unlabeled datapoint and quickly assign it to the appropriate cluster.

K-Means++ Clustering

To recap, the Step 1 of the K-Means algorithm is “Place k random centroids for the initial clusters”.

The K-Means++ algorithm replaces Step 1 of the K-Means algorithm and adds the following:

The first cluster centroid is randomly picked from the data points.
For each remaining data point, the distance from the point to its nearest cluster centroid is calculated.
The next cluster centroid is picked according to a probability proportional to the distance of each point to its nearest cluster centroid. This makes it likely for the next cluster centroid to be far away from the already initialized centroids.

Repeat 2 - 3 until k centroids are chosen.

K-Means++ improves K-Means by placing initial centroids more strategically. As a result, it can result in more optimal clusterings than K-Means.

It can also outperform K-Means in speed. If you get very unlucky initial centroids using K-Means, the algorithm can take a long time to converge. K-Means++ will often converge quicker!

Documentation

→ Scikit-Learn Documentation on Kmeans Clustering

In this documentation, you will learn scikit-Learn’s methods for k-means clustering. This is helpful if you wish to partition observations into clusters using the nearest mean.

What is Hierarchical Clustering

→ What is Hierarchical Clustering

In this article, you will learn about hierarchical cluster analysis. This is helpful if you want to divide the data into clusters, or groups of similar items.

13. Deep Learning

Details

What Is Deep Learning?

First, let’s focus on learning. If you have come across machine learning before, you might be familiar with the concept of a learning model. Learning describes the process by which models analyze data and finds patterns. Our machine learning algorithm learns from these patterns to find the best representation of this data, which it then uses to make predictions about new data that it has never seen before.

Deep learning is a subfield of machine learning, and the concept of learning is pretty much the same.

We create our model carefully
Throw relevant data at it
Train it on this data
Have it make predictions for data it has never seen

Deep learning models are used with many different types of data, such as text, images, audio, and more, making them applicable to many different domains.

The deep part of deep learning refers to the numerous “layers” that transform data. This architecture mimics the structure of the brain, where each successive layer attempts to learn progressively complex patterns from the data fed into the model. This may seem a bit abstract, so let’s look at a concrete example, such as facial recognition. With facial recognition, a deep learning model takes in a photo as an input, and numerous layers perform specific steps to identify whose face is in the picture. The steps taken by each layer might be the following:

Find the face within the image using edge detection.
Analyze the facial features (eyes, nose, mouth, etc.).
Compare against faces within a repository.
Output a prediction!

Notice that without large amounts of data, deep learning models are no more powerful (and maybe even less accurate) than less complex learning models. However, with large amounts of data, deep learning models can improve performance to the point that they outperform humans in tasks such as classifying objects or faces in images or driving. Deep learning is fundamentally a future learning system (also known as representation learning). It learns from raw data without human intervention. Hence, given massive amounts of data, a deep learning system will perform better than traditional machine learning systems that rely on feature extractions from developers.

What are Neural Networks?

A neural network is a programming model inspired by the human brain. Let’s explore how it came into existence.

Computers have been designed to excel at number-crunching tasks, something that most humans find terrifying. On the other hand, humans are naturally wired to effortlessly recognize objects and patterns, something that computers find difficult.

This juxtaposition brought up two important questions in the 1950s:

“How can computers be better at solving problems that humans find effortless?”
“How can computers solve such problems in the way a human brain does?”

In 1957, Frank Rosenblatt explored the second question and invented the Perceptron algorithm that allowed an artificial neuron to simulate a biological neuron! The artificial neuron could take in an input, process it based on some rules, and fire a result. But computers had been doing this for years — what was so remarkable?

The Perceptron Algorithm used multiple artificial neurons, or perceptrons, for image recognition tasks and opened up a whole new way to solve computational problems. However, as it turns out, this wasn’t enough to solve a wide range of problems, and interest in the Perceptron Algorithm along with Neural Networks waned for many years.

But many years later, the neurons fired back.

It was found out that creating multiple layers of neurons — with one layer feeding its output to the next layer as input — could process a wide range of inputs, make complex decisions, and still produce meaningful results. With some tweaks, the algorithm became known as the Multilayer Perceptron, which led to the rise of Feedforward Neural Networks.

Deep Learning Math

Scalars, Vectors, and Matrices

To start, let us go over a couple of topics that will be integral to understanding the mathematical operations that are present in deep learning, including how data is represented:
- Scalars:
  
  A scalar is a single quantity that you can think of as a number. In machine learning models, we can use scalar quantities to manipulate data, and we often modify them to improve our model’s accuracy. We can also represent data as scalar values depending on what dataset we are working with. Code example:
```
x = 5
```
- Vectors:
  
  Vectors are arrays of numbers. In Python, we often denote vectors as NumPy arrays. Each value in the array can be identified by its index (location within the array). Code example:
```
x = np.array([1,2,3])
```
- Matrices:
  
  Matrices are grids of information with rows and columns. We can index a matrix just like an array; however, when indexing on a matrix, we need two arguments: one for the row and one for the column. Code example:
```
x = np.array([[1,2,3],[4,5,6],[7,8,9]])
```
Tensors

Scalars, vectors, and matrices are foundational objects in linear algebra. Understanding the different ways they interact with each other and can be manipulated through matrix algebra is integral before diving into deep learning. This is because the data structure we use in deep learning is called a tensor, which is a generalized form of a vector and matrix: a multidimensional array.

A tensor allows for more flexibility with the type of data you are using and how you can manipulate that data.
Matrix Algebra

The following gifs walkthrough matrix multiplication, addition, and transpose. You can perform element-wise operations on tensors using matrix algebra as well, which you can read more about here.
- Matrix Addition:
- Scalar Multiplication:
- Matrix Multiplication:
  
  This is the most complicated, so spend time understanding how it is done in the applet in the Learning Environment. Feel free to pause the animation if you would like to spend more time making sense of each slide!
- Transpose:
Neural Networks Concept Overview

Let’s take a look at the journey our inputs take inside of a neural network! By an input, we mean a data point from our dataset. Our input can have many different features, so in our input layer, each node represents a different input feature. For example, if we were working with a dataset of different types of food, some of our features might be size, shape, nutrition, etc., where the value for each of these features would be held in an input node.

Besides an input layer, our neural network has two other different types of layers:
- Hidden layers are layers that come between the input layer and the output layer. They introduce complexity into our neural network and help with the learning process. You can have as many hidden layers as you want in a neural network (including zero of them).
- The output layer is the final layer in our neural network. It produces the final result, so every neural network must have only one output layer.
Each layer in a neural network contains nodes. Nodes between each layer are connected by weights. These are the learning parameters of our neural network, determining the strength of the connection between each linked node.
Forwardpropagation

The process we have been going through is known as forward propagation. Inputs are moved forward from the input layer through the hidden layer(s) until they reach the output layer.

Check the Apps here.
Loss Functions

We have seen how we get to an output! Now, what do we do with it? When a value is outputted, we calculate its error using a loss function. Our predicted values are compared with the actual values within the training data. There are two commonly used loss calculation formulas:
- Mean squared error, which is most likely familiar to you if you have come across linear regression. This gif below shows how mean squared error is calculated for a line of best fit in linear regression.
- Cross-entropy loss, which is used for classification learning models rather than regression.
Backpropagation

This all seems fine and dandy so far. However, what if our output values are inaccurate? Do we cry? Try harder next time? Well, we can do that, but the good news is that there is more to our deep learning models.

This is where backpropagation and gradient descent come into play. Forward propagation deals with feeding the input values through hidden layers to the final output layer. Backpropagation refers to the computation of gradients with an algorithm known as gradient descent. This algorithm continuously updates and refines the weights between neurons to minimize our loss function.

By gradient, we mean the rate of change with respect to the parameters of our loss function. From this, backpropagation determines how much each weight is contributing to the error in our loss function, and gradient descent will update our weight values accordingly to decrease this error.

Check the Apps here.
Gradient Descent

We have the overall process of backpropagation down! Now, let’s zoom in on what is happening during gradient descent.

If we think about the concept graphically, we want to look for the minimum point of our loss function because this will yield us the highest accuracy. If we start at a random point on our loss function, gradient descent will take “steps” in the “downhill direction” towards the negative gradient. The size of the “step” taken is dependent on our learning rate. Choosing the optimal learning rate is important because it affects both the efficiency and accuracy of our results.
Stochastic Gradient Descent

This leads us to the final point about gradient descent. In deep learning models, we are often dealing with extremely large datasets. Because of this, performing backpropagation and gradient descent calculations on all of our data may be inefficient and computationally exhaustive no matter what learning rate we choose.

To solve this problem, a variation of gradient descent known as Stochastic Gradient Descent (SGD) was developed. Let’s say we have 100,000 data points and 5 parameters. If we did 1000 iterations (also known as epochs in Deep Learning) we would end up with 100000⋅5⋅1000 = 500,000,000 computations. We do not want our computer to do that many computations on top of the rest of the learning model; it will take forever.

This is where SGD comes to play. Instead of performing gradient descent on our entire dataset, we pick out a random data point to use at each iteration. This cuts back on computation time immensely while still yielding accurate results.
More Variants of Gradient Descent

Just when you thought SDG solved all our problems, even more options come into the picture!

There are also other variants of gradient descent such as Adam optimization algorithm and mini-batch gradient descent. Adam is an adaptive learning algorithm that finds individual learning rates for each parameter. Mini-batch gradient descent is similar to SGD except instead of iterating on one data point at a time, we iterate on small batches of fixed size.

Adam optimizer’s ability to have an adaptive learning rate has made it an ideal variant of gradient descent and is commonly used in deep learning models. Mini-batch gradient descent was developed as an ideal trade-off between GD and SGD. Since mini-batch does not depend on just one training sample, it has a much smoother curve and is less affected by outliers and noisy data making it a more optimal algorithm for gradient descent than SGD.

Review

This overview completes the necessary mathematical intuition you need to move forward and begin coding your own learning models! To recap all the things we have learned (so many things!!!):

Scalars, vectors, matrices, and tensors
- A scalar is a singular quantity like a number.
- A vector is an array of numbers (scalar values).
- A matrix is a grid of information with rows and columns.
- A tensor is a multidimensional array and is a generalized version of a vector and matrix.
Matrix Algebra
- In scalar multiplication, every entry of the matrix is multiplied by a scalar value.
- In matrix addition, corresponding matrix entries are added together.
- In matrix multiplication, the dot product between the corresponding rows of the first matrix and columns of the second matrix is calculated.
- A matrix transpose turns the rows of a matrix into columns.
In forward propagation, data is sent through a neural network to get initial outputs and error values.
- Weights are the learning parameters of a deep learning model that determine the strength of the connection between two nodes.
- A bias node shifts the activation function either left or right to create the best fit for the given data in a deep learning model.
- Activation Functions are used in each layer of a neural network and determine whether neurons should be “fired” or not based on output from a weighted sum.
- Loss functions are used to calculate the error between the predicted values and actual values of our training set in a learning model.
In backpropagation, the gradient of the loss function is calculated with respect to the weight parameters within a neural network.
- Gradient descent updates our weight parameters by iteratively minimizing our loss function to increase our model’s accuracy.
- Stochastic gradient descent is a variant of gradient descent, where instead of using all data points to update parameters, a random data point is selected.
- Adam optimization is a variant of SGD that allows for adaptive learning rates.
- Mini-batch gradient descent is a variant of GD that uses random batches of data to update parameters instead of a random datapoint.

14. SQL for Interview Prep

Details

a. Aggregate Functions

Details

We’ve learned how to write queries to retrieve information from the database. Now, we are going to learn how to perform calculations using SQL.

Calculations performed on multiple rows of a table are called aggregates.

Here is a quick preview of some important aggregates that we will cover in the next exercises:

COUNT(): count the number of rows
SUM(): the sum of the values in a column
MAX(): the largest value
MIN(): the smallest value
AVG(): the average of the values in a column
ROUND(): round the values in the column

Aggregate functions combine multiple rows together to form a single value of more meaningful information.

GROUP BY is a clause used with aggregate functions to combine data from one or more columns.
HAVING limit the results of a query based on an aggregate property.

Aggregate

Challenge

b. Multiple Tables

Details

Concepts

JOIN will combine rows from different tables if the join condition is true.
LEFT JOIN will return every row in the left table, and if the join condition is not met, NULL values are used to fill in the columns from the right table.
Primary key is a column that serves a unique identifier for the rows in the table.
Foreign key is a column that contains the primary key to another table.
CROSS JOIN lets us combine all rows of one table with all rows of another table.
UNION stacks one dataset on top of another.
WITH allows us to define one or more temporary tables that can be used in the final query.

Multiple Tables

Challenge

c. Window Functions

Details

1) Introduction to Window Functions

So far we’ve learned how to retrieve information from tables and perform calculations on them. But we’ve seen that GROUP BY and SUM reduce the number of rows in your query results because they are combining or grouping rows.

Window functions, on the other hand, allow you to maintain the values of your original table while displaying grouped or summative information alongside in another column. This is why many Data Scientists and Data Engineers love to use window functions for complex data analysis.

2) Math and Date Function

Here’s a quick preview of the concepts that will be covered in these next six exercises:

Math operators:

+ addition
- subtraction
/ division
* multiplication
% modulo (returns the remainder)

Math functions:

ABS(): returns the absolute value of the input expression
CAST(): converts an expression into another data type

Date and time functions:

DATETIME(): returns the date and time of a time string
DATE(): returns the date portion of a time string
TIME(): returns the time portion of a time string
STRFTIME(): returns a formatted date

Date function modifiers:

Modifiers to shift the date backwards to a specified part:
- 'start of year'
- 'start of month'
- 'start of day'
Modifiers to add a specified amount to the date and time of a time string:
- '+-N years'
- '+-N months'
- '+-N days'
- '+-N hours'
- '+-N minutes'
- '+-N seconds'
Format string substitutions:
- %Y returns the year (YYYY)
- %m returns the month (01-12)
- %d returns the day of month (01-31)
- %H returns the hour (00-23)
- %M returns the minute (00-59)
- %S returns the second (00-59)

Let’s get started!

d. SQL Interview Prep

Details

1) Interview Prep Problems

2) Common SQL Interview Questions

What is SQL?

SQL, which stands for Structured Query Language, is a programming language to communicate with data stored in a relational database management system.

Its syntax is similar to English, making it easy to write, read, and interpret. It allows you to write queries which define the subset of data you are seeking. You can save these queries, refine them, share them, and run them on different databases.
What is a database?

A database is a set of data stored in a computer, where the data is structured in a way that makes it easily accessible.
What is a relational database?

A relational database is a type of database. It uses a structure that allows us to identify and access data in relation to other pieces of data in the database.

Data in a relational database is often organized as tables.
What is a RDBMS?

A relational database management system (RDBMS) is a program that allows you to create, update, and administer a relational database.

Most relational database management systems use SQL language to access the databases.

The most popular RDBMS is MySQL. Others include PostgreSQL, Oracle DB, SQL Server, and SQLite.
What is a table?

A table is a collection of data organized into rows and columns. They are sometimes referred to as “relations”.

They can have hundreds, thousands, and sometimes even millions of rows of data.
What is a row and column in a table?

A row is a single record of data in a table.

A column is a set of data values of a particular type.
What is a data type?

A data type is an attribute that specifies the type of data a column holds.

Every column of a database has a data type.

Some common data types are INTEGER, TEXT, DATE, REAL.
What is a primary key and foreign key?

A primary key is a column that uniquely identifies each row of a table.

Primary keys must satisfy the following requirements: No value can be NULL, each value must be unique, and a table cannot have more than one primary key column.

For example, in a customers table, the primary key would be customer_id.

A foreign key is the primary key for one table that appears in a different table. For example if there was also an orders table, each order can store the customer, such that the customer_id column would be the foreign key.
What is the difference between ALTER and UPDATE?

The ALTER statement is used to add a new column to a table. It changes the table structure.

The UPDATE statement is used to edit a row in the table. It changes existing records in the table.
What is a query?

A query is a SQL statement used to retrieve information stored in a database.

They allow us to communicate with the database by asking questions and returning a result set with data relevant to the question.
What is a subquery?

A subquery is an internal query nested inside of an external query.

They can nested inside of SELECT, INSERT, UPDATE, or DELETE statements.

When a subquery is present, it will be executed before the external statement is run.
What are constraints?

Constraints are a set of rules used to tell the database to limit the type of data that can be stored for the columns. They tell the database to reject inserted data that does not adhere to the restriction.

They add information about how a column can be used, and are invoked after the data type for a column.

Some examples of constraints are:

PRIMARY KEY, which uniquely identifies each row and requires each value to be unique. UNIQUE, which requires every value in the column to be different. NOT NULL, which requires columns to have a value. DEFAULT, which takes an additional argument which will be the assumed value for an inserted row if the new row does not specify a value for that column.
What is a statement?

A statement is text that the database recognizes as a valid command.

They can be used to perform tasks like altering he structure of a table, updating data, or retrieving data from a database.

The structure of a statement can vary, but each must end with a semi-colon ;. The number of lines used for a statement does not matter, and it can be written as a single line or split into multiple lines to make it easier to read.
How do you check if a field does not have a value or has a value?

When a field has no value, it is indicated with a NULL value.

To check if a field does not have a value, you can use IS NULL as the condition:
```
WHERE col IS NULL
```
To check if a field has a value, you can use IS NOT NULL as the condition:
```
WHERE col IS NOT NULL
```
What is the difference between DISTINCT and UNIQUE?

DISTINCT is a keyword used when we want to return unique values in the output. It filters out all duplicate values in the specified column.

UNIQUE is a constraint used to ensure that all values of a column are different. It is similar to PRIMARY KEY, except that a table can have many different UNIQUE columns.
What are aggregate functions used for?

Aggregate functions are used to perform a calculation on one or more values, and returns a single value of more meaningful information.

Some aggregate functions are COUNT(), SUM(), MAX(), MIN(), AVG(), and ROUND().
What is a join?

A join is a way to combine rows from two or more tables, based on a related column between them.
What is the difference between an INNER JOIN and LEFT JOIN?

An INNER JOIN is used to include combined rows from two tables that match the ON condition. The final result does not include rows with no match for the ON condition.

A LEFT JOIN is used to keep all rows from the first table, regardless of whether there is a matching row in the second table for the ON condition.
What is the purpose of windows functions?

Windows functions are used when you want to maintain the values of your original table while displaying grouped or summative information alongside. It is similar to aggregate functions, but does not reduce the number of rows in the result by combining or grouping them into a few result.
What are indexes and why are they needed?

Indexes are a powerful tool used in the background of a database to speed up querying, by acting as a lookup table for data.

They are needed to efficiently store data for quicker retrieval, which can be paramount to the success of large tech companies which need to process on the scale of petabytes of data each day.

Certificate 📃

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Data Scientist Career Path		Data Scientist Career Path
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
codecademy.png		codecademy.png

License

myarist/Data-Science-Learning-Path

Folders and files

Latest commit

History

Repository files navigation

Data Scientist Learning Path

1. Welcome to the Data Scientist Career Path

2. Getting Started with Data Science

a. Introduction to Data Science

1) Statistics

2) Probability

3) Programming

4) Domain Expertise

Review

b. Data Science Process

c. Data Science Applications

3. Python Fundamentals

a. Python Syntax and Variable Types

Documentation

b. Python Functions

c. Python Control Flow

Documentation

d. Getting Started Off-Platform

1) Command Line Interface Setup

2) Introducing Jupyter Notebook

3) Setting up Jupyter Notebook

4) Getting Started with Jupyter

5) Getting More out of Jupyter Notebook

e. Python List

e. Python Loop

f. Python Strings

g. Python Dictionaries

h. Python Classes

i. Python Modules

j. Python Files

4. Data Acquisition

a. Intro to Data Acquisition

1) Public & Private Data

2) Web Scraping

3) APIs

4) Manual Data Acquisition

How to build your own dataset for Data Science projects

b. SQL

1) Relational Databases for Data Science/Analysis

2) Introduction to SQL

3) SQL Queries

c. Web Scraping

5. Data Manipulation with Pandas

a. Python Lambda Function

b. Hands On with Pandas

c. Agregates in Pandas

d. Multiple Tables in Pandas

6. Data Wrangling and Tidying

a. Introduction to Regular Expression

b. Clean Data with Python

7. Summary Statistics

a. Variable Types

b. Mean, Median, Mode

c. Variance and Standard Deviation

d. Distribution

e. Quartiles, Quantiles, Interquartile Range

f. Summary Statistics for Categorical Data

g. Visualizing Categorical Data

h. Associations between Variables

8. Hypothesis Testing

a. Testing a Sample Statistics

b. Testing an Association

c. Experimental Design

9. Data Visualization

a. Exploratory Data Analysis for Data Visualization

b. Data Visualization with Matplotlib

c. Advanced Graphing with Seaborn

10. Natural Language Processing

a. Getting Started with NLP

b. Text Processing

c. Bag of Word

d. Term Frequency - Inverse Document Frequency

e. Word Embedings

11. Supervised Learning

Packages