can say Wrangling(munging|transformation|manipulation) | Cleaning | Pre-processing | Feature Engineering
Data Preparation is most important part of a machine learning project which is least discussed and most time consuming
- machine learning algorithms have expectations regarding
- data types
- scale
- probability distribution and
- relationships between input variables, data must be changed meet these expectations.
- Challenge of data processing is that each dataset is unique and different (data preprocessing is chalenging because of)
- Datasets differ in number of variables (tens, hundreds, thousands, or more),
- Types of variables (numeric, nominal, ordinal, boolean),
- Scale of variables,
- Drift in values over time etc...
Project can be different but steps on path to a good or even best result are generally same from project to project
- Sometimes referred to as
applied machine learning process
,data science process
Step 1
:Define Problem
1.1. Gather data from problem domain
1.2. Discuss project with subject matter experts
1.3. Select those variables to be used as inputs and outputs for a predictive model
1.4. Review data that has been collected
1.5. Summarize collected data using statistical methods
1.6. Visualize collected data using plots and charts
Step 2
:Data Preparation [Tasks]
Transform collected raw data as to make it more suitable for model
2.1. Data Cleaning
2.2. Feature Selection
2.3. Data Transformation
2.4. Feature Engineering
2.5. Dimensionality Reduction
Step 3
:Evaluate Models
3.1. Select performance metric for model evalution
3.2. Select model evaluation procedure
3.3. Select algorithms to Evaluate
3.4. make a baseline to compare with other model
3.5. reshampling technique to split data
* k-fold is often used
3.6. get most out of well performing model by tuning
* Hyperparameters
* Combine predictive models into ensembles
Step 4
:Finalize Model
- slect best performing model
- production and all
- Identifying and correcting mistakes or errors in data | Data can be mistyped, corrupted, duplicated | Messy, noisy, corrupt, or erroneous values must be addressed
- Might involve
- Removing a row or a column
- Replacing observations with new values
- Using statistics to
- Define Normal data and
- Identify Outliers
- Identifying & Removing columns which have
- Same value or
- No variance
- Identifying and Removing
- Duplicate rows of data
- Marking empty values as missing
- Imputing missing values using
- Statistics or
- A learned model
Identifying those input variables that are most relevant to target variable
Feature Selection technique is generally grouped into Supervised(having targets) and Unsupervised(not having targets)
Supervised Technique is further divided into Models that
Intrinsic
: Automatically select features as part of model fitting [Trees]Wrapper Model
: Explicitly choose features which result in best performing model [Recursive Feature Elemination]Filter Model
: score each input feature and allow a subset to be selected [Feature Importance, Stats]- Statistical Methods such as
- Correlation is popular for Scoring Input features
- Correlation is popular for Scoring Input features
Categorical Inputs
for aClassification Target Variable
Numerical Inputs
for aClassification Target Variable
Numerical Inputs
for aRegression Target Variable
2.3. Data Transformation
:
Changing scale,type or distribution of variables
Numeric Data Type
: Number values
Integer
: Integers with no fractional partFloat
: Floating point values
Categorical Data Type
: Label values
Ordinal
: Labels with a rank orderingNominal
: Labels with no rank orderingBoolean
: Values True and False
NOTE: Important consideration with data transforms is that operations are generally performed separately for each variable
Discretization Transform:
Encode (to convert) a numeric variable to an ordinal variable
Ordinal Transform:
Encode a categorical variable into an integer variable
One Hot Transform:
Encode a categorical variable into binary variables(boolean), required on most classification tasks
If the data has a Gaussian probability distribution, it may be more useful to shift data to a standard Gaussian with a mean of zero and a standard deviation of one
Normalization Transform
: Scale a variable to range 0 and 1
Standardization Transform
: Scale a variable to a standard Gaussian
Powe Transform:
Changes distribution of a variable which is nearly Gaussian, but is skewed or sifted can be made more Gaussian
Quantile Transform:
Force a probability distribution, such as Uniform or Gaussian on a variable with an unusual natural distribution
2.4. Feature Engineering
:
Deriving new variables from available data
Technuques to reuse:
- Adding a boolean flag variable for some state
- Adding a group or global summary statistic, such as a mean
- Adding new variables for each component of a compound variable, such as a date-time
Polynomial Transform:
Create copies of numerical input variables that are raised to a power
2.5. Dimensionality Reduction
:
Creating compact projections of data by creating projection of data in lower-dimensional that still preserves most important properties of original data
Common approach to dimensionality reduction is to use a Matrix Factorization Technique
Principal Component Analysis
(PCA)Singular Value Decomposition
(SVD)Model-based methods:
Linear Discriminant Analysis
Autoencoders
These techniques removes Linear Dependencies b/w input variables
SometimesManifold Learning Algorithms
can also be usedSelf-organizing maps
(SOME)t-Distributed Stochastic Neighbor Embedding
(t-SNE)
A naive approach to preparing data applying transform on entire dataset before evaluating performance of model
This results in a problem referred to as data leakage where knowledge of the hold-out test set leaks into dataset used to train model
Careful application of data preprocessing is required depending on model evalution scheme used such as
train-test-split
k-fold cross validation
- Data processing must be done on Training set only in order to avoide data leakage
Problem with Naive Data Processing
This could happen when test data is leaked into training set, or when data from future is leaked to past
For example
- Consider case where we want to Normalize data, that is scale input variables to range 0-1
- When we normalize input variables, this requires that we first calculate minimum and maximum values for each variable before using these values to scale variables
- Dataset is then split into train and test datasets, but examples in training dataset know something about data in test dataset; they have been scaled by global minimum and maximum values, so they know more about the global distribution of variable then they should
Data preprocessing on train_test_split, processing must be fit on training dataset only
Split Data
Fit Data Preparation on Training Dataset
Apply Data Preparation to Train and Test Datasets
Evaluate Models
Data preprocessing on k-Fold Cross Validation
Defines sequence(list) of Data Preparation Steps to apply by fitting model and evaluate it
Each sequence(step) in list is a tuple having 2 element
1st element: name of step (string)
2nd element: configured object of step such as Transform or Model
- Technique to
prepare data
so that itavoids data leakage
, which lea to result ofincorrect model evaluation
- Technique to
identify and handle problems with messy data
, such asoutliers and missing values
- Technique to
identify and remove irrelevant and redundant input variables
withfeature selection methods
- Technique to know which
feature selection method
to choosebased on data types of variables
- Technique to
scale range of input variables
using normalization and standardization
technique - Technique to
encode categorical variables as numbers
andnumeric variables as categories
- Technique to
transform probability distribution of input variables
- Technique to
transform a dataset with different variable types
and how totransform target variables
- Technique to
project variables into a lower-dimensional space
tocaptures salient data relationships
Part 1
: Basic
- Importance of Data preparation and its techniques, best practices to use in order to avoid data leakage
Part 2
: Data Cleaning
- Transform messy data into clean data by identifying outliers, handling missing values with statistical and modeling techniques
Part 3
: Feature Selection
- Statistical and Modeling techniques for feature selection and feature importance with how to choose technique to use for different variable types.
Part 4
: Data Transforms
- Transform variable types and variable probability distributions with a suite of standard data transform algorithms
Part 5
: Advanced Transforms
- Handle some of trickier aspects of data transforms, such as handling multiple variable types at once, transforming targets, and saving transforms after choosing a final model
Part 6
: Dimensionality Reduction
- Remove input variables by projecting data into a lower dimensional space with dimensionality-reduction algorithms