Data

Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics

The question to ask ourselves: Do we know our data...?

Ethics / altruistic motives
Data Science
Datasets and sources of raw data
Data Collection
Hypothesis
Data Exploratory Analysis
Data preparation
- Data Cleaning
- Data preprocessing / Data Wrangling
Data Generation
Feature Extraction
Feature Importance
Feature Engineering
Feature Selection
Hyperparameter tuning
Post model-creation analysis, ML interpretation/explainability
Model deployment
Statistics
Visualisation
Common mistakes when training models (data related)
Presentations
Cheatsheets
Course / books
Best practices / rules / an unordered list of high level or low level guidelines
Framework(s) / checklist(s)
Notebooks
Programs and Tools
Databases
References
Credits
Contributing

Ethics / altruistic motives

See Ethics / altruistic motives

Data Science

The Data Science Process
JustCause package/framework - framework to foster good scientific practice in the research of causality methods | PyPu | GitHub
“Metaflow is a human-friendly Python library” LinkedIn Post
5 free books for learning Python for DS
7 advanced tricks in pandas for data science
The Ultimate NumPy Tutorial for Data Science Beginners
Top 10 Data science podcast must follow for learn new things
Top 20 Youtube Channels for Data Science
Advanced Data Science from IBM
𝟏𝟐 𝐒𝐭𝐞𝐩𝐬 𝐭𝐨 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧-𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 𝐂𝐨𝐝𝐞
Top 10 Popular GitHub Repositories to learn about Data Science
The difference between Statistics and Data Science: Big Data and Inferential Statistics
DataScience resources (in the form of a book) from Eric
Data Exploration and API First Design: Deep Learning Hands-On Series with Eric Schles
Augmented Analytics Engine
Putting an end to Unreliable Analytics by David Yaffe
The Fundamentals of end-to-end Data Strategy: video | slides | Resources | Feedback

Datasets and sources of raw data

See Datasets

Data Collection

👉 Effective Data Collection 👈
The Ultimate Guide to Effective Data Collection

Hypothesis

Correlation, causation, multicollinearity, confounding features or variables
How to approach Hypothesis Testing
Does Your Hypothesis Development Canvas Tell a Story?
A Complete Guide to Hypothesis Testing
An introduction to Statistical Inference and Hypothesis testing
A set of descriptive statistics and hypothesis tests across different types of data
The statistical analysis t-test explained for beginners and experts
The book of why" by Judea Pearl : A great overview and presents many relevant techniques
[Craig's presentation: Visualizing the Why — Strategy and Roadmaps in Context](https://www.dropbox.com/s/knagl9f7u9hxvr7/Strategy%20Maps%20Agile%20Evangelists.pdf?dl=0](https://www.youtube.com/watch?v=LOjsuYzzOkA)
Correlation & Causation: The Couple That Wasn’t

Data Exploratory Analysis

See Data Exploratory Analysis

Data preparation

Data preparation
See Data preparation

Data Generation

See Data Generation

Feature Extraction

Hierarchical Feature Extraction for Compact Representation and Classification of Datasets
Guide to Feature Extraction Approaches for Text Data

Feature Importance

Example: Feature Importance implementation (python)
How to Calculate Feature Importance With Python
RFPimp:
- RF Importance
- Explaining Feature Importance by example of a Random Forest
Catboost model and W&B
LightGBM model and W&B
The 4 types of additive Feature Importances
The Math of Random Forests and Feature Importance in Scikit-learn and Spark
Path Explain - toolkit for feature attributions: GitHub | PyPI | Path Explain on MWML
Open Machine Learning Course: Feature Importance

Feature engineering

See Feature engineering

Feature Selection

See Feature Selection

Hyperparameter tuning

Ray Tune Sweeps
W&B Sweeps
Automated Machine Learning Hyperparameter Tuning in Python
Bayesian hyperparameter optimisation by Akinkunle: Original Notebook | Saved Notebook | Slides
Hyperparameter optimization for Neural Networks
Tune Hyperparameters Easily with W&B Sweeps

Post model-creation analysis, ML interpretation/explainability

Pruning: DL models
[Pruning models](https://app.wandb.ai/authors/pruning/reports/Plunging-into-Model-Pruning-in-Deep-Learning--VmlldzoxMzcyMDg](https://app.wandb.ai/authors/pruning/reports/Scooping-into-Model-Pruning-in-Deep-Learning--VmlldzoxMzcyMDg?utm_source=social_slack&utm_medium=slack&utm_campaign=report_author)
Poor Man’s BERT • Exploring Pruning as an Alternative to Knowledge Distillation. See Post model-creation analysis, ML interpretation/explainability

Model deployment

Model Deployment Methods and Techniques - Part 1
Model Deployment Methods and Techniques - Part 2
Model Deployment Methods and Techniques - Part 3
Model Deployment Methods and Techniques - Part 4
Model Deployment Methods and Techniques - Part 5

Statistics

Mode of a Log-Normal distribution by Sahil Gupta See Statistics.md

Visualisation

Data Visualization
See Visualisation

Common mistakes when training models (data related)

Having a lot more training examples of one type of object than the other types
Accidentally testing the neural network using images that were in the training set
Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on

Presentations

A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
Introduction to Data Analysis and Cleaning presentation by Mark Bell
Do we know our data, as good as we know our tools by Jeremie Charlet and Mani Sarkar

Cheatsheets

See under Cheatsheets

Courses / books

See Courses / books

Best practices / rules / an unordered list of high level or low level guidelines

12 Best Practices for Modern Data Ingestion
- PDF
A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
Rules of Machine Learning: Best Practices for ML Engineering

Framework(s) / checklist(s)

See Framework(s) / checklist(s)

Notebooks

See Notebooks

Programs and Tools

See Programs and Tools

Databases

See Databases

References

How to build a data science project from scratch
Common mistakes when carrying out machine learning and data science
A Rubric for ML Production Readiness - Breck et al. 2017 by Jiameng Gao (28 rules to follow, suggested by Google) | Original Paper by Google
Understanding Data Science Problems - template of questions to ask
eBook: How to Succeed in Data Science [deadlink]
Data Fallacies by Nabih Bawazir

Credits

Big thanks to Jeremie Charlet for his contributions to many of the resources on this page. Not forgetting the others who have also helped support in the building of the above resources.

Contributing

Contributions are very welcome, please share back with the wider community (and get credited for it)!

Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.

Back to main page (table of contents)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data

Ethics / altruistic motives

Data Science

Datasets and sources of raw data

Data Collection

Hypothesis

Data Exploratory Analysis

Data preparation

Data Generation

Feature Extraction

Feature Importance

Feature engineering

Feature Selection

Hyperparameter tuning

Post model-creation analysis, ML interpretation/explainability

Model deployment

Statistics

Visualisation

Common mistakes when training models (data related)

Presentations

Cheatsheets

Courses / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Notebooks

Programs and Tools

Databases

References

Credits

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data

Ethics / altruistic motives

Data Science

Datasets and sources of raw data

Data Collection

Hypothesis

Data Exploratory Analysis

Data preparation

Data Generation

Feature Extraction

Feature Importance

Feature engineering

Feature Selection

Hyperparameter tuning

Post model-creation analysis, ML interpretation/explainability

Model deployment

Statistics

Visualisation

Common mistakes when training models (data related)

Presentations

Cheatsheets

Courses / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Notebooks

Programs and Tools

Databases

References

Credits

Contributing