Data Science Training Package

To train up new data scientists to bare minimum competency level

Prerequisites

Minimal mathematical foundation knowledge
Some programming background
Ability and willingness to learn independently with limited guidance
Install Anaconda Python 3
Jupyter Notebook Extensions (optional)

(Bonus) Interesting Reads

When you want to take a break from lectures 😊

ML workflow

asking the right questions
- see also the data literacy project scoping guide
- look into elicitation of requirements or ideas from your users
- find ways to convert business information needs into data questions
data acquisition
- maybe 'just get some labelled training data' doesn't seem like it even warrants a mention
- trust me, it does (unless it's an open source dataset)
ETL
- (part of 'wrangling' or 'munging')
- transforming data between formats (csv, json, html/xml, pipe-delimited, ...)
- fixing broken formats (such as unquoted csv ಠ_ಠ)
- minimal data edits for now, at least until you understand the data
exploration
- also known as 'EDA'
- just look at a subset if your dataset can't fit in RAM, but make sure it's not a biased subset
- df.describe()
- it is important to visualize your data
  - parallel coordinate plot / Andrews plot / Kent–Kiviat radar m chart
  - correlation matrix
  - example
data provenance, background research, and literature review
- this often happens concurrently with EDA
- understand where the data came from
- what the labels mean, and how accurate they are, whether there's class imbalance
- any pre-processing that was done that can't be undone
- find out what has been tried before that did or didn't work
- make a list of things you think might work for your dataset
cleaning
- (part of 'wrangling' or 'munging')
- outliers / anomalies (eg huge spike in data)
- impute missing values
- remove noise
- handle Unicode
  - ftfy.fix_text
  - bs4.UnicodeDammit.detwingle (you can also use bs4 to decode html entities (possibly recursively))
  - grapheme or the \X pattern in regex
- data version control, to track how the data was cleaned
baseline / POC
- train/test split (+ optional cross-validation)
- linear least squares / logistic regression
- xgboost
- if you're getting abysmal performance, maybe the data is still borked - garbage in, garbage out
  - or maybe it's impossible, and you should just give up
- if you're getting suspiciously good performance (especially if you get 100%) something is probably wrong
  - somehow leaking your label, maybe via an extremely correlated variable
    - e.g. you've randomly split a timeseries into train/test sets that overlap in time (as opposed to training and test sets that are respectively before and after some date)
  - or maybe your problem is too easy, and you just need some rules or heuristics
featurization (NLP usually happens here)
training (ML usually happens here)
- what is the value in your data? -> ML must be either actionable or informative (or both)
  - predictive models (regressions)
  - descriptive models (classifications)
  - prescriptive models (recommendations)
  - associative models (clustering)
  - (this list is not comprehensive - e.g. there are also generative models)
- feature selection
  - duplicate features: high correlation / covariance
  - useless features: low / no variance
  - null features: mostly missing values, with only a small percent of real data
  - xgboost feature importance
- model selection
- hyperparameter optimization
- dimension reduction
- stacking/bagging/boosting
testing (measuring performance)
- as close to real data as possible
- debugging
- try to find edge cases / figure out where your model / algorithm breaks down
visualization (of results)
- also known as 'storytelling'
- seaborn or even the most basic matplotlib.pyplot
inference and explanations
- if you've gotten this far, congrats
- try something like SHAP
deploying / sharing your model
- fastapi
- streamlit
- or just share the Jupyter notebook with all your above steps (but clean it up and add explanations first!)
developing a clean API
- best practices
- documentation
- linting and type-checking your code (e.g. flake8 and maybe mypy)
building a UI
- UX is more of an art than a science, many books have been written, none of them cover everything you need to know
- but here's a TL;DR for UI: let the user get their thing done as fast as possible, with the fewest possible ways to get it wrong or misunderstand what happened, with minimal interaction per transaction, ideally not needing to read any instructions or even think about the process, and also don't make it ugly or irritating if you have to use it a thousand times (because they probably will have to)
monitoring: collecting usage stats / telemetry
- inevitably, management will ask how many people are using your thing
- you can't really answer "no clue" and still expect to get your bonus
making it faster with better algorithms (do this last, don't prematurely optimize unless it's really too slow)
- think about time / space complexity
- an inverted index for search
- using collections.deque (or a circular buffer) instead of list.pop(0)
- binary search in a sorted list (e.g. built-in bisect)
- dynamic programming, memoization (e.g. built-in functools.lru_cache), tail call elimination, loop unrolling
- approx nearest-neighbor lookup (annoy, faiss, Milvus)
- parallelism (multiprocessing), async, locks, atomicity
- A* or Dijkstra (as opposed to BFS / DFS)
- cython / numba / pypy
ml ops
- CI/CD (e.g. to push to prod)
- version control of models and data
- quality metrics, detecting drift, auto retraining
- TODO: fill up this bit

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
Automatic Speech Recognition		Automatic Speech Recognition
Foundation		Foundation
Machine Learning		Machine Learning
Natural Language Processing		Natural Language Processing
.gitignore		.gitignore
README.md		README.md
competencies.md		competencies.md
sql-tutorial.md		sql-tutorial.md
syllabus.md		syllabus.md
todo.md		todo.md

S/N	Domain	Estimated Duration
1	Foundation	2 weeks
2	Machine Learning	2 months
3	Natural Language Processing	1 month
4	Automatic Speech Recognition	1 month

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Training Package

Prerequisites

Contents

(Bonus) Interesting Reads

ML workflow

About

Contributors 4

averykhoo/data-science-training-package

Folders and files

Latest commit

History

Repository files navigation

Data Science Training Package

Prerequisites

Contents

(Bonus) Interesting Reads

ML workflow

About

Resources

Stars

Watchers

Forks

Contributors 4