To train up new data scientists to bare minimum competency level
- Minimal mathematical foundation knowledge
- Some programming background
- Ability and willingness to learn independently with limited guidance
- Install Anaconda Python 3
- Jupyter Notebook Extensions (optional)
S/N | Domain | Estimated Duration |
---|---|---|
1 | Foundation | 2 weeks |
2 | Machine Learning | 2 months |
3 | Natural Language Processing | 1 month |
4 | Automatic Speech Recognition | 1 month |
When you want to take a break from lectures 😊
- asking the right questions
- see also the data literacy project scoping guide
- look into elicitation of requirements or ideas from your users
- find ways to convert business information needs into data questions
- data acquisition
- maybe 'just get some labelled training data' doesn't seem like it even warrants a mention
- trust me, it does (unless it's an open source dataset)
- ETL
- (part of 'wrangling' or 'munging')
- transforming data between formats (csv, json, html/xml, pipe-delimited, ...)
- fixing broken formats (such as unquoted csv ಠ_ಠ)
- minimal data edits for now, at least until you understand the data
- exploration
- data provenance, background research, and literature review
- this often happens concurrently with EDA
- understand where the data came from
- what the labels mean, and how accurate they are, whether there's class imbalance
- any pre-processing that was done that can't be undone
- find out what has been tried before that did or didn't work
- make a list of things you think might work for your dataset
- cleaning
- (part of 'wrangling' or 'munging')
- outliers / anomalies (eg huge spike in data)
- impute missing values
- remove noise
- handle Unicode
ftfy.fix_text
bs4.UnicodeDammit.detwingle
(you can also use bs4 to decode html entities (possibly recursively))grapheme
or the\X
pattern inregex
- data version control, to track how the data was cleaned
- baseline / POC
- train/test split (+ optional cross-validation)
- linear least squares / logistic regression
- xgboost
- if you're getting abysmal performance, maybe the data is still borked - garbage in, garbage out
- or maybe it's impossible, and you should just give up
- if you're getting suspiciously good performance (especially if you get 100%) something is probably wrong
- somehow leaking your label, maybe via an extremely correlated variable
- e.g. you've randomly split a timeseries into train/test sets that overlap in time (as opposed to training and test sets that are respectively before and after some date)
- or maybe your problem is too easy, and you just need some rules or heuristics
- somehow leaking your label, maybe via an extremely correlated variable
- featurization (NLP usually happens here)
- training (ML usually happens here)
- what is the value in your data? -> ML must be either actionable or informative (or both)
- predictive models (regressions)
- descriptive models (classifications)
- prescriptive models (recommendations)
- associative models (clustering)
- (this list is not comprehensive - e.g. there are also generative models)
- feature selection
- duplicate features: high correlation / covariance
- useless features: low / no variance
- null features: mostly missing values, with only a small percent of real data
- xgboost feature importance
- model selection
- hyperparameter optimization
- dimension reduction
- stacking/bagging/boosting
- what is the value in your data? -> ML must be either actionable or informative (or both)
- testing (measuring performance)
- as close to real data as possible
- debugging
- try to find edge cases / figure out where your model / algorithm breaks down
- visualization (of results)
- also known as 'storytelling'
seaborn
or even the most basicmatplotlib.pyplot
- inference and explanations
- if you've gotten this far, congrats
- try something like SHAP
- deploying / sharing your model
- developing a clean API
- best practices
- documentation
- linting and type-checking your code (e.g.
flake8
and maybemypy
)
- building a UI
- UX is more of an art than a science, many books have been written, none of them cover everything you need to know
- but here's a TL;DR for UI: let the user get their thing done as fast as possible, with the fewest possible ways to get it wrong or misunderstand what happened, with minimal interaction per transaction, ideally not needing to read any instructions or even think about the process, and also don't make it ugly or irritating if you have to use it a thousand times (because they probably will have to)
- monitoring: collecting usage stats / telemetry
- inevitably, management will ask how many people are using your thing
- you can't really answer "no clue" and still expect to get your bonus
- making it faster with better algorithms (do this last, don't prematurely optimize unless it's really too slow)
- think about time / space complexity
- an inverted index for search
- using
collections.deque
(or a circular buffer) instead oflist.pop(0)
- binary search in a sorted list (e.g. built-in
bisect
) - dynamic programming, memoization (e.g. built-in
functools.lru_cache
), tail call elimination, loop unrolling - approx nearest-neighbor lookup (
annoy
,faiss
, Milvus) - parallelism (
multiprocessing
), async, locks, atomicity - A* or Dijkstra (as opposed to BFS / DFS)
- cython / numba / pypy
- ml ops
- CI/CD (e.g. to push to prod)
- version control of models and data
- quality metrics, detecting drift, auto retraining
- TODO: fill up this bit