50% of the bare minimum, in 5% of the time
- "Appreciate the existence of" means:
- Know that this thing exists and roughly what it does
- But it's not necessary to know how or why it works
Strikethroughmeans:- This topic is not being covered on purpose, to minimize the scope of the syllabus
Foundation Module (1 week)
- Mathematics
- Basic linear algebra (vectors and matrices)
- Basic probability and statistics (including distributions and Bayes’ Theorem)
- Programming
- Markdown
- Python 3 and basic regex
- Google’s Python Class
- Official Python Tutorial 1 to 8, 9.8 to 9.10 (if you have no Python knowledge)
- Do we want to give an exemplar regex cheat sheet?
- Git
- Unix tools and scripting
- minimal SQL
SELECT <named column> AS <renamed column>
COUNT(*)
andCOUNT(DISTINCT <column_name>)
UPDATE
- only 3 joins:
(INNER) JOIN
,LEFT JOIN
,LEFT OUTER JOIN
UNION
GROUP BY
WHERE
DATE('YYYY-mm-dd')
(or whatever is native to your SQL dialect)START TRANSACTION
,ROLLBACK
,COMMIT
- Avery’s comments on code style:
- Pls provide type hints for core algorithmic functions (and a description, if the algo is not obvious)
- Pls give important functions and variables appropriately descriptive names
- Pls break your code into logical chunks (and at a higher level, into reusable functions / classes)
- Pls add comments to describe what each logical chunk accomplishes, if it's not immediately obvious
- Pls use a code formatter (any formatter) and static analyzer
- if you don't know what static analyzers are or what they do, just use an IDE like PyCharm and keep your code squiggle-free
- Distance Measures
- Equal-length vectors / sequences
- Manhattan / Euclidean distances
- Hamming distance
- Cosine similarity (normalized dot product)
- Pearson's coefficient (aka the linreg r²), covariance
- Variable-length strings
- Edit / Levenshtein distance (longest common subsequence)
- BLEU score (n-grams)
- Variable-size sets
- Jaccard similarity (intersection over union)
- Appreciate the existence of:
- Edit distance backtracking
- Smoothed BLEU score
- METEOR
- ROUGE
- Damerau-Levenshtein distance
- haversine distance
- Overlap coefficient
- Spearman's coefficient (monotonicity)
- gini coefficient (for categorical sequences)
- Equal-length vectors / sequences
SplunkAdvanced regexBackreferencesPositive / negative look-aheadPositive / negative look-behindNon-capturing groupsEfficiencyCharacter classes
How to install and set up LinuxHow to install GPU driversHow to set up your network config
Less-common algorithmsStringsLongest common substringSequence alignment / warpingMultiple string search (e.g. Aho Corasick)Fuzzy string matching
Non-stringsRecursionGraph algorithms (DFS/BFS, A-star, Bellman-Ford, MST, flows, cliques, etc)Unusual data structures (linked lists, circular buffers, tries, etc)Satisfiability (CSP, linear programming, etc)Sorting algorithmsSearching (BST, heaps, hash tables, bloom filters)FFT
How to use oracle’s XML tables
Machine Learning (1 month)
- Splitting data into a training/validation set and a "sacred" test set, avoiding leakage
- (WIP) Featurization
- Handling data types
- Categorical features - usually encoded
- Unique labels / categories with insanely high cardinality are usually dropped
- One-hot encoding (for categories of reasonably low cardinality - or use -1 and 1, and 0 for missing values)
- Ordinal encoding for things like likert scales (but be warned that how these are interpreted sometimes vary from person to person, so their survey responses may not be comparable - only the differences between responses to questions may be valid)
- Numerical features - sometimes quantized or normalized
- Discretization/quantization/binning (equal-frequency, equal-width, clustering, handmade lookup table, etc.)
- Appreciate the existence of: z-score standardization, l2 normalization, min-max scaling
Log-linearizing
- Constructed features - when the base features can be made sense of in more ways
- E.g. "is this guest a noble" or "is this guest a domestic helper" for titanic
- Appreciate the existence of: using proxy variables when "true" values are unavailable
- Categorical features - usually encoded
- Imputation of null values
- Feature importance and selection (SHAP, permutation)
- gotchas to avoid
- negative effects of having strongly-correlated features (in some models)
- accidentally introducing time (e.g. via sequential ids, or shadows in labeled images of tanks)
- Handling data types
- Supervised Learning
- Regression
- Linear, ridge, lasso
- Logistic
- Classification
- KNN
- Support vector machine (just a linear SVM for binary classification)
- Decision trees
- Ensemble
- Bagging (e.g. random forests)
- Boosting (e.g. using
XGBoost
orLightGBM
) - Stacking
- Metrics
- Confusion matrix
- Recall, precision, F1 score
- AUC & ROC curve
- Micro vs macro averaging
- How to choose a threshold
- (Convex) Optimization
- Loss function
- Bias / variance trade-off
- What is over-fitting and under-fitting
- Curse of dimensionality
- Cross-validation, hyperparameter tuning
- Appreciate the existence of:
- Regularization
- Bayesian optimization, e.g. the
hyperopt
library for Python
- Regression
- Unsupervised Learning
- Clustering
- K-means
- EM GMM
- Density
- Hierarchical
- Types of anomaly detection
- Parametric (e.g. standard deviations away from mean)
- Non-parametric (e.g. quartiles, box-and-whisker plot)
- Types of topic modelling
- LSA (or NMF; these are kind of based on factorization)
- LDA (based on generative models)
- Hierarchical / agglomerative (based on similarity measures)
- Frequent itemset mining
- Clustering
- Appreciate the existence of
- Reinforcement Learning
- Genetic/evolutionary Algorithms
- Why they are different from supervised / unsupervised
- Neural Networks (this segment is still a work in progress)
- Backpropagation
- Deep neural networks
- Feedforward
- CNN, pooling, cross-entropy, softmax
- RNN, LSTM
- Dropout
- Tutorials / Capstone Project
- use jupyter and python
- learn and use
pandas
(and maybe how it relates to SQL?) scipy
,numpy
,sklearn
- Visualization (charts & graphs with
seaborn
) - Titanic (with mandatory walkthroughs)
- if you've done this before: you can choose another kaggle, or just speedrun this project
- Data exploration and data cleaning
- Dealing with missing values
- Finding and fixing outliers
- Featurization, feature importance, feature selection
- Categorical to one-hot encoding
- Binning (e.g. quartiles)
- Removing highly-correlated features to reduce dimensionality
- PCA is not a measure of feature importance
- Presentation to group covering process, outcome, and learning points
"Advanced" regressionsIsotonicHeteroskedasticKernelQuadratic / polynomialRobust regressions (e.g. repeated median regression)
Non-linear and non-binary SVMsKernelsRankingOrdinalMulticlass
Dimension reductionVC dimensionAuto-encodersPCAUMAP/PaCMAP
Other ML topicsSemi-supervised learning (e.g. LLDA)Self-organising maps (clustering)Association rulesMLE, maxentGANsExpectation maximisationWorking with unbalanced dataEmbeddingsRecommendation systems aside from itemset mining
KL / JS divergenceTime-series dataARIMAFacebook Prophet(update: use statsforecast or neuralprophet instead)
Non-convex optimization
Natural Language Processing (1 month)
- Unicode
- Joel Spolsky's article on Unicode
- see also https://tonsky.me/blog/unicode/
regex
's grapheme-match pattern\X
- Fixing mojibake
- Text Preprocessing (
NLTK
)- Tokenization and segmentation (and what is an n-gram)
- Stemming and lemmatization
- POS tagging
- Bag of words / bag of n-grams
- Word Embeddings
- Word Mover’s Distance
- Please appreciate the existence of:
- Sentence / document embeddings (e.g. doc2vec, Google’s USE)
- Cross-lingual embeddings (e.g. RCSLS, LASER)
- Language Modelling (Stanford NLP)
- HMMs
- N-grams
- OOV and simple smoothing methods
- Interpolation and back-off
- Text Classification and Sentiment Analysis (Stanford NLP)
- Naïve Bayes classifier
- Named Entity Recognition (Stanford NLP)
- As a sequence labelling problem
- Greedy
- Viterbi
- Beam
- Appreciate the existence of:
- CRFs, which are like super-powered HMMs for sequence-to-sequence
- As a sequence labelling problem
- Information Retrieval and Ranked IR (Stanford NLP)
- Inverted index
- TF-IDF
- Appreciate the existence of:
- Locality sensitive hashing (e.g. minhash vectors)
- Variants of TF-IDF (SMART IR system, Okapi BM25)
- approximate nearest neighbors, e.g.
ANNOY
orFAISS
(for indexing vectors) - geohashing (only for 2d space)
- Neural Networks for NLP (optional)
- Transformers & Muppet models
- Capstone Project
- Kaggle Avito (or new project?)
- MUST use word vectors
- MUST also use TF-IDF or edit / lev distance
- Images not necessary
- Presentation to group as usual
- Kaggle Avito (or new project?)
LinguisticsGrammar and syntax parsing (CFGs, CYK)SemanticsPolysemy / metonymy (word sense disambiguation)Semiotics (e.g. emoji)MorphologyOntology
Machine translationIBM modelsBack-translationSyntax-based SMTCorpus alignment / crawling / creationWord vector alignment (e.g. Procrustes)
Hyperbolic / elliptic embeddingsInformation extraction / relation extractionQuestion answeringSummarization / simplificationHow do CRFs work (used for NER tagging)Byte pair encoding and subword tokenization (e.g. sentencepiece)Spelling / grammar correction
Automatic Speech Recognition (1 month)
- Fundamentals of Speech Recognition (Stanford CS224S Lecture 1)
- Language Modelling (Edinburgh)
- Lexicon and language model
- Acoustic Modelling and Decoding
- HMM, forward-backward, and Viterbi (Stanford CS224S Lecture 3)
- Word error rate, training, and advanced decoding (Stanford CS224S Lecture 4)
- How to weight the AM and LM
- Viterbi beam decoding
- Multi-pass decoding e.g. N-best lists, lattices, word graphs, meshes / confusion networks
- Finite state methods
- GMM acoustic modelling and feature extraction (Stanford CS224S Lecture 5)
- Neural network acoustic models in speech recognition (Stanford CS224S Lecture 7)
- End-to-End Neural Network Speech Recognition (Stanford CS224S Lecture 8) (optional)
- Kaldi Tutorial
- SAGE Tutorial (this will be imported into Confluence because it can’t be put on GitHub)