- Ensemble of Random Forest, XgBoost, CatBoost, LightGBM, KNN
- Stratified KFold with the same seed used for all the models
- UserId, Unnamed: 0 increased accuracy and F1_score
- Creations == 0 perfectly classified class 1
- Non zero creations had all the classes equally distributed
- Trained XgBoost on the dataset with only non-zero creations and Hardcoded class to be 1 whenever creations are 0
- Trained XgBoost on the entire dataset with a new Binary feature "zero creations"
- Decomposed the feature columns to 3 using PCA
- Trained XgBoost on the entire dataset by adding the 3 new decomposed columns
- Feature selection using feature importances of the models and Recursive Feature Elimination CV
- Out-of-fold cross-validation(OOF) was used instead of average across all the folds for deciding weights of the ensemble. Source
- RANDOM FOREST - 1
- Important Features were selected using feature importances.
- Two Random Forests were combined and used together for better predictions.
- The soft probabilities from both the trees were averaged to get the final probabilities.
- Notebook
- RANDOM FOREST - 2
- Important Features were selected using Recursive Feature elimination.
- Recursive Feature Elimination Notebook
- User Id and Unnamed used for better age_group prediction.
- Notebook
- LIGHT GBM
- User Id and Unnamed used for better age_group prediction.
- Tuned the parameters for the Number of leaves with regularization to increase the score.
- Explored the double trees (class provided in the notebook), but the score decreased.
- Notebook
- CATBOOST
- User Id and Unnamed used for better age_group prediction.
- The parameters were tuned using Sklearn optimizer and the tuning code is provided.
- Triple Tree was explored but the score decreased with the addition of User Id.
- Notebook
- KNN CLASSIFIER
- Trained KNN Classifier on GPU using RAPIDS Library.
- Found the best number of neighbors using the Elbow method.
- Notebook
- DOUBLE XGBOOST - WITH MANUAL TUNING
- Removed Features using feature importances and Recursive feature elimination.
- Used Two Xgboosts together for better predictions.
- The Parameters were tuned manually.
- Notebook
- XGBOOST - BASELINE
- Removed features using Recursive Feature elimination.
- Trained Baseline Xgboost with no parameter tuning.
- Notebook
- XGBOOST - UNNAMED
- Removed features using feature importances.
- Used Unnamed: 0 and user id feature for better classification.
- Tuned the number of trees and regularization parameters by hand and the results are documented in the form of comments.
- Notebook
- XGBOOST - UNNAMED AND NON ZERO TRAINING
- Removed features using feature importances.
- Trained only on the samples with Creations value non zero which resulted in faster training and better results.
- The Samples which had Creations zero were hardcoded and classified to age group 1.
- The above method was discovered after the EDA of the training data.
- Notebook
- XGBOOST - UNNAMED AND BINARY FEATURES
- Removed Features using feature importances and Recursive feature elimination.
- Created New Binary features with Creation feature which gave the model the information about the sparse nature of the Creation column in the training data.
- Notebook
- XGBOOST - PCA
- Used Principal Component Analysis (PCA) for decomposing the data and creating new features for the model.
- The PCA Features were used in addition to the original data so as to not lose the information from the old data.
- Notebook
- ENSEMBLING
- All the above models were used to generate OOF Files which were then blended using scipy optimizer.
- We found appropriate weights using the oof files and used them to combine predictions.
- The Blending increased the f1 score as much as by 1.2
- Notebook
- Neural Networks couldn't cross 70. We tried both normal Neural Network and a Skip connection (Resnet) Model. (Approach)
- Tabnet couldn't cross 67, fluctuated a lot
- Training on GPU resulted in lower f1_score (0.3 less than CPU). (RESOURCE)
- T-SNE decomposition very slow (9 hours on GPU not enough)
- Tried using kernelPCA but faced a memory limit error.
- SVM very slow (9 hours on CPU not enough for completing even a single fold)
- Kaggle results not reproducible on Colab
- Fast.ai Tabular learner gave high training loss
- Better Hyperparameter Search using Optuna (our approach)
- Better Feature engineering can be done.
- More Deep learning approaches can be explored.