- Overview
- Installation
- Usage
- Analysis Pipeline
- Program Outputs
- Limitations
- Currently Implemented Program Features and Analyses
df-analyze
is a command-line tool for perfoming
AutoML
on small to medium-sized tabular datasets (less than about 200 000 samples,
and less than about 50 to 100 features). df-analyze
attempts to automate:
- feature type inference
- feature description (e.g. univariate associations and stats)
- data cleaning (e.g. NaN handling and imputation)
- training, validation, and test splitting
- feature selection
- hyperparameter tuning
- model selection and validation
and saves all key tables and outputs from this process.
**UPDATE - September 30 2024** Now, df-analyze
supports zero-shot
embedding of image and text data via the
df-embed.py
script. This allows the conversion of correctly
formatted image and text datasets into tabular
data that can be handled by the standard df-analyze
tabular prediction
tools.
Currently, siginifcant efforts have been made to make df-analyze
robust to
a wide variety of tabular datasets. However, there are some significant
limitations.
If you have any questions about:
- how to get
df-analyze
installed and working on your machine - understanding
df-analyze
outputs - making good choices about
df-analyze
options for your dataset - any problems you might run into using
df-analyze
- general issues with the approach and/or code of
df-analyze
- contributing to
df-analyze
- anything else that you think is relevant to the
df-analyze
software specifically (and not course-related issue or complaints, if encounteringdf-analyze
as part of a university/college course)
Don't be shy about asking for help in the Discussions!
For students encountering df-analyze
through a course, see the student
README
[WIP!] in this repo. The student README contains some descriptions and tips
that are helpful for those just starting to learn about the CLI, containers,
AutoML tools, and also some explanations and tips for running df-analyze
on
SLURM High-Performance Computing (HPC) clusters, particularly the Digitial
Research Alliance of Canada (formerly Compute Canada)
clusters.
Currently, df-analyze
is distributed as Python scripts dependent on the
contents of this repository. So, to run df-analyze
, you will generally have
to clone this repository, and either install a compatible virtual
environment or build a container to make use of df-anaylze
.
I.e. you must choose between a (hopefully platform-agnostic) local
install versus a container
build for a Linux-based system,
After having cloned the repo, the
local_install.sh
script can be used to install the dependencies for df-analyze
. You will
need to first install pyenv
(or
pyenv-win
on Windows) in order
for the install script to work, but then the script will compile Python
3.12.5 and create a virtual environment with the necessary dependencies when
you run
bash local_install.sh
Then, you can activate the installed virtual environment by running
source .venv/bin/activate
in your shell (or by running .venv/scripts/activate
on Windows - see also
here
if you run into permission issues when doing this). You should be able to see
if the install working by running:
python df-analyze.py --help
This install procedure should work on MacOS (including Apple Silicon, e.g. MX series macs), and on most major and up-to-date Linux distributions, and on Windows in the Windows Subsystem for Linux (WSL). However, Windows users wishing to avoid using the WSL should adapt the install script for their needs.
Alternately, build the Singularity / Apptainer
container and use this for running any
code that uses df-analyze
. This should work on any Linux system (including
HPC systems / clusters like Compute Canada / DRAC).
At the moment, there is no real capacity to test df-analyze
on Windows
machines. Nowadays, the Windows Subsystem for Linux (WSL) generally works
very well, so the local install scripts
should just work there, as I have tried to avoid most platform-specific
code.
If for some reason you can't use the WSL, then there are experimental maual Windows installation instructions here. But don't be surprised if things break when setting things up this way.
For full documentation of df-analyze
run:
python df-analyze.py --help
which will provide a complete description of all options. Alternately, you
can see what the --help
option outputs
here,
but keep mind the actual outputs of the --help
command are less likely to
be out of date.
For documentation of the embedding functionality, run:
python df-embed.py --help
Run a classification analysis on the data in small_classifier_data.json
:
python df-analyze.py \
--df=data/small_classifier_data.json \
--outdir=./demo_results \
--mode=classify \
--target target \
--classifiers knn lgbm rf lr sgd mlp dummy \
--embed-select none linear lgbm \
--feat-select wrap filter embed
should work and run quite quickly on the tiny toy dataset included in the repo. This will produce a lot of terminal output.
Run a classification analysis on the data in the file spreadsheet.xlsx
with
configuration options and columns specifically formatted for df-analyze
:
python df-analyze.py --spreadsheet spreadsheet.xlsx
Example of an Excel spreadsheet formatted for df-analyze
:
Another valid Excel spreadsheet:
Example of a .csv
spreadsheet formatted for df-analyze
:
--outdir ./results
--target y
--mode classify
--categoricals s,x0
--classifiers knn lgbm dummy
--nan median
--norm robust
--feat-select wrap embed
s,x0,x1,x2,x3,y
male,0,0.739547,0.312496,1.129941,0
female,0,0.094421,0.817089,1.246469,1
unspecified,1,0.323189,0.008068,0.472934,0
male,2,0.570184,0.289003,1.176338,1
...
If you have not been introduced to command-line interfaces (CLIs) before,
this convention might seem a bit odd, but df-analyze
primarily functions as
a CLI program. The logic is that CLI options (e.g. --mode
) and their
parameters or values (e.g. the classify
in --mode classify
) are specified
one-per-line in the file top section / header, with spaces separating
parameters (e.g. the knn lgbm dummy
parameters passed to the --classifiers
option), and with at least one empty line separating these options and
parameters from the actual tabular data.
Thus, the following is an INVALIDLY FORMATTED spreadsheet:
--outdir ./results
--target y
--mode classify
--categoricals s,x0
--classifiers knn dummy
--nan median
--norm minmax
--feat-select wrap filter none
s,x0,x1,x2,x3,y
male,0,0.739547,0.312496,1.129941,0
female,0,0.094421,0.817089,1.246469,1
unspecified,1,0.323189,0.008068,0.472934,0
male,2,0.570184,0.289003,1.176338,1
...
because no newlines (empty lines) separate the options from the data.
When spreadsheet and CLI options conflict, then df-analyze
will prefer the
CLI args. This allows a base spreadsheet to be setup, and for minor analysis
variants to be performed without requiring copies of the formatted data file.
So for example:
python df-analyze.py --spreadsheet sheet.xlsx --outdir ./results --nan mean
python df-analyze.py --spreadsheet sheet.xlsx --outdir ./results --nan median
python df-analyze.py --spreadsheet sheet.xlsx --outdir ./results --nan impute
would run three analyses with the options in spreadsheet.xlsx
(or default
values) but with the handing of NaN values differing for each run, regardless
of what is set for --nan
in spreadsheet.xlsx
. Note that the same output
directory can be specified each time, as df-analyze
will ensure that all
results are saved to a separate subfolder (with a unique hash reflecting the
unique combinations of options passed to df-analyze
). This ensures data
should be overwritten only if the exact same arguments are passed twice (e.g.
perhaps if manually cleaning your data and re-running).
df-analyze
now supports the pre-processing of image and text
classification datasets through the df-embed.py
python script.
The CLI help can be accessed locally by running
python df-embed.py --help
Note that before any embedding is possible, you will need to download the underlying embedding models once. This can be done with either of the commands:
python df-embed.py --download --modality nlp
python df-embed.py --download --modality vision
NOTE: Because these models are only using CPUs for inference, the memory requirements may be too high for you to efficiently embed a dataset on your local machine. While the embedding code will work and is tested on modern e.g. M-series MacBooks (Air or Pro), this may make use of swap memory, which could be unacceptably slow for your dataset(s), depending on your machine.
However, on a Linux-based cluster (e.g. CentOS or RedHat, on Compute Canada),
then inference on CPU on a node with 128GB RAM is quite efficient (datasets
of 200k to 300k samples should still embed in a few hours, and smaller
datasets in just a few minutes). But in order to do this, you will need to
build the container and then make use
of the run_python_with_home.sh
script included in this repo, and paying
attention to the advice to use readlink
or realpath
for all references to
files.
Internally, df-analyze
uses two open-source HuggingFace zero-shot
classification models:
SigLIP for
image data, and the large variant of the multilingual
E5 text embedding
models. More specifically, the models are
intfloat/multilingual-e5-large
and
google/siglip-so400m-patch14-384
,
which produce embedding vectors of size 1024 for each input text, and 1152
for each input image, respectively.
SigLip is a significant improvement on
CLIP, especially
for zero-shot classification (the main task in df-analyze
). E5 uses an
XLM-RoBERTa
backbone,
but is trained with a focus on producing quality zero-shot embeddings.
Currently, df-analyze
supports only small to medium-sized datasets
(generally, under 200 features and under 200 000 or so samples), and strongly
aims to keep compute times under 24 hours (on a typical node on the Niagara
cluster) for key operations
(embedding, predictive analysis). This means any dataset to be embedded should
also generally be under abut 200 000 samples.
For embedding, df-embed.py
makes use of CPU implementations only, and, to
not complicate data loading, currently requires a dataset to fit in memory,
loaded from a single, correctly-formatted .parquet
file.
For image classification data (python df-embed.py --modality vision
), the
file must be a two-column table with the columns named "image" and "label".
The order of the columns is not important, but the "label" column must
contain integers in {0, 1, ..., c - 1}, where c
is the number of class
labels for your data. The data type is not really important, however, if
the table is loaded into a Pandas DataFrame df
, then running
df["label"].astype(np.int64)
(assuming you have imported NumPy as np
,
as is convention) should not alter the meaning of the data.
For image regression data (very rare), the file must be a two-column table
with the columns named "image" and "target". The order of the columns is
not important, but the "target" column must contain floating point values.
The floating point data type is not really important, however, if the table
is loaded into a Pandas DataFrame df
, then running
df["label"].astype(float)
should not raise any exceptions.
The "image" column must be of bytes
dtype, and must be readable by PIL
Image.open
. Internally, all we do, again assuming that the data is loaded
into a Pandas DataFrame df
, is run:
from io import BytesIO
from PIL import Image
df["image"].apply(lambda raw: Image.open(BytesIO(raw)).convert("RGB"))
to convert images to the necessary format. This means that if you load your
images using PIL Image.open
, and you have a list of image paths (and a way
to infer the target from that path, e.g. get_target(path: Path)
, then you
can convert your images to bytes through the use of io
BytesIO
objects,
and build your parquet file with just a few lines of Python:
img: Image # PIL Image
converted = []
targets = []
for path in my_image_paths:
img = Image.open(path)
buf = BytesIO()
img.save(buf, format="JPEG")
byts = buf.getvalue()
converted.append(byts)
targets.append(get_target(path))
df = DataFrame({"image": converted, "target": targets})
df.to_parquet("images.parquet")
For text classification data (python df-embed.py --modality nlp
), the
file must be a two-column table with the columns named "text" and "label".
The order of the columns is not important, but the "label" column must
contain integers in {0, 1, ..., c - 1}, where c
is the number of class
labels for your data. The data type is not really important, however, if
the table is loaded into a Pandas DataFrame df
, then running
df["label"].astype(np.int64)
(assuming you have imported NumPy as np
,
as is convention) should not alter the meaning of the data.
For text regression data (e.g. sentiment analysis, rating prediction), the
file must be a two-column table with the columns named "text" and
"target". The order of the columns is not important, but the "target"
column must contain floating point values. The floating point data type is
not really important, however, if the table is loaded into a Pandas
DataFrame df
, then running df["label"].astype(float)
should not raise
any exceptions.
The "text" column will have "object" ("O") dtype. Assuming you have loaded
your text data into a Pandas DataFrame df
, then you can check that the
data has the correct type by running:
assert df.text.apply(lambda s: isinstance(s, str)).all()
which will raise an AssertionError if a row has an incorrect type.
In order to keep compute times reasonable, it is best for text samples to be at most a paragraph or two. I.e. the underlying model is not really intended for efficient or effective document embedding. However, this ultimately depends on the text language and it is hard to make general recommendations here.
It is EXTREMELY important that you only clone df-analyze
into $SCRATCH
, and
do all processing there. You have a very limited amount of space and absolute
number of files in your $HOME
directory, and your login node will become
nearly unusable if you clone df-analyze
there, or build the container in
$HOME
. So just immediately cd
to SCRATCH
before doing any of the below.
This should be built on a cluster that enables the --fakeroot
option or on a
Linux machine where you have sudo
privileges, and the same architecture as
the cluster (likely, x86_64).
First, clone the repository to $SCRATCH
:
cd $SCRATCH
git clone https://github.com/stfxecutables/df-analyze.git
cd df-analyze
cd $SCRATCH/df-analyze/containers
./build_container_cc.sh
This will spam a lot of text to the terminal, but what you want to see at the end is a message very similar to:
==================================================================
Container built successfully. Built container located at:
/scratch/df-analyze/df_analyze.sif
==================================================================
If you don't see this, or if somehow you see this message but there is no
df_analyze.sif
in the project root, then the complete container build log
will be located in df-analyze/containers/build.txt
. This build.txt
file
should be included with any bug reports or if encountering any issues when
building the container.
You can perform a final additional sanity test of the container build by then running the commands:
cd $SCRATCH/df-analyze/containers
./check_install.sh
You should see some output like:
Running script from: /scratch/[...]/df-analyze
Using Python 3.12.5
df-analyze 3.3.0
but with of course the final version number depending on which release you have installed. Otherwise, there will be an error message and other information.
If the singularity container df_analyze.sif
is available in the project
root, then it can be used to run arbitrary python scripts with the helper
script
inlcluded in the repo. E.g.
cd $SCRATCH/df-analyze
./run_python_with_home.sh "$(realpath my_script.py)"
HOWEVER this will frequently cause errors about files not being found.
This has to do with aliasing and the complex file systems on Compute Canada
and how these interact with path-mounting in Apptainer, but the solution is
to ALWAYS WRAP PATHS WITH THE realpath
COMMAND. E.g.
./run_python_with_home.sh df-embed.py \
--modality vision \
--data "$(realpath my_images.parquet)" \
--out "$(realpath embedded.parquet)"
this should be done if running a command in a login-node, or if making a job script to submit to the SLURM scheduler.
The overall data preparation and analysis process comprises six steps (some optional):
- Feature Type and Cardinalty Inference (Data Inspection)
- Data Preparation and Preprocessing
- Data Splitting
- Univariate Feature Analyses
- Feature Selection (optional)
- Hyperparameter tuning
- Final validation and analyses
In pseudocode (which closely approximates the code in the main()
function of
df-analyze.py
):
options = get_options()
df = options.load_df()
inspection = inspect_data(df, options)
prepared = prepare_data(inspection)
train, test = prepared.split()
associations = target_associations(train)
predictions = univariate_predictions(train)
embed_selected = embed_select_features(train, options)
wrap_selected = wrap_select_features(train, options)
filter_selected = filter_select_features(train, associations, predictions, options)
selected = (embed_selected, wrap_selected, filter_selected)
tuned = tune_models(train, selected, options)
results = eval_tuned(test, tuned, selected, options)
Features are checked, in order of priority, for features that cannot be used
by df-anaylze
. Unusable features are features which are:
- Constant (all values identical or identical except NaNs)
- Sequential (autocorrelated) datetime data
- Identifiers (all values unique and not continuous / floats)
Then, features are identified as one of:
- Binary
- Ordinal
- Continuous
- Categorical
based on a number of heuristics relating to the unique values and counts of these values, and the string representations of the features. These heuristics are made explicit in code here.
Input data is transformed so that it can be accepted by most generic ML
algorithms and/or Python data science libraries (but particularly,
NumPy,
Pandas,
scikit-learn,
PyTorch, and
LightGBM). This means the
raw input data --df
or --spreadsheet
argument) is represented as
where
X
is a Pandas DataFrame
and the target variable is represented in y
, a
Pandas Series
.
- Data Loading
- Type Conversions
- NaN unification (detecting less common NaN representations)
- Data Cleaning
- Remove samples with NaN in target variable
- Remove junk features (constant, timeseries, identifiers)
- NaNs: remove or add indicators and interpolate
- Categorical deflation (replace undersampled classes / levels with NaN)
- Feature Encoding
- Binary categorical encoding
- represented as single [0, 1] feature if no NaNs
- single NaN indicator feature added if feature is binary plus NaNs
- One-hot encoding of categoricals (NaN = one additional class / level)
- Ordinals treated as continuous
- Robust normalization of continuous features
- Binary categorical encoding
- Target Encoding
- Categorical targets are deflated and
label encoded to values in
$[0, n]$ - Continuous targets are robustly min-max normalized (to middle 95% of values)
- Categorical targets are deflated and
label encoded to values in
Categorical variables will frequently contain a large number of classes that have only a very small number of samples.
For example, a small, geographically representative survey of households (e.g. approximately 5000 samples) might contain region / municipality information. Regions or municipalities corresponding to large cities might each have over 100 samples, but small rural regions will likely be sampled less than 10 or so times each, i.e., they are undersampled. Attempting to generalize from any patterns observed in these undersampled classes is generally unwise (undersampled levels in a categorical variable are sort of the categorical equivalent of statistical noise).
In addition, leaving these undersampled levels in the data will usually significantly increase compute costs (each class of a categorical, in most encoding schemes, will increase the number of features by one), but encourage overfitting or learning of spurious (ungeneralizable) patterns. It is thus wise, usually, for both computational and generalization reasons, to exclude these classes from the categorical variable (e.g. replace with NaN, or a single "other" class).
In df-analyze
, we automatically deflate categorical variables based on a
threshold of 20 samples, i.e. classes with less than 20 samples are
remapped to the "NaN" class. This is probably not aggressive enough for most
datasets, and, for some features and smaller datasets, perhaps overly
aggressive. However, if later feature selection is used, this selection is
done on the one-hot encoded data, and so useless classes will be excluded in
a more principled way there. The choice of 20 is thus (hopefully) somewhat
conservative in the sense of not prematurely eliminating information, most of
the time.
As above, target categorical variables are deflated, except when a target class has less than 30 samples. This deflation arguably should be much more aggressive: when doing e.g. 5-fold analyses on a dataset with such a target variable, each test fold would be expected to be 20% of the samples, so about 6 representatives of this class. This is highly unlikely to result in reliable performance estimates for this class, and so only introduces noise to final performance metrics.
The data
- Univariate associations
- Univariate predictive performances
- classification task / categorical target
- tune SGDClassifier (an approximation of linear SVM and/or Logistic Regression)
- report 5-fold mean accuracy, AUROC, sensitivity and specificity
- regression task / continuous target
- tune SGDRegressor (an approximation of regularized linear regression)
- report 5-fold mean MAE, MSqE,
$R^2$ , percent explained variance, and median absolute error
- classification task / categorical target
- Use filter methods
- Remove features with minimal univariate relation to target
- Keep features with largest filter metrics
- Wrapper (stepwise) selection
- Filter selection
- Bayesian (Optuna) hyperparameter optimization with internal 5-fold validation
- Final k-fold of model tuned and trained on selected features from
$\mathcal{D}_\text{train}$ - Final evaluation of trained model on
$\mathcal{D}_\text{test}$
The output directory structure is as follows:
π ./my_output_directory
βββ π fe57fcf2445a2909e688bff847585546/
βββ π features/
β βββ π associations/
β βββ π predictions/
βββ π inspection/
βββ π prepared/
βββ π results/
βββ π selection/
β βββ π embed/
β βββ π filter/
β βββ π wrapper/
βββ π tuning/
The directory ./my_output_directory
is the directory specified by the
--outdir
argument.
There are 4 main types of files:
- Markdown reports (
*.md
) - Plaintext / CSV table files (
*.csv
) - Compressed Parquet tables (
*.parquet
) - Python object representations / serializations (
.json
)
Markdown reports (*_report.md
) should be considered the main outputs: they
include text describing certain analysis outputs, and inlined tables of key
numerical results for that portion of analysis. The inline tables in each
Markdown report are saved in the same directory of the report always as
plaintext CSV (*.csv
) files, and also occasionally additionally as a
Parquet file (*.parquet
). This is because CSV is inherently lossy and, to
be blunt, basically a trash format for represeting tabular
data.
However, it is human-readable and easy to import into common spreadsheet
tools (Excel, Google Sheets, LibreOffice Calc, etc ).
The .json
files are largely for internal use and in general should not need
to be inspected by the end-user. However, .json
was chosen over, e.g.,
.pickle
, since .json
is at least fairly human-readable, and in
particular, the options.json
file allows for
expanding main output tables with additional columns reflecting the program
options across multiple runs.
The subdirectories should generally be read / inspected in the following order:
π inspection/
π prepared/
π features/
π selection/
π tuning/
π results/
π fe57fcf2445a2909e688bff847585546/
βββ π features/
β ...
βββ features_renamings.md
βββ options.json
This directory is named after a unique hash of all the options used for a
particular invocation / execution of the df-analyze
command.
This directory contains two files. The feature_renamings.md
file indicates
which features have been renames due to problematic characters or duplicate
feature names.
The single file options.json
is a .json
representation of the specific
invocation or spreadsheet options. This is to allow multiple sets of outputs
from different options to be placed automatically in the same --outdir
top-level directory, e.g. as mentioned
above.
So for example, running mutiple options combinations to the same output directory will make something like:
π ./my_output_directory
βββ π ecc2d425d285807275c0c6ae498a1799/
βββ π fe57fcf2445a2909e688bff847585546/
βββ π 7c0797c3e6a6eebf784f33850ed96988/
π inspection/
βββ inferred_types.csv
βββ short_inspection_report.md
This contains the inferred cardinalities (e.g. continuous, ordinal, or categorical) of each features, as well as the decision rule used for each inference. Features with ambiguous cardinalities are also coerced to some cardinality (usually ordinal, since categorical variables are often low-information and increase compute costs), and this is detailed here.
Nuisance features (timeseries features or non-categorical
datetime data, unique identifiers, constant features) are automatically
removed by df-anyalze
, and those destructive data changes are documented
here.
Deflated categorical variables are documented here as well.
π prepared/
βββ info.json
βββ labels.parquet
βββ preparation_report.md
βββ X.parquet
βββ X_cat.parquet
βββ X_cont.parquet
βββ y.parquet
preparation_report.md
- shows compute times for processing steps, and documents changes to the data shape following encoding, deflation, and dropping of target NaN values
labels.parquet
- a Pandas Series linking the target label encoding (integer) to the name of the encoded class
X.parquet
- the final encoded complete data (categoricals and numeric)
X_cat.parquet
- the original (unencoded) categoricals
X_cont.parquet
- the continuous features (normalized and NaN imputed)
y.parquet
- the final encoded target variable
info.json
- serialization of internal
InspectionResults
object
- serialization of internal
π features/
βββ π associations/
βββ π predictions/
Data for univariate analyses of all features.
π associations/
βββ associations_report.md
βββ categorical_features.csv
βββ categorical_features.parquet
βββ continuous_features.csv
βββ continuous_features.parquet
associations_report.md
- tables of feature-target associations
categorical_features.csv
- plaintext table of categorical feature-target associations
categorical_features.parquet
- Parquet table of categorical feature-target associations
continuous_features.csv
- plaintext table of continuous feature-target associations
continuous_features.parquet
- Parquet table of continuous feature-target associations
π predictions/
βββ categorical_features.csv
βββ categorical_features.parquet
βββ continuous_features.csv
βββ continuous_features.parquet
βββ predictions_report.md
categorical_features.csv
- plaintext table of 5-fold predictive performances of each categorical feature
categorical_features.parquet
- Parquet table of 5-fold predictive performances of each categorical feature
continuous_features.csv
- plaintext table of 5-fold predictive performances of each continuous feature
continuous_features.parquet
- Parquet table of 5-fold predictive performances of each continuous feature
predictions_report.md
- summary tables of all feature predictive performances
Note: for "large" datasets (currently, greater than 1500 samples) these predictions are made using a small (1500 samples) subsample of the full data, for compute time reasons.
For continuous targets (i.e. regression), the subsample is made in a
representative manner by taking a stratified subsample, where stratification
is based on discretizing the continuous target variable into 5 bins (via
scikit-learn KBinsDiscretizer
and StratifiedShuffleSplit
, respectively).
For categorical targets (e.g. classification), the subsample is a "viable
subsample" (see viable_subsample
in
prepare.py
)
that first ensures all target classes have the minimum number of samples
required to avoid deflation and/or problems with 5-fold splits eliminating
a target class.
π selection/
βββ π embed/
βββ π filter/
βββ π wrap/
Data describing the features selected by each feature selection method.
π embed/
βββ linear_embed_selection_data.json
βββ lgbm_embed_selection_data.json
βββ embedded_selection_report.md
embedded_selection_report.md
- summary of features selected by (each) embedded model
[model]_embed_selection_data.json
- feature names and importance scores for
[model]
- feature names and importance scores for
π filter/
βββ association_selection_report.md
βββ prediction_selection_report.md
association_selection_report.md
- summary of features selected by univariate assocations with the target
- also includes which measure of association was used for selection
prediction_selection_report.md
- summary of features selected by univariate predictive performance
- also includes which predictive performance metric was used for selection
Note: Feature importances are not included here, as these are already
available in the features
directory.
π wrap/
βββ wrapper_selection_data.json
βββ wrapper_selection_report.md
wrapper_selection_data.json
- feature names and predictive performance of each upon selection
wrapper_selection_report.md
- summary of features selected by wrapper (stepwise) selection method
π tuning/
βββ tuned_models.csv
tuned_models.csv
- table of all tuned models for each feature selection method, including final performance and final selected hyperparameters (as a .json field)
π results/
βββ eval_htune_results.json
βββ final_performances.csv
βββ performance_long_table.csv
βββ results_report.md
βββ X_test.csv
βββ X_train.csv
βββ y_test.csv
βββ y_train.csv
final_performances.csv
andperformance_long_table.csv
[TODO: make one of these wide table]- final summary table of all performances for all models and feature selection methods
results_report.md
- readable report (with wide-form tables of performances) of above information
X_test.csv
- predictors used for final holdout and k-fold evaluations
X_train.csv
- predictors used for training and tuning
y_test.csv
- target samples used for final holdout and k-fold evaluations
y_train.csv
- target samples used for training and tuning
eval_htune_results.json
- serialization of final results object (not human readable, for internal use)
The full tree-structure of outputs is as follows:
π fe57fcf2445a2909e688bff847585546/
βββ π features/
β βββ π associations/
β β βββ associations_report.md
β β βββ categorical_features.csv
β β βββ categorical_features.parquet
β β βββ continuous_features.csv
β β βββ continuous_features.parquet
β βββ π predictions/
β βββ categorical_features.csv
β βββ categorical_features.parquet
β βββ continuous_features.csv
β βββ continuous_features.parquet
β βββ predictions_report.md
βββ π inspection/
β βββ inferred_types.csv
β βββ short_inspection_report.md
βββ π prepared/
β βββ info.json
β βββ labels.parquet
β βββ preparation_report.md
β βββ X.parquet
β βββ X_cat.parquet
β βββ X_cont.parquet
β βββ y.parquet
βββ π results/
β βββ eval_htune_results.json
β βββ final_performances.csv
β βββ performance_long_table.csv
β βββ results_report.md
β βββ X_test.csv
β βββ X_train.csv
β βββ y_test.csv
β βββ y_train.csv
βββ π selection/
β βββ π embed/
β β βββ embed_selection_data.json
β β βββ embedded_selection_report.md
β βββ π filter/
β β βββ association_selection_report.md
β β βββ prediction_selection_report.md
β βββ π wrapper/
βββ π tuning/
β βββ tuned_models.csv
βββ options.json
- there can be only one target variable per program invocation / run
- malformed data (e.g. quoting, feature names with spaces or commas,
malformed
.csv
, etc) - inappropriate data (e.g. timeseries or sequence data, NLP data)
- inappropriate tasks (e.g. unsupervised learning tasks)
- dataset size: expected max runtime should be well under 24 hours (see below for how to estimate your expected runtime on the Compute Canada / DRAC Niagara cluster)
- wrapper selection is extremely expensive and the number of selected
features (or elminated features, in the case of step-down selection) should
not exceed:
- step-up: 20
- step-down: 10
Features and targets must be treated fundamentally differently by all aspects of analysis. E.g.
- normalization of targets in regression must be different than normalization of continuous features
- samples with NaNs in the target must be dropped (resulting in a different base dataframe), but samples with NaN features can be imputed
- data splitting must be stratified in classification to avoid errors, but stratification must be based on the target (e.g. choosing a different target will generally result in different splits)
In addition, feature selection is expensive, and must be done for each target variable. Runtimes are often suprisingly sensitive to the distribution of the target variable.
Let
Based on some experiments with about 70 datasets from the OpenML
platform
and on the Niagara compute
cluster, and
limiting wrapper-based selection to step-up selection of 10 features, then a
simple rule for predicting the maximum expected runtime, df-analyze
with all options and models is:
for
The expected runtime on your machine will be quite different. If
Also, it is extremely challenging to predict the runtimes of AutoML
approaches like df-analyze
: besides machine characteristics, the structure
of the data (beyond just the number of samples and features) and
hyperparameter values (e.g. for support vector
machines)
also have a significant impact on fit times.
Datasets with df-analyze
. They are unlikely
to complete in under 24 hours, and may in fact cause out-of-memory errors
(the average Niagara node has only about 190GB of RAM, and to use all 40
cores, the dataset must be copied 40 times due to Python's inability to
properly share memory).
If you have a dataset where the expected runtime is getting close to 24 hours,
then you should strongly consider limiting the df-analyze
options such that:
- only 2-3 models (NOT including the
mlp
) are used, AND - wrapper-based feature selection is not used
Datasets with any kind of strong spatio-temporal clustering, or
spatio-temporal autocorrelation, df-anaylze
technically can handle (i.e.
will produce results for), but the reported results will be deeply invalid
and misleading. This includes:
-
time-series or sequence data, especially where the task is forecasting,
- This means data where the target variable is either a categorical or continuous variable that represents some subsequent or future state of a sequence of samples in the training data, e.g. predicting weather, stock prices, media popularity, and etc., but where a correct predictive model necessarily must know recent target values
- naive k-fold splitting is completely
invalid in these kinds of
cases, and k-fold is the basis of most of the main analyses in
df-analyze
-
spatially autocorrelated or autocorrelated data in general
- A generalization of the case above, but the same problem: k-fold is invalid when the similarity of samples that are close in space is not accounted for in splitting
-
unencoded text data / natural language processing (NLP) data
- I.e. any data where a sample feature is a word or collection of words
-
image data (e.g. computer vision prediction tasks)
- These datasets will almost always be too expensive for the ML algorithms
in
df-analyze
to process, and years of research and experience have now shown that classic ML models (which are all thatdf-analyze
fits) simply are not capable here
- These datasets will almost always be too expensive for the ML algorithms
in
Anything beyond simple prediction, e.g. unsupervised tasks like clustering,
representation learning, dimension reduction, or even semi-supervised tasks,
are simply beyond the scope of df-analyze
.
-
NaN Removal and Handling
- samples with NaN target are dropped (see Target Handling below)
- for continous features, NaNs can be either dropped, mean, median, or multiply imputed (default: mean imputation)
- categorical features encode NaNs as an additional class / level
-
Bad Feature Detection and Removal
- features containing unusable datetime data (e.g. timeseries data) are automatically detected and removed, with warnings to the user
- features containing identifiers (e.g. features that are integer or string and where each sample has a unique value) are automatically detected and removed, with user warnings
- extremely large categorical features (more categories than about 1/5 of samples, which pose a problem for 5-fold splitting) are automatically removed with user-warnings
- "suspicious" integer features (e.g. features with less than 5 unique values) are detected and the user is warned to check if categorical or ordinal
-
Categorical Feature Handling
- user can
- specify categorical feature names explicitly (preferred)
- specify a threshold (count) for number of unique values of a feature required to count as categorical
- string features (even if not user-specified) are one-hot encoded
- NaN values are automatically treated as an additional class level (no dropping of NaN samples required)
- user can
-
Continuous Feature Handling
- continuous features are MinMax normalized to be in [0, 1]
- this means
df-analyze
is currently sensitive to extreme values - TODO: make robust (percentile, or even quantile) normalization options available, and auto-detect such cases and warn the user
-
Target Handling
- all samples with NaN targets are dropped (categorical or continuous)
- rarely makes sense to count correct NaN predictions toward classification performance
- imputing NaNs in a regression target (e.g. mean, median) biases estimates of regression performance
- categorical targets containing a class with 20 or fewer samples in a level have the samples corresponding to that level dropped, and the user is warned (these cause problems in nested stratified k-fold, and any estimated of any metric or performance on such a small class is essentially meaningless)
- continuous or ordinal regression targets are robustly normalized using
2.5th and 97.5th percentile values
- with this normalization, 95% of the target values are in [0, 1]
- thus an MAE of, say, 0.5, means that the error is about half of the target (robust) range
- this also aids in the convergence and fitting of scale-sensitive models
- this also makes prediction metrics (e.g. MAE) more comparable across different targets
- all samples with NaN targets are dropped (categorical or continuous)
-
Continuous and ordinal features:
- Non-robust:
- min, mean, max, standard deviation (SD)
- Robust:
- 5th and 95th percentiles, median, interquartile range (IQR)
- Moments/Other:
- skew, kurtosis, and p-values that skew/kurtosis differ from Gaussian
- entropy (e.g. differential / continuous entropy)
- NaN counts and frequency
- Non-robust:
-
Categorical features:
- number of classes / levels
- min, max, and median of class frequencies
- heterogeneity (Chi-squared test of equal class sizes) and associated p-value
- NaN counts and frequency (treated as another class label)
-
Continuous/Ordinal Feature -> Categorical Target:
- Statistical: t-test, Mann-Whitney U, Brunner-Munzel W, Pearson r and associated p-values
- Other: Cohen's d, AUROC, mutual information
- for
-
Continuous/Ordinal Feature -> Continuous Target:
- Pearson's and Spearman's r and p-values
- F-test and p-value
- mutual information
-
Categorical Feature -> Continuous Target:
- Kruskal-Wallace H and p-value
- mutual information
- NOTE: There are relatively few measures of association for categorical-continuous variable pairs. Kruskal-Wallace H has few statistical assumptions, and essentially checks the extent that the medians of each level in the categorical variable differ significantly on the continuous target, and
-
Categorical Feature -> Categorical Target:
- Cramer's V
- simple linear predictive models (sklearn SGDClassifier, SGDRegressor) are hyperparameter tuned (using 5-fold) over a small grid for each feature
- a dummy regressor or classifier (e.g. predict target mean, predict largest class) is also always fit
- reported metrics are for the best-tuned model mean performance across the 5
folds:
- Continuous/Ordinal Target (e.g. regression):
- Models: DummyRegressor, ElasticNet, Linear regression, SVM with radial basis
- Metrics: accuracy, AUROC (except for SVM), sensitivity, specificity
- Categorical Target (e.g. classification):
- Models: DummyClassifier, Logistic regression, SVM with radial basis
- Metrics: mean abs. error, mean sq. error, median abs. error, mean abs. percentage error, R2, percent variance explained
- Continuous/Ordinal Target (e.g. regression):