We are building a Python library PyOE for data stream machine learning with a few lines. Researchers are welcome to use and give feedbacks!
This is the code for our paper OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams.
Relational datasets are widespread in real-world scenarios and are usually delivered in a streaming fashion. This type of data stream can present unique challenges, such as distribution drifts, outliers, emerging classes, and changing features, which have recently been described as open environment challenges for machine learning.
We develop an Open Environment Benchmark named OEBench to evaluate open environment challenges in relational data streams. Specifically, we investigate 55 real-world streaming datasets and establish that open environment scenarios are indeed widespread in real-world datasets, which presents significant challenges for stream learning algorithms.
This data processing pipeline is specifically designed for open environment learning, providing a comprehensive analysis of datasets, including missing values statistics, anomaly detection, multi-dimensional and one-dimensional drift detection, and concept drift detection. The pipeline is designed to process multiple datasets and provide a detailed report on various metrics.
The whole datasets can be downloaded from https://drive.google.com/file/d/1m7eKbycaEh38OxB7gJibUZ2kNqzVzYMf/view?usp=sharing.
This project requires the following Python packages:
- numpy
- pandas
- scikit-learn
- scikit-multiflow
- scipy
- pyod
- Keras
- tensorflow-gpu
- torch
- rtdl
- delu
- lightgbm
- xgboost
- catboost
- copulas
- menelaus (need Python >= 3.9)
- pytorch-tabnet
If import keras
reports error in ADBench, please replace it with import tensorflow.keras
.
-
Prepare
info.json
andschema.json
for your datasets and place them in a folder nameddataset_experiment_info
in the same directory as this script. For each dataset, create a subfolder with the dataset's name. -
If only the statistics for selected datasets are desired, in the script, update the
dataset_prefix_list
variable to include the desired dataset subfolders' names from thedataset_experiment_info
folder. statistics for all datasets are desired, current code can remain unchanged as all dataset subfolders under thedataset_experiment_info
folder will be iterated. -
Run the script, and the pipeline will process each dataset in the specified list, generating various statistics and saving the results in separate CSV files within each dataset's subfolder. An
overall_stats.csv
file will also be generated, containing aggregated statistics for all datasets.
python pipeline.py
To add a new dataset to the pipeline, follow these steps:
-
Create a new subfolder within the
dataset_experiment_info
folder, named after the dataset. -
Place the dataset file (e.g., CSV or Excel) in the
dataset
folder. -
Create a schema file
schema.json
and an dataset information fileinfo.json
for the dataset and place it in the same subfolder. -
If needed, add the dataset subfolder's name to the
dataset_prefix_list
variable in the script.
For example, to add a dataset called my_new_dataset
, you should:
- Create a subfolder named
my_new_dataset
inside thedataset_experiment_info
folder. - Place the
my_new_dataset.csv
file (or any other supported format) inside thedataset
subfolder. - Create a schema file
schema.json
and a information fileinfo.json
and place them inside themy_new_dataset
subfolder. - If needed, manually add 'my_new_dataset' to the
dataset_prefix_list
variable in the script.
Template of schema.json
of a dataset is as follows:
{
"numerical": ["num1", "num2"],
"categorical": ["cat1", "cat2"],
"target": ["target"],
"timestamp": ["date", "time"],
"replace_with_null": ["column_to_be_replaced_by_null"],
"window size": 0,
"unnecessary": ["unnecessary1", "unnecessary2"]
}
Template of info.json
of a dataset is as follows:
{
"schema": "schema.json",
"data": "dataset/my_new_.csv",
"task": "classification"
}
dataset_prefix_list
: A list of dataset path prefixes to process.done
: A list of already processed datasets.
The run_pipeline
function iterates through each dataset path prefix in the dataset_prefix_list
and processes the dataset. For each dataset, the function performs the following steps:
- Pre-processes the dataset and extracts its schema.
- Processes missing values and calculates various missing value statistics.
- Detect outliers using IForest and ECOD methods.
- Detect multi-dimensional data drift using HDDDM, kdqTree and KS Statistics.
- Detect one-dimensional data drift using KS Statistics, HDDDM, kdsTree, CBDB, and PCA-CD methods.
- Detect concept drift using the PERM, ADWIN, DDM and EDDM method.
After processing each dataset, the function saves the calculated statistics in separate CSV files within each dataset's subfolder. Additionally, the overall_stats.csv
file is generated, containing aggregated statistics for all datasets.
cluster.py
visualizes the clusters of datasets according to our calculated statistics for three open environment problems (missing values, drifts, outliers). The purpose is to select representative datasets for further experiments on 10 stream learning algorithms.
Please refer to run.sh
as an example.
Parameter | Description |
---|---|
model |
The model architecture. Options: mlp , tree . Default = mlp . |
gbdt |
Whether to use gbdt for tree model. Options: 0 , 1 . Default = 0 . |
dataset |
Dataset to use. Options: selected or others from the pipeline.py (like dataset_experiment_info/airlines , etc). Default = selected . |
alg |
The training algorithm. Options: naive , ewc , lwf , icarl , sea , arf . Default = naive . |
lr |
Learning rate for MLP models, default = 0.01 . |
batch-size |
Batch size for MLP models, default = 64 . |
epochs |
Number of training epochs in local window for MLP models, default = 10 . |
layers |
The number of layers in MLP models, default = 3 . |
reg |
The regularization factor, default = 1 . |
buffer |
The number of examplars allowed to store, default = 100 . |
ensemble |
The ensemble size for GBDT and SEA, default = 1 . |
window-factor |
The factor to multiply the default window size, default = 1 . |
missing-fill |
The method to fill missing value. Options: knn_ (_ is the number of K in KNN), regression , avg , zero . Default = knn2 . |
logdir |
The path to store the logs, default = ./logs/ . |
device |
Specify the device to run the program, default = cpu . |
init_seed |
The initial seed, default = 0 . |
- https://github.com/Minqi824/ADBench
- https://github.com/messaoudia/AdaptiveRandomForest
- https://github.com/moskomule/ewc.pytorch
If you find this repository useful, please cite our paper:
@article{diao2024oebench,
title={OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams},
author={Diao, Yiqun and Yang, Yutong and Li, Qinbin and He, Bingsheng and Lu, Mian},
journal={Proceedings of the VLDB Endowment},
volume={17},
number={6},
pages={1283--1296},
year={2024},
publisher={VLDB Endowment}
}