Energy consumption of code small language models serving with runtime engines and execution providers
- Actionable guidelines for practitioners
- Measuring the impact of deep learning serving configurations on energy and performance
- An analysis of deep learning serving configurations
A duplet of a runtime engine and an execution provider: <[Runtime engine], [Execution provider]>
- Runtime engines
- Default Torch (TORCH)
- ONNX Runtime engine (ONNX)
- OpenVINO runtime (OV)
- Torch JIT (JIT)
- Execution providers
- CPU Execution Provider (CPU)
- CUDA Execution Provider (CUDA)
The repository is structured as follows:
- app | API, schemas - dataset | input dataset generation - experiments | Notebooks and scripts to process profilers datasets - manuals | Self-contained manuals related to the serving infrastructure - model_selection | This folder contains models selection scripts and metadata - scripts | Environment scripts and bash scripts for automated experiments - testing | Scripts to send request to server - requirements.txt: The dependencies of our implementation - runall_update.sh: Bash script to start server and run experiments - code_slm_selection.csv: Selection of used code SLM
- Needed: HumanEval dataset
- Output: New input dataset
- files:
dataset/*
- Needed: Selection criteria
- Output: Selected models
- files:
code_slm_selection.csv
- Needed: Development of serving infrastructure, selected models
- Output: Serving infrastructure
- files:
app/
- Needed: Deployed serving infrastructure
- Output: results (profilers datasets)
- files:
testing/
- Edit experiment parameters (time,files,...):
- server settings
app/models_code_load.py
: Model classes- MAX_LENGTH tokens
- experiment settings
testing/utils.py
: experiment settings, python script- input dataset
repeat.sh
: repeat n experiments runs or just execute runallrunall_update.sh
: experiment settings, bash script- run server
- run experiments for each runtime engine
- server settings
- Run server and experiments: runall.sh
nohup ./repeat.sh > repeat.out 2>&1 &
Or:
nohup ./runall.sh > results/runall.out 2>&1 &
- Obtain
results/*
- Needed: Profilers datasets
- Output: Research output, data analysis and, support files to answer RQs
- files:
experiments/
- figures
- tables
- statistical results
Files in experiments/
visualize_{profiler}
- Visualization of raw data obtained from profilers.01_get_info_{profiler}
- Preprocessing of raw data obtained from profilers (script).02_get_time_marks
- Get time marks of inferences done during experiment (script)03_analysis_{execution_provider}
- Process data for analysis (notebook).04_aggregation
- Aggregated data (notebook).05_aggregated_plots
- Box plots (notebook).06_tests
- Obtaining statistical results of used statistical tests (script).07_tests_merge
- Merge test results, organized by dependent variable (notebook).08_analysis
- Notebook to analyze results (notebook).09_result_tables
- Creation of paper table (notebook).
- codeparrot-small
- tiny_starcoder
- pythia-410m
- bloomz-560m
- starcoderbase-1b
- bloomz-1b1
- tinyllama
- pythia-1.4b
- codegemma-2b
- phi2
- stablecode-3b
- stablecode-3b-completion
Dataset: testing/inputs.txt
Run server:
uvicorn app.api_code:app --host 0.0.0.0 --port 8000 --reload --reload-dir app
Make inferences:
python3 testing/main.py -i torch -r 5 | tee -a results/out_torch.log
python3 testing/main.py -i onnx -r 5 | tee -a results/out_onnx.log
python3 testing/main.py -i ov -r 5 | tee -a results/out_ov.log
python3 testing/main.py -i torchscript -r 5 | tee -a results/out_torchscript.log
Results are saved in results/
- API creation. Guide to create an API to deploy ML models.
- Add pretrained model. Guide to add pretrained ML models (from HuggingFace, hdf5 format, pickle format) to do inferences through an API.
- Deploy ML models in a cloud provider (General). Guide to deploy ML models using an API in a cloud provider.
- See more
- https://madewithml.com, API
- https://github.com/se4ai2122-cs-uniba/SE4AI2021Course_FastAPI-demo, API
- https://github.com/MLOps-essi-upc
Please use the following BibTex entry:
@article{duran2024serving,
title={Identifying architectural design decisions for achieving green ML serving},
author={Dur{\'a}n, Francisco and Martinez, Matias and Lago, Patricia and Mart{\'\i}nez-Fern{\'a}ndez, Silverio},
journal={arXiv preprint arXiv:},
year={2024}
}