`❯ llama-gguf-optimize v0.5`

Optimize. Quantize. Perfect the Efficiency.

Built with the tools and technologies:

Overview

Llama-gguf-optimize is the result of work and research in creating high-quality quantizations for multilingual models, specifically the salamandra series. With a focus on preserving language diversity, the project leverages llama.cpp’s importance matrix approach to minimize quantization loss across distinct language domains. Existing importance matrices often lack even basic multilingual support, so this toolkit includes tools and scripts to generate custom importance matrices and refined quantizations, tailored for datasets beyond common sources like WikiText. While initially addressing quantization needs for 2B models, llama-gguf-optimize has grown into a broader resource, providing insights and tools for other researchers facing similar challenges.

It currently contains new scripts to help analyze dataset performace in i-matrix based qunatization, the notes on_kl-divergence-optimization.md which summarize various discussions on llama.cpp regarding imatrices, as well a Iterative Quantization and Comparison usage guide, which will eventually detail the main flows envisioned for this project and current details the primary one. It also contains the original notes used to help inform my search to create these datasets, and the Jupyter notebooks that were used to generate the datasets and the quantized models. These notes are in some cases incorrect, but will eventually be updated with later insights. The notebooks will be generalized into scripts to help others in the process, regardless of the datasets they are using.

The imatrix_dataset.py script provides functionalities for dataset sampling and matrix generation for specific domains in (huggingface) datasets, for example languages. It can be customized via plugins in the src/imatrix_dataset directory, enabling flexible integration with various data sources. This script complements the quantization process by generating importance matrices that optimize models for specific languages and contexts.

The quantize.py script facilitates the quantization of Large Language Models (LLMs) using llama.cpp's quantization tools. It provides a streamlined way to quantize models with various options, including specifying the quantization type, output directory, and base model. It also allows for perplexity measurement and summarization to assess the impact of quantization on model performance. This script is particularly useful for researchers and developers aiming to optimize LLMs for deployment in resource-constrained environments.

The best_bub.py script is a performance optimization tool developed to fine-tune batch (--batch) and ubatch (--ubatch) parameters for logit generation processes in llama.cpp, such as those in llama-perplexity or llama-imatrix. Using a fully Bayesian approach, this script explores runtime configurations tailored to your model’s context size and available memory resources, achieving notable time savings over default configurations (with a 33% improvement observed in the author's case). best_bub.py employs Optuna’s TPESampler for intelligent sampling and incorporates Bayesian metrics and a MedianPruner to refine trial evaluations based on runtime performance trends. This approach ensures optimal parameter selection while adapting to real-time memory constraints and model-specific behavior.

The kl_d_bench.py script coordinates the generation and comparison of logits across models, running through the dataset one chunk at a time. By handling each chunk sequentially, it keeps storage needs low—requiring only enough space for the current and previous chunks—while ensuring consistency and smooth progress. kl_d_bench.py can easily pause and pick up where it left off. Though currently optimized for single-chunk processing, future updates could allow multi-chunk handling.

The generate_logits.py script is a specialized tool designed to generate and store logits efficiently from a llama.cpp model for large datasets. Built to overcome limitations of storage and memory efficiency in previous tools, it uses the HDF5 format for compact storage and a reuse list to manage processed chunks, making it resumable and able to operate within limited disk space. These optimizations make generate_logits.py particularly useful for quantization analysis and similar downstream tasks that require consistent, efficient logit generation for models with large vocabulary sizes or extensive datasets.

The compare_logits.py script is a specialized tool for comparing sets of logits on a chunk-by-chunk basis, providing detailed KL-divergence metrics essential for quantization analysis. By calculating statistics such as median, standard deviation, and specific percentiles (e.g., 90th, 95th, 99th) for each chunk, it highlights outliers where quantization diverges most from baseline. These metrics, stored in an HDF5 format for efficient storage and resumability, can support the evaluation and calibration of quantization quality, particularly for fine-tuning dataset importance.

usage-guide.notebooklm.webm

Features

	Feature	Description
⚙️	Architecture	The project utilizes a modular src directory with various scripts for quantization, model optimization, and logging. It adheres to Python best practices and leverages external libraries for machine learning tasks.
🔩	Code Quality	(Continually improving towards) High-quality code maintained through static type checking (`py.typed`), documentation within files such as `src/quantize.ipynb`, and consistent use of tools like Optuna and PyTorch for optimization and model execution.
📄	Documentation	Comprehensive documentation is available, as well as configuration details in `pyproject.toml` and `requirements.txt`. Additional markdown files provide insights into the repository's goals and methodologies, enhancing user understanding.
🔌	Integrations	Key integrations with machine learning libraries (PyTorch, NumPy), optimization tools (Optuna), and data handling modules (HDF5). External dependencies are well-managed and specified in `requirements.txt`.
🧩	Modularity	The codebase is highly modular, with functionalities split into different scripts within the src directory. Core functions for quantization and logging are separated into dedicated files (`library.py`, `gguf_optimize_logging.py`), enhancing reuse.
🧪	Testing	Using unittest scripts like `compare_logits.py` indicate functionality for validation through KL-divergence calculations, suggesting an implicit testing strategy.
⚡️	Performance	Optimized for performance and memory usage, and detailed logging configurations that can dynamically adjust based on runtime needs (`gguf_optimize_logging.py`).
🛡️	Security	No explicit security measures are mentioned, but the use of versioning and static type checking enhance maintainability and reliability, indirectly supporting secure code practices.
🔗	Dependencies	Managed through `pyproject.toml` and available in `requirements.txt`, including PyTorch for deep learning tasks, NumPy for numerical computation, HDF5 for dataset handling, and Optuna for optimization tasks.

Modules

[root]

File	Summary
requirements.txt	Lists essential external libraries ensuring consistent development environment across different setups. Highlights dependencies crucial for model optimization and data processing, supporting repositorys focus on advanced quantization techniques and optimization benchmarks.
pyproject.toml	Defines project metadata and dependencies for llama-gguf-optimize, ensuring compatible Python version and listing essential packages for machine learning tasks. Streamlines building and managing project versions with Hatch tooling support.
imatrix.ipynb	The Jupyter notebook for sampling data for i-matrix generation, originally at `src/imatrix.ipynb` but now located in the root directory. This notebook is generalized into the `imatrix.py` script.
quantize.ipynb	This code file is an original part of a repository focused on optimizing models, particularly for quantization.

src

File	Summary
version.py	Defines version for package management.
library.py	Custom library tag used in Python packages to indicate the source of the library, facilitating easy identification and tracking of dependencies within the repositorys ecosystem.
compare_logits.py	Script compares KL-divergence between two HDF5 files containing softmax logits from machine learning models. Calculates statistics per chunk, updates cumulative overall stats, and saves results to an output file. Supports processing specific chunks and logging verbosity levels.
gguf_optimize_model_fns.py	Estimates model parameters and precision for optimization within repository architecture. Utilizes metadata for parameter estimation and calculates bits per weight to assess model efficiency, logging critical information for debugging and verification.
best_bub.py	The primary purpose of `best_bub.py` is to automate the search for the best possible `--batch` and `--ubatch` configuration (dubbed BUB within the context) that maximizes performance (inference speed). Critical Features:-Parameter Tuning: It uses tools like Optuna for hyperparameter optimization to explore different configurations of the models.-Performance Evaluation: Integrates with `llama_cpp` and other scientific computing libraries to evaluate how well each configuration performs on specific tasks.-Efficiency Optimization: Incorporates multiprocessing capabilities to distribute parameter tuning across multiple processes, enhancing computational efficiency.Overall, `best_bub.py` serves as a key component for efficiently optimizing model configurations to achieve the best performance in terms of speed and accuracy.
+	quantize.py
generate_logits.py	The `generate_logits.py` script generates logits from a model, over a dataset, for further analysis and optimization.Key Features-Logit Generation: It generates logits from a given model output, enabling subsequent optimizations and comparisons.-**Integration with Repository Goals By creating detailed model outputs, it supports the repository’s aim to enhance model performance through various optimization techniques such as quantization and KL-divergence analysis.
gguf_optimize_logging.py	Configures logging for library operations, setting up message formats and output levels to standard out, facilitating consistent logging across modules with versioning information included in debug mode outputs.
imatrix_dataset.py	A Python script that generalizes and automates the data sampling process for specific domains. It integrates plugin support (in `src/imatrix_dataset`) for handling different data sources, allowing for flexible dataset sampling and importance matrix generation.
kl_d_bench.py	Orchestrates processing a dataset to generate and compare model logits, manages dataset input, ensures mutually exclusive options, validates parameters, sets logging level, and executes main function. Validates presence of necessary arguments and prevents conflicting options.

Getting Started

Prerequisites

Python: Version 3.12 or compatible versions (3.6 and above may also work)
Ensure the necessary libraries from requirements.txt are installed.

Installation

Clone the repository:
```
❯ git clone <repository_url>
```
Navigate to the project directory:
```
❯ cd <project_directory>
```
Install the required dependencies (optional if using uv):
```
❯ pip install -r requirements.txt
```

Usage

note: There is a usage guide!

Each script in llama-gguf-optimize can be run independently, offering a range of model optimization, logit generation, and comparison capabilities:

Data Sampling for Importance Matrix with imatrix.py
The imatrix.py script generalizes the data sampling process, enabling the generation of importance matrices for specific languages. It supports custom data sources through plugins in src/imatrix_dataset.
```
❯ uv run src/imatrix.py --langs <languages> --num-samples <samples_count> --skip-samples <skip_count> --output <output_file>
```
Additional options:
```
❯ uv run src/imatrix.py --help
```

Model quantization and perplexity analysis with quantize.py

❯ uv run src/quantize.py quantize --model-name <model_name> --base-model <path_to_model> --config <config_file>

To measure perplexity:

❯ uv run src/quantize.py perplexity --base-model-name <model_name> --config <config_file> --dataset <ppl_test_data>

See all options with: ❯ uv run src/quantize.py --help

Optimize Batch Sizes with best_bub.py
Run the best_bub.py script to optimize batch (--batch) and ubatch (--ubatch) parameters:
```
❯ uv run src/best_bub.py --model <path_to_model> --context-size <size> [optional model parameters]...
```
For a full list of options, see:
```
❯ uv run src/best_bub.py --help
```

Generate Logits with generate_logits.py
Use generate_logits.py to generate and save logits to an HDF5 file:

❯ uv run src/generate_logits.py --model <path_to_model> --dataset <path_to_dataset> --output <output_file>

For additional options:

❯ uv run src/generate_logits.py --help

Compare Logits with compare_logits.py
The compare_logits.py script calculates KL-divergence between two HDF5 logit files:
```
❯ uv run src/compare_logits.py <baseline_file> <target_file> --output_file <output_file>
```
Access more options with:
```
❯ uv run src/compare_logits.py --help
```

Orchestrate Logit Generation and Comparison with kl_d_bench.py
Run kl_d_bench.py to manage logit generation and comparison in a synchronized workflow:

❯ uv run src/kl_d_bench.py --baseline-model <baseline_model> --target-model <target_model> --dataset <path_to_dataset> --output-file <output_file>

For further options:

❯ uv run src/kl_d_bench.py --help

Running Tests

Execute the full test suite using:

❯ PYTHONPATH=src uv run --module unittest discover -s src/tests

Project Roadmap

v0.1: ~~best_bub.py script.~~
v0.3: ~~generation and comparison scripts.~~
v0.5: ~~kl-divergence comparison script.~~
v0.5.n: Usage guides.
v0.6: Convert jupyter notebooks to general scripts.
v1.0: PyPl submission, github actions, changelog.

Contributing

Contributions are welcome! Here are several ways you can contribute:

Report Issues: Submit bugs found or log feature requests for the `` project.
Submit Pull Requests: Review open PRs, and submit your own PRs.
Join the Discussions: Share your insights, provide feedback, or ask questions.

Contributing Guidelines

Fork the Repository: Start by forking the project repository to your LOCAL account.
Clone Locally: Clone the forked repository to your local machine using a git client.
```
git clone .
```
Create a New Branch: Always work on a new branch, giving it a descriptive name.
```
git checkout -b new-feature-x
```
Make Your Changes: Develop and test your changes locally.
Commit Your Changes: Commit with a clear message describing your updates.
```
git commit -m 'Implemented new feature x.'
```
Push to LOCAL: Push the changes to your forked repository.
```
git push origin new-feature-x
```
Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!

Contributor Graph

License

This project is protected under the GNU Lesser General Public License License. For more details, refer to the LICENSE file.

Acknowledgments

List any resources, contributors, inspiration, etc. here.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
USAGE.md		USAGE.md
bub_execution_flow.md		bub_execution_flow.md
imatrix_dataset.ipynb		imatrix_dataset.ipynb
on_kl-divergence-optimization.md		on_kl-divergence-optimization.md
on_perplexity.md		on_perplexity.md
on_quantization.md		on_quantization.md
pyproject.toml		pyproject.toml
quantizations.yaml		quantizations.yaml
quantize.ipynb		quantize.ipynb
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`❯ llama-gguf-optimize v0.5`

Table of Contents

Overview

Features

Modules

Getting Started

Prerequisites

Installation

Usage

Running Tests

Project Roadmap

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

License

robbiemu/llama-gguf-optimize

Folders and files

Latest commit

History

Repository files navigation

❯ llama-gguf-optimize v0.5

Table of Contents

Overview

Features

Modules

Getting Started

Prerequisites

Installation

Usage

Running Tests

Project Roadmap

Contributing

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`❯ llama-gguf-optimize v0.5`

Packages