Bonjour 👋

Website | Getting Started | Documentation | Discord | Contributing

Bonjour 👋

At ZML, we are creating exciting AI products on top of our high-performance AI inference stack. Our stack is built for production, using the amazing Zig language, MLIR, and the power of Bazel.

Take me straight to getting started or give me a taste 🥐!

We're happy to share!

We're very happy to share our inference stack with the World and hope it allows you, too, to build cool and exciting AI projects.

To give you a glimpse of what you can do with ZML, here is an early demo:

It shows a prototype running a LLaMA2 model sharded on 1 NVIDIA RTX 4090, 1 AMD 6800XT, and 1 Google Cloud TPU v2. All accelerators were hosted in different locations, with activations being passed over a VPN.

All processes used the same model code, cross-compiled on a Mac, and copied onto the servers.

For more inspiration, see also the examples below or check out the examples folder.

Getting started

Prerequisites

We use bazel to build ZML and its dependencies. The only prerequisite is bazel, which we recommend to download through bazelisk, a version manager for bazel.

Please note: If you do not wish to install bazel system-wide, we provide examples/bazel.sh which downloads it to your home folder and runs it.

Install Bazel (recommended):

macOS

brew install bazelisk

Linux

curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64'
chmod +x /usr/local/bin/bazel

Run a pre-packaged model

We have implemented a variety of example models in ZML. See our reference implementations in the examples folder.

MNIST

The classic handwritten digits recognition task. The model is tasked to recognize a handwritten digit, which has been converted to a 28x28 pixel monochrome image. Bazel will download a pre-trained model, and the test dataset. The program will load the model, compile it, and classify a randomly picked example from the test dataset.

On the command line:

cd examples
bazel run -c opt //mnist

# or
./bazel.sh run -c opt //mnist

TinyLlama, Stories 15M

Our LLM examples start with a small model trained specifically on children's history books. This model has been trained by Andrej Karpathy; you can read more about it on his GitHub.

cd examples
bazel run -c opt //llama:TinyLlama-Stories-15M
bazel run -c opt //llama:TinyLlama-Stories-15M -- --prompt="Once upon a time, there was a cute little dragon"

OpenLLama 3B

cd examples
bazel run -c opt //llama:OpenLLaMA-3B
bazel run -c opt //llama:OpenLLaMA-3B -- --prompt="Once upon a time,"

Meta Llama 3.1 8B

This model has restrictions, see here. It requires approval from Meta on Huggingface, which can take a few hours to get granted.

While waiting, you can already generate an access token to log into HuggingFace from bazel; see here.

Once you've been granted access, you're ready to download a gated model like Meta-Llama-3.1-8B-Instruct!

# requires token in $HOME/.cache/huggingface/token, as created by the
# `huggingface-cli login` command, or the `HUGGINGFACE_TOKEN` environment variable.
cd examples
bazel run -c opt //llama:Llama-3.1-8B-Instruct
bazel run -c opt //llama:Llama-3.1-8B-Instruct -- --prompt="Once upon a time,"

You can also try Llama-3.1-70B-Instruct if you have enough memory.

Meta Llama 3.2 1B

Like the 8B model above, this model also requires approval. See here for access requirements.

cd examples
bazel run -c opt //llama:Llama-3.2-1B-Instruct
bazel run -c opt //llama:Llama-3.2-1B-Instruct -- --prompt="Once upon a time,"

For a larger 3.2 model, you can also try Llama-3.2-3B-Instruct.

Running Models on GPU / TPU

You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:

NVIDIA CUDA: --@zml//runtimes:cuda=true
AMD RoCM: --@zml//runtimes:rocm=true
Google TPU: --@zml//runtimes:tpu=true
AWS Trainium/Inferentia 2: --@zml//runtimes:neuron=true
AVOID CPU: --@zml//runtimes:cpu=false

The latter, avoiding compilation for CPU, cuts down compilation time.

So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:

cd examples
bazel run -c opt //llama:OpenLLaMA-3B        \
          --@zml//runtimes:cuda=true         \
          -- --prompt="Once upon a time,"

Run Tests

bazel test //zml:test

A taste of ZML

MNIST

const std = @import("std");
const zml = @import("zml");

/// Model definition
const Mnist = struct {
    fc1: Layer,
    fc2: Layer,

    const Layer = struct {
        weight: zml.Tensor,
        bias: zml.Tensor,

        pub fn forward(self: Layer, input: zml.Tensor) zml.Tensor {
            return self.weight.matmul(input).add(self.bias).relu();
        }
    };

    /// just two linear layers + relu activation
    pub fn forward(self: Mnist, input: zml.Tensor) zml.Tensor {
        std.log.info("Compiling for target: {s}", .{@tagName(input.getContext().target())});
        var x = input.flattenAll().convert(.f32);
        const layers: []const Layer = &.{ self.fc1, self.fc2 };
        for (layers) |layer| {
            x = zml.call(layer, .forward, .{x});
        }
        return x.argMax(0, .u8).indices;
    }
};

Tagged Tensors

const Sdpa = struct {
    pub fn forward(_: Sdpa, ctx: *zml.Context, q_: zml.Tensor, k_: zml.Tensor, v_: zml.Tensor) zml.Tensor {
        const q = q_.withTags(.{ .b, .h, .q, .hd });
        const k = k_.withTags(.{ .b, .h, .k, .hd });
        const v = v_.withTags(.{ .b, .h, .k, .hd });
        const attn_mask = zml.nn.causalAttnMask(ctx, .{ .q = q.dim(.q), .k = k.dim(.k) }, q.dtype(), null);
        return zml.nn.sdpa(ctx, q, k, v, .{ .attn_mask = attn_mask });
    }
};

Where to go next:

You might want to check out more examples, read through the documentation directly on GitHub, or, for the full rendering experience, browse the online documentation with included API reference.

Contributing

See here.

License

ZML is licensed under the Apache 2.0 license.

Thanks to our contributors

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github		.github
async		async
bazel		bazel
docs		docs
examples		examples
mlir		mlir
pjrt		pjrt
platforms		platforms
runtimes		runtimes
stdx		stdx
third_party		third_party
tools		tools
zml		zml
.bazelignore		.bazelignore
.bazelrc		.bazelrc
.bazelrc-examples		.bazelrc-examples
.bazelrc-zml		.bazelrc-zml
.bazelversion		.bazelversion
.gitignore		.gitignore
.nvim.lua		.nvim.lua
.vscode		.vscode
BUILD.bazel		BUILD.bazel
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MODULE.bazel		MODULE.bazel
MODULE.bazel.lock		MODULE.bazel.lock
README.md		README.md
build.zig		build.zig
platform_mappings		platform_mappings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bonjour 👋

We're happy to share!

Getting started

Prerequisites

macOS

Linux

Run a pre-packaged model

MNIST

TinyLlama, Stories 15M

OpenLLama 3B

Meta Llama 3.1 8B

Meta Llama 3.2 1B

Running Models on GPU / TPU

Run Tests

A taste of ZML

MNIST

Tagged Tensors

Where to go next:

Contributing

License

Thanks to our contributors

About

Releases

Contributors 14

Languages

License

zml/zml

Folders and files

Latest commit

History

Repository files navigation

Bonjour 👋

We're happy to share!

Getting started

Prerequisites

macOS

Linux

Run a pre-packaged model

MNIST

TinyLlama, Stories 15M

OpenLLama 3B

Meta Llama 3.1 8B

Meta Llama 3.2 1B

Running Models on GPU / TPU

Run Tests

A taste of ZML

MNIST

Tagged Tensors

Where to go next:

Contributing

License

Thanks to our contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 14

Languages