A User-Space Block Device (USBD) Framework (written in Go)
Many people are familiar with user-space filesystem; on Linux these are popularly implemented using FUSE. This library is an attempt to similar provide similar facility for user-space block devices, specifically those written in Go. USBD takes advantage of the seldom used NBD interface provided by an in tree Linux kernel module to allow a daemon running in user-space to export a block device. This package can be used to write software block device (in Go) and export it via /dev/nbdX
where X is the next available device on the system. After doing so, such a device can be formatted with the filesystem of the user's choice (ex. ext4, btrfs, xfs, zfs, etc) and mounted in the usual ways (ex. mount, /etc/fstab, etc).
A daemon is created using the USBD framework that acts as an NBD server. Overhead between the user-space daemon and the kernel is minimized by using the AF_UNIX (also known as AF_LOCAL) socket type to communicate between local processes efficiently. The USBD framework allows a Go programmer to define their own block devices by implementing a type that conforms the following simple interface:
type Device interface {
Size() int64
BlockSize() int64
ReadAt(buf []byte, pos int64) (count int, err error)
WriteAt(buf []byte, pos int64) (count int, err error)
Trim(pos int64, count int) error
Flush() error
Close() error
}
Typically a daemon will call NewNbdHandler to instantiate a new usbdlib.NbdStream which is configured to communicate with the kernel NDB module and is passed to usbdlib.ReqProcessor by calling NbdStream.ProcessRequests which acts as a proxy between the kernel which is handling NDB requests and the users usbdlib.Device implementation. A command-line server, usbdsrvd, is included to allow easy testing of the USBD engine and reference devices. In its implementation the above initialization procedure can be observed. This server will make a chosen device available as /dev/nbdX
after which time it can be formatted and mounted or otherwise used as any other block device.
cmd
├── usbdsrvd
│ │ └── An example daemon that use usbdlib to serve this repositories user-space
│ │ device implementations.
│ └── conf
│ └── usbdsrvd configuration definitions and logic.
pkg
├── devices
│ ├── filedisk
│ │ └── An example USBD implementation that is backed by a simple file (filesystem).
│ ├── dedupdisk
│ │ └── An example, but potentially useful, USBD implementation that uses files to
│ │ back a logical device, but with the added use of hashing and a Pebble database
│ │ for block level deduplication. This is an older prototype and file/ramdisk are
│ │ better examples for those looking to implement their own devices.
│ ├── ramdisk
│ │ └── An example USBD implementation that is memory backed (ramdrive).
│ └── objstore
│ └── A USBD implementation, which is practically useful as it creates a logical
│ device which is cached locally but actually backed by a remote objectstore
│ of the users choice. (Ex. S3/minio, Swift, BackBlaze, Azure/Google Object Storage).
│ However, due to its complexity is not a great example for learning how to build
│ a user-space block device.
├── usbdlib
│ └── The core USBD interface definitions and engine to export/serve implementations
└── util
└── Utilities for implementors of user-space block devices to reuse
Note the objstore and dedupdisk implementations are non-trivial implementation compared to the others and use the stow ObjectStorage abstraction library and Pebble DB respectively.
As of writing, this package is actively maintained. Issue reports are welcome. Pull requests are welcome but will require consent to a contributor agreement. This project is currently beta quality. There are no known defects or risks to usage, but it has not yet been widely used. Experience reports are helpful and encouraged!
There are ramdisk and filedisk device implementations that are excellent for understanding how to implement a simple USBD device. Starting usbdsrvd with defaults (no arguments) will result in a 1 GB ramdisk backed device being exposed as the next available NBD device, typically /dev/nbd0
. Unlike these straight forward examples, there are also some advanced device implemenations which may be of more practical value.
This project is completely Linux centric (since it uses NBD) and requires the NBD kernel headers to be installed and available. To build the Go toolchain must be installed after that building is a simple matter of running go build
in the usbdsrvd directory to generate an executable.
usbd has both automated unit testing and manual testing approaches.
There is a provided a suite of basic unit tests for implementors of devices that all of the included devices use:
$ cd usbd
$ go test ./...
ok github.com/tarndt/usbd/pkg/devices/dedupdisk/dedupdisk_test 6.166s
ok github.com/tarndt/usbd/pkg/devices/filedisk/filedisk_test 8.011s
ok github.com/tarndt/usbd/pkg/devices/objstore 65.617s
ok github.com/tarndt/usbd/pkg/devices/ramdisk 2.351s
ok github.com/tarndt/usbd/pkg/usbdlib (cached)
ok github.com/tarndt/usbd/pkg/usbdlib/usbdlib_test 0.197s
ok github.com/tarndt/usbd/pkg/util (cached)
Executing these tests is just a matter of running go test
. As seen above you can run all unit tests in the project repository by running go test ./...
in the root directory. Some tests that interact with the Linux kernel require super-user privileges (aka root
). Should you run tests with verbosity (go test -v
), you will see some tests are skipped during normal execution. Additionally, while running tests with the race detector (-race
) has proven fruitful for discovering data races, please be warned that in this mode tests often take an order of magnitude longer to run and may require you to increase the test timeout (ex. -timeout=10m
). The provided test suits do support a short mode as well (go test -short
). Sometimes its useful to compile a package's unit tests to its own executable, and this can be done in typical Go fashion go test -c
(or go test -c -race
to also enable the data race detector).
Manual testing has been performed using standard block device tools (ex. hdparm) and by exporting an instance of the USBD reference implementations and creating a VirtualBox VM using the exported NDB device and installing Windows XP/10 and Ubuntu on it.
In addition to the basic reference devices, as mentioned in the project structure above, there are two notable non-trivial device implemenations.
The ObjectStore (objstore) driver creates a logical device which is cached locally on disk but actually backed by a remote object storage of the users choice (ex. S3, Swift, BackBlaze B2 and Azure's/Google's Object Storage). Besides the original implementors and providers of each object storage technology/service there are other projects and organizations that provide compatible alternatives for them; for example, minio allows easy self-hosted/Kubernetes-hosted S3, Ceph emulates both S3 and Swift with is rados gateway and many other cloud providers offer S3-compatible SaaS such as DigitalOcean's Spaces and wasabi.
The above OS install procedure has been used to manually verify correct operation of the objectstore implementation. Key to this implementation preforming well is the use of the included S2 compression and fast (enough) network access to the backing objectstore server. For security reasons its highly recommended to use the provided AES encryption functionality.
Due to fixes that have not yet made it upstream the ObjectStore implementation uses a fork of the stow ObjectStorage abstraction library which enables this project to easily support many popular object stores. The command line argument -objstore-cfg=<yourJSON>
allows for ObjectStore type specific configuration parameters to be passed. To see examples these parameters refer to the stow ObjectStore configuration definitions, for example, S3 config or Azure config options.
An easy way to test this is to setup a local minio (S3 compatible) objectstore server:
cd /tmp
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod u+x ./minio
mkdir /minio_data
./minio server ./minio_data
The minio server instance will use the default minio credentials which are in turn the defaults used by usbdsrvd
when objectstore config is not provided.
The Deduplication (dedupdisk) device implementation uses files to back a logical device, but with the added use of content-hashing and a Pebble database for block-level deduplication. Work is in progress to allow the blocks that must be stored to also be compressed.
To verify the dedup implementation, after initial installation testing the disk was then "quick" (no zeroing) re-formatted, and another fresh install was performed and file analysis verified duplication was taking place by checking the disk files did not grow meaningfully. Read performance up to 3.2 GB/s and write performance (while writing duplicate data) as high as 2.1 GB/s with SSDs hosting the backing PebbleDB database and block-file as been achieved.
While this library is intended to be used by other daemons, the included usbdsrvd (main.go) will host instances of the sample device implementations and may be useful in its own right. Starting usbdsrvd with defaults (no arguments) will result in a 1 GB ramdisk backed device being exposed as the next available NBD device typically /dev/nbd0
. If the NBD kernel module is not loaded usbdsrvd
will attempt to load it. The maximum number of NBD devices a system can have is determined at kernel module load time so if the default is too few devices you may need to increase it with -nbd-max-devs
if using the usbdsrvd
daemon or via an argument to NewNbdHandler if interfacing programmatically.
Usage: ./usbdsrvd [optional: options see below...] [optional: NBD device to use ex. /dev/nbd0, if absent the first free device is used.]
Arguments starting with <driver name>-X are only applicable if dev-type=X is being set.
Example:
1 GiB device backed with by memory: ./usbdsrvd
8 GiB device backed by a file and exported specifically on /dev/nbd5: ./usbdsrvd -dev-type=file -store-dir=/tmp -store-name=testfilevol -store-size=8GiB /dev/nbd5
12 GiB device backed by file deduplicated using PebbleDB: ./usbdsrvd -dev-type=file -store-dir=/tmp -store-name=testdedupvol -store-size=12GiB
20 GiB device backed by a locally running S3/minio objectstore: ./usbdsrvd -dev-type=objstore -store-dir=/tmp -store-name=testobjvol -store-size=20GiB
-help
Display help and exit
-dev-type string
Type of device to back block device with: 'mem', 'file', 'dedup', 'objstore'. (default "mem")
-store-dir string
Location to create new backing disk files in (default "./")
-store-name string
File base name to use for new backing disk files (default "test-lun")
-store-size value
Amount of storage capcity to use for new backing files (ex. 100 MiB, 20 GiB) (default 1.0 GiB)
-nbd-max-devs uint
If the NBD kernel module is loaded by this daemon how many NBD devices should it create (default 32)
-dedup-memcache string
Amount of memory to the dedup store ID cache (ex. 100 MiB, 20 GiB) (default "512 MiB")
-objstore-kind string
Type of remote objectstore: 's3', 'b2', 'local', 'azure', 'swift', 'google', 'oracle' or 'sftp' (default "s3")
-objstore-cfg string
JSON configuration (default assumes local minio [kind "s3"] with default settings) (default "{\"access_key_id\":\"minioadmin\",\"endpoint\":\"http://127.0.0.1:9000\",\"secret_key\":\"minioadmin\"}")
-objstore-objsize value
Size of remote objects (ex. 32 MiB, 1 GiB) (default 64 MiB)
-objstore-diskcache value
Amount of disk for caching remote objects (0 implies local fullbacking/caching or ex. 100 MiB, 20 GiB)
-objstore-persistcache
Should the local cache be persistent or deleted on device shutdown
-objstore-aeskey string
If AES is enabled; AES key to use to encrypt remote objects (if absent a key is generated and saved to ./key.aes, otherwise use: key:<value>, file:<path>, env:<varname>
-objstore-aesmode string
AES encryption mode to use to encrypt remote objects: "aes-cfb", "aes-ctr", "aes-ofb" or "identity" for no encryption. "aes-ctr" is recommended. (default "aes-ctr")
-objstore-compress string
Compression algorithm to use for remote objects: "s2", "gzip" or "identity" for no compression) (default "s2")
-objstore-concurflush uint
Maximum number of dirty local objects to concurrently upload to the remote objectstore (0 implies use heuristic)
-objstore-flushevery duration
Frequency in which dirty local objects are uploaded to the remote objectstore (0 disables autoflush) (default 10s)
Lets create a 10 GB volume using local minio (S3 compatible) server. Please note that many of the operations below require super-user privileges (aka root
).
Terminal Session 1: Setup minio server
cd /tmp
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod u+x ./minio
mkdir ./minio_data
./minio server ./minio_data
Note the data in /tmp/minio_data
would normally live on a remote service.
You should see minio is running:
WARNING: Detected default credentials 'minioadmin:minioadmin', we recommend that you change these values with 'MINIO_ROOT_USER' and 'MINIO_ROOT_PASSWORD' environment variables
......
WARNING: Console endpoint is listening on a dynamic port (44599), please use --console-address ":PORT" to choose a static port.
Terminal Session 2: Start a usbd server
cd /tmp
mkdir ./usbdsrv_cache
git clone https://github.com/tarndt/usbd.git
cd usbd/cmd/usbdsrvd
go build
sudo ./usbdsrvd -dev-type=objstore -store-dir=/tmp/usbdsrv_cache -store-name=tutorial-vol -store-size=10GiB
You should see the block device is being exported:
Generated AES-256 key and stored it in "/tmp/usbdsrv_cache/key.aes"
USBD Server (./usbdsrvd) started.
USBD Server (./usbdsrvd) using config: Exporting 20 GiB volume "tutorial-vol" as next available NBD device with local storage at "/tmp/usbdsrv_cache" using driver locally-cached objectstore using a s3 remote object store (secret_key=<REDACTED>, access_key_id="minioadmin", endpoint="http://127.0.0.1:9000"), with 64 MiB objects, using up to as much as total device size of persistent local storage for cache, s2 compression, aes-ctr encryption, flushing to remote story every 10s using 16 workers
USBD Server (./usbdsrvd) is processing requests for "/dev/nbd0"
Take note the device the volume is being exported on, the following instructions assume "/dev/nbd0" as shown above. Later, you can shutdown this service using Ctrl+C
or kill
(but make sure you umount the device first!).
Terminal Session 3: Using it! You can now format the device and start using it:
sudo mkfs.ext4 /dev/nbd0
cd /tmp
mkdir ./tutorial-vol
sudo mount /dev/nbd0 /tmp/tutorial-vol
sudo chmod a+r,a+w /tmp/tutorial-vol/
The device should show as mounted and files written to /tmp/tutorial-vol
are now backed with the (in this case the local minio) objectstore.
$ mount | grep nbd
/dev/nbd0 on /tmp/tutorial-vol type ext4 (rw,relatime)
$ cd /tmp/tutorial-vol/
$ echo "test file" > test.txt
$ cat test.txt
test file
Feel free to inspect usbdsrvd
's cache and/or minio
's data directories to observe the presence of data files.
To cleanup ensure you are not in /tmp/tutorial-vol
(or you will get a busy error) and umount.
cd /tmp
sudo umount /tmp/tutorial-vol
You can now return to session 2, Ctrl+C
usbdsrvd, and then finally session 3 and Ctrl+C
minio. To avoid data loss, it is key that correct shutdown order is followed:
- Cease usage of mount point
- unmount block device
- daemon shutdown
- objectstore shutdown
Work is in progress for an patch to the Linux kernel NBD driver to allow dynamic creation of NBD devices in addition to the pre-allocation at module load provided for today. This will allow facilities such as udev and the mknod
system call or command to provision more NBD devices then intitally created.
Work is in progress for a CSI storage driver that uses the objectstore implementation provided in this repository. This will allow Kubernetes to use the provideded object-storage backed device to provide persistant volumes. If you are interested in contributing please contact the author!