Skip to content

Latest commit

 

History

History
578 lines (449 loc) · 42.9 KB

README.md

File metadata and controls

578 lines (449 loc) · 42.9 KB

htsget-config

MIT licensed Build Status

Configuration for htsget-rs and relevant crates.

Overview

This crate is used to configure htsget-rs by using a config file or reading environment variables.

Usage

For running htsget-rs as an application

To configure htsget-rs, a TOML config file can be used. It also supports reading config from environment variables. Any config options set by environment variables override values in the config file. For some of the more deeply nested config options, it may be more ergonomic to use a config file rather than environment variables.

The configuration consists of multiple parts, config for the ticket server, config for the data server, service-info config, and config for the resolvers.

Ticket server config

The ticket server responds to htsget requests by returning a set of URL tickets that the client must fetch and concatenate. To configure the ticket server, set the following options:

Option Description Type Default
ticket_server_addr The address for the ticket server. Socket address '127.0.0.1:8080'
ticket_server_tls Enable TLS for the ticket server. See TLS for more details. TOML table Not enabled
ticket_server_cors_allow_credentials Controls the CORS Access-Control-Allow-Credentials for the ticket server. Boolean false
ticket_server_cors_allow_origins Set the CORS Access-Control-Allow-Origin returned by the ticket server, this can be set to All to send a wildcard, Mirror to echo back the request sent by the client, or a specific array of origins. 'All', 'Mirror' or a array of origins ['http://localhost:8080']
ticket_server_cors_allow_headers Set the CORS Access-Control-Allow-Headers returned by the ticket server, this can be set to All to allow all headers, or a specific array of headers. 'All', or a array of headers 'All'
ticket_server_cors_allow_methods Set the CORS Access-Control-Allow-Methods returned by the ticket server, this can be set to All to allow all methods, or a specific array of methods. 'All', or a array of methods 'All'
ticket_server_cors_max_age Set the CORS Access-Control-Max-Age for the ticket server which controls how long a preflight request can be cached for. Seconds 86400
ticket_server_cors_expose_headers Set the CORS Access-Control-Expose-Headers returned by the ticket server, this can be set to All to expose all headers, or a specific array of headers. 'All', or a array of headers []

TLS is supported by setting the ticket_server_key and ticket_server_cert options. An example of config for the ticket server:

ticket_server_addr = '127.0.0.1:8080'
ticket_server_cors_allow_credentials = false
ticket_server_cors_allow_origins = 'Mirror'
ticket_server_cors_allow_headers = ['Content-Type']
ticket_server_cors_allow_methods = ['GET', 'POST']
ticket_server_cors_max_age = 86400
ticket_server_cors_expose_headers = []

Local data server config

The local data server responds to tickets produced by the ticket server by serving local filesystem data. To configure the data server, set the following options:

Option Description Type Default
data_server_addr The address for the data server. Socket address '127.0.0.1:8081'
data_server_local_path The local path which the data server can access to serve files. Filesystem path './'
data_server_serve_at The path which the data server will prefix to all response URLs for tickets. URL path ''
data_server_tls Enable TLS for the data server. See TLS for more details. TOML table Not enabled
data_server_cors_allow_credentials Controls the CORS Access-Control-Allow-Credentials for the data server. Boolean false
data_server_cors_allow_origins Set the CORS Access-Control-Allow-Origin returned by the data server, this can be set to All to send a wildcard, Mirror to echo back the request sent by the client, or a specific array of origins. 'All', 'Mirror' or a array of origins ['http://localhost:8080']
data_server_cors_allow_headers Set the CORS Access-Control-Allow-Headers returned by the data server, this can be set to All to allow all headers, or a specific array of headers. 'All', or a array of headers 'All'
data_server_cors_allow_methods Set the CORS Access-Control-Allow-Methods returned by the data server, this can be set to All to allow all methods, or a specific array of methods. 'All', or a array of methods 'All'
data_server_cors_max_age Set the CORS Access-Control-Max-Age for the data server which controls how long a preflight request can be cached for. Seconds 86400
data_server_cors_expose_headers Set the CORS Access-Control-Expose-Headers returned by the data server, this can be set to All to expose all headers, or a specific array of headers. 'All', or a array of headers []

TLS is supported by setting the data_server_key and data_server_cert options. An example of config for the data server:

data_server_addr = '127.0.0.1:8081'
data_server_local_path = './'
data_server_serve_at = ''
data_server_key = 'key.pem'
data_server_cert = 'cert.pem'
data_server_cors_allow_credentials = false
data_server_cors_allow_origins = 'Mirror'
data_server_cors_allow_headers = ['Content-Type']
data_server_cors_allow_methods = ['GET', 'POST']
data_server_cors_max_age = 86400
data_server_cors_expose_headers = []

Sometimes it may be useful to disable the data server as all responses to the ticket server will be handled elsewhere, such as with an AWS S3 data server.

To disable the data server, set the following option:

data_server_enabled = false

Service info config

The service info config controls what is returned when the service-info path is queried.
To configure the service-info, set the following options:

Option Description Type Default
id Service ID. String Not set
name Service name. String Not set
version Service version. String Not set
organization_name Organization name. String Not set
organization_url Organization URL. String Not set
contact_url Service contact URL String Not set
documentation_url Service documentation URL. String Not set
created_at When the service was created. String Not set
updated_at When the service was last updated. String Not set
environment The environment the service is running in. String Not set

An example of config for the service info:

id = 'id'
name = 'name'
version = '0.1'
organization_name = 'name'
organization_url = 'https://example.com/'
contact_url = 'mailto:nobody@example.com'
documentation_url = 'https://example.com/'
created_at = '2022-01-01T12:00:00Z'
updated_at = '2022-01-01T12:00:00Z'
environment = 'dev'

Resolvers

The resolvers component of htsget-rs is used to map query IDs to the location of the resource. Each query that htsget-rs receives is 'resolved' to a location, which a data server can respond with. A query ID is matched with a regex, and is then mapped with a substitution string that has access to the regex capture groups. Resolvers are configured in an array, where the first matching resolver is resolver used to map the ID.

To create a resolver, add a [[resolvers]] array of tables, and set the following options:

Option Description Type Default
regex A regular expression which can match a query ID. Regex '.*'
substitution_string The replacement expression used to map the matched query ID. This has access to the match groups in the regex option. String with access to capture groups '$0'

For example, below is a regex option which matches a / between two groups, and inserts an additional data inbetween the groups with the substitution_string.

[[resolvers]]
regex = '(?P<group1>.*?)/(?P<group2>.*)'
substitution_string = '$group1/data/$group2'

For more information about regex options see the regex crate.

Each resolver also maps to a certain storage backend. This storage backend can be used to set query IDs which are served from local storage, from S3-style bucket storage, or from HTTP URLs. To set the storage backend for a resolver, add a [resolvers.storage] table. Some storage backends require feature flags to be set when compiling htsget-rs.

To use LocalStorage, set backend = 'Local' under [resolvers.storage], and specify any additional options from below:

Option Description Type Default
scheme The scheme present on URL tickets. Either 'Http' or 'Https' 'Http'
authority The authority present on URL tickets. This should likely match the data_server_addr. URL authority '127.0.0.1:8081'
local_path The local filesystem path which the data server uses to respond to tickets. This should likely match the data_server_local_path. Filesystem path './'
path_prefix The path prefix which the URL tickets will have. This should likely match the data_server_serve_at path. URL path ''
use_data_server_config Whether to use the data server config to fill in the above values. This overrides any other options specified from this table. Boolean false

To use S3Storage, build htsget-rs with the s3-storage feature enabled, set backend = 'S3' under [resolvers.storage], and specify any additional options from below:

Option Description Type Default
bucket The AWS S3 bucket where resources can be retrieved from. String Derived from the resolvers regex property if empty. This uses the first capture group in the regex as the bucket.
endpoint A custom endpoint to override the default S3 service address. This is useful for using S3 locally or with storage backends such as MinIO. See MinIO. String Not set, uses regular AWS S3 services.
path_style The S3 path style to request from the storage backend. If true, "path style" is used, e.g. host.com/bucket/object.bam, otherwise bucket.host.com/object style is used. Boolean false

UrlStorage is another storage backend which can be used to serve data from a remote HTTP URL. When using this storage backend, htsget-rs will fetch data from a url which is set in the config. It will also forward any headers received with the initial query, which is useful for authentication. To use UrlStorage, build htsget-rs with the url-storage feature enabled, set backend = 'Url' under [resolvers.storage], and specify any additional options from below:

Option Description Type Default
url The URL to fetch data from. HTTP URL "https://127.0.0.1:8081/"
response_url The URL to return to the client for fetching tickets. HTTP URL "https://127.0.0.1:8081/"
forward_headers When constructing the URL tickets, copy HTTP headers received in the initial query. Boolean true
header_blacklist List of headers that should not be forwarded. Array of headers []
tls Additionally enables client authentication, or sets non-native root certificates for TLS. See TLS for more details. TOML table TLS is always allowed, however the default performs no client authentication and uses native root certificates.

When using UrlStorage, the following requests will be made to the url.

  • GET request to fetch only the headers of the data file (e.g. GET /data.bam, with Range: bytes=0-<end_of_bam_header>).
  • GET request to fetch the entire index file (e.g. GET /data.bam.bai).
  • HEAD request on the data file to get its length (e.g. HEAD /data.bam).

By default, all headers received in the initial query will be included when making these requests. To exclude certain headers from being forwarded, set the header_blacklist option. Note that the blacklisted headers are removed from the requests made to url and from the URL tickets as well.

For example, a resolvers value of:

[[resolvers]]
regex = '^(example_bucket)/(?P<key>.*)$'
substitution_string = '$key'

[resolvers.storage]
backend = 'S3'
# Uses the first capture group in the regex as the bucket.

Will use "example_bucket" as the S3 bucket if that resolver matches, because this is the first capture group in the regex. Note, to use this feature, at least one capture group must be defined in the regex.

Note, all the values for S3Storage or LocalStorage can be also be set manually by adding a [resolvers.storage] table. For example, to manually set the config for LocalStorage:

[[resolvers]]
regex = '.*'
substitution_string = '$0'

[resolvers.storage]
backend = 'Local'
scheme = 'Http'
authority = '127.0.0.1:8081'
local_path = './'
path_prefix = ''

or, to manually set the config for S3Storage:

[[resolvers]]
regex = '.*'
substitution_string = '$0'

[resolvers.storage]
backend = 'S3'
bucket = 'bucket'

UrlStorage can only be specified manually. Example of a resolver with UrlStorage:

[[resolvers]]
regex = ".*"
substitution_string = "$0"

[resolvers.storage]
backend = 'Url'
url = "http://localhost:8080"
response_url = "https://example.com"
forward_headers = true
header_blacklist = ["Host"]

There are additional examples of config files located under examples/config-files.

Note

By default, when htsget-rs is compiled with the s3-storage feature flag, storage = 'S3' is used when no storage options are specified. Otherwise, storage = 'Local' is used when no storage options are specified. Compilation includes the s3-storage feature flag by default, so in order to have storage = 'Local' as the default, --no-default-features can be passed to cargo.

Allow guard

Additionally, the resolver component has a feature, which allows resolving IDs based on the other fields present in a query. This is useful as allows the resolver to match an ID, if a particular set of query parameters are also present. For example, a resolver can be set to only resolve IDs if the format is also BAM.

This component can be configured by setting the [resolver.allow_guard] table with. The following options are available to restrict which queries are resolved by a resolver:

Option Description Type Default
allow_reference_names Resolve the query ID if the query also contains the reference names set by this option. Array of reference names or 'All' 'All'
allow_fields Resolve the query ID if the query also contains the fields set by this option. Array of fields or 'All' 'All'
allow_tags Resolve the query ID if the query also contains the tags set by this option. Array of tags or 'All' 'All'
allow_formats Resolve the query ID if the query is one of the formats specified by this option. An array of formats containing 'BAM', 'CRAM', 'VCF', or 'BCF' ['BAM', 'CRAM', 'VCF', 'BCF']
allow_classes Resolve the query ID if the query is one of the classes specified by this option. An array of classes containing eithr 'body' or 'header' ['body', 'header']
allow_interval_start Resolve the query ID if the query reference start position is at least this option. Unsigned 32-bit integer start position, 0-based, inclusive Not set, allows all start positions
allow_interval_end Resolve the query ID if the query reference end position is at most this option. Unsigned 32-bit integer end position, 0-based exclusive. Not set, allows all end positions

An example of a fully configured resolver:

[[resolvers]]
regex = '.*'
substitution_string = '$0'

[resolvers.storage]
backend = 'S3'
bucket = 'bucket'

[resolvers.allow_guard]
allow_reference_names = ['chr1']
allow_fields = ['QNAME']
allow_tags = ['RG']
allow_formats = ['BAM']
allow_classes = ['body']
allow_interval_start = 100
allow_interval_end = 1000

In this example, the resolver will only match the query ID if the query is for chr1 with positions between 100 and 1000.

TLS

TLS can be configured for the ticket server, data server, or the url storage client. These options read private keys and certificates from PEM-formatted files. Certificates must be in X.509 format and private keys can be RSA, PKCS8, or SEC1 (EC) encoded. The following options are available:

Option Description Type Default
key The path to the PEM formatted X.509 certificate. Specifies TLS for servers or client authentication for clients. Filesystem path Not Set
cert The path to the PEM formatted RSA, PKCS8, or SEC1 encoded EC private key. Specifies TLS for servers or client authentication for clients. Filesystem path Not Set
root_store The path to the PEM formatted root certificate store. Only used to specify non-native root certificates for the HTTP client in UrlStorage. Filesystem path Not Set

When used by the ticket and data servers, key and cert enable TLS, and when used with the url storage client, they enable client authentication. The root store is only used by the url storage client. Note, the url storage client always allows TLS, however the default configuration performs no client authentication and uses the native root certificate store.

For example, TLS for the ticket server can be enabled by specifying the key and cert options:

ticket_server_tls.cert = "cert.pem"
ticket_server_tls.key = "key.pem"

This project uses rustls for all TLS logic, and it does not depend on OpenSSL. The rustls library can be more strict when accepting certificates and keys. For example, it does not accept self-signed certificates that have a CA used as an end-entity. If generating certificates for root_store using OpenSSL, the correct extensions, such as subjectAltName should be included.

An example of generating a custom root CA and certificates for a UrlStorage backend:

# Create a root CA
openssl req -x509 -noenc -subj '/CN=localhost' -newkey rsa -keyout root.key -out root.crt

# Create a certificate signing request
openssl req -noenc -newkey rsa -keyout server.key -out server.csr -subj '/CN=localhost' -addext subjectAltName=DNS:localhost

# Create the `UrlStorage` server's certificate
openssl x509 -req -in server.csr -CA root.crt -CAkey root.key -days 365 -out server.crt -copy_extensions copy

# An additional client certificate signing request and certificate can be created in the same way as the server
# certificate if using client authentication.

The root.crt can then be used in htsget-rs to allow authenticating to a UrlStorage backend using server.crt:

# Trust the root CA that signed the server's certificate.
tls.root_store = "root.crt"

Alternatively, projects such as mkcert can be used to simplify this process.

Further TLS examples are available under examples/config-files.

Config file location

The htsget-rs binaries (htsget-axum, htsget-actix and htsget-lambda) support some command line options. The config file location can be specified by setting the --config option:

cargo run -p htsget-axum -- --config "config.toml"

The config can also be read from an environment variable:

export HTSGET_CONFIG="config.toml"

If no config file is specified, the default configuration is used. Further, the default configuration file can be printed to stdout by passing the --print-default-config flag:

cargo run -p htsget-axum -- --print-default-config

Use the --help flag to see more details on command line options.

Log formatting

The Tracing crate is used extensively by htsget-rs is for logging functionality. The RUST_LOG variable is read to configure the level that trace logs are emitted.

For example, the following indicates trace level for all htsget crates, and info level for all other crates:

export RUST_LOG='info,htsget_lambda=trace,htsget_lambda=trace,htsget_config=trace,htsget_http=trace,htsget_search=trace,htsget_test=trace'

See here for more information on setting this variable.

The style of formatting can be configured by setting the following option:

Option Description Type Default
formatting_style The style of log formatting to use. One of 'Full', 'Compact', 'Pretty', or 'Json' 'Full'

See here for more information on how these values look.

Configuring htsget-rs with environment variables

All the htsget-rs config options can be set by environment variables, which is convenient for runtimes such as AWS Lambda. The ticket server, data server and service info options are flattened and can be set directly using environment variable. It is not recommended to set the resolvers using environment variables, however it can be done by setting a single environment variable which contains a list of structures, where a key name and value pair is used to set the nested options.

Environment variables will override options set in the config file. Note, arrays are delimited with [ and ] in environment variables, and items are separated by commas.

The following environment variables - corresponding to the TOML config - are available:

Variable Description
HTSGET_TICKET_SERVER_ADDR See ticket_server_addr
HTSGET_TICKET_SERVER_TLS_KEY See TLS
HTSGET_TICKET_SERVER_TLS_CERT See TLS
HTSGET_TICKET_SERVER_CORS_ALLOW_CREDENTIALS See ticket_server_cors_allow_credentials
HTSGET_TICKET_SERVER_CORS_ALLOW_ORIGINS See ticket_server_cors_allow_origins
HTSGET_TICKET_SERVER_CORS_ALLOW_HEADERS See ticket_server_cors_allow_headers
HTSGET_TICKET_SERVER_CORS_ALLOW_METHODS See ticket_server_cors_allow_methods
HTSGET_TICKET_SERVER_CORS_MAX_AGE See ticket_server_cors_max_age
HTSGET_TICKET_SERVER_CORS_EXPOSE_HEADERS See ticket_server_cors_expose_headers
HTSGET_DATA_SERVER_ADDR See data_server_addr
HTSGET_DATA_SERVER_LOCAL_PATH See data_server_local_path
HTSGET_DATA_SERVER_SERVE_AT See data_server_serve_at
HTSGET_DATA_SERVER_TLS_KEY See TLS
HTSGET_DATA_SERVER_TLS_CERT See TLS
HTSGET_DATA_SERVER_CORS_ALLOW_CREDENTIALS See data_server_cors_allow_credentials
HTSGET_DATA_SERVER_CORS_ALLOW_ORIGINS See data_server_cors_allow_origins
HTSGET_DATA_SERVER_CORS_ALLOW_HEADERS See data_server_cors_allow_headers
HTSGET_DATA_SERVER_CORS_ALLOW_METHODS See data_server_cors_allow_methods
HTSGET_DATA_SERVER_CORS_MAX_AGE See data_server_cors_max_age
HTSGET_DATA_SERVER_CORS_EXPOSE_HEADERS See data_server_cors_expose_headers
HTSGET_ID See id
HTSGET_NAME See name
HTSGET_VERSION See version
HTSGET_ORGANIZATION_NAME See organization_name
HTSGET_ORGANIZATION_URL See organization_url
HTSGET_CONTACT_URL See contact_url
HTSGET_DOCUMENTATION_URL See documentation_url
HTSGET_CREATED_AT See created_at
HTSGET_UPDATED_AT See updated_at
HTSGET_ENVIRONMENT See environment
HTSGET_RESOLVERS See resolvers
HTSGET_FORMATTING_STYLE See formatting_style

In order to use HTSGET_RESOLVERS, the entire resolver config array must be set. The nested array of resolvers structure can be set using name key and value pairs, for example:

export HTSGET_RESOLVERS="[{
    regex=regex,
    substitution_string=substitution_string,
    storage={
        type=S3,
        bucket=bucket
    },
    allow_guard={
        allow_reference_names=[chr1],
        allow_fields=[QNAME],
        allow_tags=[RG],
        allow_formats=[BAM],
        allow_classes=[body],
        allow_interval_start=100,
        allow_interval_end=1000
    }  
}]"

Similar to the data_server option, the data server can be disabled by setting the equivalent environment variable:

export HTSGET_DATA_SERVER_ENABLED=false

MinIO

Operating a local object storage like MinIO can be achieved by leveraging the endpoint directive as shown below:

[[resolvers]]
regex = '.*'
substitution_string = '$0'

[resolvers.storage]
backend = 'S3'
bucket = 'bucket'
endpoint = 'http://127.0.0.1:9000'
path_style = true

Care must be taken to ensure that the correct AWS_DEFAULT_REGION, AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY is set to allow the AWS sdk to reach the endpoint. Additional configuration of the MinIO server is required to use virtual-hosted style addressing by setting the MINIO_DOMAIN environment variable. Path style addressing can be forced using path_style = true.

See the MinIO deployment example for more information on how to configure htsget-rs and MinIO.

Crypt4GH

There is experimental support for serving Crypt4GH encrypted files. This can be enabled by compiling with the experimental feature flag.

This allows htsget-rs to read Crypt4GH files and serve them encrypted, directly to the client. In the process of serving the data, htsget-rs will decrypt the headers of the Crypt4GH files and reencrypt them so that the client can read them. When the client receives byte ranges from htsget-rs and concatenates them, the output bytes will be Crypt4GH encrypted, and will need to be decrypted before they can be read. All file formats (BAM, CRAM, VCF, and BCF) are supported using Crypt4GH.

To use this feature, set location = 'Local' under resolvers.storage.keys to specify the private and public keys:

Option Description Type Default
private_key The path to PEM formatted private key which htsget-rs uses to decrypt Crypt4GH data. Filesystem path Not Set
recipient_public_key The path to the PEM formatted public key which the recipient of the data will use. This is what the client will use to decrypt the returned data, using the corresponding private key. Filesystem path Not Set

For example:

[[resolvers]]
regex = '.*'
substitution_string = '$0'

[resolvers.storage.keys]
location = 'Local'
private_key = 'data/c4gh/keys/bob.sec' # pragma: allowlist secret
recipient_public_key = 'data/c4gh/keys/alice.pub'

Keys can also be retrieved from AWS Secrets Manager. Compile with the s3-storage feature flag and specify location = 'SecretsManager' under resolvers.storage.keys to fetch keys from Secrets Manager. When using Secrets Manager, the private_key and recipient_public_key correspond to ARNs or secret names in Secrets Manager storing PEM formatted keys.

For example:

[[resolvers]]
regex = '.*'
substitution_string = '$0'

[resolvers.storage.keys]
location = 'SecretsManager'
private_key = 'private_key_secret_name' # pragma: allowlist secret
recipient_public_key = 'public_key_secret_name'

The htsget-rs server expects the Crypt4GH file to end with .c4gh, and the index file to be unencrypted. See the data/c4gh for examples of file structure. Any of the storage types are supported, i.e. Local, S3, or Url.

As a library

This crate reads config files and environment variables using figment, and accepts command-line arguments using clap. The main function for this is from_config, which is used to obtain the Config struct. The crate also contains the regex_resolver abstraction, which is used for matching a query ID with regex, and changing it by using a substitution string.

Feature flags

This crate has the following features:

  • s3-storage: used to enable S3Storage functionality.
  • url-storage: used to enable UrlStorage functionality.
  • experimental: used to enable experimental features that aren't necessarily part of the htsget spec, such as Crypt4GH support through C4GHStorage.

License

This project is licensed under the MIT license.