An extension package to netCDF4-python to enable reading and writing netCDF files and CFA-netcdf files from / to object stores and public cloud with a S3 HTTP interface, to disk or to OPeNDAP.
- Requirements
- Installation
- Configuration
- Aliases
- Caching
- Backends
- Resource Usage
- Writing files
- Reading files
- List of examples
S3-netCDF4 requires Python 3.7 or later.
It also requires the following packages:
- numpy==1.19.4
- Cython==0.29.21
- netCDF4==1.5.5.1
- botocore==1.19.20
- aiobotocore==1.1.2
- psutil==5.7.3
(These are fulfilled by a pip installation, so it is not necessary to install them if you are installing the package via pip, as below.)
S3netCDF4 is designed to be installed in user space, without the user having
root
or sudo
privileges. System wide installation is also supported. It
is recommended to install S3netCDF4 into a virtual environment, rather than
using the system Python. S3netCDF4 does not rely on any external servers,
besides the storage systems, it is run entirely on the host machine.
s3netCDF4 can be installed either from PyPi or directly from the GitHub repository.
-
Create a Python 3 virtual environment:
python3 -m venv /path/to/venv
-
Activate the virtual environment:
source /path/to/venv/bin/activate
-
Installing S3netCDF4 requires a version of
pip
> 10.0. To install the latest version of pip into the virtual environment use the command:pip install --upgrade pip
-
Install from PyPi:
pip install S3netCDF4
-
Copy the configuration template file from
config/.s3nc.json.template
to~/.s3nc.json
and fill in the values for the variables. See the section Configuration. -
Run a test to ensure the package has installed correctly:
python test/test_s3Dataset.py
-
Users on the STFC/NERC JASMIN system will have to activate Python 3.7 by using the command:
module load jaspy
-
Create a Python 3 virtual environment:
python3 -m venv /path/to/venv
-
Activate the virtual environment:
source /path/to/venv/bin/activate
-
Installing S3netCDF4 requires a version of
pip
> 10.0. To install the latest version of pip into the virtual environment use the command:pip install --upgrade pip
-
Install the S3netCDF4 library, directly from the github repository:
pip install -e git+https://github.com/cedadev/S3-netcdf-python.git#egg=S3netCDF4
-
Copy the configuration template file from
config/.s3nc.json.template
to~/.s3nc.json
and fill in the values for the variables. See the section Configuration. -
Run a test to ensure the package has installed correctly:
python test/test_s3Dataset.py
-
Users on the STFC/NERC JASMIN system will have to repeat step 0 every time they wish to use S3netCDF4 via the virtual environment.
S3netCDF4 relies on a configuration file to resolve endpoints for the S3 services, and to control various aspects of the way the package operates. This config file is a JSON file and is located in the user's home directory:
~/.s3nc.json
In the git repository a templatised example of this configuration file is provided:
config/.s3nc.json.template
This can be copied to the user's home directory, and the template renamed to
~/.s3nc.json
.
Alternatively, an environment variable S3_NC_CONFIG
can be set to define the
location and name of the configuration file. This can also be set in code,
before the import of the S3netCDF4 module:
import os
os.environ["S3_NC_CONFIG"] = "/Users/neil/.s3nc_different_config.json"
from S3netCDF4._s3netCDF4 import s3Dataset
Once the config file has been copied, the variables in the template should then
be filled in. This file is a jinja2
template of a JSON file, and so can be used within an
ansible deployment.
Each entry in the file has a key:value pair. An example of the file is given
below:
{
"version": "9",
"hosts": {
"s3://tenancy-0": {
"alias": "tenancy-0",
"url": "http://tenancy-0.jc.rl.ac.uk",
"credentials": {
"accessKey": "blank",
"secretKey": "blank"
},
"backend": "s3aioFileObject",
"api": "S3v4"
}
},
"backends": {
"s3aioFileObject" : {
"maximum_part_size": "50MB",
"maximum_parts": 8,
"enable_multipart_download": true,
"enable_multipart_upload": true,
"connect_timeout": 30.0,
"read_timeout": 30.0
},
"s3FileObject" : {
"maximum_part_size": "50MB",
"maximum_parts": 4,
"enable_multipart_download": false,
"enable_multipart_upload": false,
"connect_timeout": 30.0,
"read_timeout": 30.0
}
},
"cache_location": "/cache_location/.cache",
"resource_allocation" : {
"memory": "1GB",
"filehandles": 20
}
}
version
indicates which version of the configuration file this is.hosts
contains a list of named hosts and their respective configuration details.s3://tenancy-0
contains the definition of a single host calledtenancy-0
. For each host a number of configuration details need to be supplied:alias
the alias for the S3 server. See the Aliases section.url
the DNS resolvable URL for the S3 server, with optional port number.credentials
contains two keys:accessKey
the user's access key for the S3 endpoint.secretKey
the user's secret key / password for the S3 endpoint.
backend
which backend to use to write the files to the S3 server. See the Backends section.api
the api version used to access the S3 endpoint.
backends
contains localised configuration information for each of the backends which may be used (if included in ahost
definition) to write the files to the S3 server. See the Backends section for more details on backends.enable_multipart_download
allow the backend to split files fetched from S3 into multiple parts when downloading.enable_multipart_upload
allow the backend to split files when uploading. The advantage of splitting the files into parts is that they can be uploaded or downloaded asynchronously, when the backend supports asynchronous transfers.maximum_part_size
the maximum size for each part of the file can reach before it is uploaded or the size of each part when downloading a file.maximum_parts
the maximum number of file parts that are held in memory before they are uploaded or the number of file parts that are downloaded at once, for asynchronous backends.connect_timeout
the number of seconds that a connection attempt will be made for before timing out.read_timeout
the number of seconds that a read attempt will be made before timing out.
cache_location
S3netCDF4 can read and write very large arrays that are split into sub-arrays. To enable very large arrays to be read, S3netCDF4 uses Numpy memory mapped arrays.cache_location
contains the location of these memory mapped array files. See Caching section below.resource_allocation
contains localised information about how much resources each instance of S3netCDF4 should use on the host machine. See the the Resource Usage section below. It contains two keys:memory
the amount of RAM to dedicate to this instance of S3netCDF4.file_handles
the number of file handles to dedicate to this instance of S3netCDF4
Note that sizes can be expressed in units other than bytes by suffixing the
size with a magnitude identifier:, kilobytes (kB
), megabytes (MB
),
gigabytes (GB
), terabytes (TB
), exabytes (EB
), zettabytes (ZB
) or
yottabytes (YB
).
To enable S3netCDF4 to write to disk, OPeNDAP and S3 object store, aliases are
used to identify S3 servers. They provide an easy to remember (and type)
shorthand for the user so that they don't have to use the DNS resolved URL and
port number for each S3 object access. When creating a netCDF4 s3Dataset
object, either to read or write, the user supplies a filename. To indicate
that the file should be written to or read from a S3 server, the string
must start with s3://
. After this must follow the aliased server name, as
defined in the config file above. After this aliased server name a bucket
name will follow, for example to read a netCDF file called test2.nc
from the
test
bucket on the s3://tenancy-0
server, the user would use this code:
Example 1: open a netCDF file from a S3 storage using the alias "tenancy-0"
from S3netCDF4._s3netCDF4 import s3Dataset as Dataset
test_dataset = Dataset("s3://tenancy-0/test/test2.nc", "r")
On creation of the s3Dataset
object, the S3netCDF4 package reads the
filename, determines that the filename starts with s3://
, reads the next
part of the string up to the next /
(which equates to tenancy-0
in this
cases) and searches through the aliases defined in the ~/.s3nc.json
file to
find a matching alias. If one is not found it will return an error message,
if it is found then it will establish a connection to that S3 server, using the
url
, accessKey
and secretKey
defined for that server. It is over this
connection that all the data transfers for this s3Dataset
take place.
If the user requests to read a variable, or a slice of a variable, that is
larger than either the host machines physical memory or the
resource_allocation: memory
setting in ~/.s3nc.json
, then S3netCDF4 will
use two strategies to enable reading very large arrays:
- a Numpy memory mapped array is used as the "target array", which will
contain the data requested by the user. This is stored in a locally cached
file, in the
cache_location
root directory. These files are deleted in the destructor of S3netCDF4 - i.e. when the program exits, or the S3netCDF4 object goes out of scope. However, during processing, this directory has the potential to grow quite large so adequate provision should be made on disk for it. - If the file being read is a CFA-netCDF file, referencing sub-array
files, then the sub-array files are streamed into memory (for files on S3
storage) or read from disk. If the amount of memory used exceeds the
resource_allocation: memory
config setting, or the number of open files exceeds theresource_allocation: filehandles
config setting, then the last accessed sub-array file is closed. This means it will be removed from memory, or the file handle will be freed, allowing another sub-array file to be read.
See the Resource Usage section below for more information on this "memory and file shuffling" behaviour.
In S3-netCDF4, a backend refers to a set of routines that handles the
interface to a storage system. The interface includes read and write, but
also gathering file information and file listings. S3-netCDF4 has a pluggable
backend architecture, and so can interact with new storage systems by writing
a new backend plugin. The backend plugins are extensions of the
io.BufferedIOBase
Python class and implement Python file object methods, such
as tell
, seek
, read
and write
. This enables interaction with the
backend as though they are POSIX disks.
These backends have to be configured on a host by host basis by setting the
host: backend
value in the ~/.s3nc.json
config file. Currently there are
two backends:
_s3aioFileObject
: This backend enables asynchronous transfers to a S3 compatible storage system. It is the fastest backend for S3 and should be used in preference to_s3FileObject
._s3FileObject
: This is a simpler, synchronous inferface to S3 storage systems. It can be used if there is a problem using_s3aioFileObject
S3netCDF4 has the ability to read and write very large files, much larger than
the available, or allocated, memory on a machine. It also has the ability to
read and write many files to and from disk, which means the number of open
files may exceed the limit set by the file system, or the settings in ulimit
.
Files are accessed when a Dataset is opened, and when a slice operator
([x,y,z]
) is used on a CFA-netCDF file.
To enable very large and very many files to be read and written to, S3netCDF4
employs a strategy where files are "shuffled" out of memory (to free up memory)
or closed (to free up disk handles). The triggers for this shuffling are
configured in the "resource_allocation"
section of the .s3nc.json
config
file:
-
resource_allocation: memory
: the amount of memory that S3netCDF4 is allowed to use before a shuffle is triggered. This applies when reading or writing files from / to remote storage, such as a S3 object store. S3netCDF4 will stream the entire netCDF file, or an entire sub-array file into memory when reading. When writing, it will create an entire netCDF file or sub-array file in memory, writing the file to the remote storage upon closing the file. -
resource_allocation: disk_handles
: the number of files on disk that S3netCDF4 is allowed to have open at any one time. This applies when reading or writing files to disk. S3netCDF4 uses the underlying netCDF4 library to read and write files to disk, but it keeps a track of the number of open files.
Note that S3netCDF4 allows full flexibility over the location of the
master-array and sub-array files of CFA-netCDF files. It allows both to be
stored on disk or S3 storage. For example, the master-array file could be
stored on disk for performance reasons, and the sub-array files stored on S3.
Or the first timestep of the sub-array files could also be stored on disk to
enable users to quickly perform test analyses
The file shuffling procedure is carried out by an internal FileManager, which keeps notes about the files that are open at any time, or have been opened in the past and the last time they were accessed. The user does not see any of this interaction, they merely interact with the S3Dataset, S3Group, S3Variable and S3Dimension objects.
- When a file is initially opened, a note is made of the mode and whether the
file is on disk or remote storage. They are marked as "OPEN_NEW" and then,
"OPEN_EXISTS" when they have been opened successfully.
- For reading from remote storage, the file is streamed into memory and then a netCDF Dataset is created from the read in data.
- For writing to remote storage, the netCDF Dataset is created in memory.
- For reading from disk, the file is opened using the underlying netCDF4 library, and the netCDF Dataset is returned.
- For writing to disk, the file is created using the netCDF4 library and the Dataset is returned.
- If the file is accessed again (e.g. via the slicing operator), then the netCDF Dataset is returned. The FileManager knows these files are already open or present in memory as they are marked as "OPEN_EXISTS".
- Steps 1 and 2 continue until either the amount of memory used exceeds
resource_allocation: memory
or the number of open files exceedsresource_allocation: disk_handles
. - If the amount of memory used exceeds
resource_allocation: memory
:
- The size of the next file is determined (read) or calculated (write).
Files are closed, and the memory they occupy is freed using the Python garbage collector, until there is enough memory free to read in or create the next file. - Files that were opened in "write" mode are closed, marked as "KNOWN_EXISTS" and written to either the remote storage (S3) or disk.
- Files that were open in "read" mode are simply closed and their entry is removed from the FileManager.
- The priority for closing files is that the last accessed file is closed first. The FileManager keeps a note when each file was accessed last.
- If a file is accessed again in "write" mode, and it is marked as "KNOWN_EXISTS" in the FileManager, then it is opened in "append" mode. In this way, a file can be created, be shuffled in and out of memory, and still be written to so that the end result is the same as if it had been in memory throughout the operation.
- If the number of open files exceeds
resource_allocation: disk_handles
:
- The procedure for point 4 is followed, except rather than closing files until there is enough memory available, files are closed until there are free file handles.
- Files are marked as "KNOWN_EXISTS" as in point 4.
This file shuffling procedure is fundamental to the performance of S3netCDF4, as it minimises the number of times a file has to be streamed from remote storage, or opened from disk. There are also optimisations in the File Manager, for example, if a file has been written to and then read, it will use the copy in memory for all operations, rather than holding two copies, or streaming to and from remote storage repeatably.
S3netCDF4 has the ability to write netCDF3, netCDF4, CFA-netCDF3
and CFA-netCDF4 files to a POSIX filesystem, Amazon S3 object storage (or
public cloud) or OPeNDAP. Files are created in the same way as the standard
netCDF4-python package, by creating a s3Dataset
object. However, the
parameters to the s3Dataset
constructor can vary in two ways:
- The
filename
can be an S3 endpoint, i.e. it starts withs3://
- The
format
keyword can also, in addition to the formats permitted by netCDF4-python, beCFA3
, to create a CFA-netCDF3 dataset, orCFA4
, to create a CFA-netCDF4 dataset. - If creating a
CFA3
orCFA4
dataset, then an optional keyword parameter can be set:cfa_version
. This can be either"0.4"
or"0.5"
. See the CFA-netCDF files section below.
Example 2: Create a netCDF4 file in the filesystem
from S3netCDF4._s3netCDF4 import s3Dataset as Dataset
test_dataset = Dataset("/Users/neil/test_dataset_nc4.nc", 'w',
format='NETCDF4')
Example 3: Create a CFA-netCDF4 file in the filesystem with CFA version 0.5 (the default)
from S3netCDF4._s3netCDF4 import s3Dataset as Dataset
cfa_dataset = Dataset("/Users/neil/test_dataset_cfa4.nc", 'w',format='CFA4')
Example 4: Create a CFA-netCDF3 file on S3 storage with CFA version 0.4
from S3netCDF4._s3netCDF4 import s3Dataset as Dataset
cfa_dataset = Dataset("s3://tenancy-0/test_bucket/test_dataset_s3_cfa3.nc",
'w', format='CFA3', cfa_version="0.4")
Choosing format="CFA3"
or format="CFA4"
when creating a file creates a
CFA-compliant netCDF file. This consists of a master-array file and a
number of sub-array files.
The version of CFA to use can also be specified, either cfa_version="0.4"
or
cfa_version="0.5"
. "0.4"
follows the
CFA conventions, where the
sub-array metadata is written into the attributes of the netCDF variables.
"0.5"
refactors the sub-array metadata into extra groups and variables
in the master-array file. "0.5"
is the preferred format as it is more
memory efficient, relying on netCDF slicing and partial reading of files, and
is faster as it does not require parsing when the master-array file is
first read. As it uses features of netCDF4, cfa_version="0.5"
is only
compatible with format="CFA4"
Note that cfa_version="0.5"
and format="CFA3"
are incompatible, as NETCDF3
does not enable groups to be used
The master-array file contains:
- the dimension definitions
- dimension variables
- scalar variable definitions: variable definitions without reference to the domain it spans
- variable metadata
- global metadata
- It does not contain any field data, but it does contain data for the dimension variables, and therefore the domain of each variable.
- The master-array file may contain a single field variable or multiple field variables.
The sub-array files contain a subdomain of a single variable in the master-array. They contain:
- the dimension definitions for the subdomain
- the dimension variables for the subdomain
- a single variable definition, complete with reference to the dimensions
- metadata for the variable
Therefore, each sub-array file is a self-describing netCDF file. If the master-array file is lost, it can be reconstructed from the sub-array files.
In CFA v0.4, the variable metadata (netCDF attributes) in each variable in the master-array file contains a partition matrix. The partition matrix contains information on how to reconstruct the master-array variables from the associated sub-arrays and, therefore, also contains the necessary information to read or write slices of the master-array variables.
In CFA v0.5, the partition matrix is stored in a group. This group has the
same name as the variable, but prefixed with cfa_
. The group contains
dimensions and variables to store the information for the partition matrix
and the partitions.
Full documentation for CFA v0.5 will be forthcoming.
The partition matrix contains:
- The dimensions in the netCDF file that the partition matrix acts over (e.g.
["time", "latitude", "longitude"
) - The shape of the partition matrix (e.g.
[4,2,2]
) - A list of partitions
Each partition in the partition matrix contains:
- An index for the partition into the partition matrix - a list the length of
the number of dimensions for the variable (e.g
[3, 1, 0]
) - The location of the partition in the master-array - a list (the length
of the number of dimensions) of pairs, each pair giving the range of
indices in the master-array for that dimension (e.g.
[[0, 10], [20, 40], [0, 45]]
) - A definition of the sub-array which contains:
- The path or URI of the file containing the sub-array. This may be on the filesystem, an OPeNDAP file or an S3 URI.
- The name of the netCDF variable in the sub-array file
- The format of the file (always
netCDF
for S3netCDF4) - The shape of the variable - i.e. the length of the subdomain in each dimension
For more information see the [CFA conventions 0.4 website](http://
www.met.reading.ac.uk/~david/cfa/0.4/).
There is also a useful synopsis in the header of the _CFAClasses.pyx
file in
the S3netCDF4 source code. Documentation for the "0.5"
version of CFA will
follow.
Note that indices in the partition matrix are indexed from zero, but the
indices are inclusive for the location of the partition in the master-array.
This is different from Python where the indices are non-inclusive. The
conversion between the two indexing methods is handled in the implementation
of _CFAnetCDFParser, so the user does not have to worrying about converting
indices
Creating dimensions and variables in the netCDF or CFA-netCDF4 dataset follows the same method as creating variables in the standard netCDF4-python library:
Example 5: creating dimensions and variables
from S3netCDF4._s3netCDF4 import s3Dataset as Dataset
cfa_dataset = Dataset("s3://minio/test_bucket/test_dataset_s3_cfa3.nc", 'w',
format='CFA3')
timed = cfa_dataset.createDimension("time", None)
times = cfa_dataset.createVariable("time", "f4", ("time",))
When creating variables, a number of different workflows for writing the files
occur. Which workflow is taken depends on the combination of the filename
path (S3
, filesystem or OPeNDAP) and format (CFA3
and CFA4
or NETCDF4
and NETCDF3_CLASSIC
). These workflows can be summarised by:
-
format=NETCDF4
orformat=NETCDF3_CLASSIC
. These two options will create a standard netCDF file.- If the filename is on a remote system, (i.e. it contains
s3://
) then the netCDF file will be created in memory and uploaded (PUT) to the S3 filesystem whens3Dataset.close()
is called or the file is "shuffled" out of memory. (see Resource Usage for more details). - If the filename does not contain
s3://
then the netCDF file will be written out to the filesystem or OPeNDAP, with the behaviour following the standard netCDF4-python library.
- If the filename is on a remote system, (i.e. it contains
-
format=CFA3
orformat=CFA4
. These two options will create a CFA-netCDF file.- At first only the master-array file is created and written to. The sub-array files are created and written to when data is written to the master-array variable.
- When the variable is created, the dimensions are supplied and this enables
the partition matrix metadata to be generated:
- The file splitting algorithm determines how to split the variable into the sub-arrays, or the user can supply the shape of the sub-arrays
- From this information the partition matrix shape and partition matrix list of dimensions are created. The partition matrix is represented internally by a netCDF dataset, and this is also created.
- Only when a variable is written to, via a slice operation on a variable,
is each individual partition written into the partition matrix.
- The sub-array file is created, either in memory for remote filesystems (S3), or to disk for local filesystems (POSIX).
- The filename for the sub-array is determined programmatically.
- The location in the master-array for each sub-array (and its shape) is determined by the slice and the sub-array shape determined by either the file splitting algorithm, or supplied by the user.
- This single partition information is written into the partition- matrix
- The field data is written into the sub-array file.
- On subsequent slices into the same sub-array, the partition information is used, rather than rewritten.
- When the master-array file is closed (by the user calling
s3Dataset.close()
):- The partition matrix metadata is written to the master-array
- If the files are located on a remote filesystem (S3), then they only
currently exist in memory (unless they have been "shuffled" to storage).
They are now closed (in memory) and then uploaded to the remote storage.
Any appended files are also uploaded to remote storage. - If the files are not on a remote filesystem, then they are closed, the sub-array files in turn, and then the master-array file last.
As noted above, CFA files actually consist of a single master-array file and many sub-array files. These subarray-files are referred to by their filepath or URI in the partition matrix. To easily associate the sub- array files with the master-array file, a naming convention and file structure is used:
-
The CFA conventions dictate that the file extension for a CFA-netCDF file should be
.nca
-
A directory is created in the same directory / same root URI as the master- array file. This directory has the same name master-array file without the
.nca
extension -
In this directory all of the sub-array files are contained. These subarray files follow the naming convention:
<master-array-file-name>.<variable-name>.[<location in the partition matrix>].nc
Example for the master-array file a7tzga.pdl4feb.nca
:
├── a7tzga.pdl4feb.nca
├── a7tzga.pdl4feb
│ ├── a7tzga.pdl4feb.field16.0.nc
│ ├── a7tzga.pdl4feb.field16.1.nc
│ ├── a7tzga.pdl4feb.field186.0.nc
│ ├── a7tzga.pdl4feb.field186.1.nc
│ ├── a7tzga.pdl4feb.field1.0.0.nc
│ ├── a7tzga.pdl4feb.field1.0.1.nc
│ ├── a7tzga.pdl4feb.field1.1.0.nc
│ ├── a7tzga.pdl4feb.field1.1.1.nc
On an S3 storage system, the master-array directory will form part of the prefix for the sub-array objects, as directories do not exist, in a literal sense, on S3 storage systems, only prefixes.
Note that the metadata in the master-array file informs S3netCDF4 where the sub-array files are located. The above file structure defines the default behaviour, but the specification of S3netCDF4 allows sub-array files to be located anywhere, be that on S3, POSIX disk or OpenDAP.
Metadata can be written to the variables and the Dataset (global metadata) in the same way as the standard netCDF4 library, by creating a member variable on the Variable or Dataset object:
Example 6: creating variables with metadata
from S3netCDF4._s3netCDF4 import s3Dataset as Dataset
with Dataset("/Users/neil/test_dataset_cfa3.nca", mode='w', diskless=True,
format="CFA3") as s3_data:
# create the dimensions
latd = s3_data.createDimension("lat", 196)
lond = s3_data.createDimension("lon", 256)
# create the dimension variables
latitudes = s3_data.createVariable("lat", "f4", ("lat",))
longitudes = s3_data.createVariable("lon", "f4", ("lon",))
# create the field variable
temp = s3_data.createVariable("tmp", "f4", ("lat", "lon"))
# add some attributes - variable metadata
s3_data.source = "s3netCDF4 python module tutorial"
s3_data.units = "degrees C"
latitudes.units = "degrees north"
longitudes.units = "degrees east"
# add some global metadata
temp.author = "Neil Massey"
For netCDF files with format=NETCDF3_CLASSIC
or format=NETCDF4
, the
variable is created and field data is written to the file (as missing values)
when createVariable
is called on the s3Dataset
object. Calls to the []
operator (i.e. slicing the array) will write data to the variable and to the
file when the operator is called. This is the same behaviour as netCDF4-
python. If a S3 URI is specified (filepath starts with s3://
) then the file
is first created in memory and then streamed to S3 on closing the file.
For netCDF files with format=CFA3
or format=CFA4
specified in the
s3Dataset
constructor, only the master-array file is written to when
createDimension
, createVariable
etc. are called on the s3Dataset
object. When createVariable
is called, a scalar field variable (i.e. with
no dimensions) is created, the partition-matrix is calculated (see File
splitting algorithm) and written to the scalar
field variable. The sub-array files are only created when the []
operator is called on the Variable
object return from the
s3Dataset.createVariable
method. This operator is implemented in S3netCDF as
the __setitem__
member function of the s3Variable
class, and corresponds
to slicing the array.
Writing a slice of field data to the master-array file, via __setitem__
consists of five operations:
-
Determining which of the sub-arrays overlap with the slice. This is currently done via a hypercube overlapping method, i.e. the location of the sub-array can be determined by dividing the dimension index by the length of the dimension in the partition matrix. This assumes that the sub-arrays are uniform (per dimension) in size.
-
If the size of the sub-array file will cause the currently used amount of memory to exceed the
resource_allocation: memory
setting in~/.s3nc.json
then some files may be shuffled out of memory. See the Resource Usage section above. This may result in some files being written to the remote storage, meaning they will be opened in append mode the next time they are written to. If, even after the file shuffling has occurred, the size of the sub-array cannot be contained in memory then a memory error will occur. -
Open or create the file for the sub-array according to the filepath or URI in the partition information. If a S3 URI is specified (filepath starts with
s3://
) then the file is opened or created in memory, and will be uploaded when.close()
is called on thes3Dataset
. The file will be will be opened in create mode (w
). -
The dimensions and variable are created for the sub-array file, and the metadata is also written.
-
Calculate the source and target slices. This calculates the mapping between the indices in the master-array and each sub-array. This is complicated by allowing the user to choose any slice for the master-array and so this must be correctly translated to the sub-array indices.
-
Copy the data from the source slice to the target slice.
For those files that have an S3 URI, uploading to S3 object storage is
performed when .close()
is called on the s3Dataset
.
The partition information is only written into the partition-matrix when the s3Dataset is in "write" mode and the user slices into the part of the master-array that is covered by that partition. Consequently, the sub-array file is only created when the partition is written into the partition-matrix.
This leads to the situation that a large part of the partition-matrix may
have undefined data, and a large number of sub-array files may not exist.
This makes s3netCDF4 excellent for sparse data, as the sub-array size can
be optimised so that the sparse data occupies minimal space.
If, in "read" mode, the user specifies a slice that contains a sub-array
that is not defined, then the missing value (_FillValue
) is returned for
the sub-domain of the master-array which the sub-array occupies.
To split the master-array into it's constituent sub-arrays a method for splitting a large netCDF file into smaller netCDF files is used. The high-level algorithm is:
-
Split the field variables so that there is one field variable per file.
netCDF allows multiple field variables in a single file, so this is an obvious and easy way of partitioning the file. Note that this only splits the field variables up, the dimension variables all remain in the master-array file. -
For each field variable file, split along the
time
,level
,latitude
orlongitude
dimensions. Note that, in netCDF files, the order of the dimensions is arbitrary, e.g. the order could be[time, level, latitide, longitude]
or[longitude, latitude, level, time]
or even[latitude, time, longitude, level]
.
S3netCDF4 uses the metadata and name for each dimension variable to determine the order of the dimensions so that it can split them correctly. Note that any other dimension (ensemble
orexperiment
) will always have length of 1, i.e. the dimension will be split into a number of fields equal to its length.
The maximum size of an object (a sub-array file) can be given as a keyword
argument to s3Dataset.createVariable
or s3Group.createVariable
:
max_subarray_size=
. If no max_subarray_size
keyword is supplied, then it
defaults to 50MB.
To determine the most optimal number of splits for the time
, latitude
or
longitude
dimensions, while still staying under this maximum size
constraint, two use cases are considered:
- The user wishes to read all the timesteps for a single latitude-longitude point of data.
- The user wishes to read all latitude-longitude points of the data for a single timestep.
For case 1, the optimal solution would be to split the master-array into
sub-arrays that have length 1 for the longitude
and latitude
dimension
and a length equal to the number of timesteps for the time
dimension. For
case 2, the optimal solution would be to not split the longitude
and
latitude
dimensions but split each timestep so that the length of the time
dimension is 1. However, both of these cases have the worst case scenario for
the other use case.
Balancing the number of operations needed to perform both of these use cases,
while still staying under the max_subarray_size
leads to an optimisation
problem where the following two equalities must be balanced:
- use case 1 = nT / dT
- use case 2 = nlat / dlat X nlon / dlon
where nT is the length of the time
dimension and dT is
the number of splits along the time
dimension. nlat is the
length of the latitude
dimension and dlat the number of splits
along the latitude
dimension. nlon is the length of the
longitude
dimension and dlon the number of splits along the
longitude dimension
.
The following algorithm is used:
- Calculate the current object size Os = nT / dT</ sub> X nlat / dlat X nlon / dlon
- while Os >
max_subarray_size
, split a dimension:- if dlat X dlon <= dT:
- if dlat <= dlon: split latitude dimension again: dlat += 1
- else: split longitude dimension again: dlon += 1
- else: split the time dimension again: dT += 1
- if dlat X dlon <= dT:
Using this simple divide and conquer algorithm ensures the max_subarray_size
constraint is met and the use cases require an equal number of operations.
Note that in v2.0 of S3netCDF4, the user can specify the sub-array shape in the s3Dataset.createVariable method. This circumvents the file-splitting algorithm and uses just the sub-array shape specified by the user.
S3netCDF4 has the ability to read normal netCDF4 and netCDF3 files, CFA-
netCDF4 and CFA-netCDF3 files from a POSIX filesystem, Amazon S3 object
store and OPeNDAP.
For files on remote storage, before reading the file, S3netCDF4 will query
the file size and determine whether it is greater than the
resource_allocation: memory
setting in the ~/.s3nc.json
configuration or
greater than the current available memory. If it is, then some files will be
"shuffled" out of memory until there is enough allocated memory available. See
Resource Usage for more details. If it is less than the
resource_allocation: memory
setting then it will stream the file
directly into memory. Files on local disk (POSIX) are opened in the same way
as the standard netCDF4 library, i.e. the header, variable and dimension
information and metadata are read in, but no field data is read.
From a user perspective, files are read in the same way as the standard
netCDF4-python package, by creating a s3Dataset
object. As with writing
files, the parameters to the s3Dataset
constructor can vary in a number of
ways:
- The
filename
can be an S3 endpoint, i.e. it starts withs3://
, or a file on the disk, or an OpenDAP URL. - The
format
can beCFA3
orCFA4
to read in a CFA-netCDF3 or CFA- netCDF4 dataset. However, it is not necessary to specify this keyword if the user wishes to read in a CFA file, as S3netCDF4 will determine, from the metadata, whether a netCDF file is a regular netCDF file or a CFA-netCDF file.
S3netCDF4 will also determine, from the file header, whether a netCDF file is a netCDF4 or netCDF3 file. If the file resides on an S3 storage system, then the first 6 bytes only of the file will be first read to determine whether the file is a netCDF4 or netCDF3 file or an invalid file. As a CFA-netCDF file is just a netCDF file, determining whether the netCDF file is a CFA-netCDF file is left until the file is read in, i.e. after the interpretation of the header. - Files that are on remote storage are streamed into memory. As files are
read in, other files may be "shuffled" out of memory if the currently used
memory exceeds the
resource_allocation: memory
setting in the~/.s3nc.json
config file. See Resource Usage.
Example 7: Read a netCDF file from disk
from S3netCDF4._s3netCDF4 import s3Dataset as Dataset
with Dataset("/Users/neil/test_dataset_nc4.nc", 'r') as nc_data:
print(nc_data.variables)
Example 8: Read a CFA-netCDF file from S3 storage
from S3netCDF4._s3netCDF4 import s3Dataset as Dataset
from S3netCDF4 import s3Dataset as Dataset
with Dataset("s3://tenancy-0/test_bucket/test_dataset_s3_cfa3.nc", 'r') as nc_data:
print(nc_data.variables)
Upon reading a CFA-netCDF file, the master-array file is interpreted to
transform the metadata in the file (for CFA "v0.4"
), or the information in
the CFA group for the variable (for CFA "v0.5"
) into the partition
matrix. See CFA-
netCDF files for more information.
Part of this transformation involves creating an instance of the s3Variable
class for each variable in the CFA-netCDF file. The s3Variable
class
contains _nc_var
: the instance of the standard netCDF4.Variable
object;
_cfa_var
: an instance of CFAVariable
, containing information about the CFA
sub-array associated with this variable; and _cfa
: an instance of
CFADataset
, containing information about the CFA master-array file that
contains this variable.
The metadata, or CFA group, in the master-array file is parsed to generate
these two objects. These two objects will be used when a user calls a slice
operation on a s3Variable object.
In v2.0.x,the s3netCDF4 API now matches the standard netCDF4 python API in
reading variable names and variables. Previously, two extra functions were used
(variables()
, and getVariable()
). During the rework, a way was found to
provide 100% compatibility with the netCDF4 python API. This is reflected in
the method of handling variables:
s3Dataset.variables
, ors3Group.variables
: returns a list of variables in the Dataset.s3Dataset.variables[<variable_name>]
, ors3Group.variables[<variable_name>]
: return thes3netCDF4.s3Variable
instance for<variable_name>
if the variable is a master-array in a CFA- netCDF file, or anetCDF4.Variable
instance if it is a dimension variable, or a variable in a standard netCDF file.
Example 9: Read a netCDF file from disk and get the "field8" variable
from S3netCDF4 import s3Dataset as Dataset
with Dataset("/Users/neil/test_dataset_nc4.nc") as src_file:
print(src_file.variables)
src_var = src_file.variables["field8"]
print(type(src_var))
Reading metadata from the Variables or Dataset (global metadata) is done in
exactly the same way as in the standard netCDF4 python package, by querying the
member variable of either a Variable or Dataset. The ncattrs
and getncattr
member functions of the Dataset
and Variable
classes are also supported.
Example 10: Read a netCDF file, a variable and its metadata
from S3netCDF4 import s3Dataset as Dataset
with Dataset("/Users/neil/test_dataset_nc4.nc") as src_file:
print(src_file.ncattrs())
src_var = src_file.getVariable["field8"]
print(src_var.ncattrs())
print(src_var.units)
print(src_var.getncattr("units"))
print(src_file.author)
print(src_file.getncattr("author"))
Reading field data in S3netCDF follows the same principles as writing the data:
- If the file is determined to have
format=NETCDF3_CLASSIC
orformat=NETCDF4
then it is read in and the field data is made available in the same manner as the standard netCDF4-python package. If the file is residing on S3 storage, then the entire file will be streamed to memory, if it is larger than theresource_allocation: memory
setting in~/ .s3nc.json
, or larger than the available memory, then a memory error will be returned. - If the file is determined to have
format=CFA3
orformat=CFA4
then just the master-array file is read in and any field data will only be read when the[]
operator (__getitem__
) is called on as3Variable
instance.
Upon opening the master-array file: - if the file is
"v0.4"
of the CFA conventions, the CFA metadata is taken from the variable metadata. The partition-matrix is constructed (see File splitting algorithm) internally as a netCDF group with dimensions and variables containing the partition information. - if the files is CFA
"v0.5"
, then the partition-matrix is read in directly from the Groups, Dimensions and Variables in the file, without any parsing having to take place. - the
_cfa
,_cfa_grp
,_cfa_dim
and_cfa_var
objects are created as member variables of thes3Dataset
,s3Group
,s3Dimension
ands3Variable
objects respectively. These are instances ofCFADataset
,CFAGroup
,CFADimension
andCFAVariable
respectively. The partition-matrix is contained within anetCDF4.Group
within the_cfa_var
instance ofCFAVariable
Internally, the partition-matrix consists of a netCDF group, which itself contains the dimensions of the partition-matrix, and variables containing the partition information. Within the s3Dataset, s3Variable and s3Group objects, there are objects that contain higher level CFA data, and the methods to operate on that data. This information is used when a user slices the field data to determine which sub-array files are read and which portion of the sub-array files are included in the slice:
- A
CFADataset
as the top level container:- A number of
CFAGroup
s: information about groups in the file. There is always at least one group: theroot
group is explicit in its representation in theCFADataset
. Within theCFAGroup
there are:- A number of
CFADim
s : information about the dimensions in the Dataset - A number of
CFAVariable
s : information about the variables in the Dataset, which contains:-
The partition-matrix which consists of a netCDF group containing:
- the scalar dimensions, with no units or associated dimension variable
- the variables containing the partition information:
pmshape
: the shape of the partition-matrixpmdimensions
: the dimensions in the master-array file which the partition matrix acts over.index
: the index in the partition-matrix. This is implied by the location in the partition-matrix but it is retained to detect erroneous lookups by the slicing algorithm.location
: the location in the master-array filencvar
: the name of the variable in the sub-array file.file
: the URL or path of the sub-array file.format
: the format of the sub-array file.shape
: the shape of the sub-array file.
-
Methods to act upon the variable and its partition-matrix, including:
__getitem__
: returns the necessary information to read and write sub-array files.getPartition
: return a user-readable version of a partition (a single element in the partition-matrix) as a Python named tuple, rather than a netCDF Group or Variable.
-
- A number of
- A number of
Reading a slice of field data from a variable in the master-array file, via getitem consists of five operations:
-
If the total size of the requested slice is greater than
resource_allocation: memory
(or the available memory) then a Numpy memory mapped array is created in the location indicated by thecache_location:
setting in the~/.s3nc.json
config file. -
Determine which of the sub-arrays overlaps with the slice, by querying the partition-matrix. This is currently done by a simple arithmetic operation that relies on the partitions all being the same size.
-
Calculate the source and target slices, the source being the sub-array and the target (the master-array) a memory-mapped Numpy array with a shape equal to the user supplied slice. The location of this sub-array in the master-array is given by the partition containing the sub-array, which gives the slice into the master-array. However, both the slice of the sub-array and master-array may need to be altered if the user supplied slice does not encapsulate the whole sub-array, for example if a range of timesteps are taken.
-
For each of the sub-arrays the file specified by the
file
variable in the partition information is opened. If the file is on disk, it is simply opened in the same way as a standard netCDF4 python file. If it is on a remote file system, such as S3, then it is streamed into memory. If the size of the sub-array file will cause the currently used amount of memory to exceed theresource_allocation: memory
setting in~/.s3nc.json
then some files may be shuffled out of memory. See the Resource Usage section above. If, even after the file shuffling has occurred, the size of the sub-array cannot be contained in memory then a memory error will occur. -
A netCDF4-python
Dataset
object is opened from the downloaded file or streamed memory. -
The values in the sub-array are copied to the master-array (the memory-mapped Numpy array) using the source (sub-array) slice and the target (master-array) slice.
Currently the reading of data is performed asynchronously, using aiobotocore. S3netCDF4 allows parallel workflows using multi-processing or Dask, by using the CFA information stored in the CFADataset, CFAGroup, CFADimension and CFAVariable classes. Examples of this will follow
- Example 1: open a netCDF file from a S3 storage using the alias "tenancy-0"
- Example 2: Create a netCDF4 file in the filesystem
- Example 3: Create a CFA-netCDF4 file in the filesystem with CFA version 0.5
- Example 4: Create a CFA-netCDF4 file in the filesystem
- Example 5: Create a CFA-netCDF3 file on S3 storage
- Example 6: creating dimensions and variables
- Example 7: creating variables with metadata
- Example 8: Read a netCDF file from disk
- Example 9: Read a netCDF file from disk and get the "field8" variable
- Example 10: Read a netCDF file, a variable and its metadata