Handling BLOBs in Spine tools (`parameter_value`) #2682

suvayu · 2024-03-27T11:50:13Z

suvayu
Mar 27, 2024
Collaborator

Querying larger/structured data takes a long time right now, since
it is stored as compact JSON BLOBs which can not be read partially.
To remedy this, we can either

change to a queryable JSON schema, or
store large data in a more suitable format, like Parquet, HDF5, etc.

Below we discuss the details of the bottleneck and propose options
to solve it.

Current status

larger/structured data (parameter_value) is stored as JSON blobs
(as binary strings) in the database, this keeps schema simple
example of larger data: a time series

Consequences

blobs are opaque, we don't know the type or size in advance
have to wait for the data to materialise in memory ($t_{wait}$)
this is the sum of two steps, deserialisation, and memory allocation
$t_{deserialise} + t_{alloc} = t_{wait}$
probably $t_{alloc}$ is more costly
this leads to wait/slow downs in unexpected places

Commentary on BLOBs in databases

strength of a database is to use it as a queryable datastore
binary blobs are impossible/difficult to query
- depends on the kind of binary data (see below)
not being able to query large blobs means we pay the materialisation
cost for the whole blob for every operation, even for the cases where
the value is discarded
an alternate strategy is to store a reference to a location outside
of the DB where binary data can be found, and have some kind of
performance enhancements available from the external binary format

About JSON in DBs

JSON as TEXT, binary string (i.e. same as an undecoded file), or
database native jsonb binary formats can be queried
in general, any other binary formats cannot be queried
- using something like Apache Arrow as the binary blob does not
  solve the problem
for non-queryable blobs, a common strategy is to store metadata
fields alongside the BLOB

Dev-experience of querying JSON blobs (SQLite)

parsers: json / jsonb
producers/setters: json_array / json_insert / json_replace, etc
path accessors/operators: json_extract / json -> path / json ->> path
iterators: json_each

SELECT pv.id,
       pv.alternative_id,
       e.name as entity,
       pd.name AS param,
       ts.key as time,
       ts.value
FROM parameter_value pv, json_each(json(pv.value), '$.data') ts 
     LEFT JOIN entity e ON pv.entity_id = e.id 
     LEFT JOIN parameter_definition pd ON pv.parameter_definition_id = pd.id 
WHERE pv.parameter_definition_id = 1;

The above query can flatten a time series in parameter_value.

"id"	"alternative_id"	"entity"	"param"	"time"	"value"
1	2	report_[...]power_plant_a__electricity_node[...]	unit_flow	"2000-01-01T00:00:00.0"	100.0
5	3	report_[...]power_plant_a__electricity_node[...]	unit_flow	"2000-01-01T00:00:00.0"	90.0
5	3	report_[...]power_plant_a__electricity_node[...]	unit_flow	"2000-01-01T01:00:00.0"	91.0
5	3	report_[...]power_plant_a__electricity_node[...]	unit_flow	"2000-01-01T02:00:00.0"	93.0
5	3	report_[...]power_plant_a__electricity_node[...]	unit_flow	"2000-01-01T03:00:00.0"	95.0

It is also possible to create indices on keys inside the JSON, but
requires us to follow a schema.

JSON functions:

SQLite
PostgreSQL
MySQL
also see for comments on schema design

Follow JSON schema to enable queries

The current JSON blob does not fit a queryable schema.

{
  "data": {
    "2000-01-01T00:00:00.0": 90.0,
    "2000-01-01T01:00:00.0": 91.0,
    "2000-01-01T02:00:00.0": 93.0
  },
  "index": {
    "ignore_year": false,
    "repeat": false
  }
}

The data field is opaque because the keys can change as they are
actually values. To the JSON parser it looks like an arbitrary JSON
object instead of an array of records. So we can’t create an index,
or calculate the length of the time series. It is possible to filter,
but not trivially.

{
  "data": [
      {"time": "2000-01-01T00:00:00.0", "value": 90.0},
      {"time": "2000-01-01T01:00:00.0", "value": 91.0},
      {"time": "2000-01-01T02:00:00.0", "value": 93.0}
  ],
  "index": {
    "ignore_year": false,
    "repeat": false
  }
}

If we change to a schema as above, now the data field is an array of
records, and is easier to query. I presume this choice was motivated
by space saving, but it also led to reduced queryability.

Storing binary data outside the DB

If we store binary data outside the DB, then the choice of the data
format determines the ability to query. Most standard options like
Parquet, HDF5, or NetCDF, has some ability to query. The primary
overhead for this option is I/O. It becomes more attractive for larger
datasets. Note that we still need some metadata in the DB.

The downside of managing files on the filesystem is that the user can
easily relocate them, and our reference in the DB would become stale
without any audit trail. On the other hand, in an industry environment
we can easily support BLOB storage on the cloud (S3, etc).

How to decide?

We need to benchmark the different overheads for the different options
for different kinds of datasets.

wait time for materialisation:
- measure time to deserialise & allocate separately
I/O overhead when reading a binary file from disk
- at what size does this become a viable option?
measure time saving when filtering/batching is possible

We need a few different datasets to compare and benchmark against,
e.g. the results DB after running SpineOpt for different problem types
& sizes.

We also need to identify common access patterns because different
access patterns will be impacted differently.

Looking forward to your comments @jkiviluo @soininen @PekkaSavolainen @manuelma

jkiviluo · 2024-03-27T14:17:19Z

jkiviluo
Mar 27, 2024
Maintainer

Isn't there also I/O overhead when reading a DB JSON field (even if querying)? I mean if e.g. Parquet knows how to not read everything, it should actually be faster in I/O, because it can be a much more compact data format.

2 replies

suvayu Mar 27, 2024
Collaborator Author

Yes, indeed you are right. Essentially the performance optimisation for Parquet is that it does things in columns, and batches (parts of a column). On top of that it supports applying a filter consisting of some rudimentary expressions; see the docs for the filter keyword, and the class documentation for Expression. So the question is at what size is this faster than reading a BLOB from the DB.

Managing files is a messy problem, primarily because the user can easily relocate them, and our reference in the DB would become stale without any audit trail. So I would suggest we experiment with a hybrid model, and see how that works. Datasets beyond a certain size get moved to the filesystem, but for smaller cases we use the DB. But we would need some benchmarks to make these decisions, and the benchmarks should be diverse enough to cover most use cases.

suvayu Mar 27, 2024
Collaborator Author

I hadn't considered one upside of supporting BLOBs outside the DB - we can support cloud storage like S3 trivially. I added that comment to the original post

soininen · 2024-03-28T12:13:14Z

soininen
Mar 28, 2024
Maintainer

Just a few comments on our current implementation:

blobs are opaque, we don't know the type or size in advance

We actually store the type in a separate database column. This was implemented so that we do not need to parse the JSON if we need just the type. Size (or more generally, dimensions) are currently not stored in the database.

this is the sum of two steps, deserialisation, and memory allocation

In my view, there is a third step as we convert the parsed structure (dict in Python) into a spinedb_api.parameter_value.ParameterValue object (see from_database() in spinedb_api). I expect parsing&conversion to take much more time than memory allocation.

The current JSON blob does not fit a queryable schema.

We have two JSON formats for variable resolution time series data. The second one is:

{
  "type": "time_series",
  "data": [
    ["2019-01-01T00:00", 1],
    ["2019-01-01T00:30", 2],
    ["2019-01-01T02:00", 8]
  ]
}

See parameter value format documentation. Note, that we have two types of time series (variable and fixed resolution) as well as other containers that could be 'big' (array, map, time pattern).

4 replies

suvayu Mar 28, 2024
Collaborator Author

Thanks for the corrections and more context!

We actually store the type in a separate database column

I knew this, but only subconsciously 😶. IIUC, this is a common pattern in Entity-Attribute-Value database models. Often people also separate the tables by type.

The second one [time series format] is:

This one is query-able.

we have two types of time series (variable and fixed resolution) as well as other containers that could be 'big'

Is it possible to get substantial examples of each of these different types? And maybe a little context on when they are useful, that will give me an idea on the kind of queries/metadata required to retrieve "only as much as needed".

soininen Mar 28, 2024
Maintainer

Is it possible to get substantial examples of each of these different types? And maybe a little context on when they are useful, that will give me an idea on the kind of queries/metadata required to retrieve "only as much as needed".

I cannot speak for models like SpineOpt, but in Toolbox Database editor we need to be able to deal with all data structures regardless of size. Some examples from the top of my head:

The tables show just the type of data unless it is something simple like a string or DateTime. The separate 'type' column in the database helps here.
Tool tips currently show the start time and number of steps for time series. Currently, we need to parse the entire blob to get this data.
Value editors show the data in tables. Since only a single value is shown at a time, I don't think this will be a bottleneck.
Pivot table (in index mode) may show a lot of data at the same time. It might be beneficial if we could load only the visible slice of the data.
When updating a value in the database, we need to compare the new value with the old one to see if something has really changed. This might be slow if both values are big and the difference is in a single element. Maybe calculating and comparing hashes would be more efficient?
Importer has the option to 'merge' data, e.g. append time series to each other. Not directly querying, but would be nice if this could be done without parsing the entire target value first.

suvayu Mar 29, 2024
Collaborator Author

When updating a value in the database, we need to compare the new value with the old one to see if something has really changed. This might be slow if both values are big and the difference is in a single element. Maybe calculating and comparing hashes would be more efficient?

This is a bit difficult. I don't think we can do a comparison with hashes here, because to get the hash the file needs to be written first. AFAIU this kind of hash based comparison is only possible for hash tree data structures (e.g. used by some filesystems like Btrfs, version control systems like Git, or Mercurial, distributed databases, etc). For in-memory data, my favoured way of doing this would be to have a read-only original, and a sequence of deltas for every edit. Calculating after the fact isn't really a performant option.

Also note, if altering the data is a frequent occurence, then Parquet is out of the picture. It is an archival format, it cannot be amended. You can add to the "dataset" as new files, but changing values means rewriting one of the existing files. HDF5 & NetCDF offers edits, but it also requires "maintenance" after you have done a lot of edits (something similar to defragmenting a drive on Windows).

Importer has the option to 'merge' data, e.g. append time series to each other. Not directly querying, but would be nice if this could be done without parsing the entire target value first.

This is quite possible since most of the file formats allow you to define a dataset out of multiple files (Parquet), or out of multiple blocks (HDF5). I think with any binary format, even if you have to write a new file, as long as there are no edits, it should be possible to "parse" faster by jumping over whole blocks. Something like this:

read header in file 1
blindly copy the data block to new file (or in batches)
keep the new file open
read header in file 2
blindly continue copying the data block to the new file
finalise the new file (write footer, and other metadata, etc)

Of could we would have to ensure their structures match ahead of time, otherwise it will crash.

soininen Apr 2, 2024
Maintainer

I don't think we can do a comparison with hashes here, because to get the hash the file needs to be written first.

I should have been clearer here: I was thinking the in-memory values here, not something we have already written to disk. We compare/merge/update values in memory, then archive the result in disk/database if needed. I was a bit off-topic, I guess.

soininen · 2024-03-28T12:37:20Z

soininen
Mar 28, 2024
Maintainer

The downside of managing files on the filesystem is that the user can easily relocate them, and our reference in the DB would become stale without any audit trail.

I wonder if we could get away from this by not storing any file paths in the database. Instead, we could store file hashes that identify the files. The actual file locations could be managed by spinedb_api so a client could just pass a new data path to DatabaseMapping if the files are relocated.

1 reply

suvayu Mar 29, 2024
Collaborator Author

Instead, we could store file hashes that identify the files.

Interesting idea, it could work. That's pretty much how most file caches work. E.g. see this from my pip cache.

$ ls .cache/pip/wheels/
03  11  22  2b  35  47  4e  59  64  6a  7d  8a  91  9f  ab  b2  ba  c8  d3  e0  ec  f8
04  13  23  2e  37  48  52  5a  65  6f  80  8b  96  a1  ad  b3  bd  ca  d4  e1  f0  f9
05  18  24  2f  39  49  53  5b  66  70  82  8c  97  a2  ae  b4  bf  cb  d6  e6  f2  fa
06  1a  26  31  40  4a  56  5f  68  71  86  8f  9a  a4  af  b5  c2  ce  da  ea  f4  fb
08  1e  27  34  46  4b  57  61  69  76  88  90  9c  a7  b1  b8  c4  d1  dd  eb  f6
$ tree -d .cache/pip/wheels/03/
.cache/pip/wheels/03/
├── 06
│   └── 94
│       └── 4ca89e9f09b1529e8d5fb38e87c1641d37e00b02cd4860cdfb
└── d4
    └── f1
        └── 8e52eb7d954f88043b275b415a1e3273514975ec680e6ef71f

7 directories

Git also does something similar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling BLOBs in Spine tools (`parameter_value`) #2682

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Handling BLOBs in Spine tools (parameter_value) #2682

suvayu Mar 27, 2024 Collaborator

Current status

Consequences

Commentary on BLOBs in databases

About JSON in DBs

Dev-experience of querying JSON blobs (SQLite)

Follow JSON schema to enable queries

Storing binary data outside the DB

How to decide?

Replies: 3 comments · 7 replies

jkiviluo Mar 27, 2024 Maintainer

suvayu Mar 27, 2024 Collaborator Author

suvayu Mar 27, 2024 Collaborator Author

soininen Mar 28, 2024 Maintainer

suvayu Mar 28, 2024 Collaborator Author

soininen Mar 28, 2024 Maintainer

suvayu Mar 29, 2024 Collaborator Author

soininen Apr 2, 2024 Maintainer

soininen Mar 28, 2024 Maintainer

suvayu Mar 29, 2024 Collaborator Author

Handling BLOBs in Spine tools (`parameter_value`) #2682

suvayu
Mar 27, 2024
Collaborator

Replies: 3 comments 7 replies

jkiviluo
Mar 27, 2024
Maintainer

suvayu Mar 27, 2024
Collaborator Author

suvayu Mar 27, 2024
Collaborator Author

soininen
Mar 28, 2024
Maintainer

suvayu Mar 28, 2024
Collaborator Author

soininen Mar 28, 2024
Maintainer

suvayu Mar 29, 2024
Collaborator Author

soininen Apr 2, 2024
Maintainer

soininen
Mar 28, 2024
Maintainer

suvayu Mar 29, 2024
Collaborator Author