Handling BLOBs in Spine tools (parameter_value
)
#2682
Replies: 3 comments 7 replies
-
Isn't there also I/O overhead when reading a DB JSON field (even if querying)? I mean if e.g. Parquet knows how to not read everything, it should actually be faster in I/O, because it can be a much more compact data format. |
Beta Was this translation helpful? Give feedback.
-
Just a few comments on our current implementation:
We actually store the type in a separate database column. This was implemented so that we do not need to parse the JSON if we need just the type. Size (or more generally, dimensions) are currently not stored in the database.
In my view, there is a third step as we convert the parsed structure (
We have two JSON formats for variable resolution time series data. The second one is: {
"type": "time_series",
"data": [
["2019-01-01T00:00", 1],
["2019-01-01T00:30", 2],
["2019-01-01T02:00", 8]
]
} See parameter value format documentation. Note, that we have two types of time series (variable and fixed resolution) as well as other containers that could be 'big' (array, map, time pattern). |
Beta Was this translation helpful? Give feedback.
-
I wonder if we could get away from this by not storing any file paths in the database. Instead, we could store file hashes that identify the files. The actual file locations could be managed by |
Beta Was this translation helpful? Give feedback.
-
Querying larger/structured data takes a long time right now, since
it is stored as compact JSON BLOBs which can not be read partially.
To remedy this, we can either
Below we discuss the details of the bottleneck and propose options
to solve it.
Current status
parameter_value
) is stored as JSON blobs(as binary strings) in the database, this keeps schema simple
Consequences
Commentary on BLOBs in databases
cost for the whole blob for every operation, even for the cases where
the value is discarded
of the DB where binary data can be found, and have some kind of
performance enhancements available from the external binary format
About JSON in DBs
database native
jsonb
binary formats can be queriedsolve the problem
fields alongside the BLOB
Dev-experience of querying JSON blobs (SQLite)
json
/jsonb
json_array
/json_insert
/json_replace
, etcjson_extract
/json -> path
/json ->> path
json_each
The above query can flatten a time series in
parameter_value
.It is also possible to create indices on keys inside the JSON, but
requires us to follow a schema.
JSON functions:
Follow JSON schema to enable queries
The current JSON blob does not fit a queryable schema.
The
data
field is opaque because the keys can change as they areactually values. To the JSON parser it looks like an arbitrary JSON
object instead of an array of records. So we can’t create an index,
or calculate the length of the time series. It is possible to filter,
but not trivially.
If we change to a schema as above, now the
data
field is an array ofrecords, and is easier to query. I presume this choice was motivated
by space saving, but it also led to reduced queryability.
Storing binary data outside the DB
If we store binary data outside the DB, then the choice of the data
format determines the ability to query. Most standard options like
Parquet, HDF5, or NetCDF, has some ability to query. The primary
overhead for this option is I/O. It becomes more attractive for larger
datasets. Note that we still need some metadata in the DB.
The downside of managing files on the filesystem is that the user can
easily relocate them, and our reference in the DB would become stale
without any audit trail. On the other hand, in an industry environment
we can easily support BLOB storage on the cloud (S3, etc).
How to decide?
We need to benchmark the different overheads for the different options
for different kinds of datasets.
We need a few different datasets to compare and benchmark against,
e.g. the results DB after running SpineOpt for different problem types
& sizes.
We also need to identify common access patterns because different
access patterns will be impacted differently.
Looking forward to your comments @jkiviluo @soininen @PekkaSavolainen @manuelma
Beta Was this translation helpful? Give feedback.
All reactions