Database v3 #517

lars-t-hansen · 2024-06-15T11:43:35Z

This is the successor to #379. Some things to consider:

We could build indices for some filtering criteria to more quickly narrow the queries. Obvious cases are from/to dates, host (node), user, maybe command name. From/to and host are already such indices due to the structure of the database. We could similarly present other major indices to the stream selector, probably roll all these into some structure with a little logic around it.
The current structure of the database is one file per day per node. Consider 2000 nodes, this yields 2000 files (inodes) per day. Suppose there's an inode limit per user of 1e6. We will hit this limit in 500 days - could this be a problem?
In general, "older" data should be archived automatically and whenever those data are required, which is almost never, the archives could be opened and the files read from the archive. The sensible thing would be to create one archive per month for data that are multiple months old, the archive should be some standard (probably compressed) form that can be opened using standard Go APIs and also standard command line tools. There would be a single archive for all files under that month - node csv data as well as sysinfo data. Archived folders must be considered read-only, which slightly complicates the internal structure of the program.

lars-t-hansen · 2024-07-29T08:59:55Z

A completely different take on this is that we should jettison the database component of Jobanalyzer and build a new one around a standard data warehouse engine, TBD. This would give us a lot more resilience and probably (on balance) reduce complexity.

There are issues with this move. Currently the analysis logic is based on stream-of-samples processing. There can be a very large volume of samples in a given time window, and I/O isn't disappearing as a problem just because we move to a database system, quite the contrary. To make use of a database system we'd probably want to preprocess data as they come in, partitioning the data into jobs and nodes, so that they can be more easily accessed for the tasks we need. At the same time, there may be utility in keeping the original data streams (or something like them) since we don't know all the uses for them yet. So we'd be storing more data, but more of it would be in a directly useful form hopefully and the net performance gain would be significant. Some experimentation and discussion would be warranted. For example, combining the current database with an RDBMS for aggregated data (job-centric view for jobs keyed by job, user, etc) might be a sensible solution too.

lars-t-hansen added task:enhancement New feature or request pri:low component:sonarlog sonalyze/sonarlog/* component:sonalyze sonalyze/* labels Jun 15, 2024

lars-t-hansen mentioned this issue Jun 21, 2024

Database v2 #379

Closed

lars-t-hansen mentioned this issue Sep 10, 2024

Basic DOS protections on the server #598

Open

lars-t-hansen mentioned this issue Sep 23, 2024

Implement config, sysinfo and clusters verbs #605

Closed

lars-t-hansen mentioned this issue Oct 18, 2024

Long-period naicreport will not scale to fram or betzy (or maybe even saga) #620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database v3 #517

Database v3 #517

lars-t-hansen commented Jun 15, 2024

lars-t-hansen commented Jul 29, 2024

Database v3 #517

Database v3 #517

Comments

lars-t-hansen commented Jun 15, 2024

lars-t-hansen commented Jul 29, 2024