Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database v3 #517

Open
3 tasks
lars-t-hansen opened this issue Jun 15, 2024 · 1 comment
Open
3 tasks

Database v3 #517

lars-t-hansen opened this issue Jun 15, 2024 · 1 comment
Labels
component:sonalyze sonalyze/* component:sonarlog sonalyze/sonarlog/* pri:low task:enhancement New feature or request

Comments

@lars-t-hansen
Copy link
Collaborator

This is the successor to #379. Some things to consider:

  • We could build indices for some filtering criteria to more quickly narrow the queries. Obvious cases are from/to dates, host (node), user, maybe command name. From/to and host are already such indices due to the structure of the database. We could similarly present other major indices to the stream selector, probably roll all these into some structure with a little logic around it.
  • The current structure of the database is one file per day per node. Consider 2000 nodes, this yields 2000 files (inodes) per day. Suppose there's an inode limit per user of 1e6. We will hit this limit in 500 days - could this be a problem?
  • In general, "older" data should be archived automatically and whenever those data are required, which is almost never, the archives could be opened and the files read from the archive. The sensible thing would be to create one archive per month for data that are multiple months old, the archive should be some standard (probably compressed) form that can be opened using standard Go APIs and also standard command line tools. There would be a single archive for all files under that month - node csv data as well as sysinfo data. Archived folders must be considered read-only, which slightly complicates the internal structure of the program.
@lars-t-hansen lars-t-hansen added task:enhancement New feature or request pri:low component:sonarlog sonalyze/sonarlog/* component:sonalyze sonalyze/* labels Jun 15, 2024
@lars-t-hansen
Copy link
Collaborator Author

A completely different take on this is that we should jettison the database component of Jobanalyzer and build a new one around a standard data warehouse engine, TBD. This would give us a lot more resilience and probably (on balance) reduce complexity.

There are issues with this move. Currently the analysis logic is based on stream-of-samples processing. There can be a very large volume of samples in a given time window, and I/O isn't disappearing as a problem just because we move to a database system, quite the contrary. To make use of a database system we'd probably want to preprocess data as they come in, partitioning the data into jobs and nodes, so that they can be more easily accessed for the tasks we need. At the same time, there may be utility in keeping the original data streams (or something like them) since we don't know all the uses for them yet. So we'd be storing more data, but more of it would be in a directly useful form hopefully and the net performance gain would be significant. Some experimentation and discussion would be warranted. For example, combining the current database with an RDBMS for aggregated data (job-centric view for jobs keyed by job, user, etc) might be a sensible solution too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sonalyze sonalyze/* component:sonarlog sonalyze/sonarlog/* pri:low task:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant