Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Issues during import (1.5GB JTL file) #305

Open
kierangirvan opened this issue Dec 10, 2024 · 11 comments
Open

Database Issues during import (1.5GB JTL file) #305

kierangirvan opened this issue Dec 10, 2024 · 11 comments
Assignees

Comments

@kierangirvan
Copy link

Describe the bug
We have noticed that our most recent tests are currently stuck in an "in progress" state. This is visible by the UI yellow widget showing the number of processing test runs.

We have viewed the BE and DB logs and its clear the database is having issues. It seems to get so far through before it throws a segmentation fault, and tries to run a recovery. In the meantime, the BE shows that its aborted the upload,.

The backend logs show the following:
image

Notice the "Connection terminated unexpectedly".

Then viewing the DB logs:
image

Notice service process was terminated by signal 11, segmentation fault.
Presumably the other messages are because the DB was terminated.
It then seems to recover, and we are able to view historical test runs, we can upload smaller tests, but the 1.4GB jtl file is failing. We have the DB running on its own ECS task now with 2vCPU and 8GB
All other containers are running on another task with 2vCPU and 4GB

Neither of the containers looked exhausted resource wise.

@kierangirvan
Copy link
Author

Oddly, the scheduler is not housekeeping these stale tests (assuming they are stale).

This scheduler activity has run several times since the test failed to upload:
image

@ludeknovy
Copy link
Owner

Hi @kierangirvan
This looks like timescaledb issue rather than a problem with the application itself.
Can you double-check the memory allocation? Does the running container have access to all of those 8 gb of ram?

Another thing to consider is the shared_buffers and effective_cache_size of the timescaledb/postgres settings.

@kierangirvan
Copy link
Author

@ludeknovy my point above regarding the scheduler not clearing stale test runs can be ignored, I've noticed overnight the scheduler indeed cleared up those staled test runs.

@kierangirvan
Copy link
Author

@ludeknovy but regarding the upload issue, we have historically managed to upload far larger files (4GB), so it doesn't appear to be a buffer/cache sizing issue. We will attempt to upload another test now that the stale test runs have been housekept (maybe clearing those stale runs might have helped?).

@ludeknovy
Copy link
Owner

@kierangirvan the scheduler has some period set after which it will clean the stale test reports up.

@ludeknovy
Copy link
Owner

ludeknovy commented Dec 11, 2024

we have historically managed to upload far larger files (4GB), so it doesn't appear to be a buffer/cache sizing issue.

but maybe you've adjusted the DB settings back then? Did you try to adjust those values for the current DB?

@ludeknovy
Copy link
Owner

anyways, here's a few more things to try:

  • TimescaleDB sometimes has issues with parallel query plans, especially on hypertables.
    Temporarily disable parallel execution for troubleshooting: SET max_parallel_workers_per_gather = 0;

  • Reindex the hypertable:
    REINDEX TABLE jtl.samples;

  • Check for corrupted chunks:
    SELECT chunk_name, table_size_pretty
    FROM timescaledb_information.chunks
    WHERE hypertable_name = 'jtl.samples';

@kierangirvan
Copy link
Author

The configuration of the DB has not changed in months. The issue looks to have started over the weekend when the DB was stopped (presumably AWS swapping us over to new hardware), by doing so we had 2 tasks (each running the DB) running for a short period, I suspect this is where the corruption has come about.

We are running those queries now and will post the output to help diagnose the issue.

@kierangirvan
Copy link
Author

To check for corrupted chunks, the query didn't work as exactly you had put it, but we removed the missing column and confirmed there is no corrupted chunks:

jtl_report=# SELECT chunk_name, table_size_pretty FROM timescaledb_information.chunks WHERE hypertable_name = 'jtl.samples';
ERROR:  column "table_size_pretty" does not exist
LINE 1: SELECT chunk_name, table_size_pretty FROM timescaledb_inform...
                           ^
jtl_report=# \d timescaledb_information.chunks
                       View "timescaledb_information.chunks"
         Column         |           Type           | Collation | Nullable | Default
------------------------+--------------------------+-----------+----------+---------
 hypertable_schema      | name                     |           |          |
 hypertable_name        | name                     |           |          |
 chunk_schema           | name                     |           |          |
 chunk_name             | name                     |           |          |
 primary_dimension      | name                     |           |          |
 primary_dimension_type | regtype                  |           |          |
 range_start            | timestamp with time zone |           |          |
 range_end              | timestamp with time zone |           |          |
 range_start_integer    | bigint                   |           |          |
 range_end_integer      | bigint                   |           |          |
 is_compressed          | boolean                  |           |          |
 chunk_tablespace       | name                     |           |          |
 chunk_creation_time    | timestamp with time zone |           |          |

jtl_report=# SELECT chunk_name FROM timescaledb_information.chunks WHERE hypertable_name = 'jtl.samples';
 chunk_name
------------
(0 rows)

@kierangirvan
Copy link
Author

We have now executed the other 2 queries (disable parallel execution and reindex of hypertable).
The upload once again failed, but the error is different, we are no longer seeing the segmentation fault, instead an exit code of 129?
image

@ludeknovy
Copy link
Owner

@kierangirvan this is out of my expertise, you need to google each of those errors and see it that takes you somewhere.
But if presumably AWS swapping us over to new hardware is true, it could have somehow made the DB corrupted or broken something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants