Database Issues during import (1.5GB JTL file) #305

kierangirvan · 2024-12-10T15:57:01Z

Describe the bug
We have noticed that our most recent tests are currently stuck in an "in progress" state. This is visible by the UI yellow widget showing the number of processing test runs.

We have viewed the BE and DB logs and its clear the database is having issues. It seems to get so far through before it throws a segmentation fault, and tries to run a recovery. In the meantime, the BE shows that its aborted the upload,.

The backend logs show the following:

Notice the "Connection terminated unexpectedly".

Then viewing the DB logs:

Notice service process was terminated by signal 11, segmentation fault.
Presumably the other messages are because the DB was terminated.
It then seems to recover, and we are able to view historical test runs, we can upload smaller tests, but the 1.4GB jtl file is failing. We have the DB running on its own ECS task now with 2vCPU and 8GB
All other containers are running on another task with 2vCPU and 4GB

Neither of the containers looked exhausted resource wise.

kierangirvan · 2024-12-10T16:25:27Z

Oddly, the scheduler is not housekeeping these stale tests (assuming they are stale).

This scheduler activity has run several times since the test failed to upload:

ludeknovy · 2024-12-10T16:34:31Z

Hi @kierangirvan
This looks like timescaledb issue rather than a problem with the application itself.
Can you double-check the memory allocation? Does the running container have access to all of those 8 gb of ram?

Another thing to consider is the shared_buffers and effective_cache_size of the timescaledb/postgres settings.

kierangirvan · 2024-12-11T08:44:41Z

@ludeknovy my point above regarding the scheduler not clearing stale test runs can be ignored, I've noticed overnight the scheduler indeed cleared up those staled test runs.

kierangirvan · 2024-12-11T08:46:35Z

@ludeknovy but regarding the upload issue, we have historically managed to upload far larger files (4GB), so it doesn't appear to be a buffer/cache sizing issue. We will attempt to upload another test now that the stale test runs have been housekept (maybe clearing those stale runs might have helped?).

ludeknovy · 2024-12-11T09:09:50Z

@kierangirvan the scheduler has some period set after which it will clean the stale test reports up.

ludeknovy · 2024-12-11T09:32:41Z

we have historically managed to upload far larger files (4GB), so it doesn't appear to be a buffer/cache sizing issue.

but maybe you've adjusted the DB settings back then? Did you try to adjust those values for the current DB?

ludeknovy · 2024-12-11T09:48:13Z

anyways, here's a few more things to try:

TimescaleDB sometimes has issues with parallel query plans, especially on hypertables.
Temporarily disable parallel execution for troubleshooting: SET max_parallel_workers_per_gather = 0;
Reindex the hypertable:
REINDEX TABLE jtl.samples;
Check for corrupted chunks:
SELECT chunk_name, table_size_pretty
FROM timescaledb_information.chunks
WHERE hypertable_name = 'jtl.samples';

kierangirvan · 2024-12-11T10:39:07Z

The configuration of the DB has not changed in months. The issue looks to have started over the weekend when the DB was stopped (presumably AWS swapping us over to new hardware), by doing so we had 2 tasks (each running the DB) running for a short period, I suspect this is where the corruption has come about.

We are running those queries now and will post the output to help diagnose the issue.

kierangirvan · 2024-12-11T11:03:31Z

To check for corrupted chunks, the query didn't work as exactly you had put it, but we removed the missing column and confirmed there is no corrupted chunks:

jtl_report=# SELECT chunk_name, table_size_pretty FROM timescaledb_information.chunks WHERE hypertable_name = 'jtl.samples';
ERROR:  column "table_size_pretty" does not exist
LINE 1: SELECT chunk_name, table_size_pretty FROM timescaledb_inform...
                           ^
jtl_report=# \d timescaledb_information.chunks
                       View "timescaledb_information.chunks"
         Column         |           Type           | Collation | Nullable | Default
------------------------+--------------------------+-----------+----------+---------
 hypertable_schema      | name                     |           |          |
 hypertable_name        | name                     |           |          |
 chunk_schema           | name                     |           |          |
 chunk_name             | name                     |           |          |
 primary_dimension      | name                     |           |          |
 primary_dimension_type | regtype                  |           |          |
 range_start            | timestamp with time zone |           |          |
 range_end              | timestamp with time zone |           |          |
 range_start_integer    | bigint                   |           |          |
 range_end_integer      | bigint                   |           |          |
 is_compressed          | boolean                  |           |          |
 chunk_tablespace       | name                     |           |          |
 chunk_creation_time    | timestamp with time zone |           |          |

jtl_report=# SELECT chunk_name FROM timescaledb_information.chunks WHERE hypertable_name = 'jtl.samples';
 chunk_name
------------
(0 rows)

kierangirvan · 2024-12-11T12:04:23Z

We have now executed the other 2 queries (disable parallel execution and reindex of hypertable).
The upload once again failed, but the error is different, we are no longer seeing the segmentation fault, instead an exit code of 129?

ludeknovy · 2024-12-11T17:47:54Z

@kierangirvan this is out of my expertise, you need to google each of those errors and see it that takes you somewhere.
But if presumably AWS swapping us over to new hardware is true, it could have somehow made the DB corrupted or broken something.

kierangirvan assigned ludeknovy Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Issues during import (1.5GB JTL file) #305

Database Issues during import (1.5GB JTL file) #305

kierangirvan commented Dec 10, 2024

kierangirvan commented Dec 10, 2024

ludeknovy commented Dec 10, 2024

kierangirvan commented Dec 11, 2024

kierangirvan commented Dec 11, 2024

ludeknovy commented Dec 11, 2024

ludeknovy commented Dec 11, 2024 •

edited

Loading

ludeknovy commented Dec 11, 2024

kierangirvan commented Dec 11, 2024

kierangirvan commented Dec 11, 2024

kierangirvan commented Dec 11, 2024

ludeknovy commented Dec 11, 2024

Database Issues during import (1.5GB JTL file) #305

Database Issues during import (1.5GB JTL file) #305

Comments

kierangirvan commented Dec 10, 2024

kierangirvan commented Dec 10, 2024

ludeknovy commented Dec 10, 2024

kierangirvan commented Dec 11, 2024

kierangirvan commented Dec 11, 2024

ludeknovy commented Dec 11, 2024

ludeknovy commented Dec 11, 2024 • edited Loading

ludeknovy commented Dec 11, 2024

kierangirvan commented Dec 11, 2024

kierangirvan commented Dec 11, 2024

kierangirvan commented Dec 11, 2024

ludeknovy commented Dec 11, 2024

ludeknovy commented Dec 11, 2024 •

edited

Loading