[Bug]: Compression Policy Stalling After Upgrade to TimescaleDB 2.16.2 #7502

rdmolony · 2024-11-29T12:59:15Z

What type of bug is this?

Crash

What subsystems and features are affected?

Compression

What happened?

TL;DR; -

My database's compression policy job (CALL _timescaledb_functions.policy_compression()) has not run successfully since I upgraded from version 2.15.3 to 2.16.2, approximately four months ago. It appears to get stuck on specific chunks during compression.

The issue has been resolved by running a SQL query to manually decompress/compress all hypertable chunks - see the SQL query at the bottom for more info.

Recently, I noticed that the periodic job CALL _timescaledb_functions.policy_compression() was running indefinitely.

So I looked at this job's stats via ...

select * from timescaledb_information.job_stats

job_id	application_name	schedule_interval	max_runtime	max_retries	retry_period	proc_schema	proc_name	owner	scheduled	fixed_schedule	config	next_start	initial_start	hypertable_schema	hypertable_name	check_schema	check_name
1014	Compression Policy [1014]	3 days 12:00:00	00:00:00	-1	01:00:00	_timescaledb_functions	policy_compression	postgres	true	false	{"hypertable_id": 61, "compress_after": "7 days"}	-infinity		public	sensor_reading_generic	_timescaledb_functions	policy_compression_check

... and saw that the last time it ran successfully was 4 months ago.

This timing corresponds with an upgrade of Timescale from version 2.15.3 to 2.16.2

I noticed that the compression policy query CALL _timescaledb_functions.policy_compression() was getting stuck on one particular hypertable chunk by running ...

with query as (
  select pid
    from pg_stat_activity
   where query = 'CALL _timescaledb_functions.policy_compression()'
)
select
    l.relation::regclass AS locked_relation,
    (chunk_schema || '.' || chunk_name)::regclass,
    c.range_start,
    l.mode,
    l.locktype,
    a.query,
    a.state,
    a.wait_event_type,
    a.wait_event
from
    pg_stat_activity a
    join pg_locks l  on l.pid = a.pid
    join query q on q.pid = l.pid
    left join timescaledb_information.chunks c on (chunk_schema || '.' || chunk_name)::regclass = l.relation::regclass

... while it was executing.

Note - the compression policy holds a lock on each hypertable it is compressing. So long as it is stuck on a specific hypertable, backfilling to this hypertable (i.e. importing historic readings) doesn't work work either, since INSERTs are blocked by this lock.

This hasn't been a major issue since backfilled historic data is mostly non-critical.

I found a relevant similar issue (thanks to ChatGPT 4o) that suggested manually compressing the chunk ...

Timescale Community Forum: In a forum post, a user described that all their compression policy jobs had been failing for approximately six weeks. The failures started after upgrading the Timescale extension. Manually running the jobs succeeded, but scheduled executions failed. The user sought advice on obtaining more detailed error information to diagnose the problem.

So I tried this via ...

select compress_chunk('_timescaledb_internal.<chunk_name>')

... and found that the query ran indefinitely.

So (dumbly) I tried decompressing the chunk manually via ...

select decompress_chunk('_timescaledb_internal.<chunk_name>')

... and it eventually decompressed after 40 minutes! Then I recompressed it, which this time took less than a minute!

I don't understand quite what went wrong or how this fixed it. ChatGPT 4o believed that this decompress/compress resolved metadata inconsistencies & bloat - but my understanding is not good enough to dig into that.

So I re-run the compression policy, this time it got stuck on another chunk.

So now I'm running decompress/compress on all chunks via ...

DO $$
DECLARE
    chunk RECORD;
BEGIN
    FOR chunk IN
        SELECT chunk_schema, chunk_name
        FROM timescaledb_information.chunks
        WHERE is_compressed = true
    LOOP
        -- Step 1: Decompress the chunk
        RAISE NOTICE 'Decompressing chunk: %.%', chunk.chunk_schema, chunk.chunk_name;
        EXECUTE format('SELECT decompress_chunk(''%I.%I'')', chunk.chunk_schema, chunk.chunk_name);

        -- Step 2: Compress the chunk
        RAISE NOTICE 'Compressing chunk: %.%', chunk.chunk_schema, chunk.chunk_name;
        EXECUTE format('SELECT compress_chunk(''%I.%I'')', chunk.chunk_schema, chunk.chunk_name);
    END LOOP;
END $$;

Fingers crossed this resolves it

TimescaleDB version affected

2.16.2

PostgreSQL version used

PostgreSQL 14.13, compiled by Visual C++ build 1940, 64-bit

What operating system did you use?

Windows Server 2019 Datacenter 17763.6532

What installation method did you use?

Other

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

This might be hard to do other than upgrading from version 2.15.3 to 2.16.2 on a Windows server (if I've understood the issue)

The text was updated successfully, but these errors were encountered:

rdmolony · 2024-11-29T15:03:03Z

Also note that (if I'm not mistaken) I should be able to set a timeout on the compression policy like ...

SELECT add_compression_policy('<hypertable-name>', INTERVAL '1 week');
WITH job AS (
    SELECT job_id
    FROM timescaledb_information.jobs
    WHERE hypertable_name = '<hypertable-name>'
        AND proc_name = 'policy_compression'
)
SELECT alter_job(
    job_id => job.job_id,
    max_runtime => INTERVAL '1 hour',
    max_retries => 3,
    retry_period => INTERVAL '10 minutes'
)

rdmolony · 2024-12-02T11:21:22Z

I can confirm that after running the decompress/compress SQL query above compression jobs are now back running without any issues.

erimatnor · 2024-12-03T08:11:21Z

@rdmolony Thank you for reporting this issue. Do you think this is issue is specific to Windows?

rdmolony · 2024-12-03T12:27:21Z

Thanks @erimatnor, it could well be, I can't be sure

normalnormie · 2024-12-04T23:22:19Z

I was also stuck with "CALL _timescaledb_functions.policy_compression()"
I installed TimescaleDB some days ago and was not having issues until today
TimescaleDB version affected:
timescaledb-2-postgresql-14 (2.17.2~ubuntu22.04)
PostgreSQL version used:
(PostgreSQL) 14.12 (Ubuntu 14.12-0ubuntu0.22.04.1)
How can we reproduce the bug?
Followed https://docs.timescale.com/self-hosted/latest/install/installation-linux/ and inserted data for some days, the specific chunk that freezes policy_compression is from the biggest table and contains 57 million rows, also I had some out-of-space on disk past days that were solved before policy_compression started.

Used @rdmolony script but got stuck on: "SELECT decompress_chunk('_timescaledb_internal._hyper_25_27_chunk')" with a "Lock: relation"
Followed https://docs.timescale.com/use-timescale/latest/compression/compression-policy/ and paused the compression policy first by finding its job_id and then disabling it, script operation finished then enabled it back.

normalnormie · 2024-12-15T03:54:45Z

I was also stuck with "CALL _timescaledb_functions.policy_compression()" I installed TimescaleDB some days ago and was not having issues until today TimescaleDB version affected: timescaledb-2-postgresql-14 (2.17.2~ubuntu22.04) PostgreSQL version used: (PostgreSQL) 14.12 (Ubuntu 14.12-0ubuntu0.22.04.1) How can we reproduce the bug? Followed https://docs.timescale.com/self-hosted/latest/install/installation-linux/ and inserted data for some days, the specific chunk that freezes policy_compression is from the biggest table and contains 57 million rows, also I had some out-of-space on disk past days that were solved before policy_compression started.

Used @rdmolony script but got stuck on: "SELECT decompress_chunk('_timescaledb_internal._hyper_25_27_chunk')" with a "Lock: relation" Followed https://docs.timescale.com/use-timescale/latest/compression/compression-policy/ and paused the compression policy first by finding its job_id and then disabling it, script operation finished then enabled it back.

An update: I detected the same problem but this time decided to wait some days to see if it solved on its own and it did, my app was stopped for a couple hours so it was probably new inserts blocking the compression job.

rdmolony added the bug label Nov 29, 2024

fabriziomello added compression jobs labels Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Compression Policy Stalling After Upgrade to TimescaleDB 2.16.2 #7502

[Bug]: Compression Policy Stalling After Upgrade to TimescaleDB 2.16.2 #7502

rdmolony commented Nov 29, 2024 •

edited

Loading

rdmolony commented Nov 29, 2024 •

edited

Loading

rdmolony commented Dec 2, 2024

erimatnor commented Dec 3, 2024

rdmolony commented Dec 3, 2024 •

edited

Loading

normalnormie commented Dec 4, 2024

normalnormie commented Dec 15, 2024

[Bug]: Compression Policy Stalling After Upgrade to TimescaleDB 2.16.2 #7502

[Bug]: Compression Policy Stalling After Upgrade to TimescaleDB 2.16.2 #7502

Comments

rdmolony commented Nov 29, 2024 • edited Loading

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

rdmolony commented Nov 29, 2024 • edited Loading

rdmolony commented Dec 2, 2024

erimatnor commented Dec 3, 2024

rdmolony commented Dec 3, 2024 • edited Loading

normalnormie commented Dec 4, 2024

normalnormie commented Dec 15, 2024

rdmolony commented Nov 29, 2024 •

edited

Loading

rdmolony commented Nov 29, 2024 •

edited

Loading

rdmolony commented Dec 3, 2024 •

edited

Loading