Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Core Dump #746

Closed
StephenAtty opened this issue Sep 1, 2024 · 4 comments · Fixed by #761
Closed

Random Core Dump #746

StephenAtty opened this issue Sep 1, 2024 · 4 comments · Fixed by #761

Comments

@StephenAtty
Copy link

I've got scheduled jobs running to create several sets of maps ( one at a time).

Occasionally I get seg faults and core dumps - its not always on a specific data set, and if I re-run the process it completes OK.

The output from the process looks like this

Store size 74G | 1/6 Block 51929/51930 (44388 ms)
Store size 87G | 2/6 Block 51928/51930 (248660 ms)
Store size 99G | 3/6 Block 51928/51930 (156323 ms)
Store size 111G | 4/6 Block 51929/51930 (233455 ms)
Store size 131G | 5/6 Block 19581/51930 Segmentation fault (core dumped)
Sun 01 Sep 2024 06:00:55 UTC Completed EU processing

Any pointers on what I can do to investigate these would be great

@cldellow
Copy link
Contributor

If you build tilemaker with debug symbols (add -g to CXXFLAGS and CFLAGS in the Makefile), then you might be able to use gdb to inspect the core dump and get something useful.

Or, you could run tilemaker on your data with gdb attached until you get a repro - the debugger will (hopefully!) stop on an assert.

Or, if it's possible to share repro instructions, I could see if I can repro it. I do see that you say it doesn't happen reliably, but I'm hoping that a "soak test" of just constantly re-building the map might show it. Could you share all the necessary inputs? e.g. what version of tilemaker are you using, what is your config.json/process.lua file, and what is your input PBF file?

@StephenAtty
Copy link
Author

I'm on Version 3.0.0

I'm running regular rebuilds once a week ) of the UK and Ireland PBFs from

http://download.geofabrik.de/europe/united-kingdom-latest.osm.pbf
and
http://download.geofabrik.de/europe/ireland-and-northern-ireland-latest.osm.pbf

I logged this when the ireland build failed

Last weekend the UK build failed with a core dump

this morning both completed OK

@cldellow
Copy link
Contributor

Thanks! I've been able to reproduce intermittent segfaults on the current master (#760 (comment)), I'll see if I can sort out what's going on.

I suspect passing --threads 1 might let you avoid the issue, but at the risk of slowing things down quite a lot. Up to you whether that's a reasonable tradeoff, or whether it's preferable to just retry when tilemaker crashes.

@StephenAtty
Copy link
Author

As it currently takes about 12 minutes to do both including downloading the files I might try it. But for the other tilesets I suspect that might be problematical

Sat 21 Sep 2024 00:30:01 UTC Getting UK Data Set
2024-09-21 00:35:51 URL:http://download.geofabrik.de/europe/united-kingdom-latest.osm.pbf [1871466474/1871466474] -> "../data/united-kingdom-latest.osm.pbf" [1]
Sat 21 Sep 00:35:51 UTC 2024
Sat 21 Sep 2024 00:35:51 UTC Starting UK processing
Sat 21 Sep 2024 00:40:49 UTC Completed UK processing
Sat 21 Sep 2024 00:40:51 UTC Getting Ireland Data Set
2024-09-21 00:41:54 URL:http://download.geofabrik.de/europe/ireland-and-northern-ireland-latest.osm.pbf [331667055/331667055] -> "../data/ireland-and-northern-ireland-latest.osm.pbf" [1]
Sat 21 Sep 2024 00:41:54 UTC Starting Ireland processing
Sat 21 Sep 2024 00:42:42 UTC Completed Ireland processing

cldellow added a commit to cldellow/tilemaker that referenced this issue Sep 21, 2024
bug 1: PooledString resizes `vector` without locks

`tables` is a shared pool of `char*` pointers, where each pointer
points to a 64KB memory chunk.

Some `PooledString`s identify their content by an index into this pool.

However, `tables` can grow. We correctly guard against concurrent
mutation (for example, here:
https://github.com/systemed/tilemaker/blob/7f0343045687ab2125910c81eed598c58fc2ff2d/src/pooled_string.cpp#L33-L39)

But readers expect to be able to read it without a lock, for example
here, where the result of a read will be used to do a write: https://github.com/systemed/tilemaker/blob/7f0343045687ab2125910c81eed598c58fc2ff2d/src/pooled_string.cpp#L54

This pattern isn't safe with `vector`, since when the `vector` grows,
it invalidates all existing pointers. It is safe with `deque`, so the fix
is to switch to a `deque`.

bug 2: vector layer metadata `map` isn't guarded

`layers` is a shared object common to all OsmLuaProcessing threads.

`layers.layers` is a `vector` that gets initialized and populated fully on
the main thread before the Lua threads start, so accessing it without
locks is fine.

`layers.layers[layer].attributeMap` is just a vanilla `map`, though,
so mutating it from multiple threads without locks is dangerous.

I just added a coarse lock for now. On my 16-core machine, it didn't
seem to introduce contention, so I didn't bother to do anything fancier
to minimize locking overhead.

I will optimistically say that this fixes systemed#746.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants