You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is one of the most profoundly bewildering bugs I have ever seen. The Kalman fitter tests in telescope geometries don't work in SYCL with OneAPI 2024.2.
Reproduction
To reproduce the bug, perform the following set of actions:
# Can also be done without the Docker container, but this is easier
$ docker run -it ghcr.io/acts-project/ubuntu2404_oneapi:55
$ git clone https://github.com/acts-project/traccc.git
$ (cd traccc; git checkout f7d9df8)
$ source /opt/intel/oneapi/setvars.sh --include-intel-llvm
# Building for the spir64_x86_64 target causes the compiler to crash, which is a whole different issue
$ export SYCLFLAGS="-fsycl -fsycl-targets=spir64"
$ cmake -S traccc -B build -DCMAKE_BUILD_TYPE=Debug -DTRACCC_BUILD_TESTING=ON -DTRACCC_BUILD_SYCL=ON
$ cmake --build build -- -j $(nproc) traccc_test_sycl
$ build/bin/traccc_test_sycl --gtest_filter="SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests.Run/*"
This will produce the following error:
Running main() from /build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests.Run/*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests
[ RUN ] SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests.Run/0
Running Seeding on device: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics
WARNING: No entries in volume finder
Detector check: OK
*** Break *** segmentation violation
So, we have a segmentation fault in this executable.
Diagnostics
At this point you may, just like I did, naively assume that this is some memory error in our code. Wouldn't that be nice and easy to fix. But nothing could be less true, as gdb shows us:
$ apt install -y gdb
$ gdb -ex run --args build/bin/traccc_test_sycl --gtest_filter="SYCLKalmanFitTelescopeValidation/KalmanFittingTelescopeTests.Run/*"
Thread 1 "traccc_test_syc" received signal SIGSEGV, Segmentation fault.
0x00007f51d97e14f8 in llvm::vpo::VPlanTTICostModel::getLoadStoreIndexSize(llvm::vpo::VPLoadStoreInst const*) const () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
(gdb) bt
...
#0 0x00007f51d97e14f8 in llvm::vpo::VPlanTTICostModel::getLoadStoreIndexSize(llvm::vpo::VPLoadStoreInst const*) const () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#1 0x00007f51d8d8f7e7 in llvm::vpo::VPlanTTICostModel::getLoadStoreCost(llvm::vpo::VPLoadStoreInst const*, llvm::Align, unsigned int, bool) const () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#2 0x00007f51d8fdd45d in llvm::vpo::VPlanTTICostModel::getTTICostForVF(llvm::vpo::VPInstruction const*, unsigned int) () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#3 0x00007f51d8fdd2bc in llvm::vpo::VPlanTTICostModel::getTTICost(llvm::vpo::VPInstruction const*) () from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#4 0x00007f51da8ee56f in llvm::vpo::VPlanCostModelWithHeuristics<llvm::vpo::HeuristicsList<llvm::vpo::VPInstruction const>, llvm::vpo::HeuristicsList<llvm::vpo::VPBasicBlock const>, llvm::vpo::HeuristicsList<llvm::vpo::VPlanVector const, llvm::vpo::VPlanCostModelHeuristics::HeuristicSpillFill, llvm::vpo::VPlanCostModelHeuristics::HeuristicUnroll> >::getCostImpl(llvm::vpo::VPInstruction const*, llvm::raw_ostream*) ()
from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
#5 0x00007f51da8ee427 in llvm::vpo::VPlanCostModelWithHeuristics<llvm::vpo::HeuristicsList<llvm::vpo::VPInstruction const>, llvm::vpo::HeuristicsList<llvm::vpo::VPBasicBlock const>, llvm::vpo::HeuristicsList<llvm::vpo::VPlanVector const, llvm::vpo::VPlanCostModelHeuristics::HeuristicSpillFill, llvm::vpo::VPlanCostModelHeuristics::HeuristicUnroll> >::getCostImpl(llvm::vpo::VPBasicBlock const*, llvm::raw_ostream*) ()
from /opt/intel/oneapi/compiler/2024.2/lib/libintelocl.so
...
So the issue is not really on our end per se, it's happening in Intel's SPIR compiler. Aight.
Workarounds
This is where it gets truly spicy. I've been able to identify two different ways that the segmentation fault can be avoided (of course, these all break the actual test; but they make it run), here they are:
In simulation/include/traccc/simulation/simulator.hpp, comment out line 97 (p.propagate(propagation, actor_states);).
In core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp, comment out lines 185 (propagator.propagate(propagation, fitter_state());) and 188 (smooth(fitter_state);).
These functions are completely independent, and one of them runs on the host, the other runs on the device. Lmao.
Conclusion
I don't even know at this point, but most certainly there is something very funky happening in OneAPI right now. It could also be some subtle bug in our code, but I haven't been able to find it.
The text was updated successfully, but these errors were encountered:
As shown in acts-project#655, this is creating a lot of headache. I am looking for a
fix but in the meanwhile this is holding up acts-project#628, so I want to
temporarily disable these tests.
As shown in acts-project#655, this is creating a lot of headache. I am looking for a
fix but in the meanwhile this is holding up acts-project#628, so I want to
temporarily disable these tests.
As shown in acts-project#655, this is creating a lot of headache. I am looking for a
fix but in the meanwhile this is holding up acts-project#628, so I want to
temporarily disable these tests.
Here is one of the most profoundly bewildering bugs I have ever seen. The Kalman fitter tests in telescope geometries don't work in SYCL with OneAPI 2024.2.
Reproduction
To reproduce the bug, perform the following set of actions:
This will produce the following error:
So, we have a segmentation fault in this executable.
Diagnostics
At this point you may, just like I did, naively assume that this is some memory error in our code. Wouldn't that be nice and easy to fix. But nothing could be less true, as gdb shows us:
So the issue is not really on our end per se, it's happening in Intel's SPIR compiler. Aight.
Workarounds
This is where it gets truly spicy. I've been able to identify two different ways that the segmentation fault can be avoided (of course, these all break the actual test; but they make it run), here they are:
simulation/include/traccc/simulation/simulator.hpp
, comment out line 97 (p.propagate(propagation, actor_states);
).core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
, comment out lines 185 (propagator.propagate(propagation, fitter_state());
) and 188 (smooth(fitter_state);
).These functions are completely independent, and one of them runs on the host, the other runs on the device. Lmao.
Conclusion
I don't even know at this point, but most certainly there is something very funky happening in OneAPI right now. It could also be some subtle bug in our code, but I haven't been able to find it.
The text was updated successfully, but these errors were encountered: