Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang14.0.7 from Android NDK crashes when compiling bf16-gemm-1x4c8-minmax-neonbf16-bfdot.c #4775

Open
georgthegreat opened this issue May 4, 2023 · 14 comments

Comments

@georgthegreat
Copy link
Contributor

Hi.

I am trying to update xnnpack to 2023-04-18 / 71f70de in order to update tensorflow-lite to 2.12.0.

Unfortunately clang for android segfaults when compiling src/bf16-gemm/gen/bf16-gemm-1x4c8-minmax-neonbf16-bfdot.c for android-armv7a.

I am able to reproduce this:

${NDK}/llvm-toolchain/bin/clang --target=armv7a-linux-androideabi21 -isystem ${NDK}/sources/cxx-stl/llvm-libc++abi/include -fsigned-char -march=armv7-a -mfloat-abi=softfp -c -o ${SRC}/contrib/restricted/google/xnnpack/src/bf16-gemm/gen/bf16-gemm-6x8c2-minmax-neonbf16-bfdot-lane-ld128.c.o -I${SRC} -I${SRC}/contrib/libs/fxdiv -I${SRC}/contrib/libs/pthreadpool/include -I${SRC}/contrib/restricted/google/xnnpack/include -I${SRC}/contrib/libs/fxdiv/include -I${SRC}/contrib/restricted/cpuinfo/include -I${SRC}/contrib/restricted/google/xnnpack/src -I${SRC}/contrib/libs/cxxsupp/libcxx/include -I${SRC}/contrib/libs/fp16/include -I${SRC}/contrib/libs/zlib/include -I${SRC}/contrib/libs/double-conversion -I${SRC}/contrib/libs/libc_compat/include/ifaddrs -fdebug-prefix-map=${SRC}=/-B -Xclang -fdebug-compilation-dir -Xclang /tmp -pipe -O3 -g -fexceptions -fno-common -fcolor-diagnostics -faligned-allocation -fdebug-default-version=4 -ffunction-sections -fdata-sections -w  -DXNN_ENABLE_ARM_BF16=1 -DXNN_ENABLE_ARM_DOTPROD=1 -DXNN_ENABLE_ARM_FP16_SCALAR=1 -DXNN_ENABLE_ARM_FP16_VECTOR=1 -DXNN_ENABLE_ASSEMBLY=1 -DXNN_ENABLE_DWCONV_MULTIPASS=0 -DXNN_ENABLE_GEMM_M_SPECIALIZATION=1 -DXNN_ENABLE_JIT=0 -DXNN_ENABLE_MEMOPT=1 -DXNN_ENABLE_RISCV_VECTOR=1 -DXNN_ENABLE_SPARSE=1 -DXNN_LOG_LEVEL=0 -Werror-implicit-function-declaration -nostdinc++ -fPIE -fPIE -march=armv8.2-a+bf16 -mfpu=neon-fp-armv8 ${SRC}/contrib/restricted/google/xnnpack/src/bf16-gemm/gen/bf16-gemm-6x8c2-minmax-neonbf16-bfdot-lane-ld128.c

I have tested NDK's r25b and r25c, but the segfault still persists and there is no way to fix this.

fatal error: error in backend: Cannot select: 0x90a21a8: v4bf16 = extract_subvector 0x909e178, Constant:i32<0>, /home/thegeorg/arcadia/contrib/restricted/google/xnnpack/src/bf16-gemm/gen/bf
16-gemm-6x8c2-minmax-neonbf16-bfdot-lane-ld128.c:173:33
  0x909e178: v8bf16,i32,ch = ARMISD::VLD1_UPD<(load (s128) from %ir.362, align 2)> 0x8f06068, 0x8cafc48, 0x8cb05a0, Constant:i32<1>, /home/thegeorg/arcadia/contrib/restricted/google/xnnpack
/src/bf16-gemm/gen/bf16-gemm-6x8c2-minmax-neonbf16-bfdot-lane-ld128.c:163:32
    0x8cafc48: i32,ch = CopyFromReg 0x8f06068, Register:i32 %84, /home/thegeorg/arcadia/contrib/restricted/google/xnnpack/src/bf16-gemm/gen/bf16-gemm-6x8c2-minmax-neonbf16-bfdot-lane-ld128.
c:163:32
      0x8cb0260: i32 = Register %84
    0x8cb05a0: i32,ch = CopyFromReg 0x8f06068, Register:i32 %70, /home/thegeorg/arcadia/contrib/restricted/google/xnnpack/src/bf16-gemm/gen/bf16-gemm-6x8c2-minmax-neonbf16-bfdot-lane-ld128.
c:158:89
      0x8cafe50: i32 = Register %70
    0x909e380: i32 = Constant<1>
  0x90b3a08: i32 = Constant<0>
In function: xnn_bf16_gemm_minmax_ukernel_6x8c2__neonbf16_bfdot_lane_ld128
PLEASE submit a bug report to https://github.com/android-ndk/ndk/issues and include the crash backtrace, preprocessed source, and associated run script.

@Maratyszcza, could you, please, help me to find a workaround for this?

@leleliu008
Copy link

step1:

sed -i 's|-march=armv6 -mfpu=vfp|-march=armv7-a -mfpu=neon|' CMakeLists.txt

step2:
add cmake option: -DXNNPACK_ENABLE_ARM_BF16=OFF

successfully built now, but don't know if the right way.

@ngzhian
Copy link
Collaborator

ngzhian commented May 10, 2023

Manage to get the repro steps down to:

android-ndk-r25c/toolchains/llvm/prebuilt/linux-x86_64/bin/clang --target=armv7a-linux-androideabi21 -Iinclude -Isrc -O1 -march=armv8.2-a+bf16 -mfpu=neon-fp-armv8 src/bf16-gemm/gen/bf16-gemm-6x8c2-minmax-neonbf16-bfdot-lane-ld128.c

Note that -O1, -O2, -O3 fails, but -O0 is fine.

taking a closer look

@ngzhian
Copy link
Collaborator

ngzhian commented May 10, 2023

add cmake option: -DXNNPACK_ENABLE_ARM_BF16=OFF is the easiest way, it doesn't include the problematic c file.
@georgthegreat if you are getting blocked, that's probably the easiest way.

@ngzhian
Copy link
Collaborator

ngzhian commented May 10, 2023

https://godbolt.org/z/656PqT6o1
I think it is a clang bug, 16.0.0 fixes it.

@ngzhian
Copy link
Collaborator

ngzhian commented May 10, 2023

I found that I can workaround the bug with changing:

const uint32x4_t va0c01 = vdupq_lane_u32(vreinterpret_u32_bf16(vget_low_bf16(va0)), 0);

to

const uint32x4_t va0c01 = vdupq_lane_u32(vget_low_u32(vreinterpretq_u32_bf16(va0)), 0);

just the ordering of interpret and get_low.

@ngzhian
Copy link
Collaborator

ngzhian commented May 10, 2023

vout0x0123 = vreinterpret_bf16_u16(vext_u16(vreinterpret_u16_bf16(vout0x0123), vreinterpret_u16_bf16(vout0x0123), 2)); is still a problem though

I think i can change it to vout0x0123 = vdup_lane_bf16(vout0x0123, 2);

@ngzhian
Copy link
Collaborator

ngzhian commented May 10, 2023

Hm, looks like we don't test building of these microkernels: https://github.com/google/XNNPACK/blob/master/scripts/build-android-armv7.sh#L59

We might also need to update the ndk version we are testing with: https://github.com/google/XNNPACK/blob/master/.github/workflows/build.yml#LL172C28-L172C28

@georgthegreat
Copy link
Contributor Author

georgthegreat commented May 10, 2023

Hm, looks like we don't test building of these microkernels:

There is even a comment about this specific issue:

# BF16 instructions cause ICE in Android NDK compiler

@ngzhian
Copy link
Collaborator

ngzhian commented May 10, 2023

I think the easiest fix is for you to add this compile flag when building XNNPACK -DXNNPACK_ENABLE_ARM_BF16=OFF, until the ndk ships a newer version of clang.

@georgthegreat
Copy link
Contributor Author

Yep, this is what I did for now.

@georgthegreat
Copy link
Contributor Author

NDK maintainers claim that canary builds with clang16 can compile this code.
I think there is nothing to be done from xnnpack side, so this may be closed if there is no workaround / nobody wants to search for it.

@RobertFlatt
Copy link

As I read the posts above there are two possible workarounds, either could be implemented in XNNPACK.

Just waiting for some future NDK where some future Clang will be fixed, is I think an insufficient response.

@Maratyszcza
Copy link
Contributor

The recommended work-around is to disable ARM BF16 extensions using either -DXNNPACK_ENABLE_ARM_BF16=OFF option (for CMake) or --define xnn_enable_arm_bf16=false option (for Bazel).

@RobertFlatt
Copy link

RobertFlatt commented Jun 11, 2023

Using tensorflow 2.13.0-rc1 and -DXNNPACK_ENABLE_ARM_BF16=OFF

cmake  -DCMAKE_TOOLCHAIN_FILE=../android-ndk-r25c/build/cmake/android.toolchain.cmake -DANDROID_ABI=armeabi-v7a -DTFLITE_ENABLE_XNNPACK=ON -DXNNPACK_ENABLE_ARM_BF16=OFF ../tensorflow_src/tensorflow/lite  
cmake --build . -j

Does not workaround the issue.

And we should not expect it to because this is defined as the default behavior https://github.com/google/XNNPACK/blob/master/scripts/build-android-armv7.sh#L59 and we know the default fails to build.
Not clear why it was suggested as a workaround.

I've been trying to get some attention paid to this for 3.5 months #4348
Please implement a fix.
It seems reasonable to suggest that: the relationship between the code and the build tools is everywhere and always the responsibility of the programmer responsible for the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants