This patch extend the @llvm.vector.reduce.fadd to take another integer
argument to indicate the order to apply.
|1,350 ms||x64 debian > Clang.CodeGen/X86::avx512-reduceIntrin.c|
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/14.0.0/include -nostdsysteminc -ffreestanding /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/X86/avx512-reduceIntrin.c -O0 -triple=x86_64-apple-darwin -target-cpu skylake-avx512 -emit-llvm -o - -Wall -Werror | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/X86/avx512-reduceIntrin.c
|1,680 ms||x64 debian > Clang.CodeGen/X86::avx512fp16-builtins.c|
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/14.0.0/include -nostdsysteminc -ffreestanding -flax-vector-conversions=none /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/X86/avx512fp16-builtins.c -triple=x86_64-unknown-unknown -target-feature +avx512fp16 -emit-llvm -o - -Wall -Werror | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/X86/avx512fp16-builtins.c
|1,640 ms||x64 debian > Clang.CodeGen/X86::avx512vlfp16-builtins.c|
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/14.0.0/include -nostdsysteminc -flax-vector-conversions=none -ffreestanding /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/X86/avx512vlfp16-builtins.c -triple=x86_64-unknown-unknown -target-feature +avx512vl -target-feature +avx512fp16 -emit-llvm -o - -Wall -Werror | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/X86/avx512vlfp16-builtins.c
|1,340 ms||x64 debian > Clang.utils/update_cc_test_checks::check-globals.test|
Script: -- : 'RUN: at line 1'; rm -rf /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/check-globals.test.tmp && mkdir /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/check-globals.test.tmp
|600 ms||x64 debian > Clang.utils/update_cc_test_checks::global-hex-value-regex.test|
Script: -- : 'RUN: at line 1'; rm -rf /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/global-hex-value-regex.test.tmp && mkdir /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/global-hex-value-regex.test.tmp
|View Full Test Results (516 Failed)|
Please note that this patch is WIP.
After we add a new argument to the intrinsic, how can we update the previous lit tests automatically? I haven't tried to update all the tests by hand, cuz it looks like takes a huge amount of work and I believe there should be a way to do things more easily and I don't know.
Thanks to have a look at this!
The relevant context is here: https://reviews.llvm.org/D116736
I don't get this change. We already use reassoc FMF to allow a non-ordered reduction, what's the purpose of the new flag?
Well, maybe I didn't get the reviewer's ideas right? Any suggestions are appreciated!
Are we following someone else's standard here, or can we specify the behavior? If a specific reduction order is required, I would very much expect that to be "ordered reduction".
I'm not a fan of introducing a third order here if we can avoid it. The value of having something other than ordered/unordered is not clear to me.
I assume ordered here means sequential. The clang builtin specifies a tree-wise reduction order. The main motivation for that order is to enable faster execution and consistent results on different HW architectures. At the moment the existing intrinsic is either slow & consistent or fast & inconsistent.
I'm not convinced about all this implied complexity.
I think clang will just have to emit expanded form of the reduction in that case.
Sure it could, but it seems unfortunate that a common reduction pattern cannot be expressed with the reduction intrinsic. The main benefit of using the reduction intrinsic here is to make instruction selection substantially easier. AFAICT the main complexity will be in the backends, which will now have to be able to deal with the new order. But given that at the moment only AArch64 and RISCV, it should't be too bad?
It's worth noting that "sequential" is only desirable for matching scalar code (which is important, but isn't everything). Explicit vector code essentially always prefers a tree reduction (either even + odd or low + high, depending on arch/uarch, but either is better than sequential). Either normal tree reduction delivers faster results with smaller average errors, and they are still reproducible. Having a defined tree reduction is a great tool for explicit vectorizers.