Page MenuHomePhabricator

[LV] Combine vector reductions parts in tree instead of serially.
Needs ReviewPublic

Authored by fhahn on Jan 17 2022, 9:50 AM.

Details

Summary

At the moment, LV chains together the reduction values for all parts
serially. This results in larger than necessary dependency chains.

This patch updates LV to repeatedly combine adjacent pairs of parts to
combine them, for arithmetic opcodes.

Diff Detail

Unit TestsFailed

TimeTest
3,540 msx64 debian > Clang.utils/update_cc_test_checks::check-globals.test
Script: -- : 'RUN: at line 1'; rm -rf /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/check-globals.test.tmp && mkdir /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/check-globals.test.tmp
2,120 msx64 debian > Clang.utils/update_cc_test_checks::global-hex-value-regex.test
Script: -- : 'RUN: at line 1'; rm -rf /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/global-hex-value-regex.test.tmp && mkdir /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/global-hex-value-regex.test.tmp
2,090 msx64 debian > Clang.utils/update_cc_test_checks::global-value-regex.test
Script: -- : 'RUN: at line 1'; rm -rf /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/global-value-regex.test.tmp && mkdir /var/lib/buildkite-agent/builds/llvm-project/build/tools/clang/test/utils/update_cc_test_checks/Output/global-value-regex.test.tmp
500 msx64 debian > HWAddressSanitizer-x86_64.TestCases/Linux::decorate-proc-maps.c
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -m64 -gline-tables-only -fsanitize=hwaddress -fuse-ld=lld -fsanitize-hwaddress-experimental-aliasing -mllvm -hwasan-generate-tags-with-calls=1 -mllvm -hwasan-globals -mllvm -hwasan-use-short-granules -mllvm -hwasan-instrument-landing-pads=0 -mllvm -hwasan-instrument-personality-functions -mllvm -hwasan-globals=0 -g /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/hwasan/TestCases/Linux/decorate-proc-maps.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/hwasan/X86_64/TestCases/Linux/Output/decorate-proc-maps.c.tmp
120 msx64 debian > LLVM.Transforms/LoopVectorize/AMDGPU::packed-math.ll
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Transforms/LoopVectorize/AMDGPU/packed-math.ll -loop-vectorize -dce -instcombine -S | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck -check-prefix=GFX9 /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Transforms/LoopVectorize/AMDGPU/packed-math.ll
View Full Test Results (417 Failed)

Event Timeline

fhahn created this revision.Jan 17 2022, 9:50 AM
fhahn requested review of this revision.Jan 17 2022, 9:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 17 2022, 9:50 AM

Does this alter much? Or do we end up redistributing them anyway? https://godbolt.org/z/z4nf5hPna

Does this alter much? Or do we end up redistributing them anyway? https://godbolt.org/z/z4nf5hPna

It won't have a massive impact in general, but it shaves off a few cycles, depending on the interleave count.

AFAICT the redistributions done in the https://godbolt.org/z/z4nf5hPna are done by ReassoicatePass, which likes to turn parallel reduction trees into serial ones (? but that's a separate issue I think), like for @float2, which looks like it got serialized. I don't think any passes that run after the vectorizer try to improve the length of reduction chains: https://godbolt.org/z/v4K4aK3a1

It won't have a massive impact in general, but it shaves off a few cycles, depending on the interleave count.

AFAICT the redistributions done in the https://godbolt.org/z/z4nf5hPna are done by ReassoicatePass, which likes to turn parallel reduction trees into serial ones (? but that's a separate issue I think), like for @float2, which looks like it got serialized. I don't think any passes that run after the vectorizer try to improve the length of reduction chains: https://godbolt.org/z/v4K4aK3a1

Do we think this is something that should be done in general? This looks like it will allow the reordering of fp instructions under -hints-allow-reordering=true without fast flags, which would not otherwise be reassociatable. But the other cases could always be done by the backend if it considered it profitable.