This is an archive of the discontinued LLVM Phabricator instance.

[X86] combineConcatVectorOps - remove FADD/FSUB/FMUL handling (2-1)
ClosedPublic

Authored by xiangzhangllvm on Apr 11 2023, 12:35 AM.

Details

Summary

Due to VADD, VSUB, VMUL can executed on more ports than VINSERT.
We tend to remove FADD/FSUB/FMUL handling from combineConcatVectorOps.
More details pls refer to fb91f0

Problems
1 This patch cause affected tests generated worse code. We need to fix it.
PS: After fb91f0 there is another 649b14 optimize these tests

2 We may also need to root cause why middle end vector optimization didn't work for there cases.
For example, llvm/test/CodeGen/X86/widen_fadd.ll function widen_fadd_v2f32_v8f32
was not be optimized (in avx mode)
from

 define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) {
  %a2 = getelementptr inbounds i8, ptr %a0, i64 8
  %b2 = getelementptr inbounds i8, ptr %b0, i64 8
  %c2 = getelementptr inbounds i8, ptr %c0, i64 8
  %a4 = getelementptr inbounds i8, ptr %a0, i64 16
  %b4 = getelementptr inbounds i8, ptr %b0, i64 16
  %c4 = getelementptr inbounds i8, ptr %c0, i64 16
  %a6 = getelementptr inbounds i8, ptr %a0, i64 24
  %b6 = getelementptr inbounds i8, ptr %b0, i64 24
  %c6 = getelementptr inbounds i8, ptr %c0, i64 24
  %va0 = load <2 x float>, ptr %a0, align 4
  %vb0 = load <2 x float>, ptr %b0, align 4
  %va2 = load <2 x float>, ptr %a2, align 4
  %vb2 = load <2 x float>, ptr %b2, align 4
  %va4 = load <2 x float>, ptr %a4, align 4
  %vb4 = load <2 x float>, ptr %b4, align 4
  %va6 = load <2 x float>, ptr %a6, align 4
  %vb6 = load <2 x float>, ptr %b6, align 4
  %vc0 = fadd <2 x float> %va0, %vb0
  %vc2 = fadd <2 x float> %va2, %vb2
  %vc4 = fadd <2 x float> %va4, %vb4
  %vc6 = fadd <2 x float> %va6, %vb6
  store <2 x float> %vc0, ptr %c0, align 4
  store <2 x float> %vc2, ptr %c2, align 4
  store <2 x float> %vc4, ptr %c4, align 4
  store <2 x float> %vc6, ptr %c6, align 4
  ret void
}

to

define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) {
  %va0 = load <8 x float>, ptr %a0, align 4
  %vb0 = load <8 x float>, ptr %b0, align 4
  %vc0 = fadd <8 x float> %va0, %vb0
  store <8 x float> %vc0, ptr %c0, align 4
  ret void
}

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2023, 12:35 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
xiangzhangllvm requested review of this revision.Apr 11 2023, 12:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2023, 12:35 AM
xiangzhangllvm edited the summary of this revision. (Show Details)Apr 11 2023, 12:46 AM
xiangzhangllvm edited the summary of this revision. (Show Details)
xiangzhangllvm added inline comments.Apr 11 2023, 1:16 AM
llvm/test/CodeGen/X86/widen_fadd.ll
132

Not sure why mid-end not vectorize them to

define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) {
  %va0 = load <8 x float>, ptr %a0, align 4
  %vb0 = load <8 x float>, ptr %b0, align 4
  %vc0 = fadd <8 x float> %va0, %vb0
  store <8 x float> %vc0, ptr %c0, align 4
  ret void
}

(even I add "fast" in fadd IRs)

Have you got example code snippets that more clearly show the regression?

Have you got example code snippets that more clearly show the regression?

I notice the regression from our big performance test (coremark-pro), hard to simply it here.

LuoYuanke added a subscriber: jsji.Apr 16 2023, 6:01 PM
RKSimon accepted this revision.Apr 17 2023, 2:50 AM

LGTM if its causing regressions, but I'd appreciate any time you can spend on PR60441

This revision is now accepted and ready to land.Apr 17 2023, 2:50 AM

LGTM if its causing regressions, but I'd appreciate any time you can spend on PR60441

It seems there is ScalarizerPass that can scalarize the vector, but it is not enabled. We can scalarize the small vector before the vectorization and let vectorizer re-vectorize them. I did a rough experiment with below patch and it seems the code of https://godbolt.org/z/sojxs9EGK can be vectorized with command clang -g0 -O3 -march=x86-64-v4 -ffast-math -mllvm -scalarize-load-store t.cpp -S

diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 4b759693fec2..a74a77872eb7 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -113,6 +113,7 @@
 #include "llvm/Transforms/Scalar/Reassociate.h"
 #include "llvm/Transforms/Scalar/SCCP.h"
 #include "llvm/Transforms/Scalar/SROA.h"
+#include "llvm/Transforms/Scalar/Scalarizer.h"
 #include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"
 #include "llvm/Transforms/Scalar/SimplifyCFG.h"
 #include "llvm/Transforms/Scalar/SpeculativeExecution.h"
@@ -968,6 +969,7 @@ PassBuilder::buildModuleSimplificationPipeline(OptimizationLevel Level,
   EarlyFPM.addPass(LowerExpectIntrinsicPass());
   EarlyFPM.addPass(SimplifyCFGPass());
   EarlyFPM.addPass(SROAPass(SROAOptions::ModifyCFG));
+  EarlyFPM.addPass(ScalarizerPass());
   EarlyFPM.addPass(EarlyCSEPass());
   if (Level == OptimizationLevel::O3)
     EarlyFPM.addPass(CallSiteSplittingPass());

LGTM if its causing regressions, but I'd appreciate any time you can spend on PR60441

OK, many thanks! Let me double confirm, the PR60441 is https://reviews.llvm.org/D60441, right?

LuoYuanke added a comment.EditedApr 17 2023, 5:40 PM

LGTM if its causing regressions, but I'd appreciate any time you can spend on PR60441

OK, many thanks! Let me double confirm, the PR60441 is https://reviews.llvm.org/D60441, right?

I think PR60441 is https://github.com/llvm/llvm-project/issues/60441.

LGTM if its causing regressions, but I'd appreciate any time you can spend on PR60441

It seems there is ScalarizerPass that can scalarize the vector, but it is not enabled. We can scalarize the small vector before the vectorization and let vectorizer re-vectorize them. I did a rough experiment with below patch and it seems the code of https://godbolt.org/z/sojxs9EGK can be vectorized with command clang -g0 -O3 -march=x86-64-v4 -ffast-math -mllvm -scalarize-load-store t.cpp -S
...

That is a very good experiment for re-vectorization, I think we should do more test for other tests, maybe a big improvement for vectorization optimization.

LuoYuanke added inline comments.Apr 17 2023, 6:16 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
56348

Add comments to address that the shuffle instructions throughput is less than fadd/fsub/fmul instruction?

LGTM if its causing regressions, but I'd appreciate any time you can spend on PR60441

OK, many thanks! Let me double confirm, the PR60441 is https://reviews.llvm.org/D60441, right?

I think PR60441 is https://github.com/llvm/llvm-project/issues/60441.

For https://godbolt.org/z/19564zseG in PR60441 I found GCC vectorize it at very early stage (at early mid-end stage we can see the diff of vectorization)

LGTM if its causing regressions, but I'd appreciate any time you can spend on PR60441

Thanks, but frankly speaking, currently I don't figure out the 2-2 patch to fix the "became worse lit tests"
Do you think we should fix the lit tests in back end again ?
Or create a issue for mid-end friends ?

xiangzhangllvm added inline comments.Apr 17 2023, 6:52 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
56348

No problem

Let me commit it if no body objects.
For the affected tests (include cases in PR60441 ) I think they are better be optmized in mid-end (re-vectorization)

PS: the failed test ./clang/test/SemaCXX/warn-unsafe-buffer-usage-fixits-pre-increment.cpp has no relation with this change.

This revision was landed with ongoing or failed builds.Apr 19 2023, 7:38 PM
This revision was automatically updated to reflect the committed changes.