This is an archive of the discontinued LLVM Phabricator instance.

[x86] limit vector increment fold to allow load folding
ClosedPublic

Authored by spatel on Oct 25 2021, 9:36 AM.

Details

Summary

The tests are based on the example from:
https://llvm.org/PR52032

I suspect that it looks worse than it actually is. :)
That is, llvm-mca says there's no uop/timing difference with the load folding and pcmpeq vs. broadcast on Haswell (and probably other targets).
The load-folding definitely makes the code smaller, so it's good for that at least. So this requires carving a narrow hole in the transform to get just this case without changing others that look good as-is (in other words, the transform still seems good for most examples).

Diff Detail

Event Timeline

spatel created this revision.Oct 25 2021, 9:36 AM
spatel requested review of this revision.Oct 25 2021, 9:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2021, 9:36 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
900

What about multi-use load, we usually don't load-fold those?

spatel marked an inline comment as done.Oct 25 2021, 11:53 AM
spatel added inline comments.
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
900

Yes, good point. I really want to use the same logic as in X86ISelLowering's MayFoldLoad, but it's a static function. Let me update that.

spatel updated this revision to Diff 382078.Oct 25 2021, 11:59 AM
spatel marked an inline comment as done.

Patch updated:

  1. Make mayFoldLoad visible to X86ISelDAGToDAG and use it in the predicate.
  2. Added a test with extra uses of loads, so we can see the result in that case (unchanged / negative test for this patch).

I can make the helper visibility/formatting changes an NFC pre-commit if that looks ok. We have functions like that scattered around, so it's not clear to me if there's a better way.

craig.topper added inline comments.Oct 25 2021, 12:12 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
5053

While you're in here, can you use cast here since the dyn_cast isn't checked for null.

SGTM in general.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
901

I'm stuck on this one-use check. Is it really reasonable?
We don't really know that the constant is used nearby, do we?
There doesn't seem to be a test for it.
https://godbolt.org/z/Ea9sad3Ya

spatel added inline comments.Oct 25 2021, 12:21 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
5053

Yes - I can do that as a 1-liner. Beyond that, I wasn't sure what cleanup we want to do.
Could update this group of 4 "MayFold*" functions together (pull the declarations into the header and X86 namespace even though there's no current external user of the other 3)?

spatel added inline comments.Oct 25 2021, 1:18 PM
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
901

Right, we don't know exactly where the other use of that constant is or even if it's part of another increment op with this check, so it's a (weak) heuristic.

I'll add a test like you've shown. Without that check, we would alter codegen on it as shown below. It's a close call, but I think it's a regression to increase the load uops on that -- even if it is one less macro instruction.

diff --git a/llvm/test/CodeGen/X86/combine-sub.ll b/llvm/test/CodeGen/X86/combine-sub.ll
index a399c5175dd6..5090895c0ab8 100644
--- a/llvm/test/CodeGen/X86/combine-sub.ll
+++ b/llvm/test/CodeGen/X86/combine-sub.ll
@@ -290,9 +290,8 @@ define void @PR52032_oneuse_constant(<8 x i32>* %p) {
 ;
 ; AVX-LABEL: PR52032_oneuse_constant:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmovdqu (%rdi), %ymm0
-; AVX-NEXT:    vpcmpeqd %ymm1, %ymm1, %ymm1
-; AVX-NEXT:    vpsubd %ymm1, %ymm0, %ymm0
+; AVX-NEXT:    vpbroadcastd {{.*#+}} ymm0 = [1,1,1,1,1,1,1,1]
+; AVX-NEXT:    vpaddd (%rdi), %ymm0, %ymm0
 ; AVX-NEXT:    vmovdqu %ymm0, (%rdi)
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
spatel updated this revision to Diff 382106.Oct 25 2021, 1:29 PM

Patch updated:
Added one-use-of-constant "1" test (currently negative for this patch) with explanatory comment.

craig.topper added inline comments.Oct 25 2021, 1:49 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
5042

Not related to this patch, but this function is incorrect for SSE which requires alignment or a subtarget feature that disables the alignment check. Are all existing uses scalar or AVX only?

craig.topper added inline comments.Oct 25 2021, 1:51 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
5042

We also don't fold non-temporal loads if we have an instruction for it. See useNonTemporalLoad

RKSimon added inline comments.Oct 26 2021, 3:20 AM
llvm/lib/Target/X86/X86ISelLowering.h
915

(minor) Add description now that its exposed

spatel added inline comments.Oct 26 2021, 8:12 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
5042

Let's try to deal with this first (and fold in the formatting changes to that patch):
D112545

spatel updated this revision to Diff 382643.Oct 27 2021, 6:27 AM

Patch updated:
Rebased after 6c0a2c2804c0 and added function definition code comment.

spatel marked an inline comment as done.Oct 27 2021, 6:28 AM
spatel marked 4 inline comments as done.

LGTM - although might be better to pre-commit the change to X86::mayFoldLoad separately from the actual PR52032 fix

Any other comments?

Seems fine to me.

RKSimon accepted this revision.Oct 28 2021, 4:35 AM
This revision is now accepted and ready to land.Oct 28 2021, 4:35 AM
This revision was landed with ongoing or failed builds.Oct 29 2021, 12:53 PM
This revision was automatically updated to reflect the committed changes.