Page MenuHomePhabricator

[x86] [DAGCombine] Prefer shifts of constant widths.
Needs ReviewPublic

Authored by jlebar on Feb 6 2020, 1:56 PM.

Details

Summary

Commute shift and select in the following pattern:

shift lhs, (select cond, constant1, constant2) -->
select cond, (shift lhs, constant1), (shift lhs, constant2)

This is beneficial on x86, where shifting by an immediate is faster than
shifting by a register.

Canonical example:

return x << (cond ? 4 : 8);

before this patch

mov     eax, edi
xor     ecx, ecx
test    esi, esi
sete    cl
lea     ecx, [rcx + 2*rcx]
add     ecx, 3
shl     eax, cl
ret

after this patch

lea     eax, [8*rdi]
shl     edi, 6
test    esi, esi
cmove   eax, edi
ret

I enabled this folding only on x86. By my reading of the ARM Coretex-A75
optimization guide, this is not beneficial there. (I didn't check other ARM
processors.) I was unable to find a PPC optimization guide that listed
instruction latencies, so I didn't enable it there.

Diff Detail

Event Timeline

jlebar created this revision.Feb 6 2020, 1:56 PM
RKSimon added a subscriber: RKSimon.

Is there any canonicalization happening in InstCombine for this?

Vector tests? SSE shifts by uniform constants are a lot better than the alternatives.

Would it be better to make this more generic from the start? Some mechanism that allows targets to push selects up/down the DAG.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
501

"but not FHL or FSHR"? Why shouldn't funnels shifts be included?

7395

isConstantIntBuildVectorOrConstantInt ?

llvm/test/CodeGen/X86/select.ll
1123 ↗(On Diff #242998)

Regression

jlebar marked an inline comment as done.Feb 7 2020, 9:03 AM

Hi, thank you for looking at this!

Is there any canonicalization happening in InstCombine for this?

Yes, it canonicalizes to shift x, (select ...), https://gcc.godbolt.org/z/AtSRrW.

Vector tests? SSE shifts by uniform constants are a lot better than the alternatives.

Vector shift by a uniform value is better than shift by a vector of values, but I don't think that's what this transformation allows. (I recall seeing something in the x86 lowering code for what I think you're describing.) Do you see a CPU on which vector shift by an *immediate* is so much than vector shift by a register that we'd want to do two shifts by immediates instead of one by a register? I'm not seeing that as I look through the Agner table for Skylake, but I may be missing something.

TODO: Once we resolve this, I need to handle and test vector shifts somehow, if only by excluding them from this transformation.

Would it be better to make this more generic from the start? Some mechanism that allows targets to push selects up/down the DAG.

Eh, YAGNI? I looked for other opportunities to apply this optimization (i.e. where it's worth "speculating" an instruction so that one of the operands gets to be an immediate) and didn't really find any. Do you?

"but not FHL or FSHR"? Why shouldn't funnels shifts be included?

Is there a target on which we would do this transformation? If not I'd rather not write code that is never exercised (and it's more code because funnel shift takes three rather than two args).

llvm/test/CodeGen/X86/select.ll
1123 ↗(On Diff #242998)

Wow. What...is...this...architecture.

I guess the answer is, we don't do this optimization if the target doesn't have cmov?

I mean, I'll do it, but are we sure the complexity is worth it for this target? I can't even find an Agner optimization table for Intel MCU (Quark?).

craig.topper added inline comments.
llvm/test/CodeGen/X86/select.ll
1123 ↗(On Diff #242998)

MCU is basically a 486 on a much more modern silicon process.

RKSimon added inline comments.Feb 7 2020, 9:58 AM
llvm/test/CodeGen/X86/select.ll
1123 ↗(On Diff #242998)

Its not just the MCU case - all of these targets' codegen look worse tbh

jlebar marked an inline comment as done.Feb 7 2020, 10:39 AM
jlebar added inline comments.
llvm/test/CodeGen/X86/select.ll
1123 ↗(On Diff #242998)

It does look worse. I wanted to dig in more...

I somewhat arbitrarily tried Core2, Westmere, and Haswell with the x86-32 and x86-64 code.

  • *x86-32* llc < testcases/shift-regression.ll -mtriple=i386-apple-darwin10 --x86-asm-syntax=intel -mcpu=athlon | llvm-mca -mcpu=<arch>
  • *x86-64* llc < testcases/shift-regression.ll -mtriple=amd64-apple-darwin10 --x86-asm-syntax=intel | llvm-mca -mcpu=<arch>

I'm reporting "total cycles".

  • Core2 x86-32: HEAD 211 / patched 304
  • Core2 x86-64: HEAD 207 / patched 204
  • Westmere x86-32: HEAD 211 / patched 304
  • Westmere x86-64: HEAD 207 / patched 204
  • Haswell x86-32: HEAD 310 / patched 309
  • Haswell x86-64: HEAD 210 / patched 210

IOW for these CPUs it's a regression only in the x86-32 code and only in the older CPUs.

One thing we could do is limit this transformation to x86-64? Or we could do more investigation, in which case I'm curious to know what kind of evidence you'd need to be comfortable with it.

jlebar updated this revision to Diff 243634.Feb 10 2020, 10:55 AM
jlebar marked an inline comment as done and an inline comment as not done.

Fix regressions and add vector test.

jlebar added inline comments.Feb 10 2020, 10:55 AM
llvm/test/CodeGen/X86/select.ll
1123 ↗(On Diff #242998)

After thinking about this more...

The reason this testcase is sort of special is because the two shift widths are N and N+1, therefore it's very easy to convert from the boolean into the shift width.

In a sense, what's happening here is not shift x, (select ...), but rather shift x, (add ...). The bug is that I'm transforming the latter, when I only want to touch the former.

I fixed the regression by restricting this optimization to happen only later in the SelectionDAG pipeline, after which the select is folded into an add. Surprisingly, my change now doesn't affect *any* existing x86 codegen tests, other than the ones that I added. This makes me a little skeptical I've done something wrong, so I'd appreciate a careful review.

I've also now excluded vector types from this transformation, taking care of the TODO from earlier, and added tests.

Have you found any other targets that would benefit from this? Otherwise we might consider put it inside X86ISelLowering initially?

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
501

OK, but maybe change this to a TODO for FSHL/FSHR ?

7382

(style) Don't use auto.

7515

Move these lower down to after SimplifyDemandedBits, we tend to have the constant folding / cleanup combines first.

llvm/lib/Target/X86/X86ISelLowering.cpp
46546

(style) Don't use auto.

llvm/test/CodeGen/X86/dagcombine-shifts.ll
225

Please commit these tests with current codegen, then rebase the patch to show the codegen diff.

arsenm added a subscriber: arsenm.Feb 18 2020, 12:49 PM
arsenm added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
3464–3468

I think we generally have too many, overly specific queries like this. Is there any real reason to NOT do this on any target?

Thank you for the comments!

Have you found any other targets that would benefit from this? Otherwise we might consider put it inside X86ISelLowering initially?

Good question. @arsenm says on Discord he'd want this on AMDGPU for 64-bit shifts:

It's indirectly better for AMDGPU since if it's a 64-bit shift we can reduce it to 32-bit shifts
i.e. https://reviews.llvm.org/rG85508595350e2de0218c15f1c0088cd9f6236894
I would probably want that on, maybe not at -Oz

jlebar marked an inline comment as done.Feb 18 2020, 1:04 PM
jlebar added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
3464–3468

I am also not thrilled about adding another query like this that pretends to be general but really means "should I do this particular peephole optimization?"

I gated it conditionally on the architecture because ISTM that this transformation could be a pessimation on platforms where shift-by-register is as fast as shift-by-immediate. We're replacing 1 x shift-by-register + 1 x select with 2 x shift-by-immediate + 1 x select. AFAICT e.g. on Coretex-A75 this would not be an improvement.

arsenm added inline comments.Feb 18 2020, 1:18 PM
llvm/include/llvm/CodeGen/TargetLowering.h
3464–3468

GlobalISel avoids this problem by having every combine be an explicit opt in if I can encourage you to also do this there

jlebar marked an inline comment as done.Feb 18 2020, 1:26 PM
jlebar added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
3464–3468

Happy to consider it, but I'm not sure what the suggestion is exactly. (Or at least, this seems like we're making the combine be an explicit opt-in, so I don't see how it's different from your suggestion.) Can you point me to the prior art as used in non-globalisel?

arsenm added inline comments.Feb 18 2020, 7:48 PM
llvm/include/llvm/CodeGen/TargetLowering.h
3464–3468

I mean SelectionDAG is bad and am just generally disappointed there isn't more work going on for globalisel optimizations, and everyone seems to still just be working on SelectionDAG

lebedev.ri added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
3464–3468

From personal expirience - i only look at seldag because that is what is on the current (codegen) path.
Doing stuff for GISel would currently result in no benefit to codegen on X86, which kinda misses the point.
Is there an actual plan for X86 target to migrate to GISel?

spatel added inline comments.Feb 24 2020, 8:41 AM
llvm/include/llvm/CodeGen/TargetLowering.h
3464–3468

<crickets chirping...>

Speaking only for myself here: migrating to GISel is a daunting task for x86 because there's so much code that has to be adapted, and I have not seen a list or estimate of the potential benefits to incentivize the move.

There are over 100 tests in llvm/test/CodeGen/X86/GlobalISel, so some work has been done. But whether that is planned to continue or not, I have no idea.

jlebar updated this revision to Diff 247514.Sun, Mar 1, 12:38 PM
jlebar marked 9 inline comments as done.

Update per comments.

@RKSimon and others, thank you for the review and comments. Sorry for my delay here; this has changed from being my day job to my weekend hobby, and that made a bigger difference in my responsiveness than I'd like or than I expected.

I believe I've addressed all of the comments. I haven't pushed the tests sans this change just on the principle that I don't want to add tests that might end up being useless (if this patch doesn't land for some reason), but I've rebased this patch atop a commit which adds those tests. So when I commit this it will be as two patches, one to add the tests as-is, and then another to update the test with this patch.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7515

Done. This also allowed me to remove the Level >= AfterLegalizeTypes, which I never liked.