This is an archive of the discontinued LLVM Phabricator instance.

[X86] When have BMI2, prefer shifts to clear low/high bits, rather than variable mask.
ClosedPublic

Authored by lebedev.ri on Jun 29 2018, 5:15 AM.

Details

Summary

This adds a reverse transform for the instcombine canonicalizations
that were added in D47980, D47981.

As discussed later, that was worse at least for the code size,
and potentially for the performance, too.

https://rise4fun.com/Alive/Zmpl

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Jun 29 2018, 5:15 AM
lebedev.ri edited the summary of this revision. (Show Details)Jun 29 2018, 5:33 AM

This is always good re instruction count.
This is good for Ryzen: https://godbolt.org/g/bd6xLh
But e.g. for Haswell, IPC halves, and RThroughput slightly decreases: https://godbolt.org/g/2gUzfP
So i'm not sure, maybe some more guards are needed?

Is it possible to reverse this as a DAG combine? I'd rather not produce 2 instructions from one pattern if we can avoid it.

lebedev.ri planned changes to this revision.Jun 29 2018, 1:43 PM

Is it possible to reverse this as a DAG combine? I'd rather not produce 2 instructions from one pattern if we can avoid it.

Probably? I will take a look.
It will still need to produce two instructions (which is safe, since there is one-use check on the mask).

If you do the one use check in the DAG combine to qualify the reverse, then we can use existing isel patterns to select each shift individually right?

If you do the one use check in the DAG combine to qualify the reverse, then we can use existing isel patterns to select each shift individually right?

I'm not sure i understand the question.
I would expect that the DAG combine reverse would fully replace these
X86InstrCompiler.td changes, while keeping the same test output.

That's what i meant. I think I didn't understand this statement "It will still need to produce two instructions (which is safe, since there is one-use check on the mask)."

  • Rebased ontop of more tests.
  • Make it DAGCombine transform, instead of X86InstrCompiler.td
  • Add preferShiftsToClearExtremeBits() Target Lowering Hook, defaults to false.
  • X86: preferShiftsToClearExtremeBits() - true if BMI2 and 32/64 bit width.
  • This exposed one more missing BZHI patternmatch (@f64_bzhi in llvm/test/CodeGen/X86/replace-load-and-with-bzhi.ll started failing.)
craig.topper added inline comments.Jun 30 2018, 10:32 AM
lib/Target/X86/X86ISelLowering.cpp
4797 ↗(On Diff #153620)

You may not want to do this for i64 on 32-bit targets. Would that improve the results in clear_highbits64_c0 for example?

lib/Target/X86/X86InstrInfo.td
2566 ↗(On Diff #153620)

The fact that you needed this patterns is actually a bug in the code that turn turns load+and into the bzhi pattern. Normally the trunc would have narrowed the subtract to 32-bits. But the trunc, shift, and subtract were created after legalize ops, but the shit amount was created as i64. It should have been created directly as an i8. The re-legalize we run as part of the last DAG combine caught it and fixed it, but it was too late to get the subtract fixed.

I suppose this pattern might be useful if the subtract had other users that weren't the truncate which would prevent the normal combine from kicking in. But by that logic we would need to replicate all patterns that all look for truncated subtracts. And it wouldn't be limited to just the BZHI64 patterns. You could have a 32 bit data, but a shift amount that is initially 64 bits.

craig.topper added inline comments.Jun 30 2018, 11:03 AM
lib/Target/X86/X86InstrInfo.td
2566 ↗(On Diff #153620)

I've fixed the bad shift creation in r336051

lebedev.ri marked an inline comment as done.

As suggested, restrict transform of 64-bit case to 64-bit targets.

lebedev.ri marked 2 inline comments as done.

Respond to rL336051.

Any further comments?

spatel added a comment.Jul 3 2018, 7:37 AM

I think the original motivating question was independent of any BMI extensions? Ie, which of these is better for a generic x86(-64):

0000000000000000	movl	%esi, %ecx
0000000000000002	shll	%cl, %edi
0000000000000004	movl	%edi, %eax
0000000000000006	shrl	%cl, %eax
0000000000000008	retq

0000000000000000	movl	$0xffffffff, %eax
0000000000000005	movl	%esi, %ecx
0000000000000007	shrl	%cl, %eax
0000000000000009	andl	%edi, %eax
000000000000000b	retq

The 'and' version is bigger and might be slower, so that's not a good default for x86 codegen?

Either way, I think it would be better to change the check-prefixes in the test files so they map to the enabled/disabled target attributes. It's not clear to me that we want to tie the transform to any particular set of those attributes.

  • Rebased ontop of more tests
  • Always transform for x86, unless vector, or 64-bit width non-64-bit target.
spatel added a comment.Jul 4 2018, 7:47 AM

This implements what was suggested. The BMI2 case looks like a win because it can save an instruction. I'm still not sure about the base case. @craig.topper or @RKSimon probably have a better sense for how that compares on various uarch. We could argue that this patch is just restoring the previous behavior, so that's the least risky and preferred way to go.

reames added a subscriber: reames.Jul 4 2018, 11:07 AM

I may be missing something but isn't this a pessimization for the case where the mask is an immediate that can be directly encoded into the and instruction? Why should we prefer dual shifts over something like an AND RAX, 0xfff?

This specifically looks for (and (shl -1, X), Y) or (and (shr -1, X), Y). If Y in either of those is a constant, I think we would have rewritten the LHS side of the shift already and removed the 'and' entirely.

I may be missing something but isn't this a pessimization for the case where the mask is an immediate that can be directly encoded into the and instruction? Why should we prefer dual shifts over something like an AND RAX, 0xfff?

Correct, i'm pretty sure we do not want to pessimize constant masks.

This is looking for a pattern like: x & (-1 << y)
That is, an And, with one side being a shift.
If y is constant, the pattern is x & (-1 << C),
and i'm somewhat expecting that (-1 << C) will be constant-folded into a constant.
So we shouldn't be able to match that.

This specifically looks for (and (shl -1, X), Y) or (and (shr -1, X), Y). If Y in either of those is a constant, I think we would have rewritten the LHS side of the shift already and removed the 'and' entirely.

That was indeed what I was missing. Thanks for the clarification.

spatel accepted this revision.Jul 9 2018, 6:06 AM

This looks like the right default for x86 based on size of the code and opportunities for other combines (and as noted, it's what we would have produced anyway until very recently), so LGTM.

include/llvm/CodeGen/TargetLowering.h
515 ↗(On Diff #153948)

I don't think this line ("Different targets...") adds anything, so I'd remove it.

516 ↗(On Diff #153948)

Clearer to say: "Return true if the variant with 2 shifts is preferred."

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4195–4204 ↗(On Diff #153948)

It's a matter of taste, but I find it more readable/less code to use if+else or ?: for 2-case logic.

This revision is now accepted and ready to land.Jul 9 2018, 6:06 AM

This looks like the right default for x86 based on size of the code and opportunities for other combines (and as noted, it's what we would have produced anyway until very recently), so LGTM.

Thank you for the review!

This revision was automatically updated to reflect the committed changes.
lebedev.ri marked 2 inline comments as done.Jul 9 2018, 12:12 PM
lebedev.ri marked an inline comment as done.
lebedev.ri added inline comments.Sep 21 2018, 4:26 AM
llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4193–4194

Hmm, this effectively killed test coverage for this pattern for BZHI/BEXTR..
It wouldn't be too much of a problem to just match the expanded pattern there,
but we only unfold if the mask is one-use..