Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
This is always good re instruction count.
This is good for Ryzen: https://godbolt.org/g/bd6xLh
But e.g. for Haswell, IPC halves, and RThroughput slightly decreases: https://godbolt.org/g/2gUzfP
So i'm not sure, maybe some more guards are needed?
Is it possible to reverse this as a DAG combine? I'd rather not produce 2 instructions from one pattern if we can avoid it.
Probably? I will take a look.
It will still need to produce two instructions (which is safe, since there is one-use check on the mask).
If you do the one use check in the DAG combine to qualify the reverse, then we can use existing isel patterns to select each shift individually right?
I'm not sure i understand the question.
I would expect that the DAG combine reverse would fully replace these
X86InstrCompiler.td changes, while keeping the same test output.
That's what i meant. I think I didn't understand this statement "It will still need to produce two instructions (which is safe, since there is one-use check on the mask)."
- Rebased ontop of more tests.
- Make it DAGCombine transform, instead of X86InstrCompiler.td
- Add preferShiftsToClearExtremeBits() Target Lowering Hook, defaults to false.
- X86: preferShiftsToClearExtremeBits() - true if BMI2 and 32/64 bit width.
- This exposed one more missing BZHI patternmatch (@f64_bzhi in llvm/test/CodeGen/X86/replace-load-and-with-bzhi.ll started failing.)
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
4797 | You may not want to do this for i64 on 32-bit targets. Would that improve the results in clear_highbits64_c0 for example? | |
lib/Target/X86/X86InstrInfo.td | ||
2566 ↗ | (On Diff #153620) | The fact that you needed this patterns is actually a bug in the code that turn turns load+and into the bzhi pattern. Normally the trunc would have narrowed the subtract to 32-bits. But the trunc, shift, and subtract were created after legalize ops, but the shit amount was created as i64. It should have been created directly as an i8. The re-legalize we run as part of the last DAG combine caught it and fixed it, but it was too late to get the subtract fixed. I suppose this pattern might be useful if the subtract had other users that weren't the truncate which would prevent the normal combine from kicking in. But by that logic we would need to replicate all patterns that all look for truncated subtracts. And it wouldn't be limited to just the BZHI64 patterns. You could have a 32 bit data, but a shift amount that is initially 64 bits. |
lib/Target/X86/X86InstrInfo.td | ||
---|---|---|
2566 ↗ | (On Diff #153620) | I've fixed the bad shift creation in r336051 |
I think the original motivating question was independent of any BMI extensions? Ie, which of these is better for a generic x86(-64):
0000000000000000 movl %esi, %ecx 0000000000000002 shll %cl, %edi 0000000000000004 movl %edi, %eax 0000000000000006 shrl %cl, %eax 0000000000000008 retq 0000000000000000 movl $0xffffffff, %eax 0000000000000005 movl %esi, %ecx 0000000000000007 shrl %cl, %eax 0000000000000009 andl %edi, %eax 000000000000000b retq
The 'and' version is bigger and might be slower, so that's not a good default for x86 codegen?
Either way, I think it would be better to change the check-prefixes in the test files so they map to the enabled/disabled target attributes. It's not clear to me that we want to tie the transform to any particular set of those attributes.
- Rebased ontop of more tests
- Always transform for x86, unless vector, or 64-bit width non-64-bit target.
This implements what was suggested. The BMI2 case looks like a win because it can save an instruction. I'm still not sure about the base case. @craig.topper or @RKSimon probably have a better sense for how that compares on various uarch. We could argue that this patch is just restoring the previous behavior, so that's the least risky and preferred way to go.
I may be missing something but isn't this a pessimization for the case where the mask is an immediate that can be directly encoded into the and instruction? Why should we prefer dual shifts over something like an AND RAX, 0xfff?
This specifically looks for (and (shl -1, X), Y) or (and (shr -1, X), Y). If Y in either of those is a constant, I think we would have rewritten the LHS side of the shift already and removed the 'and' entirely.
Correct, i'm pretty sure we do not want to pessimize constant masks.
This is looking for a pattern like: x & (-1 << y)
That is, an And, with one side being a shift.
If y is constant, the pattern is x & (-1 << C),
and i'm somewhat expecting that (-1 << C) will be constant-folded into a constant.
So we shouldn't be able to match that.
This looks like the right default for x86 based on size of the code and opportunities for other combines (and as noted, it's what we would have produced anyway until very recently), so LGTM.
include/llvm/CodeGen/TargetLowering.h | ||
---|---|---|
515 | I don't think this line ("Different targets...") adds anything, so I'd remove it. | |
516 | Clearer to say: "Return true if the variant with 2 shifts is preferred." | |
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
4197–4206 | It's a matter of taste, but I find it more readable/less code to use if+else or ?: for 2-case logic. |
llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4193–4194 ↗ | (On Diff #154673) | Hmm, this effectively killed test coverage for this pattern for BZHI/BEXTR.. |
I don't think this line ("Different targets...") adds anything, so I'd remove it.