This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] narrow truncated binops
ClosedPublic

Authored by spatel on Nov 16 2018, 11:25 AM.

Details

Summary

The motivating case for this is shown in:
https://bugs.llvm.org/show_bug.cgi?id=32023
and the corresponding rot16.ll regression tests.

Because x86 scalar shift amounts are i8 values, we can end up with trunc-binop-trunc sequences that don't get folded in IR.

As the TODO comments suggest, there will be regressions if we extend this (for x86, we mostly seem to be missing LEA opportunities, but there are likely vector folds missing too). I think those should be considered existing bugs because this is the same transform that we do as an IR canonicalization in instcombine. We just need more tests to make those visible independent of this patch.

test-shrink.ll has what appears to be a regression even with this limited patch. I'm not sure about the AMDGPU diffs.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Nov 16 2018, 11:25 AM
arsenm added inline comments.Nov 16 2018, 11:52 AM
test/CodeGen/AMDGPU/cgp-bitfield-extract.ll
128–132 ↗(On Diff #174411)

This should be fine

nhaehnle removed a subscriber: nhaehnle.Nov 19 2018, 3:47 AM

X86's combineTruncatedArithmetic does something similar - can we merge them at all?

X86's combineTruncatedArithmetic does something similar - can we merge them at all?

Yes, that should be possible. This patch has a TODO about supporting vectors, and that code has a TODO about supporting scalars (it is exclusively working on vectors). So they are completely independent for the moment, but I'll take that as a follow-up.

spatel updated this revision to Diff 175281.Nov 26 2018, 9:03 AM

Patch updated:
Rebased - no code changes, but the recently added funnel-shift tests have diffs that look neutral to me.

Note that we should be limiting LEA-based regressions with D54803, but those are still possible, so I think it's still better to make this change in multiple steps. I have another patch drafted to help with LEA, but it depends on fixing the crash seen in D54710.

LGTM.

At the beginning, I was a bit concerned by the increase in the number partial register writes.
So I started experimenting with your patch fo a bit. In practice, I couldn't come up with an obvious case where your patch introduces extra partial register stalls. Also, none of the tests modified by your patch really suffers from partial register stalls.
The checks perfomed by your transform are quite conservative (the new combine rule is limited to pre-legalization, and it only affects binary opcodes where one of the operands is a constant). So, I am happy with it.

test/CodeGen/X86/clear-lowbits.ll
898–900 ↗(On Diff #175281)

Not related to this patch: do you know why we need a movzwl here?

Only the lower 16-bits of EAX are 'defined' on exit from this function. Users of this function would not rely on the upper bits of EAX. So, - unless I am missing something - it is redundant to zero extend %val.

andreadb accepted this revision.Nov 29 2018, 10:08 AM
This revision is now accepted and ready to land.Nov 29 2018, 10:08 AM
craig.topper added inline comments.Nov 29 2018, 10:13 AM
test/CodeGen/X86/clear-lowbits.ll
898–900 ↗(On Diff #175281)

I think it needs to be zero extended because we're using 32-bit shifts and you need zeros in the upper bits to guarantee that shifting right will shift 0s into the lower 16 bits.

craig.topper added inline comments.Nov 29 2018, 10:24 AM
test/CodeGen/X86/clear-lowbits.ll
898–900 ↗(On Diff #175281)

Though we also shift back left by the same amount. So it is redundant, but SimplifyDemandedBits stopped when it saw the shift amount on the shl was non-constant. Maybe for the non-constant case, SimlifyDemandedBits should see if the input is a SHR with the same amount and then propagated demanded bits to the shr input.

andreadb added inline comments.Nov 29 2018, 10:38 AM
test/CodeGen/X86/clear-lowbits.ll
898–900 ↗(On Diff #175281)

mm.. but isn't the second shift (the shlxl) using the same shift count?
Even if the shift right introduces non zero bits, the second shift would get rid of those extra bits anyway (unless I am reading the code completely wrong... which is probably the case, since I am tired).

lebedev.ri added inline comments.
test/CodeGen/X86/clear-lowbits.ll
898–900 ↗(On Diff #175281)

As it can be seen from the comment at the beginning of the file, this test is for ic) x & (-1 << (32 - y)).
But due to other transforms it becomes id) x >> (32 - y) << (32 - y)
In other words, shift x right by (32 - y) (i.e. UB if y is 0), and then shift it left, back to the pixels' original positions,
thus effectively setting low (32 - y) bits to zero.

andreadb added inline comments.Nov 29 2018, 11:01 AM
test/CodeGen/X86/clear-lowbits.ll
898–900 ↗(On Diff #175281)

Ah sorry. I completely missed your comment here. So basically we are on the same page.

LGTM.

At the beginning, I was a bit concerned by the increase in the number partial register writes.
So I started experimenting with your patch fo a bit. In practice, I couldn't come up with an obvious case where your patch introduces extra partial register stalls. Also, none of the tests modified by your patch really suffers from partial register stalls.
The checks perfomed by your transform are quite conservative (the new combine rule is limited to pre-legalization, and it only affects binary opcodes where one of the operands is a constant). So, I am happy with it.

Thanks everyone for the reviews and comments!
I haven't thought about partial reg stalls in a while, but IIRC, AMD doesn't care in cases like these because it doesn't rename 8/16-bit regs, and Intel should be protected because we have a machine pass (-x86-fixup-bw-insts) to prevent the problem. Note that we promote ops to 32-bit (eg, D54803) for speed/size, so that reduces the chances for partial reg problems too. This was really just to fix that annoying rotate vs. shld/shrd bug. :)

I filed the 'movzwl' issue here:
https://bugs.llvm.org/show_bug.cgi?id=39840

LGTM.

At the beginning, I was a bit concerned by the increase in the number partial register writes.
So I started experimenting with your patch fo a bit. In practice, I couldn't come up with an obvious case where your patch introduces extra partial register stalls. Also, none of the tests modified by your patch really suffers from partial register stalls.
The checks perfomed by your transform are quite conservative (the new combine rule is limited to pre-legalization, and it only affects binary opcodes where one of the operands is a constant). So, I am happy with it.

Thanks everyone for the reviews and comments!
I haven't thought about partial reg stalls in a while, but IIRC, AMD doesn't care in cases like these because it doesn't rename 8/16-bit regs, and Intel should be protected because we have a machine pass (-x86-fixup-bw-insts) to prevent the problem.

AMD processors don’t normally rename portions of GPR registers. So, a partial write has a false dependency on the previous definition of the written register. Intel processor are able to rename portions of a GPR at the cost of a merge opcode when there is a read of full register later on.
So, it is important to care about those partial writes. However, your patch limits the change to the case where there is a single use, and one of the operand is a constant. So, in practice, your change is unlikely to ever cause problems...

Note that we promote ops to 32-bit (eg, D54803) for speed/size, so that reduces the chances for partial reg problems too. This was really just to fix that annoying rotate vs. shld/shrd bug. :)

In this particular case it is probably not needed. That being said, partial writes are to avoid in general (in this case it doesn’t matter).

I filed the 'movzwl' issue here:
https://bugs.llvm.org/show_bug.cgi?id=39840

This revision was automatically updated to reflect the committed changes.