This is an archive of the discontinued LLVM Phabricator instance.

[X86] X86DAGToDAGISel::matchBitExtract(): extract 'lshr' from `X`
ClosedPublic

Authored by lebedev.ri on Nov 5 2018, 2:28 AM.

Details

Summary

As discussed in previous review, and noted in the FIXME, if X is actually an lshr Y, Z (logical!),
we can fold the Z into 'control`, and let the BEXTR do this too.
I *think* INSERT_SUBREG SUB_8BIT is the correct (or identical) choice here, not zext+or, https://godbolt.org/z/VdV5E9

We can only do this for lshr, not ashr, because we do not know that the mask cover only the bits of Y,
and not any of the sign-extended bits.

The obvious question is, is this actually legal to do?
I believe it is. Relevant quotes, from Intel® 64 and IA-32 Architectures Software Developer’s Manual, BEXTR — Bit Field Extract:

  • Bit 7:0 of the second source operand specifies the starting bit position of bit extraction.
  • A START value exceeding the operand size will not extract any bits from the second source operand.
  • Only bit positions up to (OperandSize -1) of the first source operand are extracted.
  • All higher order bits in the destination operand (starting at bit position LENGTH) are zeroed.
  • The destination register is cleared if no bits are extracted.

FIXME: if we can do this, i wonder if we should prefer BEXTR over BZHI in such cases.

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Nov 5 2018, 2:28 AM

In general, I tend to prefer a sequence of MOVZ + OR over a partial register write introduced by a byte move.
That being said, I don't have a strong opinion on this.

AMD processors don't tend to rename portions of a 32/64 bit GPR. So, the partial write introduced by your byte move is not renamed, and there is a false dependency on the input register operand. On the other hand, if you use a MOVZ, you end up with one extra instruction to dispatch/execute. However, in practice the latency of that extra MOVZ can be hidden, as you can see from the throughput numbers generated by llvm-mca for your examples.

Partial register moves may no be a problem on modern Intel cpus.
Intel processors know how to rename portions of a GPR, so false dependencies can be removed that way. According to Agner: since Haswell, the processor solves this problem without any visible performance penalties. Perhaps it makes dual bookkeeping of both the partial register and the full register. (cit. "Haswell and Broadwell pipeline" and "Skylake pipeline").

There is still a small penalty for SandyBridge/Ivybridge. On those processors, a partial write can be renamed. However, when the full register is read later on, a merge opcode is issued to join the data from the different writes. That being said, the latency of that merge opcode should be relatively small.

Note however that, according to Agner (see chapter "Sandybridge and Ivybridge pipeline") a zero-extending move from an 8-bit register to a 32-bit or 64-bit register can also be eliminated at register renaming stage. That seems to be only true for Sandybridge and Ivybridge. As far as I am aware of, no AMD processor allows zero extending move elimination.
That being said, a solution that uses a MOVZ + OR may be better (or equivalent) on Sandybridge/Ivybridge.

As a side note: if we can "prove" that the upper bits of the register in input to the MOVZ is known to be zero, then the MOVZ becomes redundant and can be removed. That is another point in favor of MOVZ + OR. But - as I wrote - there is no clear winner.

Regardless of the solution that you decide to implement in the end. It is nice to see that we no longer have the regalloc constraint on physical register %CL introduced by the shift right.

-Andrea

andreadb accepted this revision.Nov 13 2018, 7:10 AM

In general, I tend to prefer a sequence of MOVZ + OR over a partial register write introduced by a byte move.
That being said, I don't have a strong opinion on this.

AMD processors don't tend to rename portions of a 32/64 bit GPR. So, the partial write introduced by your byte move is not renamed, and there is a false dependency on the input register operand. On the other hand, if you use a MOVZ, you end up with one extra instruction to dispatch/execute. However, in practice the latency of that extra MOVZ can be hidden, as you can see from the throughput numbers generated by llvm-mca for your examples.

Partial register moves may no be a problem on modern Intel cpus.
Intel processors know how to rename portions of a GPR, so false dependencies can be removed that way. According to Agner: since Haswell, the processor solves this problem without any visible performance penalties. Perhaps it makes dual bookkeeping of both the partial register and the full register. (cit. "Haswell and Broadwell pipeline" and "Skylake pipeline").

There is still a small penalty for SandyBridge/Ivybridge. On those processors, a partial write can be renamed. However, when the full register is read later on, a merge opcode is issued to join the data from the different writes. That being said, the latency of that merge opcode should be relatively small.

Note however that, according to Agner (see chapter "Sandybridge and Ivybridge pipeline") a zero-extending move from an 8-bit register to a 32-bit or 64-bit register can also be eliminated at register renaming stage. That seems to be only true for Sandybridge and Ivybridge. As far as I am aware of, no AMD processor allows zero extending move elimination.
That being said, a solution that uses a MOVZ + OR may be better (or equivalent) on Sandybridge/Ivybridge.

Acutally. BMI1/BMI2 has been introduced since Haswell (thanks Simon for pointing it out). So, the information about Sandybridge/Ivybridge is not relevant to this patch.

In conclusion:
I think your patch is fine, and it removes a not-so-nice dependency from physreg %CL.
An alternative approach that uses MOVZ+OR instead of a MOV8rr is not bad. In particular, if the compiler knows how to optimize out the MOVZ by using the "known-zero-bits", then we may want to produce an OR instead of a partial register move.

-Andrea

Regardless of the solution that you decide to implement in the end. It is nice to see that we no longer have the regalloc constraint on physical register %CL introduced by the shift right.

-Andrea

This revision is now accepted and ready to land.Nov 13 2018, 7:10 AM

...
An alternative approach that uses MOVZ+OR instead of a MOV8rr is not bad. In particular, if the compiler knows how to optimize out the MOVZ by using the "known-zero-bits", then we may want to produce an OR instead of a partial register move.

Yeah, i think i will go with MOVZ+OR.

RKSimon requested changes to this revision.Nov 14 2018, 1:34 AM

Reopening until the MOVZ+OR changes are in

This revision now requires changes to proceed.Nov 14 2018, 1:34 AM

Use MOVZ+OR.

On these tests, i would even say this new MOVZ+OR variant is better..

RKSimon added inline comments.Nov 14 2018, 6:32 AM
test/CodeGen/X86/extract-bits.ll
54 ↗(On Diff #174003)

This doesn't look its zero-extending properly - the upper bits contents of RAX/RCX will both contain garbage as movb doesn't zero-extend.

On these tests, i would even say this new MOVZ+OR variant is better..

It looks like the MOVZ is missing in various tests.
However, it shouldn't have been optimized out.

See for example your @bextr64_d5_skipextrauses(i64 %val, i64 %numskipbits, i64 %numlowbits) .
Here, the value of numskipbits (our %RSI) is unknown. We cannot safely assume that the upper 56-bits of numskipbits are cleared.
So, the OR instruction has to be preceeded by a zero-extend of %RSI.

If instead %numskipbits was declared as i8 zeroext instead of i64, then the compiler would know for sure that the upper bits are all zero, and we can just generate an OR.
I think the problem is that you are constructing a 8-bit quantity by inserting on an implicit definition, instead of a zero.

lebedev.ri planned changes to this revision.Nov 14 2018, 6:39 AM
lebedev.ri marked an inline comment as done.

...
I think the problem is that you are constructing a 8-bit quantity by inserting on an implicit definition, instead of a zero.

Yes. Sorry about that.
I have now resorted to truncating to i8, zero-extending to i16, and inserting into the proper [undef] register.
On these tests it has the same effect as directly zero-extending to the proper register size, so not sure maybe i'm again over-engineering this?

lebedev.ri added inline comments.Nov 14 2018, 7:21 AM
test/CodeGen/X86/extract-bits.ll
54 ↗(On Diff #174003)

Oh indeed.

andreadb accepted this revision.Nov 14 2018, 11:15 AM

LGTM.

Thanks.

P.s.: I wonder if the MOVZ would get away if the input operand is declared as "zeroext i8". We should check it (maybe not in this patch).

craig.topper added inline comments.Nov 14 2018, 11:24 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
2874 ↗(On Diff #174038)

When does this ever happen? Type and op legalization should have canonicalized all shifts to use an i8 amount.

2879 ↗(On Diff #174038)

Can we just directly zero extend to NVT and avoid the insert_subreg?

lebedev.ri added inline comments.Nov 14 2018, 12:14 PM
lib/Target/X86/X86ISelDAGToDAG.cpp
2874 ↗(On Diff #174038)

Not all. See if (NBits.getValueType() != NVT) { check above.
It catches (used to catch?) some cases in the pattern c/d.

I don't have a testcase for *this* branch, i'm simply being cautious based on that branch.

2879 ↗(On Diff #174038)

I answered that in https://reviews.llvm.org/D54095#1298483 :

On these tests it has the same effect as directly zero-extending to the proper register size, so not sure maybe i'm again over-engineering this?

I can't test right now but i'm basically thinking about i16 %shamt,
which would (should) be truncated to i8, as you have said, but it was i16,
so ZERO_EXT to i16 should be omittable? (and we do not care about other bits)

craig.topper added inline comments.Nov 14 2018, 2:11 PM
lib/Target/X86/X86ISelDAGToDAG.cpp
2874 ↗(On Diff #174038)

Since you know for sure the Shift Amount came from an SRL node, I think you can just get by with an assert.

2879 ↗(On Diff #174038)

So my concern is that zero_extend to i16 is 1 byte longer than zero_extend to 32/64. It also has a partial register dependency to preserve bits 31:16. But we know that so the isel pattern for zero_extend to i16 emits a zero extend to i32 and an extract_subreg to avoid that. But your code also emits an insert_subreg. So a later pass has to merge and remove the extract_subreg+insert_subreg. So it feels like we should just emit the right thing here.

If the shift amount was i16 we are way too late in the proccess for anything to realize the upper bits are 0. You would have to have an entirely different path for that.

lebedev.ri marked 6 inline comments as done.

Simply do ZERO_EXTEND.

LGTM

LGTM.

Thank you for the review!

Reopening until the MOVZ+OR changes are in

Done.

RKSimon accepted this revision.Nov 15 2018, 10:43 AM

LGTM cheers

This revision is now accepted and ready to land.Nov 15 2018, 10:43 AM
This revision was automatically updated to reflect the committed changes.