This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE2] Combine add+lsr to rshrnb for stores
ClosedPublic

Authored by MattDevereau on Jul 14 2023, 8:09 AM.

Details

Summary

[AArch64][SVE2] Combine add+lsr to rshrnb for stores

The example sequence

add z0.h, z0.h, #32
lsr z0.h, #6
st1b z0.h, x1

can be replaced with

rshrnb z0.b, #6
st1b z0.h, x1

As the top half of the destination elements are truncated.

In similar fashion,

add z0.s, z0.s, #32
lsr z1.s, z1.s, #6
add z1.s, z1.s, #32
lsr z0.s, z0.s, #6
uzp1 z0.h, z0.h, z1.h

Can be replaced with

rshrnb z1.h, z1.s, #6
rshrnb z0.h, z0.s, #6
uzp1 z0.h, z0.h, z1.h

Diff Detail

Event Timeline

MattDevereau created this revision.Jul 14 2023, 8:09 AM
MattDevereau requested review of this revision.Jul 14 2023, 8:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2023, 8:09 AM

This is a nice optimisation @MattDevereau, thanks! I found there is also another case we could support with loops like this where the store doesn't come straight afterwards:

void foo(unsigned short *dest, unsigned short *src, long n) {
  for (long i = 0; i < n; i++)
    dest[i] += ((src[i] + 32) >> 6);
}

In this case the IR sequence is add, lshr, trunc since the truncate doesn't get absorbed into the store. Maybe it's worth seeing if you can reuse your code in tryCombineStoredNarrowShift for this case too?

MattDevereau edited the summary of this revision. (Show Details)
kmclaughlin added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20099

Hi @MattDevereau,
You might need to check the value returned by DAG.getSplatValue here as I think if the operand is not a splat as expected, the dyn_cast will fail.
Please can you add a negative test for this scenario as well?

@kmclaughlin Thank you. I've added two tests @neg_trunc_lsr_add_op1_not_splat and @neg_trunc_lsr_op1_not_splat to bail out of emitting rshrnb when the RHS operands are not splat values.

I've also added a test @neg_add_has_two_uses and a check that the add does not have more than one use. Without this it is possible to generate this regression which costs 2 extra cycles

neg_add_two_use:
  ptrue p0.h
  ld1h { z0.h }, p0/z, [x0]
  rshrnb z1.b, z0.h, #6
  add z0.h, z0.h, #32 // =0x20
  add z0.h, z0.h, z0.h
  st1h { z0.h }, p0, [x2, x3, lsl #1]
  st1b { z1.h }, p0, [x1, x3]
  ret

Thank you for adding the new tests @MattDevereau, I just have a couple of small suggestions in the trySimplifySrlAddToRshrnb function.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20092

I think if you change this to dyn_cast_or_null you can remove the additional isSplatValue check at the beginning of the function.

20101

Similarly here, I think you can remove the isSplatValue above by using dyn_cast_or_null

Replaced explicit checks for splat values with dyn_cast_or_null

MattDevereau marked 3 inline comments as done.Jul 26 2023, 6:48 AM
kmclaughlin accepted this revision.Jul 27 2023, 6:14 AM

Thanks @MattDevereau, there's one nit but otherwise LGTM!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20151–20157

nit: The braces here aren't necessary

20164–20170

As above :)

This revision is now accepted and ready to land.Jul 27 2023, 6:14 AM

Hello. When adding these for NEON we were able to the transform via tablegen patterns at selection time, which has some benefits about leaving the nodes generic for as long as possible. In this case because of the differences in the instructions I would expect that it need demanded bits, so doing it as a combine probably makes sense.

It's best not to generate a MachineNode directly though if we can.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20111

We should avoid creating machine nodes in SelectionDAG combines. It's like a layering violation. Can you change this to either generate an intrinsic, or preferably add a new AArch64ISD node for it?

hassnaa-arm added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20106

Hi Mat,
I think that check doesn't consider the case when the AddValue has enabled bits other than the most significant bit.
ex: the ShiftValue is 6, and the AddValue is 33 (100001)b
Is that correct ?

20106

Do you handle the case when the destination of the AddOperation has an enabled bit in the same index of the enabled bit in the add Value ?
In that case the Add operation will has side affect, and by combining that case into rshrnb, you cancel the side effect of the AddOperation.
Do you agree on that ? or I'm missing something ??

MattDevereau added inline comments.Aug 1 2023, 6:10 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20106

I think that check doesn't consider the case when the AddValue has enabled bits other than the most significant bit.
ex: the ShiftValue is 6, and the AddValue is 33 (100001)b

This check would correctly reject that example, which is not a case for this combine. As per the documentation for rshrnb the operation is

integer res = (UInt(element) + (1 << (shift-1))) >> shift;

As 33 is not possible to get from left shifting 1, the combine will correctly bail. The lower bits in this case would be kept as the combine wouldn't fire.

20106

What side effect do you mean exactly?

Do you handle the case when the destination of the AddOperation has an enabled bit in the same index of the enabled bit in the add Value ?

I don't think the bits in the destination register of the add matter, register allocation should handle that fine and give us a non conflicting destination register.

20111

Thanks for pointing that out. I'm working on emitting a new AArch64ISD node now, however it adds a bit more complexity to the patch.

Matt added a subscriber: Matt.Aug 1 2023, 2:28 PM
hassnaa-arm added inline comments.Aug 2 2023, 2:31 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20106

Sorry I didn't express it clearly.
For example,
The add operation is src + (100000)b
I mean that what if the src has an enabled bit at the index corresponding to the '1' in the addedVal, index 5
so that means there will be 1+1 and then index 6 will be affected.
How that case is handled ?

MattDevereau added inline comments.Aug 2 2023, 3:18 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20106

Sorry but I don't understand what you mean by case. I think what you are referring to is the rounding behaviour of rshrnb which is the intended calculation and not a side effect.

Emit an AArch64ISD node instead of the machine node directly.
Include D->S addressing mode of RSHRNB

Thanks. This looks good, but it might need to be using 1ULL for the shift value and there are few extra suggestions below. Otherwise LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20085

Can you either add a check that the VT is one that is valid for RSHRNB, or an assert that we only handle those types?

20105

This may need to be 1ULL << (ShiftValue - 1), as the number could be a 64bit value.

20109–20111
SDValue Rshrnb = DAG.getNode(
      AArch64ISD::RSHRNB_I, DL, VT,
      Add->getOperand(0), DAG.getTargetConstant(ShiftValue, DL, MVT::i32));

(Or just return it directly)

20150

trySimplifySrlAddToRshrnb could take a SDValue, instead of needing the cast.

20776

I think RshrnbVT is just ValueVT?

llvm/lib/Target/AArch64/AArch64ISelLowering.h
217

This might deserve to be in it's own little section. There may be a number of T/B nodes we end up adding, if we treat them in the same way.

llvm/lib/Target/AArch64/SVEInstrFormats.td
4309

It might be better to generate a nxv16i8->nxv8i16 RSHRN with a bitcast/nvcast back to nxv8i16.

Allen added a subscriber: Allen.Aug 4 2023, 7:05 PM
MattDevereau marked 6 inline comments as done.
MattDevereau added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20085

Since were matching the Rshrnb ISD node emitted with the patterns in tablegen that match nvx8i16 -> nxv16i8, nxv4i32 -> nxv8i16, nxv2i64 -> nxv4i32 I've added an explicit block for this now:

EVT ResVT;
if (VT == MVT::nxv8i16)
  ResVT = MVT::nxv16i8;
else if (VT == MVT::nxv4i32)
  ResVT = MVT::nxv8i16;
else if (VT == MVT::nxv2i64)
  ResVT = MVT::nxv4i32;
else
  return SDValue();

We can use ResVT for the SDNode creation of Rshrnb and use VT for the bitcast back to the type that fits the DAG correctly.

20105

Done, and changed int64_t AddValue to uint64_t AddValue to line up with the change.

20109–20111

This is directly returning a bitcast SDNode back to the original VT now.

20150

I get the feeling this is a side-grade since I need to pass in SDLoc as an extra parameter this way, but I can see the cast looks a bit ugly. I'm not fussy on this so I'll go with your suggestion.

dmgreen accepted this revision.Aug 8 2023, 7:04 AM

Thanks. LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20083

SDValue is usually passed by value, not as a pointer. It might also be able to generate the DL from SDLoc DL(Srl);, depending on where it is best for the debug loc to come from.

This revision was automatically updated to reflect the committed changes.