This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] handle more splat loads
ClosedPublic

Authored by shchenz on Jul 22 2021, 8:15 AM.

Details

Summary

This is based on the patch in https://reviews.llvm.org/D106353 provided by @nemanjai

Add more splat load types handling and more tests.

Diff Detail

Event Timeline

shchenz created this revision.Jul 22 2021, 8:15 AM
shchenz requested review of this revision.Jul 22 2021, 8:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2021, 8:15 AM
nemanjai added inline comments.Jul 26 2021, 4:13 AM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
9065

Why take Op as a non-const reference? I don't see it being modified so it is not an output operand AFAICT.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
2844

As far as I can tell, this is in a block that only sets let Predicates = [VSX]. It is not safe to use direct moves on subtargets that don't have direct moves.

llvm/test/CodeGen/PowerPC/load-and-splat.ll
223

Definitely bad. P7 doesn't have direct moves.

shchenz updated this revision to Diff 361984.Jul 27 2021, 5:08 AM
shchenz marked an inline comment as done.

1: Lint warning fix
2: LD_SPLAT for v8i16/v16i8 on pwr7

shchenz marked 2 inline comments as done.Jul 27 2021, 5:11 AM
shchenz added inline comments.
llvm/test/CodeGen/PowerPC/load-and-splat.ll
223

For this case, now we use one more instruction than the left ones. But I think it should still be a win as we don't use the stack which is always good for some opts, like leaf calls related optimizations. And now it uses fewer memory operations 2 vs 3.

shchenz marked an inline comment as done.Aug 8 2021, 7:29 PM

ping

gentle ping

@nemanjai Hi Nemanja, could you please help to have another look at this issue? Thanks.

gentle ping

gentle ping...

jsji added inline comments.Oct 26 2021, 8:01 AM
llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5840

VSPLTB?

5862

clang-tidy has warneed, don't use 'else' after 'return'.

5870

The code patterns are very similar for these two cases, can we common them?
eg: something like:

SDNode *Mask = CurDAG->getMachineNode(
    Subtarget->isLittleEndian() ? PPC::LVSR : PPC::LVSL, dl, N->getValueType(0) ,
    CurDAG->getRegister(ZeroReg, MVT::i64), N->getOperand(1));

SDNode *Load =
    CurDAG->getMachineNode(PPC::LVX, dl, N->getValueType(0), MVT::Other,
                           {CurDAG->getRegister(ZeroReg, MVT::i64),
                            N->getOperand(1), N->getOperand(0)});
     opcode = ...
  splatconst=...

     if (N->getValueType(0) == MVT::v8i16){
         SDNode *LoadHigh = CurDAG->getMachineNode(
                        PPC::LVX, dl, MVT::v16i8, MVT::Other,
                        {SDValue(CurDAG->getMachineNode(
                       LIOpcode, dl, MVT::i32,
                       CurDAG->getTargetConstant(1, dl, MVT::i8)),
                       0),
                       N->getOperand(1), SDValue(Load, 1)});
   
     Load  = LoadHigh;
 opcode=...
spltconst =...

 }

CurDAG->ReplaceAllUsesOfValueWith(SDValue(N, 1), SDValue(Load, 1));
transferMemOperands(N, Load);

CurDAG->SelectNodeTo(
    N, opcode, N->getValueType(0),
    spliconst,
    SDValue(Perm, 0));
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
9200

Can we split this into multiple early returns, with comments to corresponding ones.

//case 1 ...
if(...) 
   return SDValue();

//case 2 ...
if(...) 
   return SDValue();
...
llvm/lib/Target/PowerPC/PPCInstrVSX.td
2839

Do we need some conditional checking here? As LFIWZX is changing the type to int now, are there any precision change?

3560

Do we need to predicate 64 bit before using MTVSRD?

4113

Do we need alignment check here for LXSIHZX?

llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
646

f0 here looks weird regarding register classes, xxspltd does need vs0 to get the element 0 of doubleword. Maybe we should follow up to see why we are printing f0 here in ASMPrinter.

llvm/test/CodeGen/PowerPC/scalar_vector_test_3.ll
18

Unrelated changes?

shchenz updated this revision to Diff 382973.Oct 28 2021, 3:08 AM
shchenz marked 8 inline comments as done.

address @jsji comments

llvm/lib/Target/PowerPC/PPCInstrVSX.td
2839

I think we don't need a conditional check. LFIWZX will not change the raw_data in the registers. For example, if the data stored in the memory is float type 1.5(0x000000003fc00000), after LFIWZX, the raw data stored in result FPR is still 0x000000003fc00000, not integer 1?

3560

Thanks, this is a very good question. Yes, normally, we need to predicate 64 bit if MTVSRD emits any value in the high 32bits of the first double world.

However, for this case, seems we don't need the predicate, because the useful bits for this case are the lowest 16/8 bits, so even on the 32 bit arch, we still can ensure the lowest 16/8 bits are having the right value.

I add a comment to indicate why we don't need to predicate 64 bit.

4113

I think non-alignment load may introduce perf issue, but it will not have functionality issue? Seems load like LDX also not checks the alignment. Do you see any issues here?

llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
646

Sure, I will check this later. vs0 seems a more reasonable input.

llvm/test/CodeGen/PowerPC/scalar_vector_test_3.ll
18

It is also for the weird change: xxspltd v2, f0

jsji accepted this revision as: jsji.Nov 1 2021, 8:45 AM

LGTM. Thanks.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4113

Is it possible that we call PPCldsplat with unaligned address? If so, then we may load from wrong address.

llvm/test/CodeGen/PowerPC/scalar_vector_test_3.ll
18

I meant the ;, anyway, I have committed the changes in bd932f7499ff7ab958f5bc2f55dcf4b06cd87950, you should be able to rebase.

This revision is now accepted and ready to land.Nov 1 2021, 8:45 AM
This revision was landed with ongoing or failed builds.Nov 2 2021, 10:33 PM
This revision was automatically updated to reflect the committed changes.
nemanjai added inline comments.Nov 4 2021, 5:08 AM
llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5831

Sorry, I had this comment but didn't submit because I hand't finished the review. Please flip the condition and turn it into an early exit.

5837

Seems like all the code is also inside this block. Please flip the condition and early exit. You should really look for opportunities to do this and reduce the level of nesting.

I think it would also be useful to provide a much simpler code path for loads that are aligned at 16 bytes. In that case, all you need is LVX + SPLAT (where the splat index depends on endianness.

Example:

typedef signed short __attribute__((aligned(16))) AlignedShort;
vector signed short test(AlignedShort *Ptr) {
  return (vector signed short)*(Ptr + 3);
}
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
9080

This is of course a minor nit, but please be mindful of the convention to make comments proper sentences with capitalization and punctuation.

9165

This is not trivially obvious and therefore requires a comment. Something like:

// If the input load is an extending load, it will be an i32 -> i64
// extending load and isValidSplatLoad() will update NewOpcode.

Generally, I think this pattern is somewhat dangerous. It makes an assumption that a different function has specific behaviour that isn't clearly documented. There is little assurance that someone won't update isValidSplatLoad() to accept extending loads from i16 to i64, etc. So I think you should add an assert here to ensure the types are what you expect them to be.

9176

s/Execlude/Exclude

9200

I agree with separating out the LFIWAX/LFIWZX case because the condition is very different. But for the LXVRHX/LXVRBX, the conditions are essentially the same. Those should be combined in an obvious way:

// case 2 - lxvr[bh]x
// 2.1: load result is at most 16 bits;
// 2.2: build a vector with above loaded value;
// 2.3: the vector has only one value at index 0, others are all undef;
// 2.4: on LE target, so that lxvr[bh]x does not need any permute.
if (NumUsesOfInputLD == 1 && Subtarget.isLittleEndian() &&
    Subtarget.isISA3_1() && Op->getValueType(0).getSizeInBits() <= 16)
llvm/lib/Target/PowerPC/PPCInstrVSX.td
3560

I understand that we don't really care about the undefined bits because they'll be overwritten when we splat, but why not just use MTVSRWZ and not even need the comment?

shchenz marked 7 inline comments as done.Nov 4 2021, 10:32 PM

Thanks for your comments @nemanjai

I addressed most of the comment in commit 96950270669acd3c342a266562ff3a41464cc0a0

One comment related to the code gen improvement for the aligned load will be addressed in the follow-up Phabricator patch.

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5837

I'd like to follow up on the 16-bytes aligned splat load in another patch.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
3560

Thanks for the good suggestion. This seems also resolve one crash related to INSERT_SUBREG(x, x, sub_32) on 32-bit AIX. I addressed this comment in D113236