This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Fix loads and stores for predicate vectors
ClosedPublic

Authored by dmgreen on Sep 2 2019, 12:32 PM.

Details

Summary

These predicate vectors can usually be loaded and stored with a single instruction, a VSTR_P0. However this instruction will store the entire P0 predicate, 16 bits, zeroextended to 32bits. Each lane of the the v4i1/v8i1 representing 4/2 bits.

As far as I understand, when llvm says "store this v4i1", it really does need to store 4 bits (or 8, that being the size of a byte, with this bottom 4 as the interesting bits). For example a bitcast from a v8i1 to a i8 is defined as a store followed by a load, which is how the code is expanded.

So this instead lowers the v4i1/v8i1 load/store through some shuffles to get the bits into the correct positions. This, as you might imagine, is not as efficient as a single instruction. But I believe it is needed for correctness. v16i1 equally should not load/store 32bits, only storing the 16bits of data. Stack loads/stores are still using the VSTR_P0 (as can be seen by the test not changing). This is fine as they are self-consistent, it is only "externally observable loads/stores" (from our point of view) that need to be corrected.

The test changes here are in pred-bitcast (which is no longer incorrect), pred-ldst (which is obviously a lot larger, but I don't believe will be generated a lot), and masked ld/st. The masked ld/st test we should be able to optimise better with a few folds, and we should not be generating masked ld/st only to expand them like this.

Diff Detail

Event Timeline

dmgreen created this revision.Sep 2 2019, 12:32 PM
SjoerdMeijer added a comment.EditedSep 9 2019, 3:03 AM

Hi Dave, I agree with your analysis. The codegen for this looks horrible, but it is what it is.

One question though. Loading and storing data.....do we need to worry about LE and BE here?

llvm/lib/Target/ARM/ARMISelLowering.cpp
8745

nit, typo: itto

samparker added inline comments.Sep 9 2019, 3:52 AM
llvm/test/CodeGen/Thumb2/mve-pred-bitcast.ll
23

I'm missing something here... from my understanding:

  • Select 16 bytes, taken from q2 (0xff) and q1(0x0), building a vector predicate mask in q1.
  • Then we take the bottom 4 bytes from q1, the mask, putting each into a 32-bit lane of q2.
  • Then we compare the 32-bit lanes of q2 against zero.
  • Then we select bytes from q0 (%a) and q1 (zero).

It's the second point that I don't understand... why do we only access the lower lanes of q1?

dmgreen marked an inline comment as done.Sep 9 2019, 5:38 AM

One question though. Loading and story and story data.....do we need to worry about LE and BE here?

I believe that this is OK. At least for these two we are only storing a single byte, so it shouldn't be an issue.

It does bring up the issue of whether the v16i1 is correct. It looks like the instruction is loading/storing 32bits, so might not be really right to use if we should only be storing 16bits, even in LE. And for BE would be putting the bits in the wrong place. Let me try and adjust that one too.

llvm/test/CodeGen/Thumb2/mve-pred-bitcast.ll
23

This is converting an i4, as in the bottom 4 bits of r0, into a v4i1, as in the 16 bits of P0. It needs to be sort of "shuffled" or a "signext_inreg'd" to get those bits into the correct places.

So the "step 0" in you list would be copy the bits from r0 to p0 using the msr, and the 4 bits of interest are in bottom 4 bits of p0. The rest is the awkward sign extend.

The alternative is to do this in integer instructions. That may be better, depending on the circumstances. I don't think (hope) this will come up a lot though. We just need to not get it wrong.

samparker added inline comments.Sep 9 2019, 5:58 AM
llvm/test/CodeGen/Thumb2/mve-pred-bitcast.ll
23

Ah, yes, hadn't considered the i4 properly. cheers.

dmgreen updated this revision to Diff 219342.Sep 9 2019, 7:07 AM
dmgreen retitled this revision from [ARM] Fix loads and stores for v4i1 and v8i1 to [ARM] Fix loads and stores for predicate vectors.
dmgreen edited the summary of this revision. (Show Details)

Now with v16i1, which doesn't need the extracts/buildvector, just going through the vmrs into a vstrh store.

This has the nice effect for bitcasts of eliding the load and store, giving us the bitcast lowering to vmsr for free (unfortunately the stack is still realigned. That should be fixable with a better preferred vector alignment though).

SjoerdMeijer accepted this revision.Sep 9 2019, 7:16 AM

I am happy with this if Sam is happy too.

This revision is now accepted and ready to land.Sep 9 2019, 7:16 AM
This revision was automatically updated to reflect the committed changes.