This is an archive of the discontinued LLVM Phabricator instance.

[Target][ARM] Tune getCastInstrCost for extending masked loads and truncating masked stores
ClosedPublic

Authored by dmgreen on Apr 30 2020, 2:52 AM.

Details

Summary

This patch uses the feature added in D79162 to fix the cost of a sext/zext of a masked load, or a trunc for a masked store.
Previously, those were considered cheap or even free, but it's absolutely not the case if the cast's result type doesn't fit in a 128 bits register. They're expensive!

Examples:

The cast fits in a 128 bits register:

// LLVM
define dso_local arm_aapcs_vfpcc <8 x i16> @square(<8 x i8>*, <8 x i8>, <8 x i8>) #0 {
    %mask = trunc <8 x i8> %1 to <8 x i1>
    %res = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8 (<8 x i8>* %0, i32 4, <8 x i1> %mask, <8 x i8> %2)
    %ext = sext <8 x i8> %res to <8 x i16>
    ret <8 x i16> %ext
}
// ASM
        vpt.i32 ne, q0, zr
        vldrbt.s16      q0, [r0]
        vmovlb.s8       q1, q1
        vpsel   q0, q0, q1

The cast doesn't fit in a 128 bits register:

// LLVM
define dso_local arm_aapcs_vfpcc <8 x i32> @square(<8 x i8>*, <8 x i8>, <8 x i8>) #0 {
    %mask = trunc <8 x i8> %1 to <8 x i1>
    %res = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8 (<8 x i8>* %0, i32 4, <8 x i1> %mask, <8 x i8> %2)
    %ext = sext <8 x i8> %res to <8 x i32>
    ret <8 x i32> %ext
}
// ASM
        vpt.i32 ne, q0, zr
        vldrbt.u16      q0, [r0]
        vpsel   q1, q0, q1
        vmov.u16        r0, q1[0]
        vmov.32 q0[0], r0
        vmov.u16        r0, q1[1]
        vmov.32 q0[1], r0
        vmov.u16        r0, q1[2]
        vmov.32 q0[2], r0
        vmov.u16        r0, q1[3]
        vmov.32 q0[3], r0
        vmov.u16        r0, q1[4]
        vmov.32 q2[0], r0
        vmov.u16        r0, q1[5]
        vmov.32 q2[1], r0
        vmov.u16        r0, q1[6]
        vmov.32 q2[2], r0
        vmov.u16        r0, q1[7]
        vmov.32 q2[3], r0
        vmovlb.s8       q0, q0
        vmovlb.s8       q1, q2
        vmovlb.s16      q0, q0
        vmovlb.s16      q1, q1

I've updated the costs to better reflect reality, and added a test for it in test/Analysis/CostModel/ARM/cast.ll.

I've also added a vectorizer test that showcases the improvement: in some cases, the vectorizer will now choose a smaller VF when tail-predication is enabled, which results in better codegen. (Because if it were to use a higher VF in those cases, the code we see above would be generated, and the vmovs would block tail-predication later in the process, resulting in very poor codegen overall)

Please note that the contents of this patches are subject to changes depending on the outcome of the review of D79162, but the cost calculation logic shouldn't change too much (just the way this case is detected).

Diff Detail

Event Timeline

Pierre-vh created this revision.Apr 30 2020, 2:52 AM
dmgreen added inline comments.Apr 30 2020, 1:53 PM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
330–344

Technically it's not that we can't split them, it's that we choose not to. Essentially we will have to end up paying the cost somewhere, and from a cost-modelling perspective, this is a good place to say that splitting them will be expensive, especially in a tail predicated loop where we won't be able to split a vctp sensibly.

Also do we end up scalarize the masked load? That might change where we put the cost (we could just implement getMemoryOpCost and put the cost their instead).

Pierre-vh updated this revision to Diff 263136.May 11 2020, 3:24 AM

Updating the patch following the changes to D79162

Pierre-vh marked an inline comment as done.May 11 2020, 3:24 AM
dmgreen added inline comments.May 12 2020, 10:25 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
339

This should now be (CCH == Normal || CCh == Masked) ?
The costs wouldn't really be right for other types I don't think.
We may need to fix those up later, but that can be done in a different patch.

llvm/test/Analysis/CostModel/ARM/cast.ll
2157 ↗(On Diff #263136)

Can you make sure there are tests for the legal types: <4xi8>-><4xi32>, <4xi16>-><4xi32> and <8xi8>-><8xi16>

2187 ↗(On Diff #263136)

Again you add stores for trunc of legal types, <4xi32>-><4xi8>, <4xi32>-><4xi16> and <8xi16>-><8xi8>

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll
116

Can you try and cleanup the test.

Pierre-vh updated this revision to Diff 263645.May 13 2020, 1:50 AM
Pierre-vh marked 4 inline comments as done.
  • Cleaning up vectorizer tests (removed all extra attributes)
  • Refactored & Improved the cast.ll test
  • Removed isLoadOrMaskedLoad, now using the CCH instead.
dmgreen accepted this revision.May 21 2020, 7:10 AM

Thanks. LGTM, so long as we can get the first patch in too.

This revision is now accepted and ready to land.May 21 2020, 7:10 AM
dmgreen commandeered this revision.Jun 18 2020, 5:58 AM
dmgreen edited reviewers, added: Pierre-vh; removed: dmgreen.
This revision now requires review to proceed.Jun 18 2020, 5:58 AM
dmgreen updated this revision to Diff 271690.Jun 18 2020, 5:59 AM

I've tried to rebase this onto current trunk and some of the FP16 / Trunc cost changes.

This revision was not accepted when it landed; it landed in state Needs Review.Jul 29 2020, 5:41 AM
This revision was automatically updated to reflect the committed changes.