Page MenuHomePhabricator

[AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation

Authored by SjoerdMeijer on Wed, Mar 8, 4:55 AM.



This slightly increases the costs of InsertElement instructions that are part of a vector splat sequence, i.e. a load, InsertElement and a shuffle. The resulting LD1R is a high latency instruction, and this slight increase in costs avoids SLP vectorisation for a couple of cases where this isn't profitable.

SPEC 2017 FP and INT performance results with this change are completely neutral so only seem to affect cases like the changed regression tests.


Diff Detail

Event Timeline

SjoerdMeijer created this revision.Wed, Mar 8, 4:55 AM
SjoerdMeijer requested review of this revision.Wed, Mar 8, 4:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptWed, Mar 8, 4:55 AM
SjoerdMeijer added inline comments.Wed, Mar 8, 5:00 AM

Ah, only after uploading this diff I noticed that the function names indicate that this should be profitable... I had missed that.

Hmmm.... I guess that then needs looking into.

SjoerdMeijer added inline comments.Wed, Mar 8, 5:02 AM

Eyeballing this, my first reaction is that I slightly doubt that SLP will be profitable, but I guess that's what I need to find out.

ABataev added a subscriber: ABataev.Wed, Mar 8, 5:35 AM
ABataev added inline comments.

Working on fma vectorization support in SLP, hope to spend more time on this later this month.

SjoerdMeijer added inline comments.Wed, Mar 8, 7:16 AM

It's difficult to see how the SLP variant is ever going to be faster for this example with just a handful of scalar instructions (ignoring the loads/stores) that mostly have overlap in dispatch and execution:

[0,3]     D======eeeER   ..   fmul      s4, s1, s0
[0,4]     D======eeeER   ..   fmul      s1, s2, s1
[0,5]     D=======eeeER  ..   fmul      s5, s3, s2
[0,6]     D=======eeeER  ..   fmul      s0, s3, s0
[0,7]     D==========eeER..   fsub      s2, s4, s5
[0,8]     D==========eeER..   fadd      s0, s0, s1

especially if we have to do things like shuffles:

[0,2]     D==eeeeeeeeER  .    .    .    ..   ld1r.2s    { v1 }, [x8], #4
[0,3]     D==========eeeeeeER .    .    ..   ldr        d2, [x8]
[0,4]     D================eeeER   .    ..   fmul.2s    v1, v1, v2
[0,5]     D================eeeER   .    ..   fmul.2s    v0, v2, v0[0]
[0,6]     D===================eeER .    ..   rev64.2s   v1, v1
[0,7]     D=====================eeER    ..   fsub.2s    v3, v0, v1
[0,8]     .D====================eeER    ..   fadd.2s    v0, v0, v1
[0,9]     .D======================eeER  ..   mov.s      v3[1], v0[1]
[0,10]    .D========================eeER..   str        d3, [x0]
[0,11]    .D========================eeeeER   st1.s      { v2 }[1], [x1]

Here I am showing some loads/stores, but that's just to show they are not simple loads/stores anymore but more high-latency instructions, and perhaps more importantly we have got the REV and extract, so with FMAs things might look a bit better but it's difficult to beat the scalar variant. The SLP timeline is a little bit skewed of the post-inc and the result being available a lot earlier, but the bottom line is that there is very little parallelism here as we are working on 2 floats and there's the overhead of the vectorisation.

I have run some micro-benchmarks, and I've measured that the SLP variant is indeed slower.

@fhahn , @dmgreen : I think it makes to also not SLP vectorise this function (and there other 2 below). Do you agree?

Hello. I had to remind myself where this came from. It looks like it was introduced in D123638, and there were some comments already about the performance not always being ideal. It apparently helped for some <2 x double> vectorization. I'm not sure if there it a perfect answer, but an effective cost of 2 for the throughput of the ld1r would seem to match the hardware better. This doesn't alter isLegalBroadcastLoad and the tests added in D123638 don't seem to change.

The cost of a broadcast is already 1 so the code here doesn't seem like it would do much any more. It could be checking the cost type and returning 0 for costsize and 1 for throughput. Otherwise this bit of code could probably be removed.


I think we will fold the dup into the fmul as opposed to the load now, which seems a little cheaper.
I'm not sure if that makes it cheaper overall though. I agree that the rev and the mov make the timing tight. From the look of the test it looks like "profitable" here just means that the non-fast version would be slightly more instructions, not that it was known to be profitable. i.e the test is for testing fma combining, this was just a negative test for that issue.

SjoerdMeijer updated this revision to Diff 504113.EditedFri, Mar 10, 5:52 AM

I like it when I can delete things and achieve the same, so I have just done that. This was my understanding of your comments. Thanks for the suggestion and for looking into this.

So with this new revision, the SLP vectorisation tests won't be vectorised, which is what we want.
For the cost-model test, there are 2 changes compared to the previous revision: there is cost of 21 and 42 for two shuffles with half types. Probably need looking into, and fixed in a separate/companion patch with this.

The CostKind can be TCK_RecipThroughput (the default and the one we usually care most about), TCK_Latency, TCK_CodeSize or TCK_SizeAndLatency. I think if we have the code we might as well get TCK_CodeSize correct and return 0 in that case, so the load+dup have a combined cost of 1. TCK_Latency and TCK_SizeAndLatency I'm less sure about, perhaps leave them with the same costs as TCK_RecipThroughput?

So it might be a little better to change the code to this, with a comment explaining that the other costs are expected to be higher even with ld1r:

// Check for broadcast loads.
if (CostKind == TCK_CodeSize && Kind == TTI::SK_Broadcast) {
  bool IsLoad = !Args.empty() && isa<LoadInst>(Args[0]);
  if (IsLoad && LT.second.isVector() &&
    return 0; // broadcast is handled by ld1r

Thanks, I have restored that piece of logic and added the code-size check to it (and added a code size check to the test).
That means that we now get add an additional cost for TCK_RecipThroughput, as well as for TCK_Latency and TCK_SizeAndLatency.

dmgreen accepted this revision.Mon, Mar 13, 7:02 AM

Brilliant, thanks. LGTM.

This revision is now accepted and ready to land.Mon, Mar 13, 7:02 AM
This revision was landed with ongoing or failed builds.Mon, Mar 13, 8:14 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Fri, Mar 17, 3:40 AM

Just a heads up with are seeing a 10% regression caused by this change in a very SLP sensitive workload (the original source for the slp-fma-loss.ll tests). I still have to double check where the slowdown is coming from exactly.

Just a heads up with are seeing a 10% regression caused by this change in a very SLP sensitive workload (the original source for the slp-fma-loss.ll tests). I still have to double check where the slowdown is coming from exactly.

I thought we were helping your case, not regress it. :(
But thanks for the heads up, am happy to look at it.