[ARM] Parallel DSP IR Pass

Authored by SjoerdMeijer on Jun 13 2018, 7:46 AM.



Armv6 introduced instructions to perform 32-bit SIMD operations. The purpose of
this pass is to do some straightforward IR pattern matching to create ACLE DSP
intrinsics, which map on these 32-bit SIMD operations. Because this is based on
simple pattern matching and it is Arm specific, we thought a separate IR pass
that runs late would be a good place to do this rather than e.g. the SLP

Currently, only the SMLAD instruction gets recognised. This does two
multiplications with 16-bit operands, and stores the result in an accumulator.
Support for more of these DSP instructions will be added later. This triggers
on a matrix-multiply kernel in a popular benchmark.

Diff Detail

SjoerdMeijer created this revision.Jun 13 2018, 7:46 AM

Hi Sjoerd,

Just a few comments before I go home, I will continue to look tomorrow. I was going to say that maybe this pass should live in the ARM backend, but I also see other passes in the same directory handling target specific intrinsics.


42 ↗(On Diff #151165)

SmallVector instead?

132 ↗(On Diff #151165)

This should only return true if something has changed.

250 ↗(On Diff #151165)

Why do we consider these sequential?

257 ↗(On Diff #151165)

Would it be wiser to explicitly check before calling this?

357 ↗(On Diff #151165)

This interface could be simplified by passing HasFnNoNanAttr instead of the function. Similarly, I don't see why its necessary to pass the loop and its header.

419 ↗(On Diff #151165)

nit: probably a nicer way to comment these patterns...

513 ↗(On Diff #151165)

Cores will need to support unaligned accesses to do this. It should also be double checked that the load store optimiser doesn't generate any unaligned LDRD later too. I think there is already an option in that pass to check for that.

On targets with NEON, we should prefer NEON vectorization (vmlal etc.) over DSP instructions in almost all cases. Is this intended specifically for targets which don't have NEON?

On targets with NEON, we should prefer NEON vectorization (vmlal etc.) over DSP instructions in almost all cases.

Yes, definitely!

Is this intended specifically for targets which don't have NEON?

Oh yes, forgot to mention that probably. We are looking at M-cores. However, I can also see this being useful on A-cores when (NEON) vectorisation has not kicked in, but I have not looked into that yet. So yes, targets without NEON is my first priority for now.

samparker added inline comments.Jun 14 2018, 2:15 AM
142 ↗(On Diff #151165)

How are constants handled when generating the parallel instructions?

159 ↗(On Diff #151165)

Would you mind commenting on what you're allowing here? I'm a bit confused as why you don't have to handle Mul nodes.

163 ↗(On Diff #151165)

What happens for a series of adds that aren't going to be apart of the SMLAD?

321 ↗(On Diff #151165)

So you know that the loads are sequential, but how do you know they can be loaded in parallel? I don't use any use of alias analysis.

364 ↗(On Diff #151165)

It's probably worth checking that the integer is the type we support too so we can continue early and avoid the unnecessary, and more expensive, call below.

523 ↗(On Diff #151165)

We also need to check that the DSP instructions are supported. There also should be a check, or an assert, that this pass is being run on an Arm subtarget.

Hi Sam, thanks for the reviews! This is a tidy-up addressing the nits and minor comments in the 1st review. Not addressed yet is the check for unaligned accesses support, which I will look first into now.

SjoerdMeijer marked 6 inline comments as done.Jun 14 2018, 3:11 AM
SjoerdMeijer added inline comments.
357 ↗(On Diff #151165)

The Header is used to iterator over its Phi's, and the loop to get its latch blocks. HasFnNoNanAttr is used only here, and hoisting it out and passing it looks a bit clumsy to me. Thus I think the interface and passing the function, loop, and header, is covering it quite well.

SjoerdMeijer marked an inline comment as done.

This is addressing:

We also need to check that the DSP instructions are supported. There also should be a check, or an assert, that this pass is being run on an Arm subtarget.

I've moved the pass to the ARM backend to do this, which is of course the only right place for this pass (it runs pre-isel).

I will now start looking in the other feedback; thanks for that.

john.brawn added inline comments.Jun 15 2018, 9:08 AM
131–133 ↗(On Diff #151365)

This looks odd. Why aren't you marking TargetPassConfig as required and getting it using getAnalysis?

237–241 ↗(On Diff #151365)

I think this can be simplified to

if (!(match(V0, m_APInt(C0)) && match(V1, m_APInt(C1)) && C0 == C1))
  return false;
306 ↗(On Diff #151365)

Why return false here, when you have continues for failing checks above?

330–332 ↗(On Diff #151365)

Given that we return true after finding a pair, doesn't that mean that we only ever have one entry in PMACPairs? If so we don't need it, and this function can just return the pair, or maybe we should do the call to CreateSMLADCall here.

386–388 ↗(On Diff #151365)

This should be outside the loop, and cause the function to return false if there's no loop latch.

392 ↗(On Diff #151365)

Why return false here instead of continuing?

407 ↗(On Diff #151365)

This function only manipulates the candidates list, so it should take that as an argument not Reduction.

495 ↗(On Diff #151365)

This is unused.

497–500 ↗(On Diff #151365)

It would be more understandable what's going on if you were more explicit here about what the inputs and outputs to each function are rather than hiding everything inside the Reductions list. If you restructured it to be more like:

Changed = false;
Reductions = MatchReductions(F, L, Header);
for (auto &R : Reductions) {
  Candidates = MatchParallelMACs(R);
  Changed = CreateParallelMACs(Candidates) || Changed;

then it would be easier to understand.

Hi John, thanks for the review! I have made the kernel that drives the transformations explicit regarding the inputs and outputs of the different functions; that indeed reads much nicer now.

I will now start addressing Sam's 2nd review.

SjoerdMeijer marked 9 inline comments as done.Jun 18 2018, 3:15 AM

Hi Sjoerd,

In D48270, Eli explained the real solution to creating safe ldr/str. Could you please update the load instructions with alignment and also test that no ldrd/strd instructions aren't generated where they shouldn't be.


Addressed Sam's comments:

  • set the alignment on the load instructions.
  • added alias analysis checks.
  • check that the accumulator is the right integer type.
  • bail out early when the operand chains of the muls are integers constants or more complicated patterns we don't yet support.

Tests have been modified and added for all these cases.

Hi Sjoerd,

Could you please add an llc test to ensure that any unsafe ldrds don't get generated?


samparker added inline comments.Jun 21 2018, 3:35 AM
222 ↗(On Diff #152242)

Is it worth using Inst0->isSameOperationAs(Inst1) here? I'm wondering whether we need to consider varying flags.

393 ↗(On Diff #152242)

You can combine these two type checks by using Ty->isIntegerTy(32).

543 ↗(On Diff #152242)

Isn't this supposed to match the alignment of the underlying memory type? So 2 for i16.

Yes, oops, fixed the alignment, and addressed the nits as well.


Could you please add an llc test to ensure that any unsafe ldrds don't get generated?

I was struggling creating a test case. The load/store optimiser runs after regalloc, and for it to trigger consecutive registers need to be loaded, which wasn't happening in my case. But I think this would be a fragile and indirect test. And perhaps more importantly, the alignment is now correctly set, to 2 in this case, and I think we can rely on the load/store optimiser to respect this and do the right thing. That is, I see in the load/store optimiser pass that it is not capable operating on memory operations with getAlignment() < 4.

Ok, fair enough. On a quick search, I couldn't see any alignment tests, so I've added some in rL335241.

133 ↗(On Diff #152292)

You still also need to check that the subtarget supports unaligned memory ops.

SjoerdMeijer added inline comments.Jun 21 2018, 8:27 AM
133 ↗(On Diff #152292)

Ah, yes, I was distracted, and created this D48437.
Will add this.

efriedma added inline comments.Jun 21 2018, 11:09 AM
259 ↗(On Diff #152292)

Do you need to check that the memory isn't modified between the two loads?

540 ↗(On Diff #152292)


  • added unalignment check and test,
  • use CreateAlignedLoad,
  • added new test smlad11.ll: this is a bigger example than the others as it has a chain of 4 muls, and thus should generate 2 smlad calls. But at the moment, they are not inserted correctly, and am now fixing this.
SjoerdMeijer added inline comments.Jun 22 2018, 6:57 AM
259 ↗(On Diff #152292)

Yes, we check this a bit earlier with AreNoAliases() in CreateParallelMACPairs().
The store and alias checks are tested in tests smlad6.ll and smlad7.ll, but please let me know if I've missed something here.

efriedma added inline comments.Jun 22 2018, 11:09 AM
254 ↗(On Diff #152469)

Please check isSimple() instead.

259 ↗(On Diff #152292)

The AreNoAliases check only looks for "store" instructions; other instructions can write to memory. Maybe you want AliasAnalysis::getModRefInfo()?

Reimplemented the alias checks.

samparker added inline comments.Jun 26 2018, 3:13 AM
126 ↗(On Diff #152707)

There are later checks that a header exists, so they are either unnecessary or the check should be performed here. You also check for the latch later, when that could also be done up front too.

370 ↗(On Diff #152707)

why is this performed inside the loop, should this function even called if this is false?

441 ↗(On Diff #152707)

This looks clumsy, shouldn't it be a method of ParallelMAC? You could also use SetVector instead of SmallVector to avoid the searching loop. I know that the vectors are likely to be small, but it would improve readability.

458 ↗(On Diff #152707)

Why are you adding all the instructions? You should only need to check ones that mayReadOrWriteMemory.

474 ↗(On Diff #152707)

Why process two MemLocs at a time?

553 ↗(On Diff #152707)

Hmmm, now I'm wondering what happens if the header is not also the latch? I think the assumption is that it is.

Comments addressed.
Only thing I am thinking about is the assumption Latch == Header.

SjoerdMeijer marked 6 inline comments as done.Jun 26 2018, 9:24 AM

Added the loop latch check and test.

samparker accepted this revision.Jun 28 2018, 2:55 AM

LGTM, commit the beauty :)

This revision is now accepted and ready to land.Jun 28 2018, 2:55 AM

Thank you all for your reviews!

Closed by commit rL335850: [ARM] Parallel DSP Pass (authored by SjoerdMeijer, committed by ). · Explain WhyJun 28 2018, 6:00 AM
This revision was automatically updated to reflect the committed changes.

I think there's a missing correctness check in this patch. Specifically, ARMParallelDSP::CreateParallelMACPairs doesn't check that the transformed loads are i16 loads, so a pair of i8 loads gets transformed to an i32 load. This is causing a miscompile on at least one of our internal tests.

I guess the transform is actually still viable for a pair of i8 loads with a slightly different code (we can generate ldrh+sxtb16).

Thanks! Will look into this ASAP and prepare a fix.