[ARM] Parallel DSP IR Pass
Needs ReviewPublic

Authored by SjoerdMeijer on Wed, Jun 13, 7:46 AM.

Details

Summary

Armv6 introduced instructions to perform 32-bit SIMD operations. The purpose of
this pass is to do some straightforward IR pattern matching to create ACLE DSP
intrinsics, which map on these 32-bit SIMD operations. Because this is based on
simple pattern matching and it is Arm specific, we thought a separate IR pass
that runs late would be a good place to do this rather than e.g. the SLP
vectoriser.

Currently, only the SMLAD instruction gets recognised. This does two
multiplications with 16-bit operands, and stores the result in an accumulator.
Support for more of these DSP instructions will be added later. This triggers
on a matrix-multiply kernel in a popular benchmark.

Diff Detail

SjoerdMeijer created this revision.Wed, Jun 13, 7:46 AM

Hi Sjoerd,

Just a few comments before I go home, I will continue to look tomorrow. I was going to say that maybe this pass should live in the ARM backend, but I also see other passes in the same directory handling target specific intrinsics.

cheers,
sam

lib/Transforms/Scalar/ParallelDSP.cpp
42 ↗(On Diff #151165)

SmallVector instead?

132 ↗(On Diff #151165)

This should only return true if something has changed.

250 ↗(On Diff #151165)

Why do we consider these sequential?

257 ↗(On Diff #151165)

Would it be wiser to explicitly check before calling this?

357 ↗(On Diff #151165)

This interface could be simplified by passing HasFnNoNanAttr instead of the function. Similarly, I don't see why its necessary to pass the loop and its header.

419 ↗(On Diff #151165)

nit: probably a nicer way to comment these patterns...

513 ↗(On Diff #151165)

Cores will need to support unaligned accesses to do this. It should also be double checked that the load store optimiser doesn't generate any unaligned LDRD later too. I think there is already an option in that pass to check for that.

On targets with NEON, we should prefer NEON vectorization (vmlal etc.) over DSP instructions in almost all cases. Is this intended specifically for targets which don't have NEON?

On targets with NEON, we should prefer NEON vectorization (vmlal etc.) over DSP instructions in almost all cases.

Yes, definitely!

Is this intended specifically for targets which don't have NEON?

Oh yes, forgot to mention that probably. We are looking at M-cores. However, I can also see this being useful on A-cores when (NEON) vectorisation has not kicked in, but I have not looked into that yet. So yes, targets without NEON is my first priority for now.

samparker added inline comments.Thu, Jun 14, 2:15 AM
lib/Transforms/Scalar/ParallelDSP.cpp
142 ↗(On Diff #151165)

How are constants handled when generating the parallel instructions?

159 ↗(On Diff #151165)

Would you mind commenting on what you're allowing here? I'm a bit confused as why you don't have to handle Mul nodes.

163 ↗(On Diff #151165)

What happens for a series of adds that aren't going to be apart of the SMLAD?

321 ↗(On Diff #151165)

So you know that the loads are sequential, but how do you know they can be loaded in parallel? I don't use any use of alias analysis.

364 ↗(On Diff #151165)

It's probably worth checking that the integer is the type we support too so we can continue early and avoid the unnecessary, and more expensive, call below.

523 ↗(On Diff #151165)

We also need to check that the DSP instructions are supported. There also should be a check, or an assert, that this pass is being run on an Arm subtarget.

Hi Sam, thanks for the reviews! This is a tidy-up addressing the nits and minor comments in the 1st review. Not addressed yet is the check for unaligned accesses support, which I will look first into now.

SjoerdMeijer marked 6 inline comments as done.Thu, Jun 14, 3:11 AM
SjoerdMeijer added inline comments.
lib/Transforms/Scalar/ParallelDSP.cpp
357 ↗(On Diff #151165)

The Header is used to iterator over its Phi's, and the loop to get its latch blocks. HasFnNoNanAttr is used only here, and hoisting it out and passing it looks a bit clumsy to me. Thus I think the interface and passing the function, loop, and header, is covering it quite well.

SjoerdMeijer marked an inline comment as done.

This is addressing:

We also need to check that the DSP instructions are supported. There also should be a check, or an assert, that this pass is being run on an Arm subtarget.

I've moved the pass to the ARM backend to do this, which is of course the only right place for this pass (it runs pre-isel).

I will now start looking in the other feedback; thanks for that.

john.brawn added inline comments.Fri, Jun 15, 9:08 AM
lib/Target/ARM/ARMParallelDSP.cpp
132–134

This looks odd. Why aren't you marking TargetPassConfig as required and getting it using getAnalysis?

238–242

I think this can be simplified to

if (!(match(V0, m_APInt(C0)) && match(V1, m_APInt(C1)) && C0 == C1))
  return false;
307

Why return false here, when you have continues for failing checks above?

331–333

Given that we return true after finding a pair, doesn't that mean that we only ever have one entry in PMACPairs? If so we don't need it, and this function can just return the pair, or maybe we should do the call to CreateSMLADCall here.

387–389

This should be outside the loop, and cause the function to return false if there's no loop latch.

393

Why return false here instead of continuing?

408

This function only manipulates the candidates list, so it should take that as an argument not Reduction.

496

This is unused.

498–501

It would be more understandable what's going on if you were more explicit here about what the inputs and outputs to each function are rather than hiding everything inside the Reductions list. If you restructured it to be more like:

Changed = false;
Reductions = MatchReductions(F, L, Header);
for (auto &R : Reductions) {
  Candidates = MatchParallelMACs(R);
  Changed = CreateParallelMACs(Candidates) || Changed;
}

then it would be easier to understand.

Hi John, thanks for the review! I have made the kernel that drives the transformations explicit regarding the inputs and outputs of the different functions; that indeed reads much nicer now.

I will now start addressing Sam's 2nd review.

SjoerdMeijer marked 9 inline comments as done.Mon, Jun 18, 3:15 AM