Page MenuHomePhabricator

[ARM][ParallelDSP] Change smlad insertion order
ClosedPublic

Authored by samparker on Sep 10 2019, 3:48 AM.

Details

Summary

Instead of inserting everything after the 'root' of the reduction, insert all instructions as close to their operands as possible. This can help reduce register pressure.

Note: I have no idea why git has decided that I've made a change to an MC test.

Diff Detail

Event Timeline

samparker created this revision.Sep 10 2019, 3:48 AM

This can help reduce register pressure.

Is misched really so weak that this helps significantly? Or is it not enabled for the targets in question?

lib/Target/ARM/ARMParallelDSP.cpp
696 ↗(On Diff #219511)

dominates() is linear in the length of the basic block; might want to use OrderedBasicBlock. (This is probably not the only place where this pass is quadratic in the size of the basic block, but just happened to spot it.)

Note: I have no idea why git has decided that I've made a change to an MC test.

Like the name suggests, preserve-comments-crlf.s has Windows line endings; maybe your text editor accidentally "fixed" them?

samparker updated this revision to Diff 219671.Sep 11 2019, 1:54 AM

Thanks Eli, I had no idea that OrderedBasicBlock existed! I'm now using this in a couple of places. I've also updated a couple of tests to highlight the register pressure changes. The scheduler does a good job, but it appears to struggle with some of the large blocks that we chuck at it and every cycle counts for these DSP kernels.

efriedma added inline comments.Sep 12 2019, 1:34 PM
lib/Target/ARM/ARMParallelDSP.cpp
654 ↗(On Diff #219671)

You decided not to modify this dominates() call?

samparker marked an inline comment as done.Sep 13 2019, 12:38 AM
samparker added inline comments.
lib/Target/ARM/ARMParallelDSP.cpp
654 ↗(On Diff #219671)

I assumed it wouldn't be worth it... My understanding is that I'd have to create a new ordered block each time, because the block will change after each call.

efriedma added inline comments.Sep 13 2019, 12:08 PM
lib/Target/ARM/ARMParallelDSP.cpp
654 ↗(On Diff #219671)

My general concern here would be that the pass is O(N^2) in the number of transformations in a given BB. (If you unroll a loop containing a transformable operation N times, for example.) This contributes to that, constructing the OrderedBB later in InsertParallelMACs contributes to that. But there are a bunch of other places with similar issues... for example, RecordMemoryOps has a loop that's O(N^2) in the number of loads.

Actually, thinking about it a bit more, I have another concern; you might not be picking a legal insertion point. The instruction that produces the accumulator could be anything: for example, a phi, or an invoke, or an exception-handling instruction.

samparker marked an inline comment as done.Sep 16 2019, 7:19 AM
samparker added inline comments.
lib/Target/ARM/ARMParallelDSP.cpp
654 ↗(On Diff #219671)

What about if I delayed the RecordMemoryOps logic until after we've discovered the reduction? Then, at least, we won't be paying the cost for the vast majority of cases.

The accumulator could be a phi, but I don't think that's an issue here as it would dominate all other instructions. Only the original accumulator input value can be a phi or a non-instruction and, as we don't compare same values, I don't see how a phi could be in the insertion point.

And please pardon my ignorance, but could you elaborate why an invoke would be an issue? And why exception handling needs to be considered?

I also hadn't though much about complexity, but indeed, function RecordMemoryOps, for example, is a bit of an expensive hobby.
Looking at it again, the bookkeeping looks essential, I don't see an easy way to reduce complexity. Delaying it may help a bit, but fundamentally that won't change much I think.
The usual way to deal with expensive hobbies is to introduce a threshold, and bail if it exceeds that.

The current implementation of comparing loads is quadratic, yes, but you could use a different algorithm, like splitting loads into a base pointer plus an offset, and constructing a map from base pointers to load offsets. Maybe not worthwhile, though; a threshold might be good enough.

lib/Target/ARM/ARMParallelDSP.cpp
654 ↗(On Diff #219671)

Oh, if the accumulator isn't an instruction in the same block as the multiply, we always choose the multiply as the insertion point? That makes sense... but it probably deserves a comment noting that invariant.

The current implementation of comparing loads is quadratic, yes, but you could use a different algorithm, like splitting loads into a base pointer plus an offset, and constructing a map from base pointers to load offsets.

Ah, yes, nice one!

samparker updated this revision to Diff 221770.Sep 25 2019, 7:44 AM

Thanks both. I've added a threshold to the number of loads that we can inspect, which causes us to bail before examining any loads. I've also added an early exit into the troublesome loop in RecordMemoryOps.

SjoerdMeijer added inline comments.Oct 15 2019, 6:29 AM
lib/Target/ARM/ARMParallelDSP.cpp
379 ↗(On Diff #221770)

Just curious, why did you change the iteration order, Loads/Writes vs. Writes/Loads?

test/CodeGen/ARM/ParallelDSP/blocks.ll
139 ↗(On Diff #221770)

Double checking I understand this test: this test has 16 loads, so is within the limit of 16.
So why don't we generate the 4th SMLAD?

samparker marked 2 inline comments as done.Oct 15 2019, 7:21 AM
samparker added inline comments.
lib/Target/ARM/ARMParallelDSP.cpp
379 ↗(On Diff #221770)

We know that Loads isn't empty, but Writes maybe so we can skip the loop if it is.

test/CodeGen/ARM/ParallelDSP/blocks.ll
139 ↗(On Diff #221770)

We're within the limit, but it looks like there's some limitation with the search algorithm... the adds are just ordered differently to 'usual'.

Thanks for clarifying.

I am happy with this if @efriedma is happy too.

efriedma accepted this revision.Oct 15 2019, 11:08 AM

LGTM, assuming you fix your tree so test/MC/AsmParser/preserve-comments-crlf.s isn't modified.

This revision is now accepted and ready to land.Oct 15 2019, 11:08 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptOct 16 2019, 2:43 AM
Herald added a subscriber: hiraditya. · View Herald Transcript