This is an archive of the discontinued LLVM Phabricator instance.

[X86] Let MachineCombiner reassociate adds for ILP
AbandonedPublic

Authored by reames on Jul 26 2019, 3:49 PM.

Details

Summary

It turns out that we don't even try to reassociate add chains into trees in MachineCombiner. Unless we've happened to convert them to ORs or have vectorized them that is. I can't find any principled reason for this omission. Is there something I'm missing?

(Tests needs cleaned up. This is just an auto-gen run to give a flavour of some of the changes which result. There's also a huge number of spurious changes which need removed due to tests not originally being auto-gened at all. I'll iterate on that if the direction is judged reasonable.)

Diff Detail

Event Timeline

reames created this revision.Jul 26 2019, 3:49 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 26 2019, 3:49 PM

Should we do the _DB versions too?

IIRC, MachineCombiner has the potential to cause spills (it runs without real knowledge of register pressure)...and if we spill, then we have likely killed all hope for a perf improvement due to better throughput of math/logic. This is true for any reassociable opcode, but x86 scalar 'add' has more potential chance for trouble due to prevalence and so few available registers ( (as seen by the number of diffs here?).

So it's correct that there was no principled reason to leave 'add' out, but the practical concern was spilling. I have not looked at the test diffs here, but if there's evidence of extra instructions, then we're probably going to see perf regressions rather than improvements?

IIRC, MachineCombiner has the potential to cause spills (it runs without real knowledge of register pressure)...and if we spill, then we have likely killed all hope for a perf improvement due to better throughput of math/logic. This is true for any reassociable opcode, but x86 scalar 'add' has more potential chance for trouble due to prevalence and so few available registers ( (as seen by the number of diffs here?).

So it's correct that there was no principled reason to leave 'add' out, but the practical concern was spilling. I have not looked at the test diffs here, but if there's evidence of extra instructions, then we're probably going to see perf regressions rather than improvements?

I hear the concern here, but not reassociating on adds also means we miss some semi major performance opportunities. Any suggestions on how to break the impasse with a tractable amount of work? As in, any suggestions on how to implement a reasonable register pressure heuristic in MachineCombiner?

IIRC, MachineCombiner has the potential to cause spills (it runs without real knowledge of register pressure)...and if we spill, then we have likely killed all hope for a perf improvement due to better throughput of math/logic. This is true for any reassociable opcode, but x86 scalar 'add' has more potential chance for trouble due to prevalence and so few available registers ( (as seen by the number of diffs here?).

So it's correct that there was no principled reason to leave 'add' out, but the practical concern was spilling. I have not looked at the test diffs here, but if there's evidence of extra instructions, then we're probably going to see perf regressions rather than improvements?

I hear the concern here, but not reassociating on adds also means we miss some semi major performance opportunities. Any suggestions on how to break the impasse with a tractable amount of work? As in, any suggestions on how to implement a reasonable register pressure heuristic in MachineCombiner?

I have no expertise at this level of codegen, so it's probably better to raise this question on llvm-dev for a more informed answer. I see some facility for estimating pressure used in MachineLICM before register allocation:

// Estimate register pressure during pre-regalloc pass.

...but I have no idea how much work it would take to wire that up within MachineCombiner.

Before taking on that effort, I think we should ask: what kind of perf gains could we expect if MachineCombiner was working optimally on these patterns? Maybe we can hack a limited solution and get some stats?

I could be completely wrong, but it seems unlikely to me that any recent OOO x86 would benefit much from reassociating scalar adds before it ran out of GPRs.

IIRC, MachineCombiner has the potential to cause spills (it runs without real knowledge of register pressure)...and if we spill, then we have likely killed all hope for a perf improvement due to better throughput of math/logic. This is true for any reassociable opcode, but x86 scalar 'add' has more potential chance for trouble due to prevalence and so few available registers ( (as seen by the number of diffs here?).

So it's correct that there was no principled reason to leave 'add' out, but the practical concern was spilling. I have not looked at the test diffs here, but if there's evidence of extra instructions, then we're probably going to see perf regressions rather than improvements?

I hear the concern here, but not reassociating on adds also means we miss some semi major performance opportunities. Any suggestions on how to break the impasse with a tractable amount of work? As in, any suggestions on how to implement a reasonable register pressure heuristic in MachineCombiner?

I have no expertise at this level of codegen, so it's probably better to raise this question on llvm-dev for a more informed answer. I see some facility for estimating pressure used in MachineLICM before register allocation:

// Estimate register pressure during pre-regalloc pass.

...but I have no idea how much work it would take to wire that up within MachineCombiner.

Before taking on that effort, I think we should ask: what kind of perf gains could we expect if MachineCombiner was working optimally on these patterns? Maybe we can hack a limited solution and get some stats?

I could be completely wrong, but it seems unlikely to me that any recent OOO x86 would benefit much from reassociating scalar adds before it ran out of GPRs.

But we're missing out on building trees to avoid serial dependency chains. With trees we can do adds in parallel.

There's RegPressureTracker (https://llvm.org/docs/doxygen/classllvm_1_1RegPressureTracker.html) which can be used to track register pressure through a region of MBBs. The MachineCombiner use MachineTraceMetrics to guide which combines to apply and MTM does so by choosing a preferred trace. I think we could track pressure through that trace and compare the pressure difference of the removed/added instructions per combine. At least from a high-level perspective, adding register pressure info to MTM makes sense to me.

Gerolf, do you think that would be good extension to MachineCombiner?

reames added a comment.Aug 1 2019, 3:33 PM

I've posted an alternate fix for my actual problem over in https://reviews.llvm.org/D65614. It turns out the linear schedule which MachineCombiner would be undoing here is entirely self inflicted.

reames added a comment.Aug 1 2019, 4:17 PM

Skimming through the RegisterPressure class and the approach MachineLICM uses, the mechanics of adding regrester pressure tracking to MachineCombiner don't seem too bad. I'm fairly confident we can do so without too much work, though I haven't fully worked it through much less written the code, so that might be a gotcha. Thanks everyone for the pointers.

The question left is what would we use the register pressure *for*? As in, what heuristics would make sense to impose the machine combiner?

Should we only do transforms which don't increase register pressure? Only increase it by an amount under the register class limit? The later is tempting, but then we might inhibit other transforms which also want to increase register pressure. What's the right tradeoff? I think register pressure preserving is probably too restrictive, but I'm not sure where the sweat spot is.

Additionally, if we have many combine opportunities within a BB, and running all of them would exceed our reg pressure limit, how do we prioritize which ones to perform? Is a simple greedy algorithm from the front of the block "good enough"?

This is outside my practical experience, so I'd welcome input anyone might have.

Skimming through the RegisterPressure class and the approach MachineLICM uses, the mechanics of adding regrester pressure tracking to MachineCombiner don't seem too bad. I'm fairly confident we can do so without too much work, though I haven't fully worked it through much less written the code, so that might be a gotcha. Thanks everyone for the pointers.

The question left is what would we use the register pressure *for*? As in, what heuristics would make sense to impose the machine combiner?

Should we only do transforms which don't increase register pressure? Only increase it by an amount under the register class limit? The later is tempting, but then we might inhibit other transforms which also want to increase register pressure. What's the right tradeoff? I think register pressure preserving is probably too restrictive, but I'm not sure where the sweat spot is.

I agree that preserving register pressure is probably too restrictive. Using register classes to limit register pressure is a good idea. Not sure if the cost heuristic should also account for callee saved registers though. Ideally, the machine scheduler that runs before regalloc could try hard(er) to reorder instructions for register pressure. That would potentially give to the machine combiner a bit more slack.

We could also use information from the scheduling models (when available) to further limit reassociation.
Let say that we have a chain of vector adds. If we know that the target processor has only two vector ALU pipes, then we can reorganize the expression tree so that there are always at most two parallel adds. More parallelism would not be exploited in practice, and it would only lead to a potential increase in register pressure for no good reason.

Additionally, if we have many combine opportunities within a BB, and running all of them would exceed our reg pressure limit, how do we prioritize which ones to perform? Is a simple greedy algorithm from the front of the block "good enough"?

Good question. Unfortunately I don't have a good answer...
I suggest to start experimenting with a greedy algorithm first.

This is outside my practical experience, so I'd welcome input anyone might have.

Skimming through the RegisterPressure class and the approach MachineLICM uses, the mechanics of adding regrester pressure tracking to MachineCombiner don't seem too bad. I'm fairly confident we can do so without too much work, though I haven't fully worked it through much less written the code, so that might be a gotcha. Thanks everyone for the pointers.

The question left is what would we use the register pressure *for*? As in, what heuristics would make sense to impose the machine combiner?

Should we only do transforms which don't increase register pressure? Only increase it by an amount under the register class limit? The later is tempting, but then we might inhibit other transforms which also want to increase register pressure. What's the right tradeoff? I think register pressure preserving is probably too restrictive, but I'm not sure where the sweat spot is.

I agree that preserving register pressure is probably too restrictive. Using register classes to limit register pressure is a good idea. Not sure if the cost heuristic should also account for callee saved registers though. Ideally, the machine scheduler that runs before regalloc could try hard(er) to reorder instructions for register pressure. That would potentially give to the machine combiner a bit more slack.

The machine scheduler already schedules quite aggressively for register pressure.

I think we start out with restricting the combiner to substitutions that do not increase register pressure. We can then loosen it and tune the thresholds.

We could also use information from the scheduling models (when available) to further limit reassociation.
Let say that we have a chain of vector adds. If we know that the target processor has only two vector ALU pipes, then we can reorganize the expression tree so that there are always at most two parallel adds. More parallelism would not be exploited in practice, and it would only lead to a potential increase in register pressure for no good reason.

Additionally, if we have many combine opportunities within a BB, and running all of them would exceed our reg pressure limit, how do we prioritize which ones to perform? Is a simple greedy algorithm from the front of the block "good enough"?

Good question. Unfortunately I don't have a good answer...
I suggest to start experimenting with a greedy algorithm first.

Initially a greedy approach should be fine: choose applicable candidates, until we exceed a threshold. If we start with limiting us to substitutions that do not increase pressure first, this should not be an issue at all. Doing anything more will require some additional re-structuring of the code to compute the set of candidates and rank them first, instead of applying them directly.

I think the case we want to try hard to avoid is doing combines that increase register pressure outside a loop, which in turn cause spills in a loop body. The machine scheduler cannot do much about such cases unfortunately.

reames planned changes to this revision.Aug 16 2019, 9:44 AM

Marking as "Plan Changes" to cleanup phab views.

reames abandoned this revision.Oct 15 2021, 11:59 AM

Abandoning an old review I'm not going to return to any time soon.

test/CodeGen/X86/mul-constant-i8.ll