Page MenuHomePhabricator

[Reassociate] Stop linearizing all associative expression trees w/o profitability
Changes PlannedPublic

Authored by reames on Aug 1 2019, 3:32 PM.

Details

Summary

The Reassociate pass currently does two things. First, it factors out common sub-expressions and otherwise *optimizes* the expression tree. Second, it blindly linearizes the result, regardless of whether any *optimizations* were performed.

This second part is problematic, as we don't have a robust scheduler anywhere else in the pipeline, and a linear execution order is distinctly non-optimal on modern CPUs. Consider the following (toy) example:
for (int i = 0; i < N; i++) {

int tmp = a[i];
tmp += b;
acquire_fence();
sum += tmp;

}

In this case, we end up with a nicely unrolled loop, but due to the linearization of the add expressions, we emit a long chain of additions with a single target register. This form bottlenecks in the scheduler on modern X86 chips for a 25% performance slow down over the original form.

Note that to do this, I had remove a stale piece of code which tried to aggressively re-try expression formation after removing uses. Given that code has been disabled since 2012, I'm not too worried about that.

An alternate approach to this problem would be to invest in building a scheduler for associative expressions which can properly balance ILP and register pressure. I'm hoping not to have to solve that problem. :)

Diff Detail

Event Timeline

reames created this revision.Aug 1 2019, 3:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2019, 3:32 PM

I'm seeing many regressions in diffs.

Honestly i'd say this is the wrong fix/direction,
i don't recall any such thing as register pressure
for middle-end IR, there are no registers here, only value.

reames updated this revision to Diff 212927.Aug 1 2019, 4:32 PM
reames edited the summary of this revision. (Show Details)

Rebase remainder of test changes after autogening them in a separate commit..

reames added a comment.Aug 1 2019, 5:03 PM

I'm seeing many regressions in diffs.

Can you point to a specific example of what you mean? I haven't studied every single one closely, but on a skim, I am not seeing regressions, so I'm curious what you mean?

Honestly i'd say this is the wrong fix/direction,
i don't recall any such thing as register pressure
for middle-end IR, there are no registers here, only value.

Right, but for the same reason, why should we be perturbing the input without reason? We don't have any information about profitability to justify the transform.

I'm seeing many regressions in diffs.

Can you point to a specific example of what you mean? I haven't studied every single one closely, but on a skim, I am not seeing regressions, so I'm curious what you mean?

Hm? There's clearly a lot of cases where instruction count increases.

Honestly i'd say this is the wrong fix/direction,
i don't recall any such thing as register pressure
for middle-end IR, there are no registers here, only value.

Right, but for the same reason, why should we be perturbing the input without reason? We don't have any information about profitability to justify the transform.

This is a canonicalization pass. The same reasoning could be applied to InstCombine.

test/Transforms/Reassociate/basictest.ll
39–40
64–65
89–90
109–110

!

287
test/Transforms/Reassociate/fast-basictest.ll
89–90
114–115
139–140
test/Transforms/Reassociate/no-op.ll
31–36

Hm, interesting.
I guess this is the first improvement i'm seeing.

fhahn added a reviewer: escha.Aug 2 2019, 7:55 AM
spatel added a comment.Aug 2 2019, 8:35 AM

I like the idea of limiting this pass. It annoys me that -reassociate can pseudo-randomly change the order of associative+commutative operands, and those transforms don't play by instcombine's operand complexity canonicalization rules (see getComplexity() in InstCombine). In other words, this pass can (hopefully only benignly) fight with instcombine.

But I agree with Roman's analysis of the test diffs - we need to preserve the cases where reassociation/factorization eliminates instructions. InstCombine doesn't handle that generally, and we don't want to shift that burden to InstCombine because that pass is already expensive.

I like the idea of limiting this pass. It annoys me that -reassociate can pseudo-randomly change the order of associative+commutative operands, and those transforms don't play by instcombine's operand complexity canonicalization rules (see getComplexity() in InstCombine). In other words, this pass can (hopefully only benignly) fight with instcombine.

I'm indeed not opposed to preventing this pass from doing NOP work, but still,
i don't think the reasoning/patch description should say anything about linearization/register pressure.
If that is a problem, then you will have the exact same problem if you start with such bad IR before this transform.

In other words, while you can put a band-aid on it by "arbitrarily crippling" passes,
it may only stop degradation of IR, but nothing will be improving the already-bad IR.
So i'd guess this really should be solved more head-on.

But I agree with Roman's analysis of the test diffs - we need to preserve the cases where reassociation/factorization eliminates instructions. InstCombine doesn't handle that generally, and we don't want to shift that burden to InstCombine because that pass is already expensive.

I'm seeing many regressions in diffs.

Can you point to a specific example of what you mean? I haven't studied every single one closely, but on a skim, I am not seeing regressions, so I'm curious what you mean?

Hm? There's clearly a lot of cases where instruction count increases.

Gah, you're right, my sample was bogus. What I get for skimming from the bottom of a large diff, not the top.

The issue appears to be that GVN is relying on a canonical order of operands to be able to do CSE. I'll think about that one a bit, see if I can come up with a good robust solution.

reames updated this revision to Diff 213086.Aug 2 2019, 10:47 AM

POC fix for the CSE pointed out in review. This isn't a "real" patch yet, more of a hint as to a possible direction. I'm fairly sure we *could* make this sufficiently fast if we want to move in this direction by memoizing expression trees.

I'm more interested in feedback on the approach. I have to admit I hadn't realized the CSE impact when first proposing this. Even with a GVN fix, this will still harm local CSE (in various passes) and EarlyCSE. Is that too high an impact?

reames updated this revision to Diff 213087.Aug 2 2019, 10:49 AM

Missed a condition in the POC

reames planned changes to this revision.Fri, Aug 16, 9:44 AM

Marking as "Plan Changes" to cleanup phab views.

mgrang added inline comments.Fri, Aug 16, 9:55 AM
lib/Transforms/Scalar/GVN.cpp
293
bjope added a subscriber: bjope.Fri, Aug 16, 10:52 AM