This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner] make slack optional in critical path cost calculation (PR25016)
ClosedPublic

Authored by spatel on Oct 3 2015, 2:21 PM.

Details

Summary

As noted on the dev list:
http://lists.llvm.org/pipermail/llvm-dev/2015-October/090998.html
and PR25016:
https://llvm.org/bugs/show_bug.cgi?id=25016

The MachineCombiner is doing reassociations that don't improve or even worsen the critical path. This is caused by inclusion of the "slack" factor when calculating the critical path of the original code sequence. If we don't add that, then we have a more conservative cost comparison of the old code sequence vs. a new sequence. The more liberal calculation must be preserved, however, for the AArch64 MULADD patterns because benchmark regressions were observed without that.

All existing correct reassociation tests are preserved after this change, and the two failing cases now have identical asm that does what we want:
a + b + c + d ---> (a + b) + (c + d)

Diff Detail

Event Timeline

spatel updated this revision to Diff 36447.Oct 3 2015, 2:21 PM
spatel retitled this revision from to [MachineCombiner] remove "slack" from critical path cost calculation (PR25016).
spatel updated this object.
spatel added reviewers: Gerolf, haicheng, hfinkel.
spatel added a subscriber: llvm-commits.
Gerolf edited edge metadata.Oct 5 2015, 9:44 AM
Gerolf added a subscriber: Gerolf.

Taking the slack into account was necessary for the madd combiner to avoid regressions on a number of benchmarks. The new instruction sequence may use up the slack w/o extending the critical path. This change makes the heuristic too conservative. We should not change an heuristic for a fast-math optimization that impacts everything else.

-Gerolf

Taking the slack into account was necessary for the madd combiner to avoid regressions on a number of benchmarks. The new instruction sequence may use up the slack w/o extending the critical path. This change makes the heuristic too conservative. We should not change an heuristic for a fast-math optimization that impacts everything else.

Thanks for the explanation, Gerolf. I'm still trying to come up with an example where the slack makes a positive difference. Do you have a test case that we can add to the regression suite?

The reassociation transform is not fast-math-only at this point. We're using it for integer logic (and/or/xor) and math (imul) with x86. I have delayed enabling it for integer add in the hope of shaking out bugs like this, but my hope is to allow that too eventually.

haicheng edited edge metadata.Oct 5 2015, 11:26 AM

I added another test case in https://llvm.org/bugs/show_bug.cgi?id=25016 which reassociates i32 mul and spill registers at last. This test case basically calculates a0*b0*c0*d0+a1*b1*c1*d1, but reassociates it into (((b0*c0)*d0)*a0)+(((b1*c1)*d1)*a1).

spatel updated this revision to Diff 36552.Oct 5 2015, 2:28 PM
spatel retitled this revision from [MachineCombiner] remove "slack" from critical path cost calculation (PR25016) to [MachineCombiner] make slack optional in critical path cost calculation (PR25016).
spatel updated this object.
spatel edited edge metadata.

Patch updated:
Make slack an optional part of the critical path calculation. Slack is still used with the AArch64 MULADD patterns, but not with the reassociation patterns where it may cause harm.

We should avoid scenarios where two types of machine combines suddenly would require a different heuristic within the same block, eg assume a madd combine and a reassoc in the same block. Also, on a minor note I don’t think I truly understand your underpinning performance issue yet.

Gerolf

We should avoid scenarios where two types of machine combines suddenly would require a different heuristic within the same block, eg assume a madd combine and a reassoc in the same block.

Do you have a suggestion for how to avoid that? I'm still mentally blocked on the need for the more liberal MADD cost calc because I haven't seen an example where that provides a win.

Also, on a minor note I don’t think I truly understand your underpinning performance issue yet.

Let's try to fix that then - without that understanding, this must look like a useless exercise. :)
First, ignore any information about spilling - that's not what we're concerned about in this patch.

The 2nd test case in this patch shows how things go wrong. It's similar to the example I posted on the dev list:

define double @foo4_reassociated() {
  %a = call double @bar()
  %b = call double @bar()
  %c = call double @bar()
  %d = call double @bar()
  %t0 = fadd double %a, %b   <--- 1st independent add
  %t1 = fadd double %c, %d   <--- 2nd independent add
  %t2 = fadd double %t0, %t1 <--- 3rd add depends on previous adds, so critical path is 2 fadds
  ret double %t2
}

The MachineCombiner is *increasing* the critical path by reassociating the operands:

callq     bar
vmovsd    %xmm0, 16(%rsp)    
callq     bar
vmovsd    %xmm0, 8(%rsp)       
callq     bar
vmovsd    %xmm0, (%rsp)         
callq     bar
vaddsd    (%rsp), %xmm0, %xmm0    <--- independent add
vaddsd    8(%rsp), %xmm0, %xmm0   <--- dependent add
vaddsd    16(%rsp), %xmm0, %xmm0  <--- another dependent add, so critical path is 3 fadds

(a + b) + (c + d) --> ((d + c) + b) + a

This happens because the slack on the last add in the original trace is computed as 10 cycles (!):

DEPENDENCE DATA FOR 0x7feecc800dc0
  NewRootDepth: 7 NewRootLatency: 3
  RootDepth: 4 RootLatency: 3 RootSlack: 10
  NewRootDepth + NewRootLatency = 10
  RootDepth + RootLatency + RootSlack = 17

On 2nd thought, that slack looks really wrong, so maybe the right fix for this problem actually is in MachineTraceMetrics?

On 2nd thought, that slack looks really wrong, so maybe the right fix for this problem actually is in MachineTraceMetrics?

On 3rd thought, that looks fine. I misunderstood the meanings of 'height' and 'depth' in MTM.
So I've returned to the conclusion that the bug is in MachineCombiner's handling of slack. If you examine how slack is used in EarlyIfConversion:
http://llvm.org/docs/doxygen/html/EarlyIfConversion_8cpp_source.html#l00729

I think that shows the intended usage: slack is limited both by a dynamic calculation and a heuristic. Without those limits, we're almost certain to increase the critical path and lose performance (as shown in the examples here).

I don't understand the concern with using different cost heuristics per pattern. We recalculate all of the trace metrics after performing a transform, so there should be no interference between transforms. Making reassociation have a more conservative cost equation than AArch64's MADD transform seems natural to me because the transforms have different characteristics.

For example, I see a further compile-time optimization / differentiation based on patterns that we can use here (in a follow-on patch of course): by definition, reassociation will produce the same number and types of instructions, so we don't need to calculate MTM 'heights' (resource usage must be identical). Calculating the critical path via 'depth' (data dependency chains) should be enough to decide if a reassociation transform will help.

stoklund edited edge metadata.Oct 7 2015, 8:44 AM

I think that shows the intended usage: slack is limited both by a dynamic calculation and a heuristic. Without those limits, we're almost certain to increase the critical path and lose performance (as shown in the examples here).

The slack is simply computed relative to the critical path such that depth+slack+height = criticalpath.

However, figuring out latencies of values returned by different function calls mostly amounts to guessing. Without any insight into the called function, there is not much to infer.

spatel added a comment.Oct 7 2015, 9:13 AM

I think that shows the intended usage: slack is limited both by a dynamic calculation and a heuristic. Without those limits, we're almost certain to increase the critical path and lose performance (as shown in the examples here).

The slack is simply computed relative to the critical path such that depth+slack+height = criticalpath.

However, figuring out latencies of values returned by different function calls mostly amounts to guessing. Without any insight into the called function, there is not much to infer.

Agreed. Now that I understand MTM a bit more, I think the calls in this example could be replaced by other instructions, and we could show the same problem.

I think the key insight is to see that:

critical path = max(depth, height)

But for the purposes of a *reassociation* transform, we are only concerned with reducing the depth, so we should not consider height (and therefore slack) in our cost equation for deciding if a reassociation transform is useful. This won't be true in general.

Sorry, this has been a while. Please send me the instruction sequence you get w/ and w/o reassociation. Even w/o reassoc I see three dependent fadds (although there are four rather than three).

In the case of mul+add -> madd the combine could increase depth but that is fine as long as it does not increase the critical path. The combiners estimate looks fine to me. As long as the estimate holds it should not hurt performance even when reassoc increases depth - unless the slack is wrong/or very conservative. If that is the case and it can’t be fixed then I think we have to pursue your idea and add the capability of associating a local (per pattern) estimator.

-Gerolf

Sorry, this has been a while. Please send me the instruction sequence you get w/ and w/o reassociation. Even w/o reassoc I see three dependent fadds (although there are four rather than three).

I'm struggling to explain this any better than what is shown in the "already_reassociated()" test case included in this patch. Please let me know if there's some other format or some debug output that I can include for reference.

With current (broken) reassociation, the expected sequence for (a + b) + (c + d) is being changed from:

vaddsd  16(%rsp), %xmm1, %xmm1     <--- c + d
vaddsd  (%rsp), %xmm0, %xmm0       <--- a + b              [independent]
vaddsd  %xmm0, %xmm1, %xmm0        <--- (a + b) + (c + d)

To:

vaddsd (%rsp), %xmm0, %xmm0        <--- a + b
vaddsd 8(%rsp), %xmm0, %xmm0       <--- (a + b) + c        [dependent]
vaddsd 16(%rsp), %xmm0, %xmm0      <--- ((a + b) + c) + d  [dependent]

Perhaps there is another change in your tree. We are looking at different code and/or I don’t know how your reference sequence for your test case. With fast-math and reassociation turned off I’m getting:

vmovsd  8(%rsp), %xmm1          # 8-byte Reload
                                # xmm1 = mem[0],zero
vaddsd  16(%rsp), %xmm1, %xmm1  # 8-byte Folded Reload
vaddsd  (%rsp), %xmm0, %xmm0    # 8-byte Folded Reload
vaddsd  %xmm0, %xmm1, %xmm0

Which is happens to be the same code I get w/o fast-math.

That should not be faster than fast-math with reassociation.

I don’t understand why the equation *holds* although you measure a performance loss. I very much prefer a single equation that holds for all pattern.

-Gerolf

PS:
reassoc.ll

declare double @bar()
define double @foo4_reassociated() {

%a = call double @bar()
%b = call double @bar()
%c = call double @bar()
%d = call double @bar()
%t0 = fadd double %a, %b 
%t1 = fadd double %c, %d 
%t2 = fadd double %t0, %t1 
ret double %t2

}

Perhaps there is another change in your tree. We are looking at different code and/or I don’t know how your reference sequence for your test case. With fast-math and reassociation turned off I’m getting:

If we have differing codegen/behavior, that would certainly explain why we're not reaching the same conclusion.

Let's make sure that we are using identical tests: I have confirmed on 2 machines with pristine checkouts of trunk (r250899) that I'm seeing the bad reassociation transform. The reference regression test case that I am referring to is here:
test/CodeGen/X86/machine-combiner.ll, line 650:
define double @already_reassociated() {

And this is the invocation for the checked output:
; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx -enable-unsafe-fp-math < %s | FileCheck %s --check-prefix=AVX

And these are the critical check lines:
; AVX-NEXT: vaddsd (%rsp), %xmm0, %xmm0
; AVX-NEXT: vaddsd 8(%rsp), %xmm0, %xmm0
; AVX-NEXT: vaddsd 16(%rsp), %xmm0, %xmm0

Does this reproduce in your environment?

That is still the same test case and the outcome should be independent of compiler revisions. W/o reassoc the fill after the last call is not removed, but you get more ILP out of the 3 vadds. W reassoc there is no fill, but less ILP. You can’t get both higher ILP *and* a correct program, since in your proposed faster sequence xmm1 would not be initialized:

vaddsd 16(%rsp), %xmm1, %xmm1 <--- c + d
vaddsd (%rsp), %xmm0, %xmm0 <--- a + b [independent]
vaddsd %xmm0, %xmm1, %xmm0 <--- (a + b) + (c + d)

What value do you expect in xmm1 in the first vaddsd?

-Gerolf

That is still the same test case and the outcome should be independent of compiler revisions.

OK - at least we are certain that we are analyzing the same example now. :)

W/o reassoc the fill after the last call is not removed, but you get more ILP out of the 3 vadds. W reassoc there is no fill, but less ILP. You can’t get both higher ILP *and* a correct program, since in your proposed faster sequence xmm1 would not be initialized:

vaddsd 16(%rsp), %xmm1, %xmm1 <--- c + d
vaddsd (%rsp), %xmm0, %xmm0 <--- a + b [independent]
vaddsd %xmm0, %xmm1, %xmm0 <--- (a + b) + (c + d)

What value do you expect in xmm1 in the first vaddsd?

As we can see in the test's CHECK line, xmm1 is loaded just ahead of the 'add' in the case where the reassociation is firing:
; AVX-NEXT: vmovsd 8(%rsp), %xmm1

I don't see how register initialization or load folding is relevant in this discussion. There are 3 loads and 3 adds regardless of reassociation, and one operand is already in a register (xmm0) as the result of the final 'call' instruction. But as you noted earlier, the machine combiner doesn't know about loads/spills, so all operands are in virtual registers as far as it knows.

Perhaps it makes more sense to look at the debug output for the machine combiner pass rather than the final asm?

This is the machine trace metric depth info before reassociation occurs:

Depths for BB#0:
    0 Instructions
0       ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
0       CALL64pcrel32 <ga:@bar>, <regmask ..., %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
1       ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
1       %vreg0<def> = COPY %XMM0; FR64:%vreg0
0       ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
0       CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
1       ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
1       %vreg1<def> = COPY %XMM0; FR64:%vreg1
0       ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
0       CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
1       ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
1       %vreg2<def> = COPY %XMM0; FR64:%vreg2
0       ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
0       CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
1       ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
1       %vreg3<def> = COPY %XMM0; FR64:%vreg3
1       %vreg4<def> = VADDSDrr %vreg0, %vreg1; FR64:%vreg4,%vreg0,%vreg1
1       %vreg5<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg5,%vreg2,%vreg3
4       %vreg6<def> = VADDSDrr %vreg4<kill>, %vreg5<kill>; FR64:%vreg6,%vreg4,%vreg5
7       %XMM0<def> = COPY %vreg6; FR64:%vreg6
7       RETQ %XMM0

And this is after:

Depths for BB#0:
    0 Instructions
0	ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
0	CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
1	ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
1	%vreg0<def> = COPY %XMM0; FR64:%vreg0
0	ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
0	CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
1	ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
1	%vreg1<def> = COPY %XMM0; FR64:%vreg1
0	ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
0	CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
1	ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
1	%vreg2<def> = COPY %XMM0; FR64:%vreg2
0	ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
0	CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
1	ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
1	%vreg3<def> = COPY %XMM0; FR64:%vreg3
1	%vreg5<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg5,%vreg2,%vreg3
4	%vreg7<def> = VADDSDrr %vreg1, %vreg5<kill>; FR64:%vreg7,%vreg1,%vreg5
7	%vreg6<def> = VADDSDrr %vreg0, %vreg7<kill>; FR64:%vreg6,%vreg0,%vreg7
10	%XMM0<def> = COPY %vreg6; FR64:%vreg6
10	RETQ %XMM0

Do you agree that this 2nd sequence has been de-optimized?

As noted earlier, the reason this happens is because the MTM *Height* information for the original sequence looks like this:

Heights for BB#0:
   16 Instructions
   3c @ SBPort1 (3 ops x12)
   5c @ SBPort5 (5 ops x12)
   3c @ SBPort05 (5 ops x6)
   4c @ SBPort15 (8 ops x6)
   1c @ SBPort23 (1 ops x6)
   3c @ SBPort015 (8 ops x4)
   2c @ SBPortAny (9 ops x2)
7       0       RETQ %XMM0
7       0       %XMM0<def> = COPY %vreg6; FR64:%vreg6
7       3       %vreg6<def> = VADDSDrr %vreg4<kill>, %vreg5<kill>; FR64:%vreg6,%vreg4,%vreg5
7       6       %vreg5<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg5,%vreg2,%vreg3
7       6       %vreg4<def> = VADDSDrr %vreg0, %vreg1; FR64:%vreg4,%vreg0,%vreg1
7       6       %vreg3<def> = COPY %XMM0; FR64:%vreg3
7       0       ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
7       7       CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
8       8       ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
8       6       %vreg2<def> = COPY %XMM0; FR64:%vreg2
10      9       ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
10      10      CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
11      11      ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
11      6       %vreg1<def> = COPY %XMM0; FR64:%vreg1
13      12      ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
13      13      CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
14      14      ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
14      6       %vreg0<def> = COPY %XMM0; FR64:%vreg0
16      15      ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
16      16      CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
17      17      ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
BB#0 Live-ins: SPL@17
Critical path: 17

[Gerolf's reply to my previous comment didn't show up here in Phab, so I'm copying it here for the sake of continuity and future reference.]

It would be great to have an example that actually shows a better assembly sequence. Specifically I would like to see a similar example when there is no call in the block.

Certainly - it takes a little creativity to make this happen because x86 isel is aggressive (perhaps too aggressive) about folding loads into logic/math ops and our current reassociation does not consider instructions with memory ops as candidates for reassociation.

So we need to make sure the example has associative math ops with register operands and then create a trace with a critical path that is extended by heights (resource constraints) rather than limited by depths (data dependencies). We can do that by including a bunch of ops that use a different functional unit from our reassociable ops. Here's an example:

define double @already_reassociated(double %a0, double %a1, double %a2, double %a3, i32* %b) {
  %b0 = getelementptr i32, i32* %b, i64 0
  %b1 = getelementptr i32, i32* %b, i64 2
  %b2 = getelementptr i32, i32* %b, i64 4
  %b3 = getelementptr i32, i32* %b, i64 6

  %i0 = load i32, i32* %b0
  %i1 = load i32, i32* %b1
  %i2 = load i32, i32* %b2
  %i3 = load i32, i32* %b3

  %x0 = add i32 %i0, %i1  ; The integer ops are here simply to lengthen the critical path and expose the bug.
  %x1 = add i32 %i2, %i3
  %x2 = add i32 %x0, %x1

  %x3 = add i32 %x0, %x1
  %x4 = add i32 %x2, %i3
  %x5 = add i32 %x3, %x4

  %x6 = add i32 %x3, %x4
  %x7 = add i32 %x5, %x6
  %x8 = add i32 %x6, %x7
  store i32 %x8, i32* %b3

  %t0 = fadd double %a0, %a1    ; This is the same as before: these fadds should NOT be reassociated.
  %t1 = fadd double %a2, %a3
  %t2 = fadd double %t0, %t1
  ret double %t2
}

And here's the MTM debug output showing why reassociation is happening: in the original sequence, the depth of the last fadd is '6' as expected from 2 dependent ops, but the height of this sequence increases the critical path to '10'. Machine combiner currently uses that info to trigger a reassociation, but this increases the depth of the fadds to '9'.

Depths for BB#0:
    0 Instructions
0	%vreg4<def> = COPY %RDI; GR64:%vreg4
0	%vreg3<def> = COPY %XMM3; FR64:%vreg3
0	%vreg2<def> = COPY %XMM2; FR64:%vreg2
0	%vreg1<def> = COPY %XMM1; FR64:%vreg1
0	%vreg0<def> = COPY %XMM0; FR64:%vreg0
0	%vreg5<def> = MOV32rm %vreg4, 1, %noreg, 0, %noreg; mem:LD4[%b01] GR32:%vreg5 GR64:%vreg4
0	%vreg6<def> = MOV32rm %vreg4, 1, %noreg, 24, %noreg; mem:LD4[%b3] GR32:%vreg6 GR64:%vreg4
0	%vreg7<def,tied1> = ADD32rm %vreg5<tied0>, %vreg4, 1, %noreg, 8, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[%b1] GR32:%vreg7,%vreg5 GR64:%vreg4
0	%vreg8<def,tied1> = ADD32rm %vreg6<tied0>, %vreg4, 1, %noreg, 16, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[%b2] GR32:%vreg8,%vreg6 GR64:%vreg4
5	%vreg9<def,tied1> = ADD32rr %vreg7<tied0>, %vreg8<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg9,%vreg7,%vreg8
6	%vreg10<def,tied1> = ADD32rr %vreg9<tied0>, %vreg6, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg9,%vreg6
7	%vreg11<def,tied1> = ADD32rr %vreg9<tied0>, %vreg10<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg11,%vreg9,%vreg10
8	%vreg12<def,tied1> = ADD32rr %vreg11<tied0>, %vreg11, %EFLAGS<imp-def,dead>; GR32:%vreg12,%vreg11,%vreg11
9	%vreg13<def,tied1> = ADD32rr %vreg11<tied0>, %vreg12<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg13,%vreg11,%vreg12
10	MOV32mr %vreg4, 1, %noreg, 24, %noreg, %vreg13<kill>; mem:ST4[%b3] GR64:%vreg4 GR32:%vreg13
0	%vreg14<def> = VADDSDrr %vreg0, %vreg1; FR64:%vreg14,%vreg0,%vreg1
0	%vreg15<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg15,%vreg2,%vreg3
3	%vreg16<def> = VADDSDrr %vreg14<kill>, %vreg15<kill>; FR64:%vreg16,%vreg14,%vreg15
6	%XMM0<def> = COPY %vreg16; FR64:%vreg16
6	RETQ %XMM0
Heights for BB#0:
   14 Instructions
   3c @ SBPort1 (3 ops x12)
   1c @ SBPort4 (1 ops x12)
   1c @ SBPort5 (1 ops x12)
   1c @ SBPort05 (1 ops x6)
   2c @ SBPort15 (4 ops x6)
   3c @ SBPort23 (6 ops x6)
   4c @ SBPort015 (11 ops x4)
   3c @ SBPortAny (18 ops x2)
6	0	RETQ %XMM0
6	0	%XMM0<def> = COPY %vreg16; FR64:%vreg16
6	3	%vreg16<def> = VADDSDrr %vreg14<kill>, %vreg15<kill>; FR64:%vreg16,%vreg14,%vreg15
6	6	%vreg15<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg15,%vreg2,%vreg3
6	6	%vreg14<def> = VADDSDrr %vreg0, %vreg1; FR64:%vreg14,%vreg0,%vreg1
10	0	MOV32mr %vreg4, 1, %noreg, 24, %noreg, %vreg13<kill>; mem:ST4[%b3] GR64:%vreg4 GR32:%vreg13
10	1	%vreg13<def,tied1> = ADD32rr %vreg11<tied0>, %vreg12<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg13,%vreg11,%vreg12
10	2	%vreg12<def,tied1> = ADD32rr %vreg11<tied0>, %vreg11, %EFLAGS<imp-def,dead>; GR32:%vreg12,%vreg11,%vreg11
10	3	%vreg11<def,tied1> = ADD32rr %vreg9<tied0>, %vreg10<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg11,%vreg9,%vreg10
10	4	%vreg10<def,tied1> = ADD32rr %vreg9<tied0>, %vreg6, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg9,%vreg6
10	5	%vreg9<def,tied1> = ADD32rr %vreg7<tied0>, %vreg8<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg9,%vreg7,%vreg8
10	10	%vreg8<def,tied1> = ADD32rm %vreg6<tied0>, %vreg4, 1, %noreg, 16, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[%b2] GR32:%vreg8,%vreg6 GR64:%vreg4
10	10	%vreg7<def,tied1> = ADD32rm %vreg5<tied0>, %vreg4, 1, %noreg, 8, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[%b1] GR32:%vreg7,%vreg5 GR64:%vreg4
10	10	%vreg6<def> = MOV32rm %vreg4, 1, %noreg, 24, %noreg; mem:LD4[%b3] GR32:%vreg6 GR64:%vreg4
10	10	%vreg5<def> = MOV32rm %vreg4, 1, %noreg, 0, %noreg; mem:LD4[%b01] GR32:%vreg5 GR64:%vreg4
10	6	%vreg0<def> = COPY %XMM0; FR64:%vreg0
10	6	%vreg1<def> = COPY %XMM1; FR64:%vreg1
10	6	%vreg2<def> = COPY %XMM2; FR64:%vreg2
10	6	%vreg3<def> = COPY %XMM3; FR64:%vreg3
10	10	%vreg4<def> = COPY %RDI; GR64:%vreg4
BB#0 Live-ins: XMM0@6 XMM1@6 XMM2@6 XMM3@6 DIL@10
Critical path: 10
NEW INSTR   %vreg17<def> = UNKNOWN %vreg1, %vreg15<kill>

NEW INSTR   %vreg16<def> = UNKNOWN %vreg0, %vreg17<kill>

DEPENDENCE DATA FOR 0x10884d880
 NewRootDepth: 6 NewRootLatency: 3
 RootDepth: 3 RootLatency: 3 RootSlack: 4
 NewRootDepth + NewRootLatency = 9
 RootDepth + RootLatency + RootSlack = 10

Depths after reassociation:

Depths for BB#0:
    0 Instructions
0	%vreg4<def> = COPY %RDI; GR64:%vreg4
0	%vreg3<def> = COPY %XMM3; FR64:%vreg3
0	%vreg2<def> = COPY %XMM2; FR64:%vreg2
0	%vreg1<def> = COPY %XMM1; FR64:%vreg1
0	%vreg0<def> = COPY %XMM0; FR64:%vreg0
0	%vreg5<def> = MOV32rm %vreg4, 1, %noreg, 0, %noreg; mem:LD4[%b01] GR32:%vreg5 GR64:%vreg4
0	%vreg6<def> = MOV32rm %vreg4, 1, %noreg, 24, %noreg; mem:LD4[%b3] GR32:%vreg6 GR64:%vreg4
0	%vreg7<def,tied1> = ADD32rm %vreg5<tied0>, %vreg4, 1, %noreg, 8, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[%b1] GR32:%vreg7,%vreg5 GR64:%vreg4
0	%vreg8<def,tied1> = ADD32rm %vreg6<tied0>, %vreg4, 1, %noreg, 16, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[%b2] GR32:%vreg8,%vreg6 GR64:%vreg4
5	%vreg9<def,tied1> = ADD32rr %vreg7<tied0>, %vreg8<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg9,%vreg7,%vreg8
6	%vreg10<def,tied1> = ADD32rr %vreg9<tied0>, %vreg6, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg9,%vreg6
7	%vreg11<def,tied1> = ADD32rr %vreg9<tied0>, %vreg10<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg11,%vreg9,%vreg10
8	%vreg12<def,tied1> = ADD32rr %vreg11<tied0>, %vreg11, %EFLAGS<imp-def,dead>; GR32:%vreg12,%vreg11,%vreg11
9	%vreg13<def,tied1> = ADD32rr %vreg11<tied0>, %vreg12<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg13,%vreg11,%vreg12
10	MOV32mr %vreg4, 1, %noreg, 24, %noreg, %vreg13<kill>; mem:ST4[%b3] GR64:%vreg4 GR32:%vreg13
0	%vreg15<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg15,%vreg2,%vreg3
3	%vreg17<def> = VADDSDrr %vreg1, %vreg15<kill>; FR64:%vreg17,%vreg1,%vreg15
6	%vreg16<def> = VADDSDrr %vreg0, %vreg17<kill>; FR64:%vreg16,%vreg0,%vreg17
9	%XMM0<def> = COPY %vreg16; FR64:%vreg16
9	RETQ %XMM0

The crux is that the trace metrics assumes all calls can execute in parallel. That looks like an unrealistic assumption. This suggests there is a deeper modeling problem/question we need to understand first before jumping into solution space.

That was my "2nd thought" comment in this thread on Oct 6. And I agree that we could try to improve machine trace metrics, but my "3rd thought" comments from Oct 7 explain my current opinion about the problem: for a reassociation transform, we should be more conservative in our usage of MTM. It will save compile time and makes us immune to possible bugs in MTM or whatever CPU models it is relying on. I don't see any downside to doing this.

Gerolf added a comment.Nov 3 2015, 7:39 PM

I think we need your patch. It would be nice to have also a test case that shows an actual increase in critical path length and a measurable performance difference. In the test cases so far the vadd sequences take longer w/o actually increasing the critical path. So just by looking at a reassociation case in isolation I wouldn’t worry about modifying the heuristic. However, in a scenario eg. with reassoc first and then a mul+add -> madd the bigger depth of the re-associated instruction could result in not performing the mul+ add combine later. It could be that the extra depth may have eaten the slack the madd needs to fit within the critical path.

Also, since reassoc attempts to reduce depth shouldn’t the equation ignore latencies and the code finally look something like if (MustReduceDepth) return NewDepth < OldDepth else if (MustReduceCPL) return NewCycleCount < OldCycleCount else return NewCycleCount <= OldCycleCount?

Thanks for effort and being persistent.

Cheers
Gerolf

spatel added a comment.Nov 4 2015, 9:58 AM

I think we need your patch. It would be nice to have also a test case that shows an actual increase
in critical path length and a measurable performance difference. In the test cases so far the vadd
sequences take longer w/o actually increasing the critical path.

Hmm...I'm not seeing a way to make that happen. Any suggestions on how to expose this?

We should never increase the calculated critical path (which includes slack). We know from the example that uses 'call' instructions that our CPU model may be incorrect, and this may lead to a real-world increase in CPL. That is the scenario I'm hoping to avoid because we know that a CPU model in the compiler will never be a 100% accurate simulator for any sufficiently complex CPU.

So just by looking at a reassociation case in isolation I wouldn’t worry about modifying the heuristic.
However, in a scenario eg. with reassoc first and then a mul+add -> madd the bigger depth of the
re-associated instruction could result in not performing the mul+ add combine later. It could be that
the extra depth may have eaten the slack the madd needs to fit within the critical path.

I'm sorry, but I'm still not following this argument. Let's say that we don't even have a reassoc transform here, but the instructions themselves are already reassociated optimally. In that case, if a muladd transform that would be beneficial does not fire, then it must be a deficiency in the muladd logic or MTM, right? Ie, the cost calc for doing that transform should be independent of whether the code was transformed here or earlier by some other mechanism. We guarantee this by invalidating the MTM data between transforms here in MachineCombiner.

Also, since reassoc attempts to reduce depth shouldn’t the equation ignore latencies and
the code finally look something like if (MustReduceDepth) return NewDepth < OldDepth else
if (MustReduceCPL) return NewCycleCount < OldCycleCount else return
NewCycleCount <= OldCycleCount?

Yes, this looks correct. I think this is almost identical to the logic implemented in this patch, but we can eliminate the extra calculation of 'RootLatency' and 'NewRootLatency' because we know they are always identical for reassociation. Let me fix this and update the patch.

Thanks for effort and being persistent.

Thank you! I think I have a much better understanding of MTM now. :)

spatel updated this revision to Diff 39272.Nov 4 2015, 3:38 PM
spatel edited edge metadata.

Patch updated:

  1. Fix names of parameters to improvesCriticalPathLen() -> MustReduceCriticalPath, MustReduceDepth
  2. For the MustReduceDepth case, we can early exit (don't bother calculating latency for Root and NewRoot).
  3. Improve code comments to reflect what we have discussed in this review.
spatel updated this revision to Diff 39400.Nov 5 2015, 12:35 PM

Patch updated:
If I'm understanding correctly, the logic in the last version of the patch is now acceptable, but we can improve the code/comments to make it clearer how the cost metric may differ per pattern.

Ideally, this would be contained in the enum class itself, but C++ doesn't allow that, so I've added a helper function to MachineCombiner instead.

I cleaned up the MachineCombinerPattern file in:
http://reviews.llvm.org/rL252196

Now that you check for schedule depth for reassociations do you still need to check for shorter length? If so, for which pattern? I think we should get away with checks for depth (MustReduceDepth) and just leave the default (don't prolong critical path) for the muladd etc patterns.

lib/CodeGen/MachineCombiner.cpp
322

could we have that function return eg. an enum CombinerObjective that return MustReduceDepth or Default? Then improvesCriticalPathLength only needs one parameter of that type.

spatel added a comment.Nov 9 2015, 4:25 PM

Now that you check for schedule depth for reassociations do you still need to check for shorter length? If so, for which pattern? I think we should get away with checks for depth (MustReduceDepth) and just leave the default (don't prolong critical path) for the muladd etc patterns.

Good catch: we were only distinguishing between < and <= to prevent reassociations from firing inadvertently. We can make that always <= now for the default (muladd) patterns. Note that there are no muladd regression tests that actually check for that difference though.

lib/CodeGen/MachineCombiner.cpp
322

Sure - I think we can pass the pattern to improvesCriticalPathLength, and then it can unwrap whatever properties it needs internally. This makes the C++ enum class deficiency even more obvious. We could wrap the enums in a class, but that should be a separate change IMO.

spatel updated this revision to Diff 39767.Nov 9 2015, 4:30 PM

Patch updated:

  1. Add enum to describe the combiner's objective; this is only 2 items for now, but enables further changes to progress more naturally.
  2. Remove check for reduction of instructions; this isn't necessary for the reassociation patterns now that we limit the cost metric for those to depths.
Gerolf accepted this revision.Nov 9 2015, 5:20 PM
Gerolf edited edge metadata.

LGTM - very nice!

Thanks
Gerolf

This revision is now accepted and ready to land.Nov 9 2015, 5:20 PM
This revision was automatically updated to reflect the committed changes.