This is an archive of the discontinued LLVM Phabricator instance.

Update loop branch_weight metadata after loop rotation.
Needs ReviewPublic

Authored by trentxintong on Jan 11 2017, 7:24 PM.

Details

Summary

Update loop branch_weight metadata after loop rotation.

In case we have branch_weight in the unrotated loop header, we update
it after rotation, more specifically we update the branch in the guard
block and the branch in the rotated latch block.

Event Timeline

trentxintong retitled this revision from to Update loop branch_weight metadata after loop rotation..
trentxintong updated this object.
trentxintong added a subscriber: llvm-commits.
twoh added a subscriber: twoh.Jan 11 2017, 10:26 PM
danielcdh added inline comments.Jan 12 2017, 9:04 AM
test/Transforms/LoopRotate/loop-rotate-pgo.ll
61

I'm a little confused the logic here. For the following code:

L1:
if (cond) { // taken probability 7/35

stmt;
goto L1;

}

The loop rotate originally transforms it to:

if (cond) { // taken probability 7/35
L1:

stmt;
if (cond) { // taken probability 7/35
  goto L1;
}

}

Your patch change it to:

if (cond) { // taken probability 7/28
L1:

stmt
if (cond) { // taken probability 1/7
  goto L1;
}

}

Assuming "cond" has fixed taken probability, the original probability seems more correct to me? Or am I missing something?

@danielcdh Its the true the cond is not changed, but the inputs to the condition has changed, before you were testing with index variable in the loop header, now u have 2 conditions and the SSA values you use in the conditions have changed, so the probability needs to be adjusted. I will explain more when i get to office.

trentxintong added a comment.EditedJan 12 2017, 10:23 AM

After we rotate the loop, we duplicate the comparison the old header into the guard block and we move the header to the end of the loop. So basically we have 2 branches that carry the branch_weight metadata they need to be adjusted. With this patch, we only do the adjustment for loop with only 1 exitting block, if the loop has early exits, its harder to adjust the branch weight properly. e.g. if we have an early exit in the loop, we will not be able to tell whether the latch block will ever be reached after rotation, not to mention to adjust its branch weight, (we need to look at the branch weights of the early exits to do this properly).

There are 2 conditions that are constant after rotation, (1) # of times the loop body is executed and (2) the # of times the exit block is executed. (these are the values we can get by extracting the prof metadata from the header branch before rotation). The backedge weight should simply be the # of times the loop body is executed - # of times the loop exit is executed. With the loop exit count, loop body weight and loop backedge weight, we can compute the branch weight for the guard block.

This is true iff the loop executes at least once every time. If the loop body execute seldom, they we can not do it (and we will get a negative backedge weight num with # of times the loop body is executed - # of times the loop exit is executed.) In this case, we cap the backedge # to 1 (should be 0). And with this information we can compute loop exit weight and all the other weights.

Thats roughly what the patch is doing and how the numbers in the test case is computed.

mkuper added inline comments.Jan 12 2017, 12:59 PM
lib/Transforms/Scalar/LoopRotation.cpp
527

Also, regardless of the rest of the discussion - I don't think we should drop the metadata on the floor if we fail.

I don't think "No data is better than imprecise data" is right in the general case, but that's arguable. Specifically here, though, we're imprecise even if updateLoopEstimatedBranchWeight() succeeds, because of the assumptions we make on the distribution.

trentxintong added inline comments.Jan 12 2017, 1:29 PM
lib/Transforms/Scalar/LoopRotation.cpp
527

I am fine changing this, i.e. keeping the metadata when we cant update them properly. I think they would still be useful (not counter-productive) in this case, even though imprecise.

Address suggestions.

Hopefully remove some unrelated changes to some unrelated files.

I only want the head commit to be included in this patch.

I want to merge https://reviews.llvm.org/D28460, but I do not know how to do
so on Phabricator. So I merged it manually =).

danielcdh edited edge metadata.Jan 13 2017, 6:59 AM

Thanks a lot for the example, I'll look into it.

Dehao

Looks like this patch will make the "always call" worse:

Without this patch:

pushq   %rbx
movq    %rdi, %rbx
cmpl    $0, (%rbx)
jne     .LBB1_3

.LBB1_1: # =>This Inner Loop Header: Depth=1

movq    %rbx, %rdi
callq   call_me
cmpl    $0, (%rbx)
je      .LBB1_1

.LBB1_3:

popq    %rbx
retq

With this patch:

pushq   %rbx
movq    %rdi, %rbx
cmpl    $0, (%rbx)
je      .LBB1_1

.LBB1_3:

popq    %rbx
retq

.LBB1_1: # =>This Inner Loop Header: Depth=1

movq    %rbx, %rdi
callq   call_me
cmpl    $0, (%rbx)
jne     .LBB1_3
jmp     .LBB1_1

As the trip count of this loop is always 1, the first code will have no taken branches, while with this patch, it will have 2 taken branches.

trentxintong edited edge metadata.

Guard block branches to preheader instead of header.

Talked with Xin offline:

  1. The comparison I made is (clean client v.s. entire patch), while Xin is comparing (D28460 v.s. entire patch)
  2. We both agree that the patch works fine for cases when avg trip count >= 0.5
  3. For cases when avg trip count <0.5 3.1 both have good code layout, the set probability for the guard branch are both reasonable 3.2 this patch will peel the loop by 1, while no_patch will not peel the loop, so the patch has code size overhead 3.3 the peeled loop may have better performance on the cold path when the loop trip count is unevenly distributed (but I started to doubt this now)
  4. If 3.3 is true, then the problem becomes whether we want to trade code size for performance at cold path. Xin is going to run some benchmarks with and without the test to measure code size and performance impact of this patch and report back.

I am still in middle of getting a machine which i can do performance runs on. The machine I have is not very stable, i.e. specrun # fluctuates from run to run.

davidxl added inline comments.Feb 1 2017, 2:47 PM
lib/Transforms/Scalar/LoopRotation.cpp
88

name the parameters.

525

Remove the comment about OrigHeader BR. After Block merging, it will become the latch block branch instruction.

Is it better to call this after MergeBlockInfoPredecessor when loop L is a more consistent state?

550

exitting --> exiting

575

you need a test case to cover LoopBodyWeight > LoopExitWeight case --- not the one with known tripcount. For the constant trip count case, the guard BR gets eliminated.

614

Explain this condition? Perhaps a test case?

617

This is probably not quite correct.

For the case when LoopBodyWeight > LoopExitWeight, especially when the runtime trip count is large, the guard BR should have very biased against exiting. I think it is reasonable to use an arbitrary biased exit probability such as 1/100.

For cases when LoopBodyWeight and LoopExitWeight are close (e.g., trip count < 5), it is reasonable to keep GuardBR has the same exit branch probability as the new LatchBR. This applies to LoopBodyWeight < LoopExitWeight case as well.

test/Transforms/LoopRotate/loop-rotate-pgo.ll
10

Perhaps name it 'loop_with_known_trip_count'

17

This loop has statically known trip count. The profile data does not match it. Use a different meta data.

33

add comment the loop body is by-passed most of the time.

37

explicitly check br instruction as well so that two branch targets are listed. !prof is meaningless without looking at the branch

51

Same here -- also check branch instruction

I am very sorry I have not put up the #s for speccpu 2006. I am stuck in middle of a few things and I will put it up as soon as I have them. I will also address the comments too.

Address davidxl's comments.

I reworked how !prof metadata is computed after loop rotation.

There is one test case in peel-loop-pgo.ll. thats because I corrected how
estimated loop trip count is computed. I think its a good idea to fix loop rotation
metadada and getEstimatedTripCount together as getEstimatedTripCount uses
!prof metadata which is computed in a new way. However, the drawback is that
i need to modify a test case unrelated to loop rotation.

I ran C/C++ benchmarks in CPU2006 with the current state of the patch (baseline is without the metadata adjustment, negative percent means the benchmark runs slower or code size becomes smaller after the patch. It seems the the regressions in 429 and 444 are real. And we also have a code size reduction of -2.41% in 401. Overall, we have more performance regressions after we adjust the metadata this way.

Benchmark	Perf		CodeSize
400		-0.22%		0.00%
401		0.76%		-2.41%
403		-0.51%		0.07%
429		-2.09%		0.32%
445		0.64%		-0.08%
456		-0.29%		0.01%
458		0.00%		-0.04%
462		-0.15%		0.00%
464		0.16%		0.00%
471		-0.38%		-0.02%
473		0.00%		-0.21%
483		-0.40%		-0.04%
433		0.73%		0.03%
444		-2.15%		-0.26%
447		-0.17%		-0.06%
450		0.00%		-0.05%
453		0.38%		-0.01%
470		0.12%		0.00%
482		0.18%		0.14%
davidxl added inline comments.Feb 21 2017, 2:33 PM
lib/Transforms/Scalar/LoopRotation.cpp
591

What if the average trip count is > 20? In that case, 0.05 is larger than the latch br exit probability which is not right. The 'min' needs to be taken.

It is also better to use fixed point operation (with BranchProbablilty and BlockFrequency classes)

BranchProbability GuardBP(5, 100);
uint64_t NotTakenOnceWeight = (BlockFrequency(LoopExitWeight) * GuardBP).getFrequency();
uint64_t LoopPHWeight = LoopExitWeight - NotTakenOnceWeight;

GuardBR->setMetadata(MD_prof, ...);

'5' also needs to be a parameter. Do not use hard coded number.

623

same here, 0.95 may be smaller than the original Header BR's exit branch probability. The max should be taken.

The value 0.5 is not necessary -- why not just directly use original header BR's exit BP?

636

This can be done by computing the right notTakenOnceRatio above.

649

The weight computation code above should be refactored and shared across different cases. The difference only lies in the way NotTakenOnceRatio is computed.

Sorry for the really long hiatus. I am getting back to this one.

Rework the reworked metadata update in loop rotation. I still have some hardcoded numbers
i do not know whether to put them as options.

At least, we should try to agree on the mechanism itself and start collecting some numbers.

Add asserts and comments.

Incorrectly used max in place of min.

davidxl added inline comments.May 3 2017, 12:22 PM
lib/Transforms/Scalar/LoopRotation.cpp
561

maximum--> minimum

578

Avoid using floating point operation here. Just do:

if (LoopBodyWeight > 20 * LoopExitWeight) ..

Branch Weight Data is actually 32bit, so there is no risk of overflow here.

584

When ACT < 0.2, it seems simpler just do

NotTakenOnceProb = (ExitWeight - BodyWeight)/ExitWeight.

599

This adjustment will create weird results:

For ACT between 1 and 5, the computed NonTakenOnce Prob is kept which 50%. However, when ACT is between 0.5 and 1, the resulting not taken once Prob is smaller than 50% which contradicts to the trend -- the smaller ACT is, the more likely it is not executed once.

To fix this, do the following:

When ACT is between 0.5 and 5, use 50%
When ACT is less than 0.5, use (ExitWeight - BodyWeight)/ExitWeight

This makes the weight 'continuous and the special handling using 'min' can be removed.