This is an archive of the discontinued LLVM Phabricator instance.

getLoopEstimatedTripCount should really be called getLoopEstimatedBackedgeTakeCount.
AbandonedPublic

Authored by trentxintong on Jan 8 2017, 12:24 PM.

Details

Summary

getLoopEstimatedTripCount should really be called getLoopEstimatedBackedgeTakeCount.

Event Timeline

trentxintong retitled this revision from to getLoopEstimatedTripCount should really be called getLoopEstimatedBackedgeTakeCount..
trentxintong updated this object.
trentxintong added reviewers: mkuper, danielcdh, davidxl.
trentxintong added a subscriber: llvm-commits.

@danielcdh I can change the FlatLoopTripCountThreshold as this does change the behavior of unroller right now.

danielcdh edited edge metadata.Jan 9 2017, 10:16 AM

What's the motivation of this change?

Thanks,
Dehao

How do we represent a trip-count-zero estimate?

include/llvm/Transforms/Utils/LoopUtils.h
464

BE taken count -> backedge-taken count

What's the motivation of this change?

Thanks,
Dehao

The getLoopEstimatedTripCount function is really returning the backedge taken count. Trip count == backedge taken count + 1. So I rename it in this patch. The place I am not sure is whether you were looking for backedge count or trip count in LoopUnrollPass.cpp. It seems that you were assigning a profiled backedge count to ProfileTripCount variable.

mkuper edited edge metadata.Jan 9 2017, 10:49 AM

I don't think this is the change we want to make here, for a few reasons:

  1. On the API level, I think it makes sense to have the estimated loop trip-count. So, if we want to add 1 to the BE count, that should probably just happen inside the function, instead of renaming the function but keeping functionality as is.
  2. I'm not sure there's an actual off-by-1 here. Or, rather, there's definitely an off-by-1, but only in some cases - I guess it depends on whether the loop is bottom-tested or not?
  3. I've actually verified that this gives the right results for instrumentation-based FDO for loops with a known trip count. It's possible that I simply made a mistake - or perhaps there's an error somewhere else that this cancels out with. If you want to change this, could you please check that again?

cat b.cc

int bar(int);

void foo(int a) {

while (a) {
  a = bar(a);
}

}

#cat c.cc
void foo(int);

int bar(int x) {

return x - 1;

}

int main() {
foo(5);
return 0;
}

#clang -O2 -fprofile-generate=prof.raw b.cc c.cc -o instr_bin
#llvm-profdata merge prof.raw -o prof
#clang b.cc -c -O2 -fprofile-use=prof -emit-llvm
#opt -S b.bc |grep branch_weights
!29 = !{!"branch_weights", i32 1, i32 5}

Looks like (5 + 1/2) / 1 = 5 is the correct trip count?

I don't think this is the change we want to make here, for a few reasons:

  1. On the API level, I think it makes sense to have the estimated loop trip-count. So, if we want to add 1 to the BE count, that should probably just happen inside the function, instead of renaming the function but keeping functionality as is.

I fully agree.

  1. I'm not sure there's an actual off-by-1 here. Or, rather, there's definitely an off-by-1, but only in some cases - I guess it depends on whether the loop is bottom-tested or not?

In case the loop is bottom tested (rotated), there would be an off-by-1, as the loop body is executed backedge-count + 1 times.
If its not bottom tested, we would not have this profile data on the latch, right ? I would assume the latch simply branches back to the header and the test is in the header.

  1. I've actually verified that this gives the right results for instrumentation-based FDO for loops with a known trip count. It's possible that I simply made a mistake - or perhaps there's an error somewhere else that this cancels out with. If you want to change this, could you please check that again?

Yes, that's basically how I verified it as well. :-)

cat b.cc

int bar(int);

void foo(int a) {

while (a) {
  a = bar(a);
}

}

#cat c.cc
void foo(int);

int bar(int x) {

return x - 1;

}

int main() {
foo(5);
return 0;
}

#clang -O2 -fprofile-generate=prof.raw b.cc c.cc -o instr_bin
#llvm-profdata merge prof.raw -o prof
#clang b.cc -c -O2 -fprofile-use=prof -emit-llvm
#opt -S b.bc |grep branch_weights
!29 = !{!"branch_weights", i32 1, i32 5}

Looks like (5 + 1/2) / 1 = 5 is the correct trip count?

Thanks for the example, I will use it to take a look.

In case the loop is bottom tested (rotated), there would be an off-by-1, as the loop body is executed backedge-count + 1 times.
If its not bottom tested, we would not have this profile data on the latch, right ? I would assume the latch simply branches back to the header and the test is in the header.

Hm, yes, that sounds right.

I'm still not sure why we get seemingly correct results with the current code, though. :-)
One option is that we don't. But if we do - maybe we don't update profile metadata correctly in loop rotation?

In case the loop is bottom tested (rotated), there would be an off-by-1, as the loop body is executed backedge-count + 1 times.
If its not bottom tested, we would not have this profile data on the latch, right ? I would assume the latch simply branches back to the header and the test is in the header.

Hm, yes, that sounds right.

I'm still not sure why we get seemingly correct results with the current code, though. :-)
One option is that we don't. But if we do - maybe we don't update profile metadata correctly in loop rotation?

I am pretty sure its a problem with loop rotation, basically when we duplicate and move the terminator in the header, we did not update the prof metadata correctly. With the example @danielcdh gives me, right after rotation I am getting this. The !prof in guard block is definitely incorrect, the one in the latch block is not correct either I think. I am working on fixing this.

define void @_Z3fooi(i32 %a) local_unnamed_addr #0 !prof !29 {
entry:

%tobool2 = icmp eq i32 %a, 0
br i1 %tobool2, label %while.end, label %while.body, !prof !30

while.body: ; preds = %entry, %while.body

%a.addr.03 = phi i32 [ %call, %while.body ], [ %a, %entry ]
%call = tail call i32 @_Z3bari(i32 %a.addr.03)
%tobool = icmp eq i32 %call, 0
br i1 %tobool, label %while.end, label %while.body, !prof !30

while.end: ; preds = %while.body, %entry

ret void

}

!30 = !{!"branch_weights", i32 2, i32 10}

trentxintong edited edge metadata.

Address suggestions by @mkuper

I think this ought to be merged into D28593 - committing (or reverting) them separately would mean we get the wrong trip counts, so they should go in atomically.

Yes, I was about to do that.