Page MenuHomePhabricator

[MBP][X86] Include static prof data when collecting loop BBs

Authored by void on Feb 18 2020, 5:18 PM.



If the programmer adds static profile data to a branch---i.e. uses
"builtin_expect()" or similar---then we should honor it. Otherwise,
builtin_expect()" is ignored in crucial situations. So we trust that
the programmer knows what they're doing until proven wrong.

Diff Detail

Event Timeline

void created this revision.Feb 18 2020, 5:18 PM

I'm in no way qualified to be reviewing changes to something with as important implications as machine block placement, but from the test modifications, it doesn't seem too invasive a change. Posting mostly notes for other reviewers to check.


I really wish this was a method on MachineBlockPlacement or MachineLoop rather than a lambda in an if statement.


LGTM; b5 is only branched to from b4, but b4's has higher branch weight to b2. Though it would be nice to just list all the blocks in order; otherwise I'm just guessing where the rest land.


probably should specify where if.end.i.i gets placed, too, since it also has profile metadata.


LGTM; backedge is very likely to go to header. middle is unlikely to go to slow


LGTM; %latch has a 50-50 chance of branching either to %exit or %header. !3 seems unused.


this change is curious; I'd have expected je's to flip to jne's or vice versa, but not unconditional jumps.

void marked 2 inline comments as done.Feb 18 2020, 9:10 PM
void added inline comments.

This change is very interesting actually. The two mem-move blocks are placed at the end of the function. They then branch back up into what was a tail duplicated block. So the testl/je statements are actually at the original place, and the jmp here is just a simple branch to them.

lebedev.ri retitled this revision from Include static prof data when collecting loop BBs to [MBP][X86] Include static prof data when collecting loop BBs.Feb 19 2020, 1:27 AM

I think if we add more checks as I've commented above and see if there's a way to make the lamba-in-if nicer, this is ready to land.

void updated this revision to Diff 245459.Feb 19 2020, 10:34 AM
void marked 4 inline comments as done.

Move lambda logic into a method. Improve some tests with additional checks.


Is it that it's cleaner to do it that way? The logic is going to be basically the same.

nickdesaulniers accepted this revision.Feb 19 2020, 10:59 AM
nickdesaulniers marked 2 inline comments as done.

LGTM; thanks Bill!


Yes, it's much nicer IMO to have short concise well named functions. It helps make this code more readable. Thank you.


Looks like the intent of the test is that "if.then" was not predicated. not necessarily anything to do with the block placement.

This revision is now accepted and ready to land.Feb 19 2020, 10:59 AM
This revision was automatically updated to reflect the committed changes.
leonardchan added a subscriber: leonardchan.EditedMar 18 2020, 12:27 PM

Is it expected that this patch could increase binary size? I'm seeing about a 9kB total increase in fuchsia ZBIs between clang with this patch and the one before it.

davidxl added inline comments.

The check here seems too weak. One branch in the loop has user annotation does not mean other branch probablity data can be trusted. More sophisticated analysis is needed.


Note that with PGO is on, the branches are also annotated with MD_prof meta data. In other words, you will also need to check fhat the function entry does not have profile count.

void marked an inline comment as done.Mar 20 2020, 3:40 PM
void added inline comments.

Do you mean that a builtin_expect shouldn't be considered when using profile data?

davidxl added inline comments.Mar 20 2020, 5:12 PM

No, that is what I meant.

When profile data is available, builtin_expect will be ignored -- the prof meta data is overridden by profile data reader.

What I meant is that builtin_expect has very vague meaning -- some user uses it to indicate branch bias which can be weak, or strong. LLVM's implementation treat it as a very strong bias (2000:1 ratio) -- so it will likely to exclude too many blocks from the block set when used in this context -- leading to code size increase or performance regression.

In other words, PGO is the way to go :)