This is an archive of the discontinued LLVM Phabricator instance.

[MBP] Use profile count to compute tail dup cost if it is available
ClosedPublic

Authored by Carrot on Jul 6 2020, 4:45 PM.

Details

Summary

Current tail duplication in machine block placement pass uses block frequency information in cost model. But frequency number has only relative meaning compared to other basic blocks in the same function. A large frequency number doesn't mean it is hot and a small frequency number doesn't mean it is cold.

To overcome this problem, this patch uses profile count in cost model if it's available. So we can tail duplicate real hot basic blocks.

When tested with spec2006int, the performance doesn't change, the number of tail duplicated blocks was reduced from 2376 to 1746.

In our internal testing, search1 was not impacted, search2 was improved by 0.1%, another 0.1% can be achieved with larger threshold parameter.

Diff Detail

Event Timeline

Carrot created this revision.Jul 6 2020, 4:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2020, 4:45 PM

A few things:

  1. can this be extended to the TailDup pass too?
  2. what is the impact on text size?
  3. what is the performance with AFDO and XFDO?
  1. Can the threshold be tuned higher (with more performnace)?
llvm/lib/CodeGen/MachineBlockPlacement.cpp
419

This should probbaly return Optional<..>

425

Return None here. Also when None is returned, I think the caller needs to handle it conservatively -- perhaps resort to Freq based method

3164

Is this change related?

Carrot marked 3 inline comments as done.Jul 13 2020, 6:27 PM

A few things:

  1. can this be extended to the TailDup pass too?

This patch only changes the cost function, it doesn't change tail dup algorithms, especially the partial duplication is not implemented in TailDup pass.

  1. what is the impact on text size?

The total text size of spec2006int is reduced by 7804 bytes.

  1. what is the performance with AFDO and XFDO?

The performance benefit depends on the precision of profile count. So I expect the performance gain with AFDO is similar to FDO. But XFDO has less accurate profile count, so it should has less performance gain.

  1. Can the threshold be tuned higher (with more performnace)?

The performance impact of tail duplication is mainly a function of two interfering factors, increased fall through (reduced taken branch) and increased code size (less effective usage of instruction cache).
For small binaries, the instruction cache is not a problem, so they prefer more tail duplications, thus smaller TailDupProfileCountThreshold.
For large binaries, it's not a good idea to trade a small number of increased fall through for worse instruction cache behavior. So we need to increase the bar for tail duplication, thus larger TailDupProfileCountThreshold.
The current default value of TailDupProfileCountThreshold is tuned to keep spec2006int not impacted, while still has 0.1% perf win for Goolge search benchmark. I can get another 0.1% improvement for the search benchmark if I increase TailDupProfileCountThreshold to 500.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
419

See following reply.

425

Look at the implementation of getBlockProfileCount and getBlockFreq.

BlockFrequency BlockFrequencyInfo::getBlockFreq(const BasicBlock *BB) const {

return BFI ? BFI->getBlockFreq(BB) : 0;

}

Optional<uint64_t>
BlockFrequencyInfo::getBlockProfileCount(const BasicBlock *BB,

                                       bool AllowSynthetic) const {
if (!BFI)
  return None;

return BFI->getBlockProfileCount(*getFunction(), BB, AllowSynthetic);

}

In the same condition (!BFI), neither function returns meaningful result, but one returuns None, another returns 0.

Another situation None may be returned is the function has no profile count, so the result is actually a meaningful 0.

3164

It is not profile count cost model related.
It is a different tail dup improvement.
Do you want me to send another patch for it?

Carrot updated this revision to Diff 277630.Jul 13 2020, 6:30 PM
davidxl added inline comments.Jul 15 2020, 8:22 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3164

Better separate it. Is the contributing to the performance improvement mentioned?

Carrot updated this revision to Diff 278918.Jul 17 2020, 3:42 PM
Carrot marked an inline comment as done.
Carrot added inline comments.
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3164

It doesn't cause visible performance impact.

davidxl accepted this revision.Jul 20 2020, 9:38 AM

lgtm

llvm/lib/CodeGen/MachineBlockPlacement.cpp
181

Nit: TailDupProfilePercentThreshold

This revision is now accepted and ready to land.Jul 20 2020, 9:38 AM
Carrot updated this revision to Diff 279358.Jul 20 2020, 3:33 PM
Carrot marked an inline comment as done.

Will commit this version.

This revision was automatically updated to reflect the committed changes.