Current tail duplication embedded in MBP duplicates a BB into all or none of its predecessors without too much cost analysis. So sometimes it is duplicated into cold predecessors, and in other cases it may miss the duplication into hot predecessors.
This patch improves tail duplication in 3 aspects:
- A successor can be duplicated into part of its predecessors.
- A more fine-grained benefit analysis, combined with 1, now a successor is duplicated into hot predecessors only.
- If a successor can't be duplicated into one predecessor, it doesn't impact the duplication into other predecessors.
It doesn't impact the performance of spec2006int like many other code layout changes, spec doesn't have large instruction work set.
Performance test with two Google's internal search benchmarks shows obvious improvement on large binaries. I got 0.30% and 0.18% on them.
Need to update the comments for this function.
Nit: maybe move CandidatePtr before DuplicatePreds?