This enables loop peeling by default when profile information is available.
Performance on SPEC looks flat, modulo noise. I am, however, seeing improvements on internal benchmarks, for instrumentation-based PGO.
The previous improvement in SPEC povray disappeared. Either it was a fluke after all, or it was "swallowed" by r287186 (that is, the important thing for povray was disabling runtime unrolling, not enabling peeling).
Two caveats:
- The mean size increase for SPEC is 1.4%, the total .text size increase is ~0.5%. Most size increases are very modest, and we even have some size decreases (I guess due to interaction with inlining.), but there are a couple of outliers - bzip2 grows by ~10.5%, and sphinx3 by ~4.5%. For the set of internal benchmarks, the size impact is much smaller.
- This may cause regressions if the profile information is very inaccurate (no surprise). The unfortunate part is that this currently happens for sampling PGO, because we can get weight(preheader) = weight(header), causing us to peel one iteration. The case I'm seeing should be fixed once D26256 lands, but there may be more.
Anyone interested in testing this for their PGO workloads?