This implements profile-based loop peeling.
The basic idea is that when the average dynamic trip-count of a loop is known, based on PGO, to be low, we can expect a performance win by peeling off the first several iterations of that loop. Unlike unrolling based on a known trip count, or a trip count multiple, this doesn't save us the conditional check and branch on each iteration. However, it does allow us to simplify the straight-line code we get (constant-folding, etc.), which is important given that we know that we will usually only hit this code, and not the actual loop.
The code is somewhat similar (and is based on the original version of) the runtime unrolling code, but I think like they're sufficiently different that trying to share the implementation isn't a good idea. Since the current runtime unrolling implementation already has two different prolog/epilog cases, making it do peeling as well will make it rather unreadable.
I'm planning on committing this as disabled-by-default, until I have a bit more confidence in the performance - some more tuning may be required.
@mkuper any particular reasons why the backedge weight has to be 1 instead of 0 ?