Page MenuHomePhabricator

[LoopUnroll] Peel off iterations if it makes conditions true/false.
ClosedPublic

Authored by fhahn on Feb 28 2018, 9:41 AM.

Details

Summary

If the loop body contains conditions of the form IndVar < #constant, we
can remove the checks by peeling off #constant iterations. This patch
initially starts out with supporting very simple conditions. It can be
made more powerful in follow-up commits.

This improves codegen for PR34364.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
efriedma added inline comments.Feb 28 2018, 11:56 AM
lib/Transforms/Utils/LoopUnrollPeel.cpp
154 ↗(On Diff #136315)

Please don't use getCanonicalInductionVariable; indvars doesn't generate canonical induction variables anymore, so this will only handle limited cases.

Can you use SCEV here instead?

177 ↗(On Diff #136315)

This doesn't check for the possibility that the induction variable could wrap... not a correctness problem, of course, but it's not clear the transform is profitable in that case.

mkazantsev added inline comments.Feb 28 2018, 9:18 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
154 ↗(On Diff #136315)

I would also advice using SCEV here. It can help to cover more cases than just comparison against constants for no extra price. See ScalarEvolution::isKnownPredicate or ScalarEvolution::isLoopEntryGuardedByCond.

175 ↗(On Diff #136315)

Is it OK if you have negative UpperBound? If you interpret it as unsigned value, it is a very big value, and you are going to peel max possible amount of iterations in this case.

test/Transforms/LoopUnroll/peel-loop-conditions.ll
10 ↗(On Diff #136315)

I don't think it's profitable unless we have proved that %k is positive.

mkazantsev requested changes to this revision.Feb 28 2018, 9:20 PM
This revision now requires changes to proceed.Feb 28 2018, 9:20 PM
lebedev.ri added inline comments.Mar 1 2018, 2:33 AM
test/Transforms/LoopUnroll/peel-loop-conditions.ll
75 ↗(On Diff #136315)

What about peeling from the end (i guess will still work), and from the beginning and end?

fhahn updated this revision to Diff 136540.Mar 1 2018, 9:00 AM

Update to use SCEV. It now uses a rather simple approach: evaluate AddRec at the DesiredPeelCount and then try to add the recurrence step until the predicate is not known any longer. We could probably compute the number of iterations directly, but I could not find any existing function in SCEV that would handle all predicates. But we only loop for MaxPeelCount iterations at max, and MaxPeelCount should be low.

fhahn added inline comments.Mar 1 2018, 9:08 AM
test/Transforms/LoopUnroll/peel-loop-conditions.ll
75 ↗(On Diff #136315)

We could do that as well. It would require changes to peelLoop though and should be done as a follow up patch to this one.

I believe this has lots of potential (not just for constant). What other follow-up commits are you planning on ?

fhahn added a comment.Mar 1 2018, 3:45 PM

I believe this has lots of potential (not just for constant). What other follow-up commits are you planning on ?

I am open for suggestions. I think peeling off iterations at the end as Roman suggested might be worthwhile too. And opportunistically peeling off a few iterations, if the distance between start and end are low, but unrolling does not apply.

I am open for suggestions. I think peeling off iterations at the end as Roman suggested might be worthwhile too. And opportunistically peeling off a few iterations, if the distance between start and end are low, but unrolling does not apply.

Peeling off iterations at the end might hit the case I'm trying to optimize. I will be happy to support it.

lib/Transforms/Utils/LoopUnrollPeel.cpp
169 ↗(On Diff #136540)

Isn't it okay to remove this nullcheck for Condition ?

192 ↗(On Diff #136540)

Do we really need this loop. Isn't it possible to find the count using getMinusSCEV(RightSCEV, LeftSCEV->getStart()) as long as getMinusSCEV(RightSCEV, LeftSCEV->getStart()) is constant ?

fhahn added a comment.Mar 7 2018, 2:12 PM

Peeling off iterations at the end might hit the case I'm trying to optimize. I will be happy to support it.

Great! Is there any code/examples you could share that would benefit?

lib/Transforms/Utils/LoopUnrollPeel.cpp
169 ↗(On Diff #136540)

Ah yes, we are not dyn_cast'ing any more.

192 ↗(On Diff #136540)

Yes we could, but we would need some logic dealing with different predicates I think. That should not be too much work and I am happy to do it, unless there is existing infrastructure I might have missed.

175 ↗(On Diff #136315)

With using SCEV, this problem went away I think.

177 ↗(On Diff #136315)

I think this still needs wrapping checks. I'll have to look into how to best do that what SCEV.

Great! Is there any code/examples you could share that would benefit?

Hopefully, below C code should show the case I'm pursuing. In this loop, we know that the condition is false only in the last one iteration.

void test (int* p, int* p2, int* p3, int v, int L) {

unsigned int M = *p;
int k;
for (k = 1; k <= M; k++) {
  p2[k] = v;
  if (k < M) {
      p3[k] = v;
  }
}

}

samparker added a comment.EditedMar 8 2018, 1:07 AM

Just FYI, I am also currently working on a pass to handle slightly more generic cases within the loop body, such like:

for (unsigned i = 0; i; i < 1000; ++i) {

if (i < M)
... something
else
... something else

}

My transformation operates on the unrolled version and also attempts to remove selects if the loop body has been simplified into one block. Hopefully I will be able to get an initial version up for review and discussion next week.

fhahn added a comment.Mar 8 2018, 1:19 AM

Just FYI, I am also currently working on a pass to handle slightly more generic cases within the loop body, such like:

for (unsigned i = 0; i; i < 1000; ++i) {

if (i < M)
... something
else
... something else

}

My transformation operates on the unrolled version and also attempts to remove selects if the loop body has been simplified into one block. Hopefully I will be able to get an initial version up for review and discussion next week.

Interesting. What transformation is this pass doing? Loop peeling should handle cases like that for constant bounds. Non-constant bounds still won't be peeled, so it should not interfere with your transformation.

Yes, this is a really nice approach for lower constant bounds. My transform selects a region of conditionally executed blocks and then attempts to hoist conditional statements into the head block of that region. Currently I can find loop unrolled induction variable idioms as well as finding a connected path of conditional values. From there I can produce a fast path free of compares and branches, as well as leaving the original code as a fallback. I also split and remove selects because performing conditional moves is still expensive on our small cores as well as the increased register pressure they create.

My transform selects a region of conditionally executed blocks and then attempts to hoist conditional statements into the head block of that region.

Will your pass merge some blocks under same condition into a group in the loop? I'm curious if the conditional branch is removed in the loop?

Just FYI, I am also currently working on a pass to handle slightly more generic cases within the loop body, such like:

for (unsigned i = 0; i; i < 1000; ++i) {
  if (i < M)
    ... something
  else
    ... something else
}

My transformation operates on the unrolled version and also attempts to remove selects if the loop body has been simplified into one block. Hopefully I will be able to get an initial version up for review and discussion next week.

I'm not sure if I fully understand what your pass will do. It would be great if you can show how this code would be transformed after your pass ?

To remove the conditional branch in general, I think we can split the loop like :

Min = min(1000, M);
unsigned i;
for (i = 0; i; i < Min; ++i) {
    ... something
}

for (; i; i < 1000; ++i) {
    ... something else
}

@samparker, will your pass perform something like this or something else ?

I believe peeling is a special case of such loop splitting, but peeling itself is worthwhile to do for a low constant bound.

fhahn added inline comments.Mar 8 2018, 2:05 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
192 ↗(On Diff #136540)

I had a closer look today. By using isKnownPredicate, we can also handle cases like the one below, where we compare 2 SCEVAddRecExprs, easily. Given that MaxPeelCount should be quite small and we only iterate at most MaxPeelCount times for each compare instruction, I am not sure if it's worth making things more complicated. It might be worth to add a "fast-path" for the case where we comparing a constant bound with a AddRecExpr with a known integer constant. What do you think?

for.body.lr.ph:
  br label %for.body

for.body:
  %i.05 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
  %j = phi i32 [ 2, %for.body.lr.ph ], [ %j.inc, %for.inc ]
  %cmp1 = icmp ult i32 %i.05, %j
  br i1 %cmp1, label %if.then, label %for.inc

if.then:
  call void @f1()
  br label %for.inc

for.inc:
  %inc = add nsw i32 %i.05, 2
  %j.inc = add nsw i32 %j, 1
  %cmp = icmp slt i32 %inc, %k
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret void
junbuml added inline comments.Mar 9 2018, 10:44 AM
lib/Transforms/Utils/LoopUnrollPeel.cpp
192 ↗(On Diff #136540)

Based on that MaxPeelCount is a small constant, I don't see any big burden on it. fast-path sounds good to me.

mkazantsev requested changes to this revision.Mar 11 2018, 10:12 PM
mkazantsev added inline comments.
lib/Transforms/Utils/LoopUnrollPeel.cpp
173 ↗(On Diff #136540)

What was the point of moving this if? Could we not just update DesiredPeelCount before this line?
We only need MaxPeelCount under this condition, there is no point in calculating it before it.

184 ↗(On Diff #136540)

After all this logic, could you please create a variable like LeftAR = cast<SCEVAddRecExpr>(LeftSCEV) to make it explicitly clear what you expect to see next? I also suggest bailing out if LeftAR->isAffine() is false because it will lead you to huge SCEV computations in the loop below.

187 ↗(On Diff #136540)

Any good reason why it is 16 bit?

189 ↗(On Diff #136540)

You could have used SE.getConstant(LeftSCEV->getType(), DesiredPeelCount) and no need to declare C for this.

190 ↗(On Diff #136540)

I totally don't get this logic. Are we going to prove _another_ predicate if SCEV was unable to prove something for this one? Could you please add a comment on what is going on here?

194 ↗(On Diff #136540)

Step calculation can be hoisted out of this loop. I would also suggest bailing early if AR is not affine because adding a step of non-affine AddRec many times can produce really big and ugly SCEVs.

This revision now requires changes to proceed.Mar 11 2018, 10:12 PM
fhahn updated this revision to Diff 138061.Mar 12 2018, 11:25 AM
fhahn marked 4 inline comments as done.

@mkazantsev thank you very much for having a look. I hope I addressed your feedback. Are you concerned about having the loop that evaluates the predicate at the current iteration?

fhahn added inline comments.Mar 12 2018, 11:29 AM
lib/Transforms/Utils/LoopUnrollPeel.cpp
173 ↗(On Diff #136540)

MaxPeelCount is passed to countToEliminateCompares, to limit the maximum iterations we do.

194 ↗(On Diff #136540)

I've added a comment that makes it clearer I hope. The idea is to handle cases like below, where the condition is known to be false initially. Initially i > 2 is not known, but the inverse i <= 2 is known.

if (i > 2) {
  // do something
} else {
  // do something else
}
wcxf added a subscriber: wcxf.Mar 12 2018, 8:00 PM
mkazantsev added inline comments.Mar 12 2018, 10:01 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
173 ↗(On Diff #136540)

My bad, I didn't notice it. :)

194 ↗(On Diff #136540)

I see the point now, but you are using wrong function. If you look into getSwappedPredicate, it returns < for >, and what you need is called getInversePredicate.

mkazantsev requested changes to this revision.Mar 12 2018, 10:03 PM

Please fix getSwappedPredicate with getInversePredicate, the rest looks fine now.

This revision now requires changes to proceed.Mar 12 2018, 10:03 PM
fhahn updated this revision to Diff 138144.Mar 13 2018, 2:46 AM

Thanks. Updated to use getInversePredicate

lib/Transforms/Utils/LoopUnrollPeel.cpp
194 ↗(On Diff #136540)

Right, my bad, thank you very much :)

mkazantsev added inline comments.Mar 13 2018, 9:34 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
182 ↗(On Diff #138144)

Why do we bail if both left and right are AddRecs? What is the problem?

If there are no conceptual troubles with handling this case, I'd rather handle it. And if so, please add a corresponding test.

mkazantsev added inline comments.Mar 13 2018, 9:55 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
173 ↗(On Diff #138144)

I think you should also check isKnownPredicate for LeftSCEV, RightSCEV at this point and bail if it is trivially known true or false. In this case it makes more sense to eliminate this comparison than try to peel something out.

mkazantsev requested changes to this revision.Mar 13 2018, 10:31 PM

I think it is a bug to be fixed.

lib/Transforms/Utils/LoopUnrollPeel.cpp
185 ↗(On Diff #138144)

Bail if LeftAR->getLoop() != L.

This revision now requires changes to proceed.Mar 13 2018, 10:31 PM
fhahn updated this revision to Diff 138308.Mar 14 2018, 2:31 AM
fhahn marked 2 inline comments as done.

Thanks again! I've added a test case to make sure we do not peel the inner loop if the condition depends ARs of the outer loop

fhahn added inline comments.Mar 14 2018, 12:57 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
182 ↗(On Diff #138144)

I don't think there are conceptual problems, although it will make things slightly more complicated; I think we would need some more checks and also evaluate the second AR, if we have one. I have test case, but I would prefer to add that in a follow up patch, if that is ok with you?

mkazantsev accepted this revision.Mar 14 2018, 9:30 PM

LGTM. I would rather prefer removing the restriction on RightSCEV being AR in this patch and then making a follow-up if you need some extra logic for it, but it's up to you.

lib/Transforms/Utils/LoopUnrollPeel.cpp
182 ↗(On Diff #138144)

We could just remove this "continue" check without adding extra logic for right AR. I mean, we only care that LeftSCEV is an AR and we don't care what RightSCEV actually is. For example, in a loop:

int i = 1, j = 1;
while (true) {
  i++;
  j = (j /s (i + 1));
}

SCEV of i's Phi will be a SCEVAddRec and SCEV of j's Phi will be SCEVUnknown, and for some reason you allow your optimization when RightSCEV = j and explicitly prohibit it when RightSCEV = i. I wonder how i is more complicated than j? :)

I'm OK if you remove this check in a follow-up patch or right in this one, whatever you like more.

This revision is now accepted and ready to land.Mar 14 2018, 9:30 PM
fhahn added a comment.Mar 15 2018, 2:44 AM

Thanks again for having a look. I'll drop the bail out and add a test before I commit the change.

lib/Transforms/Utils/LoopUnrollPeel.cpp
182 ↗(On Diff #138144)

Ah yes. I removed the check now, as it won't anything wrong. What I meant is that I think we could do better than just dropping the bail out. That's what I plan to do in a follow up patch.

fhahn updated this revision to Diff 138511.Mar 15 2018, 2:46 AM

Drop bailout

fhahn updated this revision to Diff 138547.Mar 15 2018, 7:28 AM

Rebased. I had to add -unroll-peel-max-count=0 to the new test case added in D43931, as loop peeling now covers that case too

This revision was automatically updated to reflect the committed changes.
mcrosier added a comment.EditedMar 22 2018, 12:01 PM

Hi Florian,
We identified a 2.15% regression in SPEC2006/h264ref due to this commit. After this change, the FullPelBlockMotionBiPred() function is no longer inlined into the hottest function, BlockMotionSearch(). Previous to this change, the function was inlined because there was a single callsite in the entire program (known only when compiling in LTO) and the original definition could be removed after inlining. However, after loop peeling the callsite of FullPelBlockMotionBiPred() is replicated, which prevents inlining.

I was wondering if we could avoid peeling in this case until we have some type of cost model that can determine if peeling would prevent inlining. Also, after looking at the code (which I can't share here) you might also notice that the amount of code being peeled in this case is fairly large relative to the amount of code being removed from the loop. It might also make sense to have a heuristic that takes code size into consideration when peeling, if that hasn't already been done.

Thoughts?

Chad

fhahn added a comment.Mar 22 2018, 3:02 PM

Hi Chad,

Hi Florian,
We identified a 2.15% regression in SPEC2006/h264ref due to this commit. After this change, the FullPelBlockMotionBiPred() function is no longer inlined into the hottest function, BlockMotionSearch(). Previous to this change, the function was inlined because there was a single callsite in the entire program (known only when compiling in LTO) and the original definition could be removed after inlining. However, after loop peeling the callsite of FullPelBlockMotionBiPred() is replicated, which prevents inlining.

I was wondering if we could avoid peeling in this case until we have some type of cost model that can determine if peeling would prevent inlining. Also, after looking at the code (which I can't share here) you might also notice that the amount of code being peeled in this case is fairly large relative to the amount of code being removed from the loop. It might also make sense to have a heuristic that takes code size into consideration when peeling, if that hasn't already been done.

Thoughts?

Thanks for making me aware of this, I originally thought considering MaxPeelCount should help us avoid those cases.

I will follow this up with patches in the following days. I think there are a couple of things that can be done to make the peeling more conservative for now. First, only peel if we can proof that the condition is known to be true in the peeled part and false in the loop (or vice versa). Otherwise we cannot simplify the loop body and peeling is likely not very beneficial. Second, have a simple cost function, that takes the size of the loop body vs the eliminated instructions into account.

Also, D43878, which enables induction variable simplification after peeling is not committed yet, so currently the loop body may not be simplified after peeling, even if it could be.

Cheers,
Florian

Hi Chad,

Hi Florian,
We identified a 2.15% regression in SPEC2006/h264ref due to this commit. After this change, the FullPelBlockMotionBiPred() function is no longer inlined into the hottest function, BlockMotionSearch(). Previous to this change, the function was inlined because there was a single callsite in the entire program (known only when compiling in LTO) and the original definition could be removed after inlining. However, after loop peeling the callsite of FullPelBlockMotionBiPred() is replicated, which prevents inlining.

I was wondering if we could avoid peeling in this case until we have some type of cost model that can determine if peeling would prevent inlining. Also, after looking at the code (which I can't share here) you might also notice that the amount of code being peeled in this case is fairly large relative to the amount of code being removed from the loop. It might also make sense to have a heuristic that takes code size into consideration when peeling, if that hasn't already been done.

Thoughts?

Thanks for making me aware of this, I originally thought considering MaxPeelCount should help us avoid those cases.

I will follow this up with patches in the following days. I think there are a couple of things that can be done to make the peeling more conservative for now. First, only peel if we can proof that the condition is known to be true in the peeled part and false in the loop (or vice versa). Otherwise we cannot simplify the loop body and peeling is likely not very beneficial. Second, have a simple cost function, that takes the size of the loop body vs the eliminated instructions into account.

Also, D43878, which enables induction variable simplification after peeling is not committed yet, so currently the loop body may not be simplified after peeling, even if it could be.

Cheers,
Florian

You're welcome, Florian. And thanks for following up on this (and all the great work you're doing here)!

fhahn added a comment.Mar 28 2018, 9:30 AM

@mcrosier I've submitted D44983 for review. It prevents peeling, if we cannot simplify the loop body after peeling. Peeling if we cannot simplify the loop body afterwards is likely not beneficial. It would be great if you could check if that helps in your case. If it's not easy for you to check, I can try and test it myself.

Coming up with some additional heuristics, e.g. based on the number of instructions peeled and not eliminated, should be possible, but without knowing the inlining situation we would probably have to choose a rather arbitrary threshold.

@mcrosier I've submitted D44983 for review. It prevents peeling, if we cannot simplify the loop body after peeling. Peeling if we cannot simplify the loop body afterwards is likely not beneficial. It would be great if you could check if that helps in your case. If it's not easy for you to check, I can try and test it myself.

Coming up with some additional heuristics, e.g. based on the number of instructions peeled and not eliminated, should be possible, but without knowing the inlining situation we would probably have to choose a rather arbitrary threshold.

Sure, I'll take a look now and let you know! Sorry I didn't see this sooner.

@mcrosier I've submitted D44983 for review. It prevents peeling, if we cannot simplify the loop body after peeling. Peeling if we cannot simplify the loop body afterwards is likely not beneficial. It would be great if you could check if that helps in your case. If it's not easy for you to check, I can try and test it myself.

Coming up with some additional heuristics, e.g. based on the number of instructions peeled and not eliminated, should be possible, but without knowing the inlining situation we would probably have to choose a rather arbitrary threshold.

Sure, I'll take a look now and let you know! Sorry I didn't see this sooner.

Unfortunately, D44983 does not fix this case. I'm going to dig into this now. I'll update you once I have some additional findings.

fhahn added a comment.Apr 4 2018, 2:33 PM

Thanks for letting me know, so it looks like we need better size based heuristics as well. I'll take a closer look at what happens in FullPelBlockMotionBiPred soon.

fhahn added a comment.Apr 6 2018, 8:41 AM

@mcrosier I've submitted D44983 for review. It prevents peeling, if we cannot simplify the loop body after peeling. Peeling if we cannot simplify the loop body afterwards is likely not beneficial. It would be great if you could check if that helps in your case. If it's not easy for you to check, I can try and test it myself.

Coming up with some additional heuristics, e.g. based on the number of instructions peeled and not eliminated, should be possible, but without knowing the inlining situation we would probably have to choose a rather arbitrary threshold.

Sure, I'll take a look now and let you know! Sorry I didn't see this sooner.

Unfortunately, D44983 does not fix this case. I'm going to dig into this now. I'll update you once I have some additional findings.

I had a closer look at FullPelBlockMotionBiPred and we peeled off an iteration because we have something like

if (i % 2) {
  ...
  if (i != 0) {...}
 ...
}

Peeling based on those nested conditional is likely to increase the code size too much compared to the benefit. D45374 only considers conditions in blocks that are executed on every iteration.

@mcrosier I've submitted D44983 for review. It prevents peeling, if we cannot simplify the loop body after peeling. Peeling if we cannot simplify the loop body afterwards is likely not beneficial. It would be great if you could check if that helps in your case. If it's not easy for you to check, I can try and test it myself.

Coming up with some additional heuristics, e.g. based on the number of instructions peeled and not eliminated, should be possible, but without knowing the inlining situation we would probably have to choose a rather arbitrary threshold.

Sure, I'll take a look now and let you know! Sorry I didn't see this sooner.

Unfortunately, D44983 does not fix this case. I'm going to dig into this now. I'll update you once I have some additional findings.

I had a closer look at FullPelBlockMotionBiPred and we peeled off an iteration because we have something like

if (i % 2) {
  ...
  if (i != 0) {...}
 ...
}

Peeling based on those nested conditional is likely to increase the code size too much compared to the benefit. D45374 only considers conditions in blocks that are executed on every iteration.

IIUC, D45374 should fix the regression in h264ref we're seeing (but I haven't tested it yet). However, I wanted to share with you my findings. Currently, loop peeling will not peel a loop if it includes a function call that is likely to be inlined (i.e., is not marked with a noinline attribute, has internal linkage and has a single use). This is exactly the case we're dealing with here except FullPelBlockMotionBiPred isn't marked as internal until the LTO phase of compilation. Thus, one possible approach would be to defer peeling until the LTO phase. After r329392, this can be accomplished with a small change to the pass manager:

diff --git a/lib/Transforms/IPO/PassManagerBuilder.cpp b/lib/Transforms/IPO/PassManagerBuilder.cpp
index 1def2e8..b06ca2a 100644
--- a/lib/Transforms/IPO/PassManagerBuilder.cpp
+++ b/lib/Transforms/IPO/PassManagerBuilder.cpp
@@ -651,7 +651,10 @@ void PassManagerBuilder::populateModulePassManager(
   addInstructionCombiningPass(MPM);
 
   if (!DisableUnrollLoops) {
-    MPM.add(createLoopUnrollPass(OptLevel));    // Unroll small loops
+    if (PrepareForLTO)
+      MPM.add(createSimpleLoopUnrollPass(OptLevel));    // Unroll small loops
+    else
+      MPM.add(createLoopUnrollPass(OptLevel));    // Unroll small loops
 
     // LoopUnroll may generate some redundency to cleanup.
     addInstructionCombiningPass(MPM);

However, I haven't run any performance tests (but I do know if fixes the h264ref regression) nor do I know if the community would be interested in such an approach.

A more targeted approach might be to only prevent peeling of the loop (rather than disable peeling entirely) during the first phase of LTO if the loop has any calls.

This is exactly the case we're dealing with here except FullPelBlockMotionBiPred isn't marked as internal until the LTO phase of compilation. Thus, one possible approach would be to defer peeling until the LTO phase. After r329392, this can be accomplished with a small change to the pass manager:

ThinLTO doesn't run vectorization or unrolling before link-time; among other reasons, it avoids problems like this. You might want to consider making non-thin LTO work the same way.

This is exactly the case we're dealing with here except FullPelBlockMotionBiPred isn't marked as internal until the LTO phase of compilation. Thus, one possible approach would be to defer peeling until the LTO phase. After r329392, this can be accomplished with a small change to the pass manager:

ThinLTO doesn't run vectorization or unrolling before link-time; among other reasons, it avoids problems like this. You might want to consider making non-thin LTO work the same way.

Thanks for the suggestion, Eli. Given this is already the behavior of ThinLTO, I'm going to investigate this further.