The triangle tail duplication heuristic can improve performance for small chains
of triangles, but this can cause poorly generated code to perform worse, for
example, one of the mips shift lowerings generates code that looks like this:
if x do something that doesn't modify x if !x do something
Requiring 3 in a row avoids this case.
but 2 in a row is likely overly aggressive for O2, at least for now.
I have benchmark data showing this is profitable in the cases where it applies:
No significant performance changes to llvm test-suite. Tiny size increases: 0.027% on the 5 affected benchmarks.
This improves an important internal Google benchmark (protocol buffer serialization) by 2%, with no significant effect on other internal benchmarks.