I am currently investigating a regression exposed by some of the changes
to the intrinsics cost modeling related to ctlz on X86.
The problem with CTLZ on X86 is that it either gets lowered to LZCNT or
BSR. On most Intel CPUs, e.g. Haswell & Skylake, those instructions have
to go through a single port. Speculating them in loops can cause
substantial slow-downs (for example a 2-3x regression in some of the
Swift string search functions), especially if the branch to the ctlz is
never or rarely taken.
Unfortunately I am not sure what the best solution for the problem is.
Outside of loops, speculating ctlz can probably still be beneficial in
some cases. In this patch, I tried to reduce the budget for speculation
if we can determine that we are in a loop. But this is quite fragile and
might be too conservative for some instructions.
Any ideas/suggestions would be greatly appreciated.
Do you have a better example more like the loops you're seeing performance issues on?
This one looks kind of silly since %x is loop invariant.