Hi All,
I would like to propose a change to turn zext+seteq+cmp into shr+lzcnt.
This optimisation is beneficial on Jaguar architecture only, where the lzcnt has a good reciprocal throughput.
Other architectures such as Intel's Haswell/Broadwell or AMD's Bulldozer/PileDriver do not benefit from it.
For this reason the change also add a "HasFastLZCNT" feature which gets enabled for Jaguar.
This isn't a processor/cpuid feature, please move this further down to be closer to the other fast/slow characteristic features.