PR43381 notes that while we are good at matching `(X >> C1) & C2` as BEXTR/BEXTRI,
we only do that if we either have BEXTRI (TBM),
or if BEXTR is marked as being fast (`-mattr=+fast-bextr`).
In all other cases we don't match.
But that is mainly only true for AMD CPU's.
However, for all the CPU's for which we have sched models,
the BZHI is always fast (or the sched models are all bad.)
So if we decide that it's unprofitable to emit BEXTR/BEXTRI,
we should fall-back to BZHI if it is available,
and follow-up with the shift.
Indeed, BZHI does not have an immediate form,
but i think this is pretty identical to the BMI1 BEXTR situation.
(careful, i don't know much about intel cpu's my choice of `-mcpu` may be bad here)
Thus we manage to fold a load:
And if we don't, i think we still win some cycles: