BMI new instruction tzcnt has better performance than bsf on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
bsf. If we force emit rep prefix for bsf, we will gain better
performance when the same code run on new processors.
GCC has already done this way: https://c.godbolt.org/z/6xere6fs1
Fixes #34191
Add a comment to explain why we do this?