BMI new instruction tzcnt has better performance than bsf on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
bsf. If we force emit rep prefix for bsf, we will gain better
performance when the same code run on new processors.
GCC has already done this way: https://c.godbolt.org/z/6xere6fs1