The trip count for a memcpy/memset will be n/16 rounded up to the nearest integer. So (n+15)>>4. The old code was including a BIC too, to clear one of the bits, which does not seem correct. This remove the extra BIC.
Note that ideally this would never actually be generated, as in the creation of a tail predicated loop we will DCE that setup code, letting the WLSTP perform the trip count calculation. So this doesn't usually come up in testing (and apparently the ARMLowOverheadLoops pass does not do any sort of validation on the tripcount). Only if the generation of the WLTP fails will it use the incorrect BIC instructions.
I think this might need to be &ARM::GPRlrRegClass ...
I recall ARMLowOverheadLoops::IterationCountDCE() not working for me correctly
when I had used rGPR.
But if the tests work correctly then I guess it's fine leaving it as rGPR.