We have some long-standing missing shuffle optimizations that could use this transform via VectorCombine now:
https://bugs.llvm.org/show_bug.cgi?id=35454
(and we still don't get that case in the backend either)
This function is apparently templated because there's existing code in IR that treats mask values as unsigned and backend code that treats masks values as signed?
The mask values are not endian-dependent IIUC.
Retaining a fast copy path for Scale == 1 would make sense