These expansions were rather inefficient and were done with more code
than necessary. This change optimizes them to use expansions more
similar to GCC. The code size is the same (when optimizing for code
size) but somehow LLVM reorders blocks in a non-optimal way. Still, this
should be an improvement with a reduction in code size of around 0.12%
(when building compiler-rt).
I made this patch to get more familiar with these inline expansions, in the hope that I can also do the other expansions inline (such as 32-bit shifts).
Note: this doesn't decrease binary size because LLVM duplicates the loop check (dec and br). When optimizing for size, this would be reduced by one instruction, such as in shift_i8_i8_size of shift.ll.