With this patch LLVM starts scheduling instructions based on their latency.
For top-down scheduling, prefer scheduling instructions at a lower depth:
those instructions will be ready to execute before others at a higher depth.
For bottom-up scheduling, prefer scheduling instructions at a lower height.
For the following testcase:
void fun(char *restrict in, char *restrict out) { *in++ = *out++; *in++ = *out++; *in++ = *out++; *in++ = *out++; }
on aarch64 we used to produce this code :
ldrb w8, [x1] strb w8, [x0] ldrb w8, [x1, #1] strb w8, [x0, #1] ldrb w8, [x1, #2] strb w8, [x0, #2] ldrb w8, [x1, #3] strb w8, [x0, #3]
with the patch we now produce:
ldrb w8, [x1] ldrb w9, [x1, #1] ldrb w10, [x1, #2] ldrb w11, [x1, #3] strb w8, [x0] strb w9, [x0, #1] strb w10, [x0, #2] strb w11, [x0, #3]
There are about 600 tests modified by this patch (mostly on the x86 side, and a few on aarch64.)