The motivation here is to reduce the stall cycles spent when a load is waiting for its address to become available. I am using InstCombine to fold GEP into Select when it seems profitable. This can enable speculation of loads, which is already happening in InstCombine. The DAGCombiner tries to revert the folding of load into select at the moment.
I am planning to break this patch down to smaller pieces, but first I wanted to show what I am trying to achieve and hopefully get some feedback. Another place this optimization might fit is codegenprepare. I think implementing it at the backend is not an option as it's too late to know whether we can safely speculate the loads.
The codegen test below shows the motivating example. The load doesn't have to wait for the select.
Before:
add x8, x20, #8 add x9, x20, #4 cmp w0, w19 csel x8, x8, x9, gt ldr w0, [x8]
After:
ldp w8, w9, [x20, #4] cmp w0, w19 csel w0, w9, w8, gt
This change improves an internal benchmark approximately by 4%.
I'm pretty sure this isn't safe, in general. The "inbounds" marker only guarantees that the arithmetic is in bounds; it doesn't make any guarantees about the type of the pointer.