The issue
Consider the two following LLVM IR functions, both of which just store a value at an offset from a pointer. The only difference is the order of the GEP and addrspacecast:
target triple = "nvptx64-nvidia-cuda" define void @bad(i64) { %ptr = inttoptr i64 %0 to i16* %gep = getelementptr i16, i16* %ptr, i64 16 %asc = addrspacecast i16* %gep to i16 addrspace(1)* store i16 0, i16 addrspace(1)* %asc, align 16 ret void } define void @good(i64) { %ptr = inttoptr i64 %0 to i16* %asc = addrspacecast i16* %ptr to i16 addrspace(1)* %gep = getelementptr i16, i16 addrspace(1)* %asc, i64 16 store i16 0, i16 addrspace(1)* %gep, align 16 ret void }
This gets compiled to the following PTX by the NVPTX backend:
// // Generated by LLVM NVPTX Back-End // .version 3.2 .target sm_20 .address_size 64 // .globl bad // -- Begin function bad // @bad .visible .func bad( .param .b64 bad_param_0 ) { .reg .b16 %rs<2>; .reg .b64 %rd<4>; // %bb.0: ld.param.u64 %rd1, [bad_param_0]; add.s64 %rd2, %rd1, 32; cvta.to.global.u64 %rd3, %rd2; mov.u16 %rs1, 0; st.global.u16 [%rd3], %rs1; ret; // -- End function } // .globl good // -- Begin function good .visible .func good( .param .b64 good_param_0 ) // @good { .reg .b16 %rs<2>; .reg .b64 %rd<3>; // %bb.0: ld.param.u64 %rd1, [good_param_0]; cvta.to.global.u64 %rd2, %rd1; mov.u16 %rs1, 0; st.global.u16 [%rd2+32], %rs1; ret; // -- End function }
In the case where the GEP precedes the addrspacecast (the bad function), the backend emits an explicit ADD instruction, rather than moving the addition to the addressing mode of the store.
This is because instruction selection doesn't take into account possible addrspacecasts.
Proposed fix
To fix this, I check if the address is an AddrSpaceCastSDNode, and if so, if there's an ADD node behind it.
In that case, I transform addrspacecast(add(x, y)) to add(addrspacecast(x), y), allowing the ADD to be fused in the memory operation.