This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Fix instruction selection for addresses in case of addrspacecasts
AbandonedPublic

Authored by thomasfaingnaert on Mar 7 2020, 1:33 PM.

Details

Reviewers
jholewinski
tra
Summary

The issue

Consider the two following LLVM IR functions, both of which just store a value at an offset from a pointer. The only difference is the order of the GEP and addrspacecast:

target triple = "nvptx64-nvidia-cuda"

define void @bad(i64) {
  %ptr = inttoptr i64 %0 to i16*

  %gep = getelementptr i16, i16* %ptr, i64 16
  %asc = addrspacecast i16* %gep to i16 addrspace(1)*
  store i16 0, i16 addrspace(1)* %asc, align 16

  ret void
}

define void @good(i64) {
  %ptr = inttoptr i64 %0 to i16*

  %asc = addrspacecast i16* %ptr to i16 addrspace(1)*
  %gep = getelementptr i16, i16 addrspace(1)* %asc, i64 16
  store i16 0, i16 addrspace(1)* %gep, align 16

  ret void
}

This gets compiled to the following PTX by the NVPTX backend:

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

	// .globl	bad             // -- Begin function bad
                                        // @bad
.visible .func bad(
	.param .b64 bad_param_0
)
{
	.reg .b16 	%rs<2>;
	.reg .b64 	%rd<4>;

// %bb.0:
	ld.param.u64 	%rd1, [bad_param_0];
	add.s64 	%rd2, %rd1, 32;
	cvta.to.global.u64 	%rd3, %rd2;
	mov.u16 	%rs1, 0;
	st.global.u16 	[%rd3], %rs1;
	ret;
                                        // -- End function
}
	// .globl	good            // -- Begin function good
.visible .func good(
	.param .b64 good_param_0
)                                       // @good
{
	.reg .b16 	%rs<2>;
	.reg .b64 	%rd<3>;

// %bb.0:
	ld.param.u64 	%rd1, [good_param_0];
	cvta.to.global.u64 	%rd2, %rd1;
	mov.u16 	%rs1, 0;
	st.global.u16 	[%rd2+32], %rs1;
	ret;
                                        // -- End function
}

In the case where the GEP precedes the addrspacecast (the bad function), the backend emits an explicit ADD instruction, rather than moving the addition to the addressing mode of the store.
This is because instruction selection doesn't take into account possible addrspacecasts.

Proposed fix

To fix this, I check if the address is an AddrSpaceCastSDNode, and if so, if there's an ADD node behind it.
In that case, I transform addrspacecast(add(x, y)) to add(addrspacecast(x), y), allowing the ADD to be fused in the memory operation.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2020, 1:33 PM
Herald added a subscriber: hiraditya. · View Herald Transcript

Addressed clang-tidy warnings.

tra requested changes to this revision.Mar 9 2020, 10:21 AM
tra added a subscriber: tra.

While such reordering may be beneficial, in general, GEP(ASC(x)) is not always equivalent of ASC(GEP(x)), so we can't just blindly do it. I believe this has been discussed few times in the past, though I can't find relevant emails now.

According to PTX spec for cvta instruction which we use for ASC, resulting address is undefined in cases where the generic address does not fall within the address window of the specified state space.
So, if x points to the last element of the target address space, and GEP adds non-zero offset, then behavior of GEP(ASC(x)) will be defined, but ASC(GEP(x)) will be not.

There may also be other interesting issues on the boundaries of address spaces. I.e. if AS10 and AS11 are adjacent in generic address space and x, again, points at AS10's last element in generic AS, then GET(ASC(x)) will result in out-of-bounds access in AS10, but ASC(GEP(x)) will point to a valid address in AS11. AFAICT, PTX does not give any guarantees regarding generic address space layout, so we can't rule out this case as we don't have a way to check the bounds.
Similar issues will affect specific-to-generic ASC, too. E.g. GEP(ASC(x) from AS10 to generic) will give us a valid generic pointer, while ASC(GEP(x)) will be undefined.

Another point is that if this optimization is worth doing (assuming we can figure out how to do it safely), then doing it in IR, instead of pattern-matching it during lowering may be a better way. We already have https://llvm.org/doxygen/SeparateConstOffsetFromGEP_8cpp_source.html which does similar optimization, though it does not handle addrspacecast at the moment.

This revision now requires changes to proceed.Mar 9 2020, 10:21 AM
In D75817#1912891, @tra wrote:

While such reordering may be beneficial, in general, GEP(ASC(x)) is not always equivalent of ASC(GEP(x)), so we can't just blindly do it. I believe this has been discussed few times in the past, though I can't find relevant emails now.

I actually stumbled upon this issue because opt reordered the GEP and ASC, i.e. the input above gets converted to the following using opt -O3:

; ModuleID = 'tmp.ll'
source_filename = "tmp.ll"
target triple = "nvptx64-nvidia-cuda"

; Function Attrs: nofree norecurse nounwind writeonly
define void @bad(i64) local_unnamed_addr #0 {
  %ptr = inttoptr i64 %0 to i16*
  %gep = getelementptr i16, i16* %ptr, i64 16
  %asc = addrspacecast i16* %gep to i16 addrspace(1)*
  store i16 0, i16 addrspace(1)* %asc, align 16
  ret void
}

; Function Attrs: nofree norecurse nounwind writeonly
define void @good(i64) local_unnamed_addr #0 {
  %ptr = inttoptr i64 %0 to i16*
  %gep1 = getelementptr i16, i16* %ptr, i64 16
  %gep = addrspacecast i16* %gep1 to i16 addrspace(1)*
  store i16 0, i16 addrspace(1)* %gep, align 16
  ret void
}

attributes #0 = { nofree norecurse nounwind writeonly }

As you can see, the good function was changed so that the GEP precedes the ASC.

I assumed this to be because ASC after GEP is the canonical form, and opt reordered these to make the life of the backends easier.
But I suppose that's not the case, and my real issue is that opt shouldn't have reordered GEP and ASC in the first place?

I did some more testing, and it appears that that the InstCombine pass (opt -instcombine) is responsible for this.
Maybe it would be better to ensure InstCombine doesn't reorder instead of fixing this in instruction selection as I do now?
I'm not sure if that would influence other backends that rely on this reordering, though.

tra added a comment.Mar 10 2020, 8:50 AM

I did some more testing, and it appears that that the InstCombine pass (opt -instcombine) is responsible for this.
Maybe it would be better to ensure InstCombine doesn't reorder instead of fixing this in instruction selection as I do now?
I'm not sure if that would influence other backends that rely on this reordering, though.

It is possible that reordering is OK for some combination of address spaces on some platforms, but I doubt it's universal, which suggests that InstCombine should not be doing that.
I may be missing something here. It may be worth asking on llvm-dev@ whether InstCombine does the right thing regarding such GEP/ASC reordering.

thomasfaingnaert abandoned this revision.Mar 24 2020, 5:55 AM

I thought some more on this, and I don't think this is the best solution.
Besides, in most cases (such as the simple example posted above), the CUDA driver seems to do a pretty good job of folding away these extra additions anyway.