This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/NVPTX/
-
Target/
-
NVPTX/
-
NVPTXISelDAGToDAG.cpp
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
-
addrspace-offsets.ll

Differential D75817

[NVPTX] Fix instruction selection for addresses in case of addrspacecasts
AbandonedPublic

Authored by thomasfaingnaert on Mar 7 2020, 1:33 PM.

Download Raw Diff

Details

Reviewers

jholewinski
tra

Summary

The issue

Consider the two following LLVM IR functions, both of which just store a value at an offset from a pointer. The only difference is the order of the GEP and addrspacecast:

target triple = "nvptx64-nvidia-cuda"

define void @bad(i64) {
  %ptr = inttoptr i64 %0 to i16*

  %gep = getelementptr i16, i16* %ptr, i64 16
  %asc = addrspacecast i16* %gep to i16 addrspace(1)*
  store i16 0, i16 addrspace(1)* %asc, align 16

  ret void
}

define void @good(i64) {
  %ptr = inttoptr i64 %0 to i16*

  %asc = addrspacecast i16* %ptr to i16 addrspace(1)*
  %gep = getelementptr i16, i16 addrspace(1)* %asc, i64 16
  store i16 0, i16 addrspace(1)* %gep, align 16

  ret void
}

This gets compiled to the following PTX by the NVPTX backend:

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

	// .globl	bad             // -- Begin function bad
                                        // @bad
.visible .func bad(
	.param .b64 bad_param_0
)
{
	.reg .b16 	%rs<2>;
	.reg .b64 	%rd<4>;

// %bb.0:
	ld.param.u64 	%rd1, [bad_param_0];
	add.s64 	%rd2, %rd1, 32;
	cvta.to.global.u64 	%rd3, %rd2;
	mov.u16 	%rs1, 0;
	st.global.u16 	[%rd3], %rs1;
	ret;
                                        // -- End function
}
	// .globl	good            // -- Begin function good
.visible .func good(
	.param .b64 good_param_0
)                                       // @good
{
	.reg .b16 	%rs<2>;
	.reg .b64 	%rd<3>;

// %bb.0:
	ld.param.u64 	%rd1, [good_param_0];
	cvta.to.global.u64 	%rd2, %rd1;
	mov.u16 	%rs1, 0;
	st.global.u16 	[%rd2+32], %rs1;
	ret;
                                        // -- End function
}

In the case where the GEP precedes the addrspacecast (the bad function), the backend emits an explicit ADD instruction, rather than moving the addition to the addressing mode of the store.
This is because instruction selection doesn't take into account possible addrspacecasts.

Proposed fix

To fix this, I check if the address is an AddrSpaceCastSDNode, and if so, if there's an ADD node behind it.
In that case, I transform addrspacecast(add(x, y)) to add(addrspacecast(x), y), allowing the ADD to be fused in the memory operation.

Diff Detail

Event Timeline

thomasfaingnaert created this revision.Mar 7 2020, 1:33 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2020, 1:33 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

Harbormaster failed remote builds in B48479: Diff 248964!Mar 7 2020, 2:05 PM

Addressed clang-tidy warnings.

While such reordering may be beneficial, in general, GEP(ASC(x)) is not always equivalent of ASC(GEP(x)), so we can't just blindly do it. I believe this has been discussed few times in the past, though I can't find relevant emails now.

According to PTX spec for cvta instruction which we use for ASC, resulting address is undefined in cases where the generic address does not fall within the address window of the specified state space.
So, if x points to the last element of the target address space, and GEP adds non-zero offset, then behavior of GEP(ASC(x)) will be defined, but ASC(GEP(x)) will be not.

There may also be other interesting issues on the boundaries of address spaces. I.e. if AS10 and AS11 are adjacent in generic address space and x, again, points at AS10's last element in generic AS, then GET(ASC(x)) will result in out-of-bounds access in AS10, but ASC(GEP(x)) will point to a valid address in AS11. AFAICT, PTX does not give any guarantees regarding generic address space layout, so we can't rule out this case as we don't have a way to check the bounds.
Similar issues will affect specific-to-generic ASC, too. E.g. GEP(ASC(x) from AS10 to generic) will give us a valid generic pointer, while ASC(GEP(x)) will be undefined.

Another point is that if this optimization is worth doing (assuming we can figure out how to do it safely), then doing it in IR, instead of pattern-matching it during lowering may be a better way. We already have https://llvm.org/doxygen/SeparateConstOffsetFromGEP_8cpp_source.html which does similar optimization, though it does not handle addrspacecast at the moment.

This revision now requires changes to proceed.Mar 9 2020, 10:21 AM

In D75817#1912891, @tra wrote:

While such reordering may be beneficial, in general, GEP(ASC(x)) is not always equivalent of ASC(GEP(x)), so we can't just blindly do it. I believe this has been discussed few times in the past, though I can't find relevant emails now.

I actually stumbled upon this issue because opt reordered the GEP and ASC, i.e. the input above gets converted to the following using opt -O3:

; ModuleID = 'tmp.ll'
source_filename = "tmp.ll"
target triple = "nvptx64-nvidia-cuda"

; Function Attrs: nofree norecurse nounwind writeonly
define void @bad(i64) local_unnamed_addr #0 {
  %ptr = inttoptr i64 %0 to i16*
  %gep = getelementptr i16, i16* %ptr, i64 16
  %asc = addrspacecast i16* %gep to i16 addrspace(1)*
  store i16 0, i16 addrspace(1)* %asc, align 16
  ret void
}

; Function Attrs: nofree norecurse nounwind writeonly
define void @good(i64) local_unnamed_addr #0 {
  %ptr = inttoptr i64 %0 to i16*
  %gep1 = getelementptr i16, i16* %ptr, i64 16
  %gep = addrspacecast i16* %gep1 to i16 addrspace(1)*
  store i16 0, i16 addrspace(1)* %gep, align 16
  ret void
}

attributes #0 = { nofree norecurse nounwind writeonly }

As you can see, the good function was changed so that the GEP precedes the ASC.

I assumed this to be because ASC after GEP is the canonical form, and opt reordered these to make the life of the backends easier.
But I suppose that's not the case, and my real issue is that opt shouldn't have reordered GEP and ASC in the first place?

I did some more testing, and it appears that that the InstCombine pass (opt -instcombine) is responsible for this.
Maybe it would be better to ensure InstCombine doesn't reorder instead of fixing this in instruction selection as I do now?
I'm not sure if that would influence other backends that rely on this reordering, though.

In D75817#1913351, @thomasfaingnaert wrote:

I did some more testing, and it appears that that the InstCombine pass (opt -instcombine) is responsible for this.
Maybe it would be better to ensure InstCombine doesn't reorder instead of fixing this in instruction selection as I do now?
I'm not sure if that would influence other backends that rely on this reordering, though.

It is possible that reordering is OK for some combination of address spaces on some platforms, but I doubt it's universal, which suggests that InstCombine should not be doing that.
I may be missing something here. It may be worth asking on llvm-dev@ whether InstCombine does the right thing regarding such GEP/ASC reordering.

I thought some more on this, and I don't think this is the best solution.
Besides, in most cases (such as the simple example posted above), the CUDA driver seems to do a pretty good job of folding away these extra additions anyway.

Revision Contents

Path

Size

llvm/

lib/

Target/

NVPTX/

NVPTXISelDAGToDAG.cpp

31 lines

test/

CodeGen/

NVPTX/

addrspace-offsets.ll

29 lines

Diff 248965

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

Show First 20 Lines • Show All 3,651 Lines • ▼ Show 20 Lines	if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(Addr.getOperand(1))) {
Base = CurDAG->getTargetFrameIndex(FIN->getIndex(), mvt);		Base = CurDAG->getTargetFrameIndex(FIN->getIndex(), mvt);
else		else
Base = Addr.getOperand(0);		Base = Addr.getOperand(0);
Offset = CurDAG->getTargetConstant(CN->getZExtValue(), SDLoc(OpNode),		Offset = CurDAG->getTargetConstant(CN->getZExtValue(), SDLoc(OpNode),
mvt);		mvt);
return true;		return true;
}		}
}		}

		// Handle addrspacecast nodes
		if (AddrSpaceCastSDNode *ASCN = dyn_cast<AddrSpaceCastSDNode>(Addr)) {
		// Check if there's an add behind the addrspacecast
		SDNode *ADDN = dyn_cast<SDNode>(ASCN->getOperand(0));

		if (ADDN && (ADDN->getOpcode() == ISD::ADD)) {
		// Yes, so create new addrspacecast and add nodes.
		// We transform addrspacecast(add(x, y)) to add(addrspacecast(x), y).
		SDValue NewASCN = CurDAG->getAddrSpaceCast(
		SDLoc(ASCN), ASCN->getValueType(0), ADDN->getOperand(0),
		ASCN->getSrcAddressSpace(), ASCN->getDestAddressSpace());
		SDValue NewADDN =
		CurDAG->getNode(ISD::ADD, SDLoc(ADDN), ADDN->getValueType(0),
		NewASCN, ADDN->getOperand(1));

		// Replace the old addrspacecast by the new add, effectively swapping the
		// order of the addrspacecast and add.
		ReplaceUses(Addr, NewADDN);

		// Instruction selection is not called for this new addrspacecast node, so
		// call it manually.
		SelectAddrSpaceCast(cast<SDNode>(NewASCN));

		// Finally, run instruction selection for the address on these new nodes.
		// This will trigger the add case above, and move the addition to the
		// addressing mode for the memory operation.
		return SelectADDRri_imp(OpNode, NewADDN, Base, Offset, mvt);
		}
		}

return false;		return false;
}		}

// register+offset		// register+offset
bool NVPTXDAGToDAGISel::SelectADDRri(SDNode *OpNode, SDValue Addr,		bool NVPTXDAGToDAGISel::SelectADDRri(SDNode *OpNode, SDValue Addr,
SDValue &Base, SDValue &Offset) {		SDValue &Base, SDValue &Offset) {
return SelectADDRri_imp(OpNode, Addr, Base, Offset, MVT::i32);		return SelectADDRri_imp(OpNode, Addr, Base, Offset, MVT::i32);
}		}
▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

llvm/test/CodeGen/NVPTX/addrspace-offsets.ll

This file was added.

				; RUN: llc < %s \| FileCheck %s

				target triple = "nvptx64-nvidia-cuda"

				define void @addrspacecast_offset(i64) {
				%ptr = inttoptr i64 %0 to i16*

				; CHECK: st.global.u16 [%rd{{[0-9]+}}], %rs{{[0-9]+}}
				%gep1 = getelementptr i16, i16* %ptr, i64 0
				%asc1 = addrspacecast i16* %gep1 to i16 addrspace(1)*
				store i16 0, i16 addrspace(1)* %asc1, align 16

				; CHECK: st.global.u16 [%rd{{[0-9]+}}+32], %rs{{[0-9]+}}
				%gep2 = getelementptr i16, i16* %ptr, i64 16
				%asc2 = addrspacecast i16* %gep2 to i16 addrspace(1)*
				store i16 0, i16 addrspace(1)* %asc2, align 16

				; CHECK: st.global.u16 [%rd{{[0-9]+}}+64], %rs{{[0-9]+}}
				%gep3 = getelementptr i16, i16* %ptr, i64 32
				%asc3 = addrspacecast i16* %gep3 to i16 addrspace(1)*
				store i16 0, i16 addrspace(1)* %asc3, align 16

				; CHECK: st.global.u16 [%rd{{[0-9]+}}+96], %rs{{[0-9]+}}
				%gep4 = getelementptr i16, i16* %ptr, i64 48
				%asc4 = addrspacecast i16* %gep4 to i16 addrspace(1)*
				store i16 0, i16 addrspace(1)* %asc4, align 16

				ret void
				}