This is an archive of the discontinued LLVM Phabricator instance.

[Statepoint Lowering] Allow dead gc pointer from deopt section to be on register.
AbandonedPublic

Authored by skatkov on Feb 19 2021, 9:08 PM.

Download Raw Diff

Details

Reviewers

reames
dantrushin

Summary

Currently if gc value used in deopt section is not mentioned in gc section
its lowering requires spill.

The patch introduces the option to allow lowering such value on register.
Option is off by default, so current behavior is not changed.

Diff Detail

Event Timeline

skatkov created this revision.Feb 19 2021, 9:08 PM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptFeb 19 2021, 9:08 PM

skatkov requested review of this revision.Feb 19 2021, 9:08 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 19 2021, 9:08 PM

llvm/lib/CodeGen/SelectionDAG/StatepointLowering.cpp
614	Seems you don't need a map, set would suffice here. Which would greatly simplify code.
618	You don't need this. Deopt pointers are not relocated, so need not to be limited.

ok, if we do not need to limit deopt gc values then I decided to return back almost to what you reverted.

Fix default value.

Fixed silly bug.

I don't believe this is safe to land as is, at least without some careful framing and documenting of assumptions.

The problem I see is that gc values are expected to have base pointers. A gc pointer which appears in the deopt list, but not in the gc-value list doesn't have an associated base. It might happen to be the base itself, but it also might not. Now, it may be that in some particular use case, we know any deopt value must be a base, but we need a way to encode that fact if so.

Taking a step back, how does this case arise? If this is truly dead code, we might be better off lowering as a poison constant (or simply dropping it).

This revision now requires changes to proceed.Feb 22 2021, 11:15 AM

In D97108#2579479, @reames wrote:

I don't believe this is safe to land as is, at least without some careful framing and documenting of assumptions.

The problem I see is that gc values are expected to have base pointers. A gc pointer which appears in the deopt list, but not in the gc-value list doesn't have an associated base. It might happen to be the base itself, but it also might not. Now, it may be that in some particular use case, we know any deopt value must be a base, but we need a way to encode that fact if so.

Taking a step back, how does this case arise? If this is truly dead code, we might be better off lowering as a poison constant (or simply dropping it).

Hello Philip, thank you for your comment.
This corresponds to the case when some gc pointer is alive in deopt bundle while callee works and dead after return.
The typical case would be @llvm.experimental.deoptimize intrinsic. All gc values in deopt bundle are dead after the call.
RS4GC will generate the gc value mentioned in deopt bundle in gc bundle as well. Moreover it will generate a gc.relocate intruction for it.
However gc.relocate instruction will be removed by any DCE due to it has no uses (all code after deopt is dead) and trivially dead (as readonly).
After all gc.relocates are eliminated InstCombine will be able to remove this gc pointer from gc section.

As a result we come to the case when there is a deopt gc value which is not listed in gc section.
I could fix the instcombine to preserve the value in gc section but it will not help in terms of base/derived because the only place where base/derived
info is stored is gc.relocate.
If we really want to keep this information we should preserve gc.relocates for values listed in deopt bundle or find another place to track the property.

Serguei,

From your comment, it sounds like we're already assuming values in the deopt list are base pointers fairly widely. (e.g. even the current stack lowering makes this assumption) Your explanation matches my vague memory, so I don't have any reason to question that.

With that in mind, I think it's okay to continue with that assumption, but I'd like to make that assumption as explicit as we can manage. (And document it explicitly!)

Glancing at the code, I want to suggest an alternate approach. I won't require this, but I'm curious to hear what you think.

At the MARKed location, what if we inserted something along the lines of the following:

// If we find a deopt value which isn't explicitly added, we need to
// ensure it gets lowered such that gc cycles occurring before the
// deoptimization event during the lifetime of the call don't invalidate
// the pointer we're deopting with.  Note that we assume that all
// pointers passed to deopt are base pointers; relaxing that assumption
// would require relatively large changes to how we represent relocations.
for (Value *V : I.deopt_operands()) {
  if (!isGCValue(V)) continue;
  if (Seen.insert(V).second) {
    SI.Bases.push_back(V);
    SI.Ptrs.push_back(V);
  }
}

I think this has the same effect, but is much more explicit about what is going on.

Your take?

llvm/lib/CodeGen/SelectionDAG/StatepointLowering.cpp
1067	MARK (see overall comment)

skatkov mentioned this in D97554: [Statepoint Lowering] Consider dead deopt gc values together with other gc values.Feb 26 2021, 6:36 AM

Philip, I've published alternative patch https://reviews.llvm.org/D97554 wit your idea to show the impact.
From my point of view it looks better in terms that we explicitly process deopt gc pointers as other gc values.
The disadvantages is that now we will have a lot of dead tied-def registers and deopt value will be listed twice in uses.

So generally the approach you proposed is a bit cleaner in understanding but costs a bot more in terms of compile time.

In D97108#2590224, @skatkov wrote:

From my point of view it looks better in terms that we explicitly process deopt gc pointers as other gc values.
The disadvantages is that now we will have a lot of dead tied-def registers and deopt value will be listed twice in uses.

This comment confused me greatly, and caused me to take a closer look at the current patch. After doing so, I'm glad I did! This version is unsound.

For the gc pointer to remain valid until deoptimization, we must update it during any GC cycles during the lifetime of the call and before the deopt point. To do that, we need to model the call as modifying the pointer. Otherwise, the register allocator is free to place it in a read only location (such as an argument slot) or combine the updated and non-updated copies of value into a single register. (Consider the case where a null check on the original unrelocated value was reordered after the safepoint.) The fact this one doesn't cause a tied def is a symptom of a bug, not a code quality improvement.

In D97108#2590570, @reames wrote:

In D97108#2590224, @skatkov wrote:

From my point of view it looks better in terms that we explicitly process deopt gc pointers as other gc values.
The disadvantages is that now we will have a lot of dead tied-def registers and deopt value will be listed twice in uses.

This comment confused me greatly, and caused me to take a closer look at the current patch. After doing so, I'm glad I did! This version is unsound.

For the gc pointer to remain valid until deoptimization, we must update it during any GC cycles during the lifetime of the call and before the deopt point. To do that, we need to model the call as modifying the pointer. Otherwise, the register allocator is free to place it in a read only location (such as an argument slot) or combine the updated and non-updated copies of value into a single register. (Consider the case where a null check on the original unrelocated value was reordered after the safepoint.) The fact this one doesn't cause a tied def is a symptom of a bug, not a code quality improvement.

I agree that alternative solution is better in terms of probability but here we need fixup caller saved registers anyway.
In alternative approach we end up with something like

tied-def %1 = statepoint (deopt %1) (gc tied-use %1)

ad in theory nothing prevents RA to allocate different registers for use in gc and use in deopt. It is unlikely but anyway.
This is just a comment.

in favor of https://reviews.llvm.org/D97554

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

StatepointLowering.cpp

26 lines

test/

CodeGen/

X86/

statepoint-vreg-dead-gc-deopt.ll

40 lines

statepoint-vreg-details.ll

4 lines

Diff 325149

llvm/lib/CodeGen/SelectionDAG/StatepointLowering.cpp

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines
STATISTIC(NumOfStatepoints, "Number of statepoint nodes encountered");		STATISTIC(NumOfStatepoints, "Number of statepoint nodes encountered");
STATISTIC(StatepointMaxSlotsRequired,		STATISTIC(StatepointMaxSlotsRequired,
"Maximum number of stack slots required for a singe statepoint");		"Maximum number of stack slots required for a singe statepoint");

cl::opt<bool> UseRegistersForDeoptValues(		cl::opt<bool> UseRegistersForDeoptValues(
"use-registers-for-deopt-values", cl::Hidden, cl::init(false),		"use-registers-for-deopt-values", cl::Hidden, cl::init(false),
cl::desc("Allow using registers for non pointer deopt args"));		cl::desc("Allow using registers for non pointer deopt args"));

		cl::opt<bool> UseRegistersForDeadGCDeoptValues(
		"use-registers-for-dead-gc-deopt-values", cl::Hidden, cl::init(true),
		cl::desc("Use registers for pointer deopt args missing in gc section"));

cl::opt<bool> UseRegistersForGCPointersInLandingPad(		cl::opt<bool> UseRegistersForGCPointersInLandingPad(
"use-registers-for-gc-values-in-landing-pad", cl::Hidden, cl::init(false),		"use-registers-for-gc-values-in-landing-pad", cl::Hidden, cl::init(false),
cl::desc("Allow using registers for gc pointer in landing pad"));		cl::desc("Allow using registers for gc pointer in landing pad"));

cl::opt<unsigned> MaxRegistersForGCPointers(		cl::opt<unsigned> MaxRegistersForGCPointers(
"max-registers-for-gc-values", cl::Hidden, cl::init(0),		"max-registers-for-gc-values", cl::Hidden, cl::init(0),
cl::desc("Max number of VRegs allowed to pass GC pointer meta args in"));		cl::desc("Max number of VRegs allowed to pass GC pointer meta args in"));

▲ Show 20 Lines • Show All 527 Lines • ▼ Show 20 Lines	#endif
// Process derived pointers first to give them more chance to go on VReg.		// Process derived pointers first to give them more chance to go on VReg.
for (const Value *V : SI.Ptrs)		for (const Value *V : SI.Ptrs)
processGCPtr(V);		processGCPtr(V);
for (const Value *V : SI.Bases)		for (const Value *V : SI.Bases)
processGCPtr(V);		processGCPtr(V);

LLVM_DEBUG(dbgs() << LowerAsVReg.size() << " pointers will go in vregs\n");		LLVM_DEBUG(dbgs() << LowerAsVReg.size() << " pointers will go in vregs\n");

		DenseMap<SDValue, int> DeoptLowerAsVReg(LowerAsVReg);
		dantrushinUnsubmitted Not Done Reply Inline Actions Seems you don't need a map, set would suffice here. Which would greatly simplify code. dantrushin: Seems you don't need a map, set would suffice here. Which would greatly simplify code.
		auto processDeoptGCPtr = [&](const Value *V) {
		SDValue PtrSD = Builder.getValue(V);
		if (DeoptLowerAsVReg.size() == MaxVRegPtrs)
		return;
		dantrushinUnsubmitted Not Done Reply Inline Actions You don't need this. Deopt pointers are not relocated, so need not to be limited. dantrushin: You don't need this. Deopt pointers are not relocated, so need not to be limited.
		// Avoid duplicates.
		if (DeoptLowerAsVReg.count(PtrSD))
		return;
		assert(V->getType()->isVectorTy() == PtrSD.getValueType().isVector() &&
		"IR and SD types disagree");
		if (!canPassGCPtrOnVReg(PtrSD)) {
		LLVM_DEBUG(dbgs() << "direct/spill "; PtrSD.dump(&Builder.DAG));
		return;
		}
		LLVM_DEBUG(dbgs() << "vreg "; PtrSD.dump(&Builder.DAG));
		DeoptLowerAsVReg[PtrSD] = CurNumVRegs++;
		};

auto isGCValue = [&](const Value *V) {		auto isGCValue = [&](const Value *V) {
auto *Ty = V->getType();		auto *Ty = V->getType();
if (!Ty->isPtrOrPtrVectorTy())		if (!Ty->isPtrOrPtrVectorTy())
return false;		return false;
if (auto *GFI = Builder.GFI)		if (auto *GFI = Builder.GFI)
if (auto IsManaged = GFI->getStrategy().isGCManagedPointer(Ty))		if (auto IsManaged = GFI->getStrategy().isGCManagedPointer(Ty))
return *IsManaged;		return *IsManaged;
return true; // conservative		return true; // conservative
};		};

auto requireSpillSlot = [&](const Value *V) {		auto requireSpillSlot = [&](const Value *V) {
if (isGCValue(V))		if (isGCValue(V))
return !LowerAsVReg.count(Builder.getValue(V));		return !DeoptLowerAsVReg.count(Builder.getValue(V));
return !(LiveInDeopt \|\| UseRegistersForDeoptValues);		return !(LiveInDeopt \|\| UseRegistersForDeoptValues);
};		};

// Before we actually start lowering (and allocating spill slots for values),		// Before we actually start lowering (and allocating spill slots for values),
// reserve any stack slots which we judge to be profitable to reuse for a		// reserve any stack slots which we judge to be profitable to reuse for a
// particular value. This is purely an optimization over the code below and		// particular value. This is purely an optimization over the code below and
// doesn't change semantics at all. It is important for performance that we		// doesn't change semantics at all. It is important for performance that we
// reserve slots for both deopt and gc values before lowering either.		// reserve slots for both deopt and gc values before lowering either.
for (const Value *V : SI.DeoptState) {		for (const Value *V : SI.DeoptState) {
		if (UseRegistersForDeadGCDeoptValues && isGCValue(V))
		processDeoptGCPtr(V);
if (requireSpillSlot(V))		if (requireSpillSlot(V))
reservePreviousStackSlotForValue(V, Builder);		reservePreviousStackSlotForValue(V, Builder);
}		}

for (const Value *V : SI.Ptrs) {		for (const Value *V : SI.Ptrs) {
SDValue SDV = Builder.getValue(V);		SDValue SDV = Builder.getValue(V);
if (!LowerAsVReg.count(SDV))		if (!LowerAsVReg.count(SDV))
reservePreviousStackSlotForValue(V, Builder);		reservePreviousStackSlotForValue(V, Builder);
▲ Show 20 Lines • Show All 395 Lines • ▼ Show 20 Lines	#endif
for (const GCRelocateInst *Relocate : I.getGCRelocates()) {		for (const GCRelocateInst *Relocate : I.getGCRelocates()) {
SI.GCRelocates.push_back(Relocate);		SI.GCRelocates.push_back(Relocate);

SDValue DerivedSD = getValue(Relocate->getDerivedPtr());		SDValue DerivedSD = getValue(Relocate->getDerivedPtr());
if (Seen.insert(DerivedSD).second) {		if (Seen.insert(DerivedSD).second) {
SI.Bases.push_back(Relocate->getBasePtr());		SI.Bases.push_back(Relocate->getBasePtr());
SI.Ptrs.push_back(Relocate->getDerivedPtr());		SI.Ptrs.push_back(Relocate->getDerivedPtr());
}		}
}		}
		reamesUnsubmitted Not Done Reply Inline Actions MARK (see overall comment) reames: MARK (see overall comment)

SI.GCArgs = ArrayRef<const Use>(I.gc_args_begin(), I.gc_args_end());		SI.GCArgs = ArrayRef<const Use>(I.gc_args_begin(), I.gc_args_end());
SI.StatepointInstr = &I;		SI.StatepointInstr = &I;
SI.ID = I.getID();		SI.ID = I.getID();

SI.DeoptState = ArrayRef<const Use>(I.deopt_begin(), I.deopt_end());		SI.DeoptState = ArrayRef<const Use>(I.deopt_begin(), I.deopt_end());
SI.GCTransitionArgs = ArrayRef<const Use>(I.gc_transition_args_begin(),		SI.GCTransitionArgs = ArrayRef<const Use>(I.gc_transition_args_begin(),
I.gc_transition_args_end());		I.gc_transition_args_end());
▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/statepoint-vreg-dead-gc-deopt.ll

This file was added.

				; This file contains some of the same basic tests as statepoint-vreg.ll, but
				; focuses on examining the intermediate representation. It's separate so that
				; the main file is easy to update with update_llc_test_checks.py

				; This run is to demonstrate what MIR SSA looks like.
				; RUN: llc -use-registers-for-dead-gc-deopt-values=true -max-registers-for-gc-values=4 -stop-after finalize-isel < %s \| FileCheck --check-prefix=CHECK-VREG %s
				; This run is to demonstrate register allocator work.
				; RUN: llc -use-registers-for-dead-gc-deopt-values=true -max-registers-for-gc-values=4 -stop-after virtregrewriter < %s \| FileCheck --check-prefix=CHECK-PREG %s

				target datalayout = "e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-pc-linux-gnu"

				declare dso_local void @func()
				declare dso_local void @consume(i32 addrspace(1)*)

				; deopt GC pointer not present in GC args can use register.
				define void @test_deopt_gcpointer(i32 addrspace(1)* %a, i32 addrspace(1)* %b) gc "statepoint-example" {
				; CHECK-VREG-LABEL: name: test_deopt_gcpointer
				; CHECK-VREG: %1:gr64 = COPY $rsi
				; CHECK-VREG: %0:gr64 = COPY $rdi
				; CHECK-VREG: %2:gr64 = STATEPOINT 0, 0, 0, @func, 2, 0, 2, 0, 2, 1, %0, 2, 1, %1(tied-def 0), 2, 0, 2, 1, 0, 0, csr_64, implicit-def $rsp, implicit-def $ssp
				; CHECK-VREG: $rdi = COPY %2
				; CHECK-VREG: CALL64pcrel32 @consume, csr_64, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp
				; CHECK-VREG: RET 0

				; CHECK-PREG-LABEL: name: test_deopt_gcpointer
				; CHECK-PREG: renamable $rbx = COPY $rsi
				; CHECK-PREG: renamable $rbx = STATEPOINT 0, 0, 0, @func, 2, 0, 2, 0, 2, 1, killed renamable $rdi, 2, 1, killed renamable $rbx(tied-def 0), 2, 0, 2, 1, 0, 0, csr_64, implicit-def $rsp, implicit-def $ssp
				; CHECK-PREG: $rdi = COPY killed renamable $rbx
				; CHECK-PREG: CALL64pcrel32 @consume, csr_64, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp

				%safepoint_token = tail call token (i64, i32, void (), i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_isVoidf(i64 0, i32 0, void () @func, i32 0, i32 0, i32 0, i32 0) ["deopt" (i32 addrspace(1)* %a), "gc-live" (i32 addrspace(1)* %b)]
				%rel = call i32 addrspace(1)* @llvm.experimental.gc.relocate.p1i32(token %safepoint_token, i32 0, i32 0)
				call void @consume(i32 addrspace(1)* %rel)
				ret void
				}

				declare token @llvm.experimental.gc.statepoint.p0f_isVoidf(i64, i32, void ()*, i32, i32, ...)
				declare dso_local i32 addrspace(1)* @llvm.experimental.gc.relocate.p1i32(token, i32, i32)

llvm/test/CodeGen/X86/statepoint-vreg-details.ll

	; This file contains some of the same basic tests as statepoint-vreg.ll, but			; This file contains some of the same basic tests as statepoint-vreg.ll, but
	; focuses on examining the intermediate representation. It's separate so that			; focuses on examining the intermediate representation. It's separate so that
	; the main file is easy to update with update_llc_test_checks.py			; the main file is easy to update with update_llc_test_checks.py

	; This run is to demonstrate what MIR SSA looks like.			; This run is to demonstrate what MIR SSA looks like.
	; RUN: llc -max-registers-for-gc-values=4 -stop-after finalize-isel < %s \| FileCheck --check-prefix=CHECK-VREG %s			; RUN: llc -use-registers-for-dead-gc-deopt-values=false -max-registers-for-gc-values=4 -stop-after finalize-isel < %s \| FileCheck --check-prefix=CHECK-VREG %s
	; This run is to demonstrate register allocator work.			; This run is to demonstrate register allocator work.
	; RUN: llc -max-registers-for-gc-values=4 -stop-after virtregrewriter < %s \| FileCheck --check-prefix=CHECK-PREG %s			; RUN: llc -use-registers-for-dead-gc-deopt-values=false -max-registers-for-gc-values=4 -stop-after virtregrewriter < %s \| FileCheck --check-prefix=CHECK-PREG %s

	target datalayout = "e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-pc-linux-gnu"			target triple = "x86_64-pc-linux-gnu"

	declare dso_local i1 @return_i1()			declare dso_local i1 @return_i1()
	declare dso_local void @func()			declare dso_local void @func()
	declare dso_local void @consume(i32 addrspace(1)*)			declare dso_local void @consume(i32 addrspace(1)*)
	declare dso_local void @consume2(i32 addrspace(1), i32 addrspace(1))			declare dso_local void @consume2(i32 addrspace(1), i32 addrspace(1))
	▲ Show 20 Lines • Show All 337 Lines • Show Last 20 Lines