This is an archive of the discontinued LLVM Phabricator instance.

Extra processing for BitCast + PHI in InstCombine
ClosedPublic

Authored by itsimbal on Jan 22 2019, 6:23 AM.

Download Raw Diff

Details

Reviewers

Commits

rG53980b24b7d0: Extra processing for BitCast + PHI in InstCombine
rL353595: Extra processing for BitCast + PHI in InstCombine

Summary

For some specific cases with bitcast A->B->A with intervening PHI nodes InstCombiner::optimizeBitCastFromPhi transformation creates extra PHI nodes, which are actually a copy of already created PHI or in another words, they are redundant. These extra PHI nodes could lead to extra move instructions generated after DeSSA transformation. This happens when several conditions are met

SROA kicks in and creates new alloca;
there is a simple assignment L = R, which falls under 'canonicalize loads' done by combineLoadToOperationType (this transformation is by default). Exactly this transformation is the reason of bitcasts generated;
the alloca is then used in A->B->A + PHI chain;
there is a loop unrolling.

As a result optimizeBitCastFromPhi creates as many of PHI nodes for each new SROA alloca as loop unrolling factor is. These new extra PHI nodes are redundant actually except of one and should not be created. Moreover the idea of optimizeBitCastFromPhi is to get rid of the cast (when possible) but that doesn't happen in these conditions.

The proposed fix is to do the cast replacement for the whole calculated/accumulated PHI closure not for one cast only, which is an argument to the optimizeBitCastFromPhi. These will help to accomplish several things: 1) avoid extra PHI nodes generated as all casts which may trigger optimizeBitCastFromPhi transformation will be replaced, 3) bitcasts will be replaced, and 3) create more opportunities to remove dead code, which appears after the replacement.

A new test case shows that it's possible to get rid of all bitcasts completely and get quite good code reduction.

Diff Detail

Event Timeline

itsimbal created this revision.Jan 22 2019, 6:23 AM

Herald added a subscriber: dmgreen. · View Herald TranscriptJan 22 2019, 6:23 AM

A friendly reminder for the review.

I couldn't understand the problem after reading the description.

Can I describe it as: A value is used as two different types in two basic blocks (or more). Although the bitcast of %11 (I guess optimizeBitCastFromPhi starts from here) can be removed with new set of PHIs, the old PHIs can't be removed due to its usage of type B in bb6. So we need to keep two set of values(registers) in many places even after DeSSA.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2019, 2:53 PM

There are two issues:

Extra redundant phi nodes are created. If you look at the test output (without proposed changes) you will see

  %4 = phi float [ %21, %.bb12 ], [ %conv.i, %.bb2 ]
  %5 = phi float [ %22, %.bb12 ], [ %conv.i, %.bb2 ]
  %rA.sroa.8.0 = phi i32 [ %rA.sroa.8.2, %.bb12 ], [ %1, %.bb2 ]
  %6 = phi float [ %23, %.bb12 ], [ %conv.i, %.bb2 ]
  %7 = phi float [ %24, %.bb12 ], [ %conv.i, %.bb2 ]
  %rA.sroa.0.0 = phi i32 [ %rA.sroa.0.2, %.bb12 ], [ %1, %.bb2 ]

and

  %13 = phi float [ %add33.1, %.bb4 ], [ %4, %.bb3 ]
  %14 = phi float [ %add33.1, %.bb4 ], [ %5, %.bb3 ]
  %rA.sroa.8.1 = phi i32 [ %11, %.bb4 ], [ %rA.sroa.8.0, %.bb3 ]
  %15 = phi float [ %add33.2, %.bb4 ], [ %6, %.bb3 ]
  %16 = phi float [ %add33.2, %.bb4 ], [ %7, %.bb3 ]
  %rA.sroa.0.1 = phi i32 [ %12, %.bb4 ], [ %rA.sroa.0.0, %.bb3 ]

Here %5, %7, %14, %16 are redundant. These insns are created as not all bitcast insns are removed when we first hit the needed pattern. Depending on a loop unroll factor there will be (factor-1) redundant phi nodes. In the attached test case for simplicity unroll the factor is 2. In my real test the unroll factor is 16 and I got 15 redundant phi nodes for each original phi node.

Not all bitcast insns are removed while they could be. The optimizeBitCastFromPhi removes only one bitcast, which triggers it. Due to this the old phi nodes are also not removed. The fix tries to remove all bitcast insns which are reachable from found phi closure. The test output (with the fix applied) doesn't have any bitcast insns and old phi nodes as well.

Thanks for the explanation.
LGTM

This revision is now accepted and ready to land.Feb 8 2019, 8:12 AM

Closed by commit rL353595: Extra processing for BitCast + PHI in InstCombine (authored by GBuella). · Explain WhyFeb 8 2019, 5:46 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineCasts.cpp

45 lines

test/

Transforms/

InstCombine/

cast_phi.ll

135 lines

Diff 182902

lib/Transforms/InstCombine/InstCombineCasts.cpp

Show First 20 Lines • Show All 2,161 Lines • ▼ Show 20 Lines	Instruction InstCombiner::optimizeBitCastFromPhi(CastInst &CI, PHINode PN) {
Value *Src = CI.getOperand(0);		Value *Src = CI.getOperand(0);
Type *SrcTy = Src->getType(); // Type B		Type *SrcTy = Src->getType(); // Type B
Type *DestTy = CI.getType(); // Type A		Type *DestTy = CI.getType(); // Type A

SmallVector<PHINode *, 4> PhiWorklist;		SmallVector<PHINode *, 4> PhiWorklist;
SmallSetVector<PHINode *, 4> OldPhiNodes;		SmallSetVector<PHINode *, 4> OldPhiNodes;

// Find all of the A->B casts and PHI nodes.		// Find all of the A->B casts and PHI nodes.
// We need to inpect all related PHI nodes, but PHIs can be cyclic, so		// We need to inspect all related PHI nodes, but PHIs can be cyclic, so
// OldPhiNodes is used to track all known PHI nodes, before adding a new		// OldPhiNodes is used to track all known PHI nodes, before adding a new
// PHI to PhiWorklist, it is checked against and added to OldPhiNodes first.		// PHI to PhiWorklist, it is checked against and added to OldPhiNodes first.
PhiWorklist.push_back(PN);		PhiWorklist.push_back(PN);
OldPhiNodes.insert(PN);		OldPhiNodes.insert(PN);
while (!PhiWorklist.empty()) {		while (!PhiWorklist.empty()) {
auto *OldPN = PhiWorklist.pop_back_val();		auto *OldPN = PhiWorklist.pop_back_val();
for (Value *IncValue : OldPN->incoming_values()) {		for (Value *IncValue : OldPN->incoming_values()) {
if (isa<Constant>(IncValue))		if (isa<Constant>(IncValue))
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	for (unsigned j = 0, e = OldPN->getNumOperands(); j != e; ++j) {
} else if (auto *PrevPN = dyn_cast<PHINode>(V)) {		} else if (auto *PrevPN = dyn_cast<PHINode>(V)) {
NewV = NewPNodes[PrevPN];		NewV = NewPNodes[PrevPN];
}		}
assert(NewV);		assert(NewV);
NewPN->addIncoming(NewV, OldPN->getIncomingBlock(j));		NewPN->addIncoming(NewV, OldPN->getIncomingBlock(j));
}		}
}		}

		// Traverse all accumulated PHI nodes and process its users,
		// which are Stores and BitcCasts. Without this processing
		// NewPHI nodes could be replicated and could lead to extra
		// moves generated after DeSSA.
// If there is a store with type B, change it to type A.		// If there is a store with type B, change it to type A.
for (User *U : PN->users()) {
auto *SI = dyn_cast<StoreInst>(U);
if (SI && SI->isSimple() && SI->getOperand(0) == PN) {		// Replace users of BitCast B->A with NewPHI. These will help
		// later to get rid off a closure formed by OldPHI nodes.
		Instruction *RetVal = nullptr;
		for (auto *OldPN : OldPhiNodes) {
		PHINode *NewPN = NewPNodes[OldPN];
		for (User *V : OldPN->users()) {
		if (auto *SI = dyn_cast<StoreInst>(V)) {
		if (SI->isSimple() && SI->getOperand(0) == OldPN) {
Builder.SetInsertPoint(SI);		Builder.SetInsertPoint(SI);
auto *NewBC =		auto *NewBC =
cast<BitCastInst>(Builder.CreateBitCast(NewPNodes[PN], SrcTy));		cast<BitCastInst>(Builder.CreateBitCast(NewPN, SrcTy));
SI->setOperand(0, NewBC);		SI->setOperand(0, NewBC);
Worklist.Add(SI);		Worklist.Add(SI);
assert(hasStoreUsersOnly(*NewBC));		assert(hasStoreUsersOnly(*NewBC));
}		}
}		}
		else if (auto *BCI = dyn_cast<BitCastInst>(V)) {
		// Verify it's a B->A cast.
		Type *TyB = BCI->getOperand(0)->getType();
		Type *TyA = BCI->getType();
		if (TyA == DestTy && TyB == SrcTy) {
		Instruction I = replaceInstUsesWith(BCI, NewPN);
		if (BCI == &CI)
		RetVal = I;
		}
		}
		}
		}

return replaceInstUsesWith(CI, NewPNodes[PN]);		return RetVal;
}		}

Instruction *InstCombiner::visitBitCast(BitCastInst &CI) {		Instruction *InstCombiner::visitBitCast(BitCastInst &CI) {
// If the operands are integer typed then apply the integer transforms,		// If the operands are integer typed then apply the integer transforms,
// otherwise just apply the common ones.		// otherwise just apply the common ones.
Value *Src = CI.getOperand(0);		Value *Src = CI.getOperand(0);
Type *SrcTy = Src->getType();		Type *SrcTy = Src->getType();
Type *DestTy = CI.getType();		Type *DestTy = CI.getType();
▲ Show 20 Lines • Show All 164 Lines • Show Last 20 Lines

test/Transforms/InstCombine/cast_phi.ll

This file was added.

				; RUN: opt < %s -instcombine -S \| FileCheck %s
				; RUN: opt < %s -passes=instcombine -S \| FileCheck %s

				define void @MainKernel(i32 %iNumSteps, i32 %tid, i32 %base) {
				; CHECK-NOT: bitcast

				%callA = alloca [258 x float], align 4
				%callB = alloca [258 x float], align 4
				%conv.i = uitofp i32 %iNumSteps to float
				%1 = bitcast float %conv.i to i32
				%conv.i12 = zext i32 %tid to i64
				%arrayidx3 = getelementptr inbounds [258 x float], [258 x float]* %callA, i64 0, i64 %conv.i12
				%2 = bitcast float* %arrayidx3 to i32*
				store i32 %1, i32* %2, align 4
				%arrayidx6 = getelementptr inbounds [258 x float], [258 x float]* %callB, i64 0, i64 %conv.i12
				%3 = bitcast float* %arrayidx6 to i32*
				store i32 %1, i32* %3, align 4
				%cmp7 = icmp eq i32 %tid, 0
				br i1 %cmp7, label %.bb1, label %.bb2

				.bb1:
				%arrayidx10 = getelementptr inbounds [258 x float], [258 x float]* %callA, i64 0, i64 256
				store float %conv.i, float* %arrayidx10, align 4
				%arrayidx11 = getelementptr inbounds [258 x float], [258 x float]* %callB, i64 0, i64 256
				store float 0.000000e+00, float* %arrayidx11, align 4
				br label %.bb2

				.bb2:
				%cmp135 = icmp sgt i32 %iNumSteps, 0
				br i1 %cmp135, label %.bb3, label %.bb8

				; CHECK-LABEL: .bb3
				; CHECK: phi float
				; CHECK: phi float
				; CHECK: phi i32 {{.*}} [ %iNumSteps
				; CHECK-NOT: rA.sroa.[0-9].[0-9] = phi i32
				; CHECK-NOT: phi float
				; CHECK-NOT: phi i32
				; CHECK-LABEL: .bb4

				.bb3:
				%rA.sroa.8.0 = phi i32 [ %rA.sroa.8.2, %.bb12 ], [ %1, %.bb2 ]
				%rA.sroa.0.0 = phi i32 [ %rA.sroa.0.2, %.bb12 ], [ %1, %.bb2 ]
				%i12.06 = phi i32 [ %sub, %.bb12 ], [ %iNumSteps, %.bb2 ]
				%4 = icmp ugt i32 %i12.06, %base
				%add = add i32 %i12.06, 1
				%conv.i9 = sext i32 %add to i64
				%arrayidx20 = getelementptr inbounds [258 x float], [258 x float]* %callA, i64 0, i64 %conv.i9
				%5 = bitcast float* %arrayidx20 to i32*
				%arrayidx24 = getelementptr inbounds [258 x float], [258 x float]* %callB, i64 0, i64 %conv.i9
				%6 = bitcast float* %arrayidx24 to i32*
				%cmp40 = icmp ult i32 %i12.06, %base
				br i1 %4, label %.bb4, label %.bb5

				.bb4:
				%7 = load i32, i32* %5, align 4
				%8 = load i32, i32* %6, align 4
				%9 = bitcast i32 %8 to float
				%10 = bitcast i32 %7 to float
				%add33 = fadd float %9, %10
				%11 = bitcast i32 %rA.sroa.8.0 to float
				%add33.1 = fadd float %add33, %11
				%12 = bitcast float %add33.1 to i32
				%13 = bitcast i32 %rA.sroa.0.0 to float
				%add33.2 = fadd float %add33.1, %13
				%14 = bitcast float %add33.2 to i32
				br label %.bb5

				; CHECK-LABEL: .bb5
				; CHECK: phi float
				; CHECK: phi float
				; CHECK-NOT: rA.sroa.[0-9].[0-9] = phi i32
				; CHECK-NOT: phi float
				; CHECK-NOT: phi i32
				; CHECK-LABEL: .bb6

				.bb5:
				%rA.sroa.8.1 = phi i32 [ %12, %.bb4 ], [ %rA.sroa.8.0, %.bb3 ]
				%rA.sroa.0.1 = phi i32 [ %14, %.bb4 ], [ %rA.sroa.0.0, %.bb3 ]
				br i1 %cmp40, label %.bb6, label %.bb7

				.bb6:
				store i32 %rA.sroa.0.1, i32* %2, align 4
				store i32 %rA.sroa.8.1, i32* %3, align 4
				br label %.bb7

				.bb7:
				br i1 %4, label %.bb9, label %.bb10

				.bb8:
				ret void

				.bb9:
				%15 = load i32, i32* %5, align 4
				%16 = load i32, i32* %6, align 4
				%17 = bitcast i32 %16 to float
				%18 = bitcast i32 %15 to float
				%add33.112 = fadd float %17, %18
				%19 = bitcast i32 %rA.sroa.8.1 to float
				%add33.1.1 = fadd float %add33.112, %19
				%20 = bitcast float %add33.1.1 to i32
				%21 = bitcast i32 %rA.sroa.0.1 to float
				%add33.2.1 = fadd float %add33.1.1, %21
				%22 = bitcast float %add33.2.1 to i32
				br label %.bb10

				; CHECK-LABEL: .bb10
				; CHECK: phi float
				; CHECK: phi float
				; CHECK-NOT: rA.sroa.[0-9].[0-9] = phi i32
				; CHECK-NOT: phi float
				; CHECK-NOT: phi i32
				; CHECK-LABEL: .bb11

				.bb10:
				%rA.sroa.8.2 = phi i32 [ %20, %.bb9 ], [ %rA.sroa.8.1, %.bb7 ]
				%rA.sroa.0.2 = phi i32 [ %22, %.bb9 ], [ %rA.sroa.0.1, %.bb7 ]
				br i1 %cmp40, label %.bb11, label %.bb12

				; CHECK-LABEL: .bb11
				; CHECK: store float
				; CHECK: store float
				; CHECK-NOT: store i32 %rA.sroa.[0-9].[0-9]
				; CHECK-LABEL: .bb12

				.bb11:
				store i32 %rA.sroa.0.2, i32* %2, align 4
				store i32 %rA.sroa.8.2, i32* %3, align 4
				br label %.bb12

				.bb12:
				%sub = add i32 %i12.06, -4
				%cmp13 = icmp sgt i32 %sub, 0
				br i1 %cmp13, label %.bb3, label %.bb8
				}