This is an archive of the discontinued LLVM Phabricator instance.

Can you please explain what the JumpThreading interaction is in more detail? This looks plausible to me, but I want to make sure no change is needed in JumpThreading (which is supposed to be able to thread over selects as well, not just branches).

llvm/lib/Passes/PassBuilderPipelines.cpp
1017	This bit definitely makes sense to keep branches for IPSCCP -- maybe after this, we can even swap the SimplifyCFG and SROA order, as the current one only exists so that SimplifyCFG doesn't do too much.

Currently simplifycfg will optimize zot to the following

define internal fastcc i32 @zot(ptr %arg, ptr %arg1) unnamed_addr {
bb:
  br label %bb2

bb2:                                              ; preds = %bb11, %bb
  %phi = phi ptr [ %arg, %bb ], [ %spec.select, %bb11 ]
  %phi3 = phi i32 [ 0, %bb ], [ %add, %bb11 ]
  %icmp4 = icmp eq ptr %phi, %arg1
  %getelementptr = getelementptr inbounds i32, ptr %phi, i64 1
  %spec.select = select i1 %icmp4, ptr %phi, ptr %getelementptr
  %spec.select1 = select i1 %icmp4, ptr null, ptr %phi
  %icmp10 = icmp eq ptr %spec.select1, null
  br i1 %icmp10, label %bb12, label %bb11

bb11:                                             ; preds = %bb2
  %load = load i32, ptr %spec.select1, align 4
  %add = add i32 %load, %phi3
  br label %bb2

bb12:                                             ; preds = %bb2
  ret i32 %phi3
}

I'm not sure how jump-threading would work for this IR.

Simplifycfg's speculation prevents some instructions from getting licm'd and jump-threading fixes those cases.

I've looked into this, and concluded that jump threading isn't performing any essential optimization here.

The real problem is that we end up losing an inbounds on the GEP due to our questionable "gep 0 is not always inbounds" semantics. But in this case, we can still recover from that by making SCEV smarter. I've made a few changes (https://github.com/llvm/llvm-project/commit/7cf567d46121f4aa8f659554b5e8584cd0fac056, https://github.com/llvm/llvm-project/commit/406e9c93726ed929ca09ffed2fd3a60cfd633f4b, https://reviews.llvm.org/D153624) that will allow us to optimize your example with the current pipeline.

Not to say that this change isn't also reasonable, but I think it needs a different motivating example.

thanks for those patches, they do fix this case, but they don't fix the case with malloc. e.g. the following loop doesn't get unrolled until the very last loop-unroll pass

define i32 @f() local_unnamed_addr {
bb:
  %alloc = tail call dereferenceable_or_null(64) ptr @malloc(i64 64)
  store i32 1, ptr %alloc, align 4
  %getelementptr = getelementptr i32, ptr %alloc, i64 1
  store i32 2, ptr %getelementptr, align 4
  %getelementptr1 = getelementptr i32, ptr %alloc, i64 2
  store i32 3, ptr %getelementptr1, align 4
  %getelementptr2 = getelementptr i32, ptr %alloc, i64 3
  br label %bb11.i

bb11.i:                                           ; preds = %bb, %bb11.i
  %phi37.i = phi i32 [ %add.i, %bb11.i ], [ 0, %bb ]
  %phi6.i = phi ptr [ %spec.select.i, %bb11.i ], [ %alloc, %bb ]
  %spec.select.i = getelementptr i32, ptr %phi6.i, i64 1
  %load.i = load i32, ptr %phi6.i, align 4
  %add.i = add i32 %load.i, %phi37.i
  %icmp4.i = icmp ne ptr %spec.select.i, %getelementptr2
  %icmp102.i = icmp ne ptr %spec.select.i, null
  %icmp10.not.i = and i1 %icmp102.i, %icmp4.i
  br i1 %icmp10.not.i, label %bb11.i, label %zot.exit

zot.exit:                                         ; preds = %bb11.i
  tail call void @free(ptr %alloc)
  ret i32 %add.i
}

; Function Attrs: mustprogress nofree nounwind willreturn allockind("alloc,uninitialized") allocsize(0) memory(inaccessiblemem: readwrite)
declare noalias noundef ptr @malloc(i64 noundef) local_unnamed_addr #0

; Function Attrs: mustprogress nounwind willreturn allockind("free") memory(argmem: readwrite, inaccessiblemem: readwrite)
declare void @free(ptr allocptr nocapture noundef) local_unnamed_addr #1

attributes #0 = { mustprogress nofree nounwind willreturn allockind("alloc,uninitialized") allocsize(0) memory(inaccessiblemem: readwrite) "alloc-family"="malloc" "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { mustprogress nounwind willreturn allockind("free") memory(argmem: readwrite, inaccessiblemem: readwrite) "alloc-family"="malloc" "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

nikic mentioned this in rGb24b6c4a32e1: [PhaseOrdering] Add test with gep null compare in loop (NFC).Jun 29 2023, 1:19 AM

Good point. I think the malloc case is good motivation to change gep inbounds 0 semantics, so I'll look into that.

For the record, I've committed the test with alloca and malloc variants with https://github.com/llvm/llvm-project/commit/b24b6c4a32e159ece79583b3aaf5887ea44c4ebf.

The patch stack starting at https://reviews.llvm.org/D154051 will fix the malloc case.

I've also found that

diff --git a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
index 40475d9563b2..dee3a1308405 100644
--- a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
+++ b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
@@ -2064,6 +2064,8 @@ PreservedAnalyses IndVarSimplifyPass::run(Loop &L, LoopAnalysisManager &AM,
   if (!IVS.run(&L))
     return PreservedAnalyses::all();
 
+  AR.SE.forgetLoop(&L);
+
   auto PA = getLoopPassPreservedAnalyses();
   PA.preserveSet<CFGAnalyses>();
   if (AR.MSSA)

fixes the issue and allows loop-unroll-full to see the proper trip count. I'm not familiar with SCEV caching, is this a stale analysis issue?

nikic mentioned this in D155997: [Phase Ordering] Don't speculate in SimplifyCFG before PGO annotation.Jul 28 2023, 10:43 AM

Revision Contents

Path

Size

llvm/

lib/

Passes/

PassBuilderPipelines.cpp

4 lines

test/

Transforms/

PhaseOrdering/

X86/

vector-reductions-logical.ll

2 lines

simplifycfg-jump-threading.ll

21 lines

Diff 533097

llvm/lib/Passes/PassBuilderPipelines.cpp

Show First 20 Lines • Show All 1,008 Lines • ▼ Show 20 Lines	if (Phase != ThinOrFullLTOPhase::ThinLTOPostLink) {
MPM.addPass(InferFunctionAttrsPass());		MPM.addPass(InferFunctionAttrsPass());
MPM.addPass(CoroEarlyPass());		MPM.addPass(CoroEarlyPass());

FunctionPassManager EarlyFPM;		FunctionPassManager EarlyFPM;
// Lower llvm.expect to metadata before attempting transforms.		// Lower llvm.expect to metadata before attempting transforms.
// Compare/branch metadata may alter the behavior of passes like		// Compare/branch metadata may alter the behavior of passes like
// SimplifyCFG.		// SimplifyCFG.
EarlyFPM.addPass(LowerExpectIntrinsicPass());		EarlyFPM.addPass(LowerExpectIntrinsicPass());
EarlyFPM.addPass(SimplifyCFGPass());		EarlyFPM.addPass(SimplifyCFGPass(SimplifyCFGOptions().speculateBlocks(false)));
		nikicUnsubmitted Not Done Reply Inline Actions This bit definitely makes sense to keep branches for IPSCCP -- maybe after this, we can even swap the SimplifyCFG and SROA order, as the current one only exists so that SimplifyCFG doesn't do too much. nikic: This bit definitely makes sense to keep branches for IPSCCP -- maybe after this, we can even…
EarlyFPM.addPass(SROAPass(SROAOptions::ModifyCFG));		EarlyFPM.addPass(SROAPass(SROAOptions::ModifyCFG));
EarlyFPM.addPass(EarlyCSEPass());		EarlyFPM.addPass(EarlyCSEPass());
if (Level == OptimizationLevel::O3)		if (Level == OptimizationLevel::O3)
EarlyFPM.addPass(CallSiteSplittingPass());		EarlyFPM.addPass(CallSiteSplittingPass());
MPM.addPass(createModuleToFunctionPassAdaptor(		MPM.addPass(createModuleToFunctionPassAdaptor(
std::move(EarlyFPM), PTO.EagerlyInvalidateAnalyses));		std::move(EarlyFPM), PTO.EagerlyInvalidateAnalyses));
}		}

▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	PassBuilder::buildModuleSimplificationPipeline(OptimizationLevel Level,
// Create a small function pass pipeline to cleanup after all the global		// Create a small function pass pipeline to cleanup after all the global
// optimizations.		// optimizations.
FunctionPassManager GlobalCleanupPM;		FunctionPassManager GlobalCleanupPM;
// FIXME: Should this instead by a run of SROA?		// FIXME: Should this instead by a run of SROA?
GlobalCleanupPM.addPass(PromotePass());		GlobalCleanupPM.addPass(PromotePass());
GlobalCleanupPM.addPass(InstCombinePass());		GlobalCleanupPM.addPass(InstCombinePass());
invokePeepholeEPCallbacks(GlobalCleanupPM, Level);		invokePeepholeEPCallbacks(GlobalCleanupPM, Level);
GlobalCleanupPM.addPass(		GlobalCleanupPM.addPass(
SimplifyCFGPass(SimplifyCFGOptions().convertSwitchRangeToICmp(true)));		SimplifyCFGPass(SimplifyCFGOptions().convertSwitchRangeToICmp(true).speculateBlocks(false)));
MPM.addPass(createModuleToFunctionPassAdaptor(std::move(GlobalCleanupPM),		MPM.addPass(createModuleToFunctionPassAdaptor(std::move(GlobalCleanupPM),
PTO.EagerlyInvalidateAnalyses));		PTO.EagerlyInvalidateAnalyses));

// Add all the requested passes for instrumentation PGO, if requested.		// Add all the requested passes for instrumentation PGO, if requested.
if (PGOOpt && Phase != ThinOrFullLTOPhase::ThinLTOPostLink &&		if (PGOOpt && Phase != ThinOrFullLTOPhase::ThinLTOPostLink &&
(PGOOpt->Action == PGOOptions::IRInstr \|\|		(PGOOpt->Action == PGOOptions::IRInstr \|\|
PGOOpt->Action == PGOOptions::IRUse)) {		PGOOpt->Action == PGOOptions::IRUse)) {
addPGOInstrPasses(MPM, Level,		addPGOInstrPasses(MPM, Level,
▲ Show 20 Lines • Show All 937 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll

	Show First 20 Lines • Show All 403 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP0:%.]] = shufflevector <4 x i32> [[T:%.]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>			; CHECK-NEXT: [[TMP0:%.]] = shufflevector <4 x i32> [[T:%.]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt <8 x i32> [[TMP0]], <i32 255, i32 255, i32 255, i32 255, i32 1, i32 1, i32 1, i32 1>			; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt <8 x i32> [[TMP0]], <i32 255, i32 255, i32 255, i32 255, i32 1, i32 1, i32 1, i32 1>
	; CHECK-NEXT: [[TMP2:%.*]] = icmp slt <8 x i32> [[TMP0]], <i32 255, i32 255, i32 255, i32 255, i32 1, i32 1, i32 1, i32 1>			; CHECK-NEXT: [[TMP2:%.*]] = icmp slt <8 x i32> [[TMP0]], <i32 255, i32 255, i32 255, i32 255, i32 1, i32 1, i32 1, i32 1>
	; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i1> [[TMP1]], <8 x i1> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i1> [[TMP1]], <8 x i1> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>
	; CHECK-NEXT: [[TMP4:%.*]] = freeze <8 x i1> [[TMP3]]			; CHECK-NEXT: [[TMP4:%.*]] = freeze <8 x i1> [[TMP3]]
	; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i1> [[TMP4]] to i8			; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i1> [[TMP4]] to i8
	; CHECK-NEXT: [[DOTNOT:%.*]] = icmp eq i8 [[TMP5]], 0			; CHECK-NEXT: [[DOTNOT:%.*]] = icmp eq i8 [[TMP5]], 0
	; CHECK-NEXT: [[SHIFT:%.*]] = shufflevector <4 x i32> [[T]], <4 x i32> poison, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>			; CHECK-NEXT: [[SHIFT:%.*]] = shufflevector <4 x i32> [[T]], <4 x i32> poison, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
	; CHECK-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> [[SHIFT]], [[T]]			; CHECK-NEXT: [[TMP6:%.*]] = add nuw nsw <4 x i32> [[SHIFT]], [[T]]
	; CHECK-NEXT: [[ADD:%.*]] = extractelement <4 x i32> [[TMP6]], i64 0			; CHECK-NEXT: [[ADD:%.*]] = extractelement <4 x i32> [[TMP6]], i64 0
	; CHECK-NEXT: [[CONV:%.*]] = sitofp i32 [[ADD]] to float			; CHECK-NEXT: [[CONV:%.*]] = sitofp i32 [[ADD]] to float
	; CHECK-NEXT: [[RETVAL_0:%.*]] = select i1 [[DOTNOT]], float [[CONV]], float 0.000000e+00			; CHECK-NEXT: [[RETVAL_0:%.*]] = select i1 [[DOTNOT]], float [[CONV]], float 0.000000e+00
	; CHECK-NEXT: ret float [[RETVAL_0]]			; CHECK-NEXT: ret float [[RETVAL_0]]
	;			;
	entry:			entry:
	%vecext = extractelement <4 x i32> %t, i32 0			%vecext = extractelement <4 x i32> %t, i32 0
	%cmp = icmp slt i32 %vecext, 1			%cmp = icmp slt i32 %vecext, 1
	▲ Show 20 Lines • Show All 207 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/simplifycfg-jump-threading.ll

	Show All 30 Lines
	bb12:			bb12:
	ret i32 %phi3			ret i32 %phi3
	}			}

	define i32 @f() {			define i32 @f() {
	; CHECK-LABEL: define i32 @f			; CHECK-LABEL: define i32 @f
	; CHECK-SAME: () local_unnamed_addr #[[ATTR0:[0-9]+]] {			; CHECK-SAME: () local_unnamed_addr #[[ATTR0:[0-9]+]] {
	; CHECK-NEXT: bb:			; CHECK-NEXT: bb:
	; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [3 x i32], align 4			; CHECK-NEXT: ret i32 6
	; CHECK-NEXT: store i32 1, ptr [[ALLOCA]], align 4
	; CHECK-NEXT: [[GETELEMENTPTR:%.*]] = getelementptr inbounds i32, ptr [[ALLOCA]], i64 1
	; CHECK-NEXT: store i32 2, ptr [[GETELEMENTPTR]], align 4
	; CHECK-NEXT: [[GETELEMENTPTR1:%.*]] = getelementptr inbounds i32, ptr [[ALLOCA]], i64 2
	; CHECK-NEXT: store i32 3, ptr [[GETELEMENTPTR1]], align 4
	; CHECK-NEXT: [[GETELEMENTPTR2:%.*]] = getelementptr inbounds i32, ptr [[ALLOCA]], i64 3
	; CHECK-NEXT: br label [[BB11_I:%.*]]
	; CHECK: bb11.i:
	; CHECK-NEXT: [[PHI37_I:%.]] = phi i32 [ [[ADD_I:%.]], [[BB11_I]] ], [ 0, [[BB:%.*]] ]
	; CHECK-NEXT: [[PHI6_I:%.]] = phi ptr [ [[SPEC_SELECT_I:%.]], [[BB11_I]] ], [ [[ALLOCA]], [[BB]] ]
	; CHECK-NEXT: [[SPEC_SELECT_I]] = getelementptr i32, ptr [[PHI6_I]], i64 1
	; CHECK-NEXT: [[LOAD_I:%.*]] = load i32, ptr [[PHI6_I]], align 4
	; CHECK-NEXT: [[ADD_I]] = add i32 [[LOAD_I]], [[PHI37_I]]
	; CHECK-NEXT: [[ICMP4_I:%.*]] = icmp ne ptr [[SPEC_SELECT_I]], [[GETELEMENTPTR2]]
	; CHECK-NEXT: [[ICMP102_I:%.*]] = icmp ne ptr [[SPEC_SELECT_I]], null
	; CHECK-NEXT: [[ICMP10_NOT_I:%.*]] = and i1 [[ICMP102_I]], [[ICMP4_I]]
	; CHECK-NEXT: br i1 [[ICMP10_NOT_I]], label [[BB11_I]], label [[ZOT_EXIT:%.*]]
	; CHECK: zot.exit:
	; CHECK-NEXT: ret i32 [[ADD_I]]
	;			;
	bb:			bb:
	%alloca = alloca [3 x i32], align 4			%alloca = alloca [3 x i32], align 4
	store i32 1, ptr %alloca, align 4			store i32 1, ptr %alloca, align 4
	%getelementptr = getelementptr i32, ptr %alloca, i32 1			%getelementptr = getelementptr i32, ptr %alloca, i32 1
	store i32 2, ptr %getelementptr, align 4			store i32 2, ptr %getelementptr, align 4
	%getelementptr1 = getelementptr i32, ptr %alloca, i32 2			%getelementptr1 = getelementptr i32, ptr %alloca, i32 2
	store i32 3, ptr %getelementptr1, align 4			store i32 3, ptr %getelementptr1, align 4
	%getelementptr2 = getelementptr i32, ptr %alloca, i32 3			%getelementptr2 = getelementptr i32, ptr %alloca, i32 3
	%call = call i32 @zot(ptr %alloca, ptr %getelementptr2)			%call = call i32 @zot(ptr %alloca, ptr %getelementptr2)
	ret i32 %call			ret i32 %call
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[PhaseOrdering] Don't speculate blocks in simplifycfg before jump-threadingNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 533097

llvm/lib/Passes/PassBuilderPipelines.cpp

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll

llvm/test/Transforms/PhaseOrdering/simplifycfg-jump-threading.ll

[PhaseOrdering] Don't speculate blocks in simplifycfg before jump-threading
Needs ReviewPublic