Speculating blocks can prevent jump threading from triggering.
Fixes some phase ordering issues affecting Rust uncovered by D145265.
Differential D153392
[PhaseOrdering] Don't speculate blocks in simplifycfg before jump-threading aeubanks on Jun 20 2023, 6:24 PM. Authored by
Details
Diff Detail
Event TimelineComment Actions Can you please explain what the JumpThreading interaction is in more detail? This looks plausible to me, but I want to make sure no change is needed in JumpThreading (which is supposed to be able to thread over selects as well, not just branches).
Comment Actions Currently simplifycfg will optimize zot to the following define internal fastcc i32 @zot(ptr %arg, ptr %arg1) unnamed_addr { bb: br label %bb2 bb2: ; preds = %bb11, %bb %phi = phi ptr [ %arg, %bb ], [ %spec.select, %bb11 ] %phi3 = phi i32 [ 0, %bb ], [ %add, %bb11 ] %icmp4 = icmp eq ptr %phi, %arg1 %getelementptr = getelementptr inbounds i32, ptr %phi, i64 1 %spec.select = select i1 %icmp4, ptr %phi, ptr %getelementptr %spec.select1 = select i1 %icmp4, ptr null, ptr %phi %icmp10 = icmp eq ptr %spec.select1, null br i1 %icmp10, label %bb12, label %bb11 bb11: ; preds = %bb2 %load = load i32, ptr %spec.select1, align 4 %add = add i32 %load, %phi3 br label %bb2 bb12: ; preds = %bb2 ret i32 %phi3 } I'm not sure how jump-threading would work for this IR. Simplifycfg's speculation prevents some instructions from getting licm'd and jump-threading fixes those cases. Comment Actions I've looked into this, and concluded that jump threading isn't performing any essential optimization here. The real problem is that we end up losing an inbounds on the GEP due to our questionable "gep 0 is not always inbounds" semantics. But in this case, we can still recover from that by making SCEV smarter. I've made a few changes (https://github.com/llvm/llvm-project/commit/7cf567d46121f4aa8f659554b5e8584cd0fac056, https://github.com/llvm/llvm-project/commit/406e9c93726ed929ca09ffed2fd3a60cfd633f4b, https://reviews.llvm.org/D153624) that will allow us to optimize your example with the current pipeline. Not to say that this change isn't also reasonable, but I think it needs a different motivating example. Comment Actions thanks for those patches, they do fix this case, but they don't fix the case with malloc. e.g. the following loop doesn't get unrolled until the very last loop-unroll pass define i32 @f() local_unnamed_addr { bb: %alloc = tail call dereferenceable_or_null(64) ptr @malloc(i64 64) store i32 1, ptr %alloc, align 4 %getelementptr = getelementptr i32, ptr %alloc, i64 1 store i32 2, ptr %getelementptr, align 4 %getelementptr1 = getelementptr i32, ptr %alloc, i64 2 store i32 3, ptr %getelementptr1, align 4 %getelementptr2 = getelementptr i32, ptr %alloc, i64 3 br label %bb11.i bb11.i: ; preds = %bb, %bb11.i %phi37.i = phi i32 [ %add.i, %bb11.i ], [ 0, %bb ] %phi6.i = phi ptr [ %spec.select.i, %bb11.i ], [ %alloc, %bb ] %spec.select.i = getelementptr i32, ptr %phi6.i, i64 1 %load.i = load i32, ptr %phi6.i, align 4 %add.i = add i32 %load.i, %phi37.i %icmp4.i = icmp ne ptr %spec.select.i, %getelementptr2 %icmp102.i = icmp ne ptr %spec.select.i, null %icmp10.not.i = and i1 %icmp102.i, %icmp4.i br i1 %icmp10.not.i, label %bb11.i, label %zot.exit zot.exit: ; preds = %bb11.i tail call void @free(ptr %alloc) ret i32 %add.i } ; Function Attrs: mustprogress nofree nounwind willreturn allockind("alloc,uninitialized") allocsize(0) memory(inaccessiblemem: readwrite) declare noalias noundef ptr @malloc(i64 noundef) local_unnamed_addr #0 ; Function Attrs: mustprogress nounwind willreturn allockind("free") memory(argmem: readwrite, inaccessiblemem: readwrite) declare void @free(ptr allocptr nocapture noundef) local_unnamed_addr #1 attributes #0 = { mustprogress nofree nounwind willreturn allockind("alloc,uninitialized") allocsize(0) memory(inaccessiblemem: readwrite) "alloc-family"="malloc" "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" } attributes #1 = { mustprogress nounwind willreturn allockind("free") memory(argmem: readwrite, inaccessiblemem: readwrite) "alloc-family"="malloc" "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" } Comment Actions Good point. I think the malloc case is good motivation to change gep inbounds 0 semantics, so I'll look into that. For the record, I've committed the test with alloca and malloc variants with https://github.com/llvm/llvm-project/commit/b24b6c4a32e159ece79583b3aaf5887ea44c4ebf. Comment Actions The patch stack starting at https://reviews.llvm.org/D154051 will fix the malloc case. Comment Actions I've also found that diff --git a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp index 40475d9563b2..dee3a1308405 100644 --- a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp +++ b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp @@ -2064,6 +2064,8 @@ PreservedAnalyses IndVarSimplifyPass::run(Loop &L, LoopAnalysisManager &AM, if (!IVS.run(&L)) return PreservedAnalyses::all(); + AR.SE.forgetLoop(&L); + auto PA = getLoopPassPreservedAnalyses(); PA.preserveSet<CFGAnalyses>(); if (AR.MSSA) fixes the issue and allows loop-unroll-full to see the proper trip count. I'm not familiar with SCEV caching, is this a stale analysis issue? |
This bit definitely makes sense to keep branches for IPSCCP -- maybe after this, we can even swap the SimplifyCFG and SROA order, as the current one only exists so that SimplifyCFG doesn't do too much.