This is an archive of the discontinued LLVM Phabricator instance.

[PhaseOrdering] Don't speculate blocks in simplifycfg before jump-threading
Needs ReviewPublic

Authored by aeubanks on Jun 20 2023, 6:24 PM.

Details

Reviewers
nikic
asbirlea
Summary

Speculating blocks can prevent jump threading from triggering.

Fixes some phase ordering issues affecting Rust uncovered by D145265.

Diff Detail

Event Timeline

aeubanks created this revision.Jun 20 2023, 6:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 20 2023, 6:24 PM
aeubanks requested review of this revision.Jun 20 2023, 6:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 20 2023, 6:24 PM

fixes both regressions reported in D145265

nikic added a comment.Jun 21 2023, 1:38 AM

Can you please explain what the JumpThreading interaction is in more detail? This looks plausible to me, but I want to make sure no change is needed in JumpThreading (which is supposed to be able to thread over selects as well, not just branches).

llvm/lib/Passes/PassBuilderPipelines.cpp
1017

This bit definitely makes sense to keep branches for IPSCCP -- maybe after this, we can even swap the SimplifyCFG and SROA order, as the current one only exists so that SimplifyCFG doesn't do too much.

Currently simplifycfg will optimize zot to the following

define internal fastcc i32 @zot(ptr %arg, ptr %arg1) unnamed_addr {
bb:
  br label %bb2

bb2:                                              ; preds = %bb11, %bb
  %phi = phi ptr [ %arg, %bb ], [ %spec.select, %bb11 ]
  %phi3 = phi i32 [ 0, %bb ], [ %add, %bb11 ]
  %icmp4 = icmp eq ptr %phi, %arg1
  %getelementptr = getelementptr inbounds i32, ptr %phi, i64 1
  %spec.select = select i1 %icmp4, ptr %phi, ptr %getelementptr
  %spec.select1 = select i1 %icmp4, ptr null, ptr %phi
  %icmp10 = icmp eq ptr %spec.select1, null
  br i1 %icmp10, label %bb12, label %bb11

bb11:                                             ; preds = %bb2
  %load = load i32, ptr %spec.select1, align 4
  %add = add i32 %load, %phi3
  br label %bb2

bb12:                                             ; preds = %bb2
  ret i32 %phi3
}

I'm not sure how jump-threading would work for this IR.

Simplifycfg's speculation prevents some instructions from getting licm'd and jump-threading fixes those cases.

nikic added a comment.Jun 23 2023, 5:18 AM

I've looked into this, and concluded that jump threading isn't performing any essential optimization here.

The real problem is that we end up losing an inbounds on the GEP due to our questionable "gep 0 is not always inbounds" semantics. But in this case, we can still recover from that by making SCEV smarter. I've made a few changes (https://github.com/llvm/llvm-project/commit/7cf567d46121f4aa8f659554b5e8584cd0fac056, https://github.com/llvm/llvm-project/commit/406e9c93726ed929ca09ffed2fd3a60cfd633f4b, https://reviews.llvm.org/D153624) that will allow us to optimize your example with the current pipeline.

Not to say that this change isn't also reasonable, but I think it needs a different motivating example.

thanks for those patches, they do fix this case, but they don't fix the case with malloc. e.g. the following loop doesn't get unrolled until the very last loop-unroll pass

define i32 @f() local_unnamed_addr {
bb:
  %alloc = tail call dereferenceable_or_null(64) ptr @malloc(i64 64)
  store i32 1, ptr %alloc, align 4
  %getelementptr = getelementptr i32, ptr %alloc, i64 1
  store i32 2, ptr %getelementptr, align 4
  %getelementptr1 = getelementptr i32, ptr %alloc, i64 2
  store i32 3, ptr %getelementptr1, align 4
  %getelementptr2 = getelementptr i32, ptr %alloc, i64 3
  br label %bb11.i

bb11.i:                                           ; preds = %bb, %bb11.i
  %phi37.i = phi i32 [ %add.i, %bb11.i ], [ 0, %bb ]
  %phi6.i = phi ptr [ %spec.select.i, %bb11.i ], [ %alloc, %bb ]
  %spec.select.i = getelementptr i32, ptr %phi6.i, i64 1
  %load.i = load i32, ptr %phi6.i, align 4
  %add.i = add i32 %load.i, %phi37.i
  %icmp4.i = icmp ne ptr %spec.select.i, %getelementptr2
  %icmp102.i = icmp ne ptr %spec.select.i, null
  %icmp10.not.i = and i1 %icmp102.i, %icmp4.i
  br i1 %icmp10.not.i, label %bb11.i, label %zot.exit

zot.exit:                                         ; preds = %bb11.i
  tail call void @free(ptr %alloc)
  ret i32 %add.i
}

; Function Attrs: mustprogress nofree nounwind willreturn allockind("alloc,uninitialized") allocsize(0) memory(inaccessiblemem: readwrite)
declare noalias noundef ptr @malloc(i64 noundef) local_unnamed_addr #0

; Function Attrs: mustprogress nounwind willreturn allockind("free") memory(argmem: readwrite, inaccessiblemem: readwrite)
declare void @free(ptr allocptr nocapture noundef) local_unnamed_addr #1

attributes #0 = { mustprogress nofree nounwind willreturn allockind("alloc,uninitialized") allocsize(0) memory(inaccessiblemem: readwrite) "alloc-family"="malloc" "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { mustprogress nounwind willreturn allockind("free") memory(argmem: readwrite, inaccessiblemem: readwrite) "alloc-family"="malloc" "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
nikic added a comment.Jun 29 2023, 1:22 AM

Good point. I think the malloc case is good motivation to change gep inbounds 0 semantics, so I'll look into that.

For the record, I've committed the test with alloca and malloc variants with https://github.com/llvm/llvm-project/commit/b24b6c4a32e159ece79583b3aaf5887ea44c4ebf.

nikic added a comment.Jun 29 2023, 2:17 AM

The patch stack starting at https://reviews.llvm.org/D154051 will fix the malloc case.

I've also found that

diff --git a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
index 40475d9563b2..dee3a1308405 100644
--- a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
+++ b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
@@ -2064,6 +2064,8 @@ PreservedAnalyses IndVarSimplifyPass::run(Loop &L, LoopAnalysisManager &AM,
   if (!IVS.run(&L))
     return PreservedAnalyses::all();
 
+  AR.SE.forgetLoop(&L);
+
   auto PA = getLoopPassPreservedAnalyses();
   PA.preserveSet<CFGAnalyses>();
   if (AR.MSSA)

fixes the issue and allows loop-unroll-full to see the proper trip count. I'm not familiar with SCEV caching, is this a stale analysis issue?