By following the single predecessors of the predecessors of the call
site, we do not need to restrict the control flow.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I believe we need to check the code size increase with this because it will be applied widely and clone call-sites which could be inlined.
lib/Transforms/Scalar/CallSiteSplitting.cpp | ||
---|---|---|
335 ↗ | (On Diff #125142) | We should bail out when Preds[0] == Preds[1]. For below IR , we should not split CS. define i32 @test(i32* %a, i32 %v, i32 %p) { TBB: %cmp = icmp eq i32* %a, null br i1 %cmp, label %Tail, label %Tail Tail: %r = call i32 @callee(i32* %a, i32 %v, i32 %p) ret i32 %r } |
test/Transforms/CallSiteSplitting/callsite-split-or-phi.ll | ||
169 ↗ | (On Diff #125142) | We need more test cases because this change will hit even very simple case like : define i32 @test(i32* %a, i32 %v, i32 %p) { Header: br i1 undef, label %Tail, label %End TBB: %cmp = icmp eq i32* %a, null br i1 %cmp, label %Tail, label %End Tail: %r = call i32 @callee(i32* %a, i32 %v, i32 %p) ret i32 %r End: ret i32 %v } |
Added testcases as suggested. I've also moved the or structure related tests to a separate file and the test case that should not be split to another file to avoid having test files that are too big. I think it would also make sense to move tests we do not expect to be split to that file.
I've run the LNT test suite, SPEC2006 and SPEC2000. There were no noticeable changes in code size (all changes < 0.5% on AArch64)
@junbuml I'd appreciate another look, please let me know if you are happy with the test changes.
I've run the LNT test suite, SPEC2006 and SPEC2000. There were no noticeable changes in code size (all changes < 0.5% on AArch64)
Thanks Florian for the update. Can you please also share if you see any + or - in performance? I will run performance tests in my side as well. It seems that the comment in the top of CallSiteSplitting.cpp should be updated accordingly.
test/Transforms/CallSiteSplitting/callsite-no-or-structure.ll | ||
---|---|---|
136 ↗ | (On Diff #127494) | It will be good to add a test case where Tail form a loop like : define i32 @test_loop(i32* %a, i32 %v, i32 %p) { TBB: %cmp = icmp eq i32* %a, null br i1 %cmp, label %Tail, label %End Tail: %r = call i32 @callee(i32* %a, i32 %v, i32 %p) br i1 undef, label %Tail, label %End End: ret i32 %v } |
I didn't see any significant performance changes in my spec2000/2006/2017 runs on AArch64 .
I don't see any signification changes either. It's more a step towards potentially making the pass more general, as @davidxl suggested a while ago in a comment.
For Cortex-A57, I get -0.23% on the geomean of SPEC2006 exec times, for SPEC2000 + test suite it's -0.38%
I've committed this after updating the top comment, as well as dropped to Or from tryToSplitOnOrPredicatedArgument and updated the comment there.
I hope the comments are clear.