The partially invariant unswitch has been implemented on LoopUnswitch. https://reviews.llvm.org/D93764 We need to port the feature to SimpleLoopUnswitch for new pass manager. It is related to https://bugs.llvm.org/show_bug.cgi?id=49128
Details
Diff Detail
Event Timeline
llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp | ||
---|---|---|
2027–2029 | Sounds good, looking forward to a follow-up! | |
2053 | message in assert needs updating? | |
2735–2749 | enough to capture [this]? | |
2744 | Why TinyPtrVector here? In most case we will have at least 2 instructions here anyways? Probably better to use a SmallVector? | |
2747 | emplace_back? |
Yep, let me try to add the tests.
llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp | ||
---|---|---|
210 | Yep, I will update it. | |
2027–2029 | Yep | |
2053 | Yep, I will update it. | |
2735–2749 | This code is inside static function rather than class's member function so we can not use [this]. I will update it with [&L]. | |
2744 | The SimpleUnswitchPass uses TinyPtrVector for UnswitchCandidates. In order to follow the interface, TinyPtrVector is being used here. | |
2747 | TinyPtrVector does not have emplace_back. | |
2947–2948 | Yep, I will undo it. |
LGTM, thanks! A few small comments that can be addressed before committing without further review I think.
llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp | ||
---|---|---|
3151 | nit: perhaps unswitched using a partially invariant condition, ...? | |
llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll | ||
1104 | Could you add a brief comment and a more descriptive name for the test? | |
1143 | It would be great if you could use descriptive labels in the test both for basic blocks and values, to make it easier to read in case people need to take a look in the future. | |
1150 | please avoid using null as pointer, as this UB |
Thanks @fhahn! After updating code following your comments, I will push this change.
llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp | ||
---|---|---|
3151 | Yep, I will update it. | |
llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll | ||
1104 | Yep, I will add them. | |
1143 | Yep, I will update it. | |
1150 | Yep, I will update it. |
It looks like this patch causes SimpleLoopUnswitch to not terminate on some inputs, see https://bugs.llvm.org/show_bug.cgi?id=50302 for an example
Thanks. If it's not straight-forward to resolve, it would be best to revert the patch for now.
@fhahn Thanks for kind suggestion. I have figured out what causes the endless compiling. It looks we need to move partially invariant instructions rather than duplicating them. Once I resolve this issue, I need to check the score of benchmarks again. Therefore, I would post a patch for review tomorrow or the day after tomorrow.
Great, thanks! It sounds like it would make sense to revert until this is resolved than, so there's no need to rush :)
Yep, I will revert it now. Additionally, I have seen SimpleLoopUnswitch with this patch does not affect performance number of omnetpp in SPEC2017. I need to figure out it too. Once I fixed these issues, I will let you know.
The below bugs are fixed.
https://bugs.llvm.org/show_bug.cgi?id=50279
https://bugs.llvm.org/show_bug.cgi?id=50302
Even if this patch does not add the processed loop to loop pass manager again, the later loop transformations create loop with the header, which has partially invariant instructions, and add it to loop pass manager again. The loop pass manager runs SimpleLoopUnswitchPass with the loop again and it sometimes causes endless unswitch with some CFGs.
In order to avoid the endless unswitch, the new change removes the partially invariant instruction from header as below.
// +--------------------+ // | preheader | // | %a = load %ptr | // +--------------------+ // | /----------------\ // +--------v----v------+ | // | header |---\ | // | %c = phi %a, %b | | | // | %cond = cmp %c, .. | | | // | br %cond | | | // +--------------------+ | | // | | | // +--------v-----------+ | | // | store %ptr | | | // +--------------------+ | | // | | | // +--------v-----------<---/ | // | latch >----------/ // | %b = load %ptr | // +--------------------+
@fhahn I have updated the diff which fixes the bugs. I have added phi node to header in order to remove the partially invariant instructions from header. If you feel something wrong from it, please let me know.
For the performance impact on the omnetpp of spec2017, it looks the different inlining is being happened on the benchmark between legacy and new pass manager and it causes fewer chances for unswitching loop. Maybe, I could try to figure out what it causes the different inlining between legacy and new pass manager later.
@fhahn Can we push this patch again please? I have checked the benchmarks and the scores are fine.
Hm, this seems quite fragile, as another pass may decide to move the load to the header again. In the legacy pass manager, the loop is annotated with metadata to avoid partially unswithcing the same loop multiple times (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LoopUnswitch.cpp#L849) Can we do the same here?
llvm/test/Transforms/SimpleLoopUnswitch/endless-unswitch.ll | ||
---|---|---|
1 ↗ | (On Diff #345815) | We should avoid adding tests using -O3 here. Can we reproduce the issue by running simple-loop-unswtich twice? Also it would be good to clean up the test a bit. |
um... I thought the load would not be moved to header later because it is loop invariant and we usually try to move it to outside loop.
In the legacy pass manager, the loop is annotated with metadata to avoid partially unswithcing the same loop multiple times (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LoopUnswitch.cpp#L849) Can we do the same here?
Yep, I have seen the metadata. I was just not sure whether people will accept it or not. Let me try to add it.
llvm/test/Transforms/SimpleLoopUnswitch/endless-unswitch.ll | ||
---|---|---|
1 ↗ | (On Diff #345815) | um... in order to reproduce the endless unswitch, it needs other passes. Let me try to reduce the test. |
Following comments of @fhahn, updated code.
- Added metadata to avoid endless unswitch
- Reduced test case
@fhahn I have followed the LoopUnSwitch pass's metadata and it looks ok with benchmark scores. If you feel something wrong from the change, please let me know.
Thanks for the update! looks good, as this in line with the legacy loop-unswitch implementation. I think the last piece of work remaining is to update the tests to check that the metadata is emitted? Also, could you add a test with an partially unswitchable loop, that already has llvm.loop.unswitch.partial.disable and ensure it is not unswitched?
Following comment of @fhahn, added tests for metadata with llvm.loop.unswitch.partial.disable.
llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch-check-metadata.ll | ||
---|---|---|
9 ↗ | (On Diff #348957) | Can this test just be part of llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll? |
llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch-generate-metadata.ll | ||
8 ↗ | (On Diff #348957) | can you also include checks for the generated IR, ensuring that the whole metadata chain is correct (!llvm.loop attached to the right unswitched loop and contains the disable metadata? Also, can this just be part of llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll? I think we can just extend the checks there, rather than adding a new file + function. |
Thanks @fhahn! After updating tests, I will push it.
llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch-check-metadata.ll | ||
---|---|---|
9 ↗ | (On Diff #348957) | Yep, I will add the metadata check in llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll. |
llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch-generate-metadata.ll | ||
8 ↗ | (On Diff #348957) | Yep, I will add the metadata check in llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll. |
Great to see progress on this :-)
I don't see the speedup on omnetpp that we saw before on SystemZ. Do I need to pass some option or adjust some target threshold value somewhere?
@jonpa Thanks for checking the performance number on SystemZ.
I was able to see the speed up on omnetpp of SPEC2006 but there was no speedup on omentpp of SPEC2017. I guess you are looking SPEC2017 one.
It seemed there is different inlining between new pass manager and legacy pass manager and it causes to reduce partial loop unswitch opportunities on omnetpp of SPEC2017.
I am trying to get the speedup with the partial loop unswitch on omnetpp of SPEC2017 again. Once I fix it, I will let you know.
yes
It seemed there is different inlining between new pass manager and legacy pass manager and it causes to reduce partial loop unswitch opportunities on omnetpp of SPEC2017.
I am trying to get the speedup with the partial loop unswitch on omnetpp of SPEC2017 again. Once I fix it, I will let you know.
awesome!
Thanks for reminding me! @fhahn
Yep, I have checked those bug are fixed. I will close them.
@jaykang10 it looks like this patch regressed codegen in some cases: https://bugs.llvm.org/show_bug.cgi?id=51139. It would be great if you could take a look.
As I mentioned on https://bugs.llvm.org/show_bug.cgi?id=51141, I am not sure the unswitch pass has to detect the cases which the loop load pre, sccp or other passes can optimize and the pass do not unswitch the loop...
I am creating a patch to fix https://bugs.llvm.org/show_bug.cgi?id=51141 first. The patch does not fix https://bugs.llvm.org/show_bug.cgi?id=51139. Once I create it, let me add you as reviewer.
For https://bugs.llvm.org/show_bug.cgi?id=51139, I have added below comment on it.
In this case, the inliner pass fails to inline the function g after unswitch because of the cost as below debug message. NOT Inlining (cost=250, threshold=250), Call: call void @g(i32 %2) #3 Originally, after inlining function g, the JumpThreading pass made the block with call @foo dead and SimplifyCFG pass deleted it. If you add `always_inline` attribute to the function g's prototype as below, you can see the callfoo is gone. void g(int h) __attribute__((always_inline));
In this case, as I mentioned previously, I am not sure the unswitch pass has to check the inline cost or something like that... If someone has idea about it, please let me know.
The SimpleUnswitchPass checks the cost of duplicating the loop so we could adjust it. Let me try to adjust the cost calculation for partially invariant unswitch.
@jaykang10, can you share the key loops that contribute to the speedup you observed from partial unswitch in spec2k6 omnetpp by any chance? Thanks.