This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] reorder LSR and PPCCTRLoops pass
AbandonedPublic

Authored by shchenz on Jun 4 2019, 2:29 AM.

Details

Summary

Currently pass LSR runs before PPCCTRLoops, we meet some problem with this order because of the icmp inside loop which compares iteration indexed and loop trip count. In pass LSR, we always treat that icmp as a valid ICmpZero type LSRUse. But this is not true because in later pass PPCCTRLoops, we may replace this icmp with ctrloop instruction bdnz. So we may get suboptimal code based on this order.

Reordering LSR and PPCCTRLoops makes LSR know precisely whether the icmp replaced or kept in PPCCTRLoops

Get improvement for most benchmarks of SPEC CPU2017 on Power9. Biggest gain is xz, about 3.5%. No degradation found.

Diff Detail

Event Timeline

shchenz created this revision.Jun 4 2019, 2:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 4 2019, 2:29 AM

I'm really uneasy about this change. PPCCTRLoops inserts special loop intrinsics which the rest of the loop-optimization pipeline doesn't understand, and so logically, it should come last. It's pretty clear that LSR is interfering with PPCCTRLoops, is this because it transforms the loops into some form that SCEV doesn't understand? Could we adjust LSR to make it more friendly to hardware-assisted loops? In theory, if LSR had the right cost model, it could actually make *more* loops that PPCCTRLoops could handle, and not fewer?

Also, please make sure that you look at D62604.

shchenz added a comment.EditedJun 5 2019, 2:17 AM

I'm really uneasy about this change. PPCCTRLoops inserts special loop intrinsics which the rest of the loop-optimization pipeline doesn't understand, and so logically, it should come last. It's pretty clear that LSR is interfering with PPCCTRLoops, is this because it transforms the loops into some form that SCEV doesn't understand? Could we adjust LSR to make it more friendly to hardware-assisted loops? In theory, if LSR had the right cost model, it could actually make *more* loops that PPCCTRLoops could handle, and not fewer?

Thanks for your comment Hal @hfinkel . I create this patch mostly because of the suboptimal loop instructions in some testcases. Making more hardware loops is not this patch's original intention. Seems reorder these two passes can generate more hardware loop. One suboptimal loop instructions example is in file test/CodeGen/PowerPC/lsr-ctrloop.ll, we see that there is one redundant instruction add 5, 3, 4 inside the loop. After investigate this issue, we found this is caused by ICmpZero LSRUse. After LSR gathers all formulaes for all LSRUse, for the case, all stores inside the loop is a user for both addrec reg({0,+,384} and reg({(192 + %0),+,384}), but ICmpZero LSRUse only has use for addrec reg({0,+,384}), so when LSR try to narrow down the search space, it selects addrec reg({0,+,384}) as a winner since its usage number is bigger than reg({(192 + %0),+,384}). But the issue is: by using reg({(192 + %0),+,384}), we can get better code in LSR. You can see the difference in the testcase.

Yes, I tried to add a target hook function in PPCTargetTransformationInfo class, let's say canConvertToHardwareLoop. This function should be called in LSR through TTI and in PPCCTRLoops to check if a loop can be converted to hardware loop. And it should come from current condition check code in PPCCTRLoops. But there is some disadvantage about this solution:
1: The condition check code in PPCCTRLoops pass uses many Analysis Pass result, like LoopInfo, SCEV, DominatorTree, AssumptionCache, DataLayout, TargetLibraryInfo, TargetTransformInfo and so on, we need to pass all these infos from LSR/PPCCTRLoops to the target hook funciton in PPCTargetTransformationInfo. When I implemented like this way, I am a little worried this maybe a little 'heavy' for a target hook compared with other exist functions in PPCTargetTransformationInfo? Seems now https://reviews.llvm.org/D62604 implements like this, but at that time when I did this, it is not so straightforward for me.
2: Most important reason is, all above Analysis Pass results may be different at the point LSR calling canConvertToHardwareLoop and the point PPCCTRLoops calling canConvertToHardwareLoop , so it is possible that canConvertToHardwareLoop returns different value to LSR and PPCCTRLoops, then the ICmpZero is still not accurate and LSR still generates suboptimal instructions.

For the loop intrinsics generated before LSR, as I can see, LSR can still recognize it as a loop and I saw PPCLoopPreIncPrep can also transform the loop with intrinsics to a pre-inc loop when I tested it. So I guess LoopInfoWrapperPass can still recognize it as a Loop? Maybe you mean the loop trip count can not be got by the loop since it is now in loop preheader by calling the intrinsics? I am not sure about the impact since some impartant loop passes depending on loop trip count like loop unroll, loop vectorize are all before LSR and PPCCTRLoops. It would be great if you can give more detail about the impact?

shchenz abandoned this revision.Jun 18 2019, 1:43 AM

Abandon this patch.

Change to another solution in https://reviews.llvm.org/D63477