This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Limit building time for CodeGenPrepare
ClosedPublic

Authored by xiangzhangllvm on Jul 8 2022, 1:00 AM.

Details

Summary

Currently CodeGenPrepare is very time consuming in handling big functions.

Old Algorithm :
It iterate each BB in function, and go on handle very instructions in BB.
Due to some instruction optimizations may affect the BBs' dominate tree.
The old logic will re-iterate and try optimize for each BB.

Suppose we have a big function with 20000 BBs, If we handled the last BB
with fine tuning the dominate tree. We need totally re-iterate and try optimize
the 20000 BBs from the beginning.

The Complex is near N!

And we really encounter somes big tests (> 20000 BBs) that cost more than 30 mins in this pass.
(Debug version compiler will cost 2 hours here)

What this patch do for huge function ?
It mainly changes the iteration way for optimization.
1 First we do optimizeBlock for each BB (that is same with old way).
And, in the meaning time, If BB is changed/updated in the optimization, it will be put into FreshBBs (try do optimizeBlock again).
The new created BB at previous iteration will also put into FreshBBs.

2 For the BBs which not updated at previous iteration, we directly skip it.

Strictly speaking, here may miss some opportunity, but I think the probability is very small.
(from my local performance test for specs and some other perf tests, there is no harm to performance. )

3 For Instructions in single BB, we do optimizeInst for each instruction.
If optimizeInst change the instruction dominator in this BB, rather than break and go back to optimize the first BB (the old way),
we directly iterate instructions (to do optimizeInst) in this updated BB again (the new way).

What this patch do for small/normal (not huge) function ?
It is same with the Old Algorithm. (NFC)

Diff Detail

Event Timeline

xiangzhangllvm created this revision.Jul 8 2022, 1:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2022, 1:00 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
xiangzhangllvm requested review of this revision.Jul 8 2022, 1:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2022, 1:00 AM
yubing added a subscriber: yubing.EditedJul 11 2022, 12:49 AM

optimizeGatherScatterInst is a mandatory procedure where codegenpre convert GEP(splat base ptr) into GEP(scalar base ptr) since SelectionDAGBuilder only handle GEP(scalar base ptr). please take a look at optimizeGatherScatterInst's comments

optimizeGatherScatterInst is a mandatory procedure where codegenpre convert GEP(splat base ptr) into GEP(scalar base ptr) since SelectionDAGBuilder only handle GEP(scalar base ptr). please take a look at optimizeGatherScatterInst's comments

Thanks a lot! But this has been converted in the first time. We just skip BB which has been optimized.

nikic added a subscriber: nikic.Jul 11 2022, 2:16 AM
nikic added inline comments.
llvm/lib/CodeGen/CodeGenPrepare.cpp
583–584

Before introducing limits, can we try to remove this case? That is, continue simplifying the rest of the blocks before we start a new iteration? I expect that the current implementation looks like this to avoid some iterator invalidation issues, but there's probably a way to avoid that with a bit more effort.

yubing added a comment.EditedJul 11 2022, 6:41 AM

optimizeGatherScatterInst is a mandatory procedure where codegenpre convert GEP(splat base ptr) into GEP(scalar base ptr) since SelectionDAGBuilder only handle GEP(scalar base ptr). please take a look at optimizeGatherScatterInst's comments

Thanks a lot! But this has been converted in the first time. We just skip BB which has been optimized.

if GEP and Gather/Scatter are not in the same bb, we don't do optimize GEP in the first round. When it is second round and GEP and Gather/Scatter are put in the same bb, your code will let the GEP's optimization not happen.

// If the GEP and the gather/scatter aren't in the same BB, don't optimize.
// FIXME: We should support this by sinking the GEP.
if (MemoryInst->getParent() != GEP->getParent())
  return false;

if GEP and Gather/Scatter are not in the same bb, we don't do optimize GEP in the first round. When it is second round and GEP and Gather/Scatter are put in the same bb, your code will let the GEP's optimization not happen.

// If the GEP and the gather/scatter aren't in the same BB, don't optimize.
// FIXME: We should support this by sinking the GEP.
if (MemoryInst->getParent() != GEP->getParent())
  return false;

Let me try to find a test case. (I default enable BuildTimeLimit in my local, didn't meet any compiling fail test case yet.)
I think we should go further to find here " GEP's optimization is Mandatory" is make sense or not.

llvm/lib/CodeGen/CodeGenPrepare.cpp
583–584

Thanks a lot for reviewing this patch! but if the CFG changed, we should re-start the iteration, or the iteration may have problem.

if GEP and Gather/Scatter are not in the same bb, we don't do optimize GEP in the first round. When it is second round and GEP and Gather/Scatter are put in the same bb, your code will let the GEP's optimization not happen.

// If the GEP and the gather/scatter aren't in the same BB, don't optimize.
// FIXME: We should support this by sinking the GEP.
if (MemoryInst->getParent() != GEP->getParent())
  return false;

Let me try to find a test case. (I default enable BuildTimeLimit in my local, didn't meet any compiling fail test case yet.)
I think we should go further to find here " GEP's optimization is Mandatory" is make sense or not.

The GEP optimization isn't mandatory. It will only generate worse code. CodeGenPrepare doesn't run at -O0 so nothing in it can be mandatory.

xiangzhangllvm added a subscriber: david-arm.EditedJul 11 2022, 8:43 PM

if GEP and Gather/Scatter are not in the same bb, we don't do optimize GEP in the first round. When it is second round and GEP and Gather/Scatter are put in the same bb, your code will let the GEP's optimization not happen.

Let me try to find a test case. (I default enable BuildTimeLimit in my local, didn't meet any compiling fail test case yet.)
I think we should go further to find here " GEP's optimization is Mandatory" is make sense or not.

I didn't encounter Gather/Scatter build fails.

I did a local experiment:
Even I totally disable all the optimizeBlock, I only meet one build fail in isel for test/CodeGen/Thumb2/mve-pred-vselect.ll.
(affected with optimizeSelectInst)

-  bool MadeChange = true;
+ bool MadeChange = false;      // let it  not do optimization at all
  while (MadeChange) {
    MadeChange = false;
    DT.reset();
    for (BasicBlock &BB : llvm::make_early_inc_range(F)) {
      bool ModifiedDTOnIteration = false;
      MadeChange |= optimizeBlock(BB, ModifiedDTOnIteration);

I am not familiar with Thumb2's ISel. I don't think it is make sense that its correctness base on such optimization.
ping the test contributor @david-arm
(Anyway this patch didn't affect it.)

xiangzhangllvm added a comment.EditedJul 11 2022, 10:12 PM

The GEP optimization isn't mandatory. It will only generate worse code. CodeGenPrepare doesn't run at -O0 so nothing in it can be mandatory.

Yes, and for this patch, I think it may not worse, only a few of instruction optimization will affect (just fine tuning) the CFG, if the BB has optimized, it will has little optimization chance again. I tested the spec performance, it has almost no affect.
If we not consider limited the building time, it very hurt to the JIT compiler. (in fact, it is our dpc++ user blame such compiling time)

xiangzhangllvm edited the summary of this revision. (Show Details)Jul 11 2022, 10:16 PM
nikic added inline comments.Jul 12 2022, 1:23 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
583–584

This is what we currently do, but the current implementation has a lot of room for improvement. For example, combineToUAddWithOverflow() will set ModifiedDT even though it does not modify control flow at all! It only does it to avoid iterating over a dead instruction in the same block. This can be solved by either only restarting the iteration in the BB (rather than the whole function), or by passing in the iterator to optimizeInst() so it can be moved in such case, or similar.

For cases that actually do modify the CFG, we could use DTU instead of full DT recomputation, and we could maybe switch to a BB worklist, so new blocks can be queued, and deleted blocks removed from the worklist.

Nobody has bothered to do this yet, because it never came up as a problem. Now that it did, things should get fixed properly.

xiangzhangllvm added inline comments.Jul 12 2022, 1:46 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
583–584

and we could maybe switch to a BB worklist, so new blocks can be queued, and deleted blocks removed from the worklist.

Oh, That is very good suggestion !
Let's first break the N! Algorithm complexity .

For cases that actually do modify the CFG, we could use DTU instead of full DT recomputation,

According to my observation, the DT recomputation didn't consume too much time. Anyway we can refine it if one day we encounter such case.

Thank you so much!

I am implementing the refine job, but It's more complicated than we thought.

The most trouble is
The optimizeInst will affect other BBs even it do not change the Dominate Tree.

So, in addition to collected new created BBs and erase deleted BBs (for worklist).
we still need to track each optimization to check if they affect other BBs and collected the BBs (into worklist) to re-optimize them.

xiangzhangllvm updated this revision to Diff 446014.EditedJul 19 2022, 7:02 PM

Refine:
1 Add SetVector FreshBBs to trace the BBs need to be optimized.
2 Add enum ModifyDT { NotModifyDT, ModifyBBDT, ModifyInstDT } to spilt out the iteration for instruction in single BB.
3 Modify some optimizations to re-optimized updated BBs

(All lit tests pass, my local big case still works well: 30mins --> 5 mins)

RKSimon added inline comments.Jul 22 2022, 3:01 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
281

Please can you add doc comments to describe the enum values?

341

comment

Address Simon's comments. Thank you for reviewing!

xiangzhangllvm marked 2 inline comments as done.Jul 24 2022, 8:19 PM
xiangzhangllvm edited the summary of this revision. (Show Details)Jul 25 2022, 6:00 PM
xiangzhangllvm added inline comments.Jul 26 2022, 6:09 PM
llvm/lib/CodeGen/CodeGenPrepare.cpp
1033

Here hurt the performance, because It can't make sure the BBs optimization order.
Let me update here.

xiangzhangllvm edited the summary of this revision. (Show Details)

The order of inserting updated BB may be ataxic, and hurt performance.
So this patch is to let the BB iteration in order.
(In my local testing, it still works well in reducing time building for big/huge test cases.)

Hello @nikic, your suggestions are good. I have updated the patch.
Could you help review? many thanks!

In the latest update, I removed the worklist, due to I find it hard to control the updated BBs' order when push them in worklist.
And this cause performance regression in my local. So I update it with checking set FreshBBs.

xiangzhangllvm added a comment.EditedAug 21 2022, 6:04 PM

Hi, reviewers, I think you have concerns about this new logic.
Compared with old logic, It really may miss ,in theory, some optimization opportunity.
But build time is still a big problem now.
So I plan to let this new logic only works for very large cases.
If you still has any concerns or good idea, pls let me know, thanks!

xiangzhangllvm edited the summary of this revision. (Show Details)

Split the logic for small/normal function (NFC) and huge function (Consider building time).

xiangzhangllvm edited the summary of this revision. (Show Details)
LuoYuanke added inline comments.Aug 28 2022, 6:18 PM
llvm/lib/CodeGen/CodeGenPrepare.cpp
599

The sinked BB also changed and may have several sinked BBs. Do we record the all sinked BBs in FreshBBs?

603

indent.

xiangzhangllvm added inline comments.Aug 29 2022, 5:52 PM
llvm/lib/CodeGen/CodeGenPrepare.cpp
599

Yes, all updated BB will be put into FreshBBs. Most of them are inserted in updated replaceAllUsesWith.
For BB which no update, we just optimize them one time.

603

good catch!

Do clang format (I add an independent commit to clang format all this file at a808ac2e42a94f4e440ebd21fd8b759dae0b6a05)

xiangzhangllvm marked an inline comment as done.Aug 30 2022, 2:01 AM
LuoYuanke added inline comments.Aug 30 2022, 2:38 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
263

Pls add cases for the patch with HugeFuncThresholdInCGPP be specified with small number.

580–581

As @nikic suggest, is it better to have worklist for the un-visited BB or changed BB? But it seems we may need to avoid double pushing BB.

1029

OldI (old instruction)?

1057

How can we remider devloper that FreshBB should be updated for their new code?

2156–2158

Ditto.

2434

Why need this change? Isn't it covered by line 2431?

7105

The instruction is cloned. Why the def of its operand need to be sinked?

7135

Should we insert the BB to FreshBBs?

8219

It should be buggy if the following instrcutions in the BB depends on the dominator tree.

xiangzhangllvm added inline comments.Aug 30 2022, 6:38 PM
llvm/lib/CodeGen/CodeGenPrepare.cpp
580–581

FreshBBs is that worklist.
For huge function FreshBBs will collected all changed BB.
(All un-visited BB will be visited at first iteration. Then set FuncIterated = true )

1029

let me refine it, thanks.

1057

Let me add requirement in FreshBB's comment.

2434

The BB passed in can be no Terminator, so BB->getTerminator() may be nullptr.
dyn_cast can not handle nullptr.

7105

When we copy an Instruction (to a new place) to do optimization, means we updated an instruction. This instruction's operand defs may have new opportunity to sink to the optimized new instruction.

7135

Strictly speaking we should do.
But this instruction has been erased. All the opportunity about this "updated" instruction is disappear.

8219

The break will try re-iterate such BB.
Do you mean here the BB may be erased ?
I think the optimizeInst can not erase the BB.
If it split the BB, it will split it to the BB + new BB, the iteration on such BB still work.

LuoYuanke added inline comments.Aug 30 2022, 8:39 PM
llvm/lib/CodeGen/CodeGenPrepare.cpp
8219

I mean the dominator tree has changed but it is not updated yet. We can't invoke DT.dominates(...) before we updating the dominator tree.

LuoYuanke added inline comments.Aug 30 2022, 8:41 PM
llvm/lib/CodeGen/CodeGenPrepare.cpp
7105

Here we just copy the instruction. We know it is sinked when it is erased in line 7135.

llvm/lib/CodeGen/CodeGenPrepare.cpp
263

Sure

7105

Let me try give an example soon.

8219

Make sense, let me reset the DT here, thanks!

Do we need the IsHugeFunc flag? Can't we use the worklist approach unconditionally? This is what we usually do. Limiting it to IsHugeFunc means that it received essentially zero testing.

Do we need the IsHugeFunc flag? Can't we use the worklist approach unconditionally? This is what we usually do. Limiting it to IsHugeFunc means that it received essentially zero testing.

I think IsHugeFunc can help avoid runtime performance regression on real application. We can specify IsHugeFunc as 0 in test case and then shrink the default IsHugeFunc size after this patch. Finally we can remove IsHugeFunc.

Do we need the IsHugeFunc flag? Can't we use the worklist approach unconditionally? This is what we usually do. Limiting it to IsHugeFunc means that it received essentially zero testing.

I think IsHugeFunc can help avoid runtime performance regression on real application. We can specify IsHugeFunc as 0 in test case and then shrink the default IsHugeFunc size after this patch. Finally we can remove IsHugeFunc.

Is there any reason to expect this to materially affect runtime performance?

Do we need the IsHugeFunc flag? Can't we use the worklist approach unconditionally? This is what we usually do. Limiting it to IsHugeFunc means that it received essentially zero testing.

I think IsHugeFunc can help avoid runtime performance regression on real application. We can specify IsHugeFunc as 0 in test case and then shrink the default IsHugeFunc size after this patch. Finally we can remove IsHugeFunc.

Is there any reason to expect this to materially affect runtime performance?

I think it is more conservative and we can move on by tuning the IsHugeFunc size step by step. Otherwise we may revert the whole patch if someone report regression after the patch landing for a long while. Ideally, I agree to remove IsHugeFunc.

I think it is more conservative and we can move on by tuning the IsHugeFunc size step by step. Otherwise we may revert the whole patch if someone report regression after the patch landing for a long while. Ideally, I agree to remove IsHugeFunc.

IMHO, if it causes issues we want to find out early and revert the whole patch, rather than just exposing these issues in very rare circumstances.

xiangzhangllvm added inline comments.Aug 31 2022, 2:39 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
7105

I find an ARM case about it, it not about sink, it about copy 2 times:

1st time: copy insertelement from entry to vector.body

entry:
  ...
1  %l0 = trunc i16 %x to i8
2  %l1 = insertelement <8 x i8> undef, i8 %l0, i32 0
  ...
3  br label %vector.body

vector.body:                                      ; preds = %vector.body, %entry
  ...
4  %0 = insertelement <8 x i8> undef, i8 %l0, i32 0       // copy from line 2
5  %1 = shufflevector <8 x i8> %0, <8 x i8> undef, <8 x i32> zeroinitializer
6  %l9 = mul <8 x i8> %1, %l8

2nd time, copy insertelement's operand trunc

  ...
1  %l0 = trunc i16 %x to i8 
2  %l1 = insertelement <8 x i8> undef, i8 %l0, i32 0
  ...
3  br label %vector.body

vector.body:                                      ; preds = %vector.body, %entry
  ...
  %0 = trunc i16 %x to i8                           // 2nd copy optimization

  %1 = insertelement <8 x i8> undef, i8 %0, i32 0    // use 2nd copy
  %2 = shufflevector <8 x i8> %1, <8 x i8> undef, <8 x i32> zeroinitializer
  %l9 = mul <8 x i8> %2, %l8

Do we need the IsHugeFunc flag? Can't we use the worklist approach unconditionally? This is what we usually do. Limiting it to IsHugeFunc means that it received essentially zero testing.

I tend first step to add this IsHugeFunc flag, because the new logic may miss some opportunity, compared with old logic, in theory.
To save build time, the new logic only re-optimize updated BBs. In fact, the un-updated BB may still has opportunity to sink/combine to updated BBs.

LuoYuanke added inline comments.Aug 31 2022, 2:48 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
7105

Is the insertelement in entry BB erased for the first time?

xiangzhangllvm added inline comments.Aug 31 2022, 2:51 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
7105

It is not erased, because it still has user in entry BB.
But it is possible to erase.

Do we need the IsHugeFunc flag? Can't we use the worklist approach unconditionally? This is what we usually do. Limiting it to IsHugeFunc means that it received essentially zero testing.

I tend first step to add this IsHugeFunc flag, because the new logic may miss some opportunity, compared with old logic, in theory.
To save build time, the new logic only re-optimize updated BBs. In fact, the un-updated BB may still has opportunity to sink/combine to updated BBs.

And
In fact the performance of new logic is very close to the old logic now.
Even here removed IsHugeFunc flag, all lit tests has no change.
And I also do spec testing (x86), there is no drop in my local.

add option -cgpp-huge-func to tests

xiangzhangllvm marked 6 inline comments as done.Aug 31 2022, 10:08 PM

Address Yuanke's comments

The fails test msan_debug_info.ll has no relation with such pass, It must be a mistake, because it just run msan related passes (not run codegenparepare pass at all):

can reproduce by (in windows)

$ "c:\work\tests\llvm-premerge-tests\llvm-project\build\bin\opt.exe" "-passes=module(msan-module),function(msan)" "-msan-instrumentation-with-call-threshold=0" "-msan-track-origins=1" "-S" "./test/Instrumentation/MemorySanitizer/msan_debug_info.ll" -o  t.s --print-after-all &> pp

I'll set the cgpp-huge-func default value to 10000 back at first stage.

LuoYuanke added inline comments.Sep 5 2022, 1:34 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
8220

It seems not necessary to getDT immedately after DT.reset. The lazy getDT should help on compiling time.

xiangzhangllvm added inline comments.Sep 6 2022, 12:14 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
8220

If optimizeInst direct use the DT will fail. I meet some tests fail here before. So I add the getDT here.

This revision is now accepted and ready to land.Sep 6 2022, 12:37 AM
This revision was landed with ongoing or failed builds.Sep 6 2022, 7:10 PM
This revision was automatically updated to reflect the committed changes.