This is an archive of the discontinued LLVM Phabricator instance.

[StructurizeCFG] Improve basic block ordering
ClosedPublic

Authored by bcahoon on Apr 6 2022, 9:08 AM.

Details

Summary

StructurizeCFG linearizes the successors of branching basic block
by adding Flow blocks to record the true/false path for branches
and back edges. This patch reduces the number of Phi values needed
to capture the control flow path by improving the basic block
ordering.

Previously, StructurizeCFG adds loop exit blocks outside of the
loop. StructurizeCFG sets a boolean value to indicate the path
taken, and all exit block live values extend to after the loop.
For loops with a large number of exits blocks, this creates a
huge number of values that are maintained, which increases
compilation time and register pressure. This is problem
especially with ASAN, which adds early exits to basic blocks with
unreachable instructions for each instrumented check in the loop.

In specific cases, this patch reduces the number of values needed
after loop by moving the exit block to the loop. This is done
for blocks that have a single predecessor and single successor by
moving the block to appear just after the predecessor.

Diff Detail

Event Timeline

bcahoon created this revision.Apr 6 2022, 9:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 9:08 AM
bcahoon requested review of this revision.Apr 6 2022, 9:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 9:08 AM

Needs more reviewers ... I would suggest authors of previous non-trivial changes to the structurizer, and their reviewers.

llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
454–456

Lots of questions because the structurizer itself is so under-specified:

  1. Seems like a block can be moved multiple times until it "floats" to its final place. Is this movement monotonic?
  2. Can you please add a comment explaining the criterion in terms of the "Visited" state of the Structurizer itself?
  3. Are there invariants about MoveTo and Moved that we could assert in each of the two for loops?
  4. What happens when MoveTo[X] == MoveTo[Y]?
foad added a comment.Apr 12 2022, 2:57 AM

I'm being picky about the description because I'd really like to understand it clearly before reading the code.

This patch reduces the of Phi values needed

"the number of Phi values"

This is problem
especially with ASAN, which adds early exits to unreachable
blocks for each instrumented check in the loop.

I don't understand what "unreachable" refers to here.

In specific cases, this patch reduces the number of values needed
after loop by moving the exit block to the loop.

I think "into the loop" would be clearer if that's what you mean.

In specific cases, this patch reduces the number of values needed
after loop by moving the exit block to the loop.

I think "into the loop" would be clearer if that's what you mean.

Good points, all of them. Additionally, note that loops are different before and after structurization. So moving an exit block into a loop is not as crazy as it seems at first glance!

which increases compilation time and register pressure.

Have you looked at which part is responsible for the compilation time increase? Is is possible that we hit inefficiency in certain pass?
The "register pressure" here specifically means SGPR usage, right?

I don't think it is a good idea to move exist blocks into the loop especially when threads in a wave exit the loop in different iterations. Executing the exit block after loop should benefit for runtime performance.

which increases compilation time and register pressure.

Have you looked at which part is responsible for the compilation time increase? Is is possible that we hit inefficiency in certain pass?
The "register pressure" here specifically means SGPR usage, right?

I don't think it is a good idea to move exist blocks into the loop especially when threads in a wave exit the loop in different iterations. Executing the exit block after loop should benefit for runtime performance.

We do know exactly why we need this patch, which is for the address sanitizer. Performance is not an issue when asan is ON. Then one safe way forward could be to hide this patch behind a new command-line option, which is turned on by default when asan is enabled. The new option will also allow us to separately investigate impact on important workloads before we make any call on enabling it by default with or without asan.

bcahoon updated this revision to Diff 422850.Apr 14 2022, 7:19 AM
bcahoon edited the summary of this revision. (Show Details)
bcahoon added a reviewer: cfang.

Made several changes based upon suggestions and to fix a correctness issue:

  • The new function, reorderNodes, must appear before collectInfos because collectInfos collects branch predicate information based upon the order of the nodes, so the wrong branch information was used after nodes were reordered.
  • Changed name of the function, from improveNodeOrder to reorderNodes because changing the order is an improvement under certain conditions.
  • Only reorder the nodes for large regions. The size of the region can be changed with a command-line option. The issues fixed by this patch are seen only with very large regions, so this flag limits the impact of this patch. I'm open to other ways to do this.
  • Since reorderNodes appears before collectInfos, it no longer can use Visited. Instead, reorderNodes keeps track of the basic blocks in Order and uses that as part of the criteria to reorder. We only want to reorder a node that has a predecessor in Order.

which increases compilation time and register pressure.

Have you looked at which part is responsible for the compilation time increase? Is is possible that we hit inefficiency in certain pass?
The "register pressure" here specifically means SGPR usage, right?

Yes, the register pressure is SGPR usage. In large regions with a large number of loop exit blocks, a common case when ASAN is enabled, StrucuturizeCFG creates an ordering that creates a very large number of values that are live across the loop, e.g. to capture the control flow path.

I don't think it is a good idea to move exist blocks into the loop especially when threads in a wave exit the loop in different iterations. Executing the exit block after loop should benefit for runtime performance.

We do know exactly why we need this patch, which is for the address sanitizer. Performance is not an issue when asan is ON. Then one safe way forward could be to hide this patch behind a new command-line option, which is turned on by default when asan is enabled. The new option will also allow us to separately investigate impact on important workloads before we make any call on enabling it by default with or without asan.

I added an option to limit when the nodes are reordered. I plan to do some testing with the nodes are always reordered just to see the impact. Open to other ways to implement this, such as your suggestion.

llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
454–456

A block is moved only once. It will move to just after its predecessor. However, the predecessor may have multiple successors. Only one of the successors will move, if there are multiple successors that meet the critertia (a single predecessor in the Order list, and a single successor that is the region exit block).

Visited is no longer used in reorderNodes. But, we still need to seen track of the blocks in the region that are in Order. We only want to move a block if it's predecessor is in Order. Otherwise, it's not obvious that moving it will be valid.

Still working on a simple invariant to check w.r.t MoveTo and Moved. MoveTo[X] != MoveTo[Y] because an basic block that is in MoveTo has only a single predecessor, either X or Y.

bcahoon updated this revision to Diff 436386.Jun 13 2022, 7:41 AM

No changes with this patch. It's been a while since this patch was added. I was investigating an internal test failure, which turns out to be unrelated to this patch. So, I'd like to move forward with reviewing this again.

The change itself looks okay to me, to the extent that it does not seem to break the structurizer. Just by putting nodes in the right order, it is still able to generate predicates that preserve the original control flow. I think that's good enough to boldly make this change. I do have some comments/questions about the tests.

llvm/test/Transforms/StructurizeCFG/improve-order.ll
6

What happens when an edge exits a loop nest? Do we need a test for that?

11

Would be very useful to have comments that indicate which blocks to look out for in each test, and where do they end up. For example, in this first function, B and D are exit blocks from A and C respectively. Flow and Flow1 form the diamond pattern around them. Without this change, this whole pattern would be outside the loop, but now it is inside. The structurizer automatically sets up predicates so that B and D are not actually executed as part of the loop, but only when it's time to exit. Is that correct?

65

To maintain convention, this should be PredC?

193

Or in particular, the successor is not outside the region, right?

272

I am not sure what this means by "multiple common successors". Was unable to spot that in the input LLVM IR. Does this refer to a structurized output without the modifications in this change?

429

Would it be wrong to say that this case is too simple because there is no loop involved? Is there a use-case where reordering is beneficial without a loop?

I had tried a different approach to avoid inserting excessive number of boolean values during the loop-exit-unify in D127831. I just did some testing of that change against the LLVM IR Brendon shared with me. It shows the change could help reducing the number of registers as well as compile time. But it is sad that I still hit the error: "unhandled SGPR spill to memory" from SGPRSpillBuilder in SIRegisterInfo.cpp. Can the limitation be fixed? I did some register pressure comparison, seems the way I proposed would use much less VGPR than (D123230 + D123231), but use more SGPR. I haven't looked further why there is such behavior difference. I think we need more investigation to know why. But looks like D127831 might help us generate better code because we can use one VGPR as the backup storage for spilling of 64/32 SGPRs. And the idea used there is much easy to follow.

bcahoon updated this revision to Diff 438223.Jun 19 2022, 4:12 PM

Updates to the comments in the improve-order.ll test, based up reviewer suggestions.

bcahoon marked 6 inline comments as done.Jun 19 2022, 4:27 PM
bcahoon added inline comments.
llvm/test/Transforms/StructurizeCFG/improve-order.ll
6

Yes, it's the next test called reorder_inner_loop.

11

I've added comments that provide more information, based upon your suggestions. You are correct that, without this change exit blocks appear after the loop because they are dependent on the Flow block that contains the back edge for the loop (and maybe a series of dependent Flow edges). And, yes, the structurizer sets up predicates so that B and D are not actually executed, unless the exit occurs.

65

I changed this and other similar occurrences.

193

That's correct. I added that to the comment below.

272

This test contains more exit blocks. Also, blocks J and K are created that branch to a common block. Basically, a level of indirection that doesn't occur in the other tests.

429

I doubt the reordering is beneficial without a loop. There isn't an easy way to check for a loop in StructurizeCFG, so reorderNodes may change the order even if there isn't a loop. The patch just looks for a simple pattern, when a block has a single predecessor and a single successor. But, that causes the blocks to be reordered in this test. To stop that, the check to reorder also looks for a successor that exits the region.

bcahoon marked 6 inline comments as done.Jun 19 2022, 4:32 PM

I had tried a different approach to avoid inserting excessive number of boolean values during the loop-exit-unify in D127831. I just did some testing of that change against the LLVM IR Brendon shared with me. It shows the change could help reducing the number of registers as well as compile time. But it is sad that I still hit the error: "unhandled SGPR spill to memory" from SGPRSpillBuilder in SIRegisterInfo.cpp. Can the limitation be fixed? I did some register pressure comparison, seems the way I proposed would use much less VGPR than (D123230 + D123231), but use more SGPR. I haven't looked further why there is such behavior difference. I think we need more investigation to know why. But looks like D127831 might help us generate better code because we can use one VGPR as the backup storage for spilling of 64/32 SGPRs. And the idea used there is much easy to follow.

I tried your approach and it does improve compile-time, but I also see the register allocation error in another test. I'm not sure when that register allocation error will be fixed. But, it's surprising that it occurs because I do see register pressions improvements with your patch.

sameerds accepted this revision.Jun 21 2022, 10:50 PM

I had tried a different approach to avoid inserting excessive number of boolean values during the loop-exit-unify in D127831. I just did some testing of that change against the LLVM IR Brendon shared with me. It shows the change could help reducing the number of registers as well as compile time. But it is sad that I still hit the error: "unhandled SGPR spill to memory" from SGPRSpillBuilder in SIRegisterInfo.cpp. Can the limitation be fixed? I did some register pressure comparison, seems the way I proposed would use much less VGPR than (D123230 + D123231), but use more SGPR. I haven't looked further why there is such behavior difference. I think we need more investigation to know why. But looks like D127831 might help us generate better code because we can use one VGPR as the backup storage for spilling of 64/32 SGPRs. And the idea used there is much easy to follow.

Is there any dependency before submitting the current patch?

This revision is now accepted and ready to land.Jun 21 2022, 10:50 PM
This revision was landed with ongoing or failed builds.Jun 22 2022, 2:13 PM
This revision was automatically updated to reflect the committed changes.

The issue @critson reported in the related change reminds to take a careful look of this change. I think there is one critical correctness issue need to be fixed. When we are moving the exit block into the loop, we need to make sure there is no convergent operation in the exit block, otherwise we may change the threads that will participate the convergent operation if the threads exit the loop non-uniformly. So we have to scan all the instructions in the exit block before making the reorder decision. Another thing I just noticed is the existing implementation does not actually check it is moving loop/SCC exits. It also tries to reorder the blocks for an ordinary if-else region, which is really unnecessary. Now that we need to scan all the instructions to check against convergent operations, I think we need to make sure we are only scanning the loop-exits. To achieve this, I think you can remember the blocks of each SCC in StructurizeCFG::orderNodes(), and when you try to check whether it needs reorder, please make sure its single predecessor is inside an SCC but the block itself is not. I hope the threshold value check applies to both number of blocks in SCC as well as the number of exits. Apply the threshold value to number of blocks in SCC just helps us filter out loops that we don't need to care early. Checking the number of region nodes sounds not strong enough, even we have a large loop, we still don't want to move the loop exits into the loop unless it has lots of exits.

The issue @critson reported in the related change reminds to take a careful look of this change. I think there is one critical correctness issue need to be fixed. When we are moving the exit block into the loop, we need to make sure there is no convergent operation in the exit block, otherwise we may change the threads that will participate the convergent operation if the threads exit the loop non-uniformly. So we have to scan all the instructions in the exit block before making the reorder decision.

Hi @ruiling can you expand upon this a little bit since I'm not sure I understand. The exit block isn't part of the loop after the transformation, meaning that it's not executed as part of the loop. The block appears in the control flow after the exiting block. It's only executed if the loop exits. There shouldn't be any change in blocks executed with this patch.

Another thing I just noticed is the existing implementation does not actually check it is moving loop/SCC exits. It also tries to reorder the blocks for an ordinary if-else region, which is really unnecessary. Now that we need to scan all the instructions to check against convergent operations, I think we need to make sure we are only scanning the loop-exits. To achieve this, I think you can remember the blocks of each SCC in StructurizeCFG::orderNodes(), and when you try to check whether it needs reorder, please make sure its single predecessor is inside an SCC but the block itself is not.

You're correct that the code doesn't check for a loop and will try to reorder the blocks for ordinary if-then-else sequences. My thinking was that it wouldn't be worth keeping track of which blocks are in each SCC since the situation of reordering the if-then-else regions seemed to occur infrequently given the other constraints for reordering. But, I can add that functionality.

I hope the threshold value check applies to both number of blocks in SCC as well as the number of exits. Apply the threshold value to number of blocks in SCC just helps us filter out loops that we don't need to care early. Checking the number of region nodes sounds not strong enough, even we have a large loop, we still don't want to move the loop exits into the loop unless it has lots of exits.

Sure, I think that's reasonable to check for loops with a large number of exits. I can create a patch that keeps track of the region exits and uses that metric instead.

The issue @critson reported in the related change reminds to take a careful look of this change. I think there is one critical correctness issue need to be fixed. When we are moving the exit block into the loop, we need to make sure there is no convergent operation in the exit block, otherwise we may change the threads that will participate the convergent operation if the threads exit the loop non-uniformly. So we have to scan all the instructions in the exit block before making the reorder decision.

Hi @ruiling can you expand upon this a little bit since I'm not sure I understand. The exit block isn't part of the loop after the transformation, meaning that it's not executed as part of the loop. The block appears in the control flow after the exiting block. It's only executed if the loop exits. There shouldn't be any change in blocks executed with this patch.

I think by placing the loop-exit after loop exiting, the structurizer will wire the loop-exit inside the loop. I have add some note based on one lit test to explain the divergent loop exits case, hope it would help.

llvm/test/Transforms/StructurizeCFG/improve-order.ll
10

For this example, the block B and D are loop exits. Before this change, they are placed outside of the loop after structurization, so they will only start executing after all the threads in a wave exit the loop. But with this change, they appear inside the loop after structurization. By putting the exits inside the loop, if part of the threads in a wave exit the loop into B, then these threads will start executing B while other threads continue executing the blocks that have been inside the thread-level loop before structurization. This basically describe the wave execution when loop exit condition is divergent. Before this change, the loop-exits will only be executed once after all the threads exit the loop.