This is an archive of the discontinued LLVM Phabricator instance.

BasicBlockUtils: Add a new way for CreateControlFlowHub()
AcceptedPublic

Authored by bcahoon on Jun 15 2022, 12:17 AM.

Details

Reviewers
sameerds
ruiling
Summary

The existing way of creating the predicate in the guard blocks is using
a boolean value per outgoing block. This would be increase the number of
live booleans as we are having more outgoing blocks. The new way added in
this change is to store one integer to represent the outgoing block we
want to branch to, then at each guard block, an integer equality check
will be performed to decide which a specific outgoing block will be
taken.

Diff Detail

Event Timeline

ruiling created this revision.Jun 15 2022, 12:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2022, 12:17 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
ruiling requested review of this revision.Jun 15 2022, 12:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2022, 12:17 AM
foad added a subscriber: foad.Aug 1 2022, 9:28 AM
bcahoon commandeered this revision.Oct 10 2022, 2:00 PM
bcahoon edited reviewers, added: ruiling; removed: bcahoon.
bcahoon updated this revision to Diff 466611.Oct 10 2022, 2:22 PM

I'd like to try to move this patch by Ruiling forward. Using integer values instead of
boolean values for regions with many outgoing blocks is beneficial by reducing
register pressure. Using boolean vlaues requires a live value for each outgoing block,
but using an integer requires one only. In loops with lot of outgoing blocks, we see
a significant reduction in register pressure and faster compile-time.

Added a test case when using integers. Existing test changes are needed because
Phis are added in a different order now. Another change is that a comand-line
option may be used to specify when to use boolean or integer values.

sameerds accepted this revision.Oct 26 2022, 11:53 PM

LGTM, with a minor issue in the option name and description.

llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
60–64

The option name and its description does not really say which utility it's meant for in this file. I would suggest a name such as "MaxBooleansInControlFlowHub" along with a mention of the control flow hub in the description.

This revision is now accepted and ready to land.Oct 26 2022, 11:53 PM
foad added a comment.Oct 27 2022, 1:13 AM

Why do we still need the "old way"? Does the "new way" generate worse code in some cases?

Why do we still need the "old way"? Does the "new way" generate worse code in some cases?

Good question. The new way does require an additional compare instruction to test where control continues after the loop. One for each output edge. But, yes, there is a tradeoff here. I think it's worth assessing if the "new way" can replace the "old way" at some point. My thinking here is to be conservative and keep the "old way" for now as I'm a little hesitant to replace it without more performance data.

Why do we still need the "old way"? Does the "new way" generate worse code in some cases?

The "old way" uses one boolean value to record whether an outgoing block would be taken, we need N boolean values when we have N outgoing blocks. But the "new way" just needs one 32bits variable to record the outgoing target, meanwhile it needs some additional compare instructions to generate the block prediction. So, this is a tradeoff between the register usage and number of instructions.

arsenm added a subscriber: arsenm.Oct 27 2022, 9:17 PM
arsenm added inline comments.
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
60–64

Given this is a utility function I would expect this to be a parameter instead

foad added a comment.Oct 28 2022, 6:47 AM

Why do we still need the "old way"? Does the "new way" generate worse code in some cases?

Good question. The new way does require an additional compare instruction to test where control continues after the loop. One for each output edge. But, yes, there is a tradeoff here. I think it's worth assessing if the "new way" can replace the "old way" at some point. My thinking here is to be conservative and keep the "old way" for now as I'm a little hesitant to replace it without more performance data.

Right, I understand that using an i32 instead of an i1 means that you need extra icmp instructions in the IR, but does it actually cause more instructions in the final generated code (for AMDGPU)? If the i1 was stored in a general purpose register then you would need a cmp anyway to use it to control a conditional branch.

bcahoon updated this revision to Diff 471673.Oct 28 2022, 3:30 PM

Based upon reviewer feedback, this update changes the name of the command-line option, and passes the command-line value as a parameter to CreateControlFlowHub

bcahoon marked 2 inline comments as done.Oct 28 2022, 3:45 PM

Why do we still need the "old way"? Does the "new way" generate worse code in some cases?

Good question. The new way does require an additional compare instruction to test where control continues after the loop. One for each output edge. But, yes, there is a tradeoff here. I think it's worth assessing if the "new way" can replace the "old way" at some point. My thinking here is to be conservative and keep the "old way" for now as I'm a little hesitant to replace it without more performance data.

Right, I understand that using an i32 instead of an i1 means that you need extra icmp instructions in the IR, but does it actually cause more instructions in the final generated code (for AMDGPU)? If the i1 was stored in a general purpose register then you would need a cmp anyway to use it to control a conditional branch.

Only a handful of code gen LIT tests fail if I enable the new way, with an integer value for all cases. I was surprised it was so few. The code differences in the tests are not significant. When boolean values are used, the compiler generates and instructions, and when the integer value is used the compiler generate cmp instructions instead. Some other minor differences in one of the tests, e.g., moving a constant to an sgpr. I can try to find and run some other tests and check the generated code.

Right, I understand that using an i32 instead of an i1 means that you need extra icmp instructions in the IR, but does it actually cause more instructions in the final generated code (for AMDGPU)? If the i1 was stored in a general purpose register then you would need a cmp anyway to use it to control a conditional branch.

I would expect that the i1 values get packed into an sreg, and then used in cbranch_vccz? That wouldn't need a cmp to actually compare the i1 value.

sameerds added inline comments.Oct 28 2022, 10:33 PM
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
1807

This is more curiosity than comment ... what is the advantage of using an Optional here? How does it compare to just using ~0 as the (maximum) default value?

foad added a comment.Oct 29 2022, 2:52 AM

Right, I understand that using an i32 instead of an i1 means that you need extra icmp instructions in the IR, but does it actually cause more instructions in the final generated code (for AMDGPU)? If the i1 was stored in a general purpose register then you would need a cmp anyway to use it to control a conditional branch.

I would expect that the i1 values get packed into an sreg, and then used in cbranch_vccz? That wouldn't need a cmp to actually compare the i1 value.

OK, if they are divergent and get packed into an SGPR with one bit per lane then that does sound much better than using one VGPR per value. And the threshold of 32 makes some sense, since it's roughly equal to the wavefront size.

bcahoon added inline comments.Oct 30 2022, 9:24 AM
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
1807

My first instinct was to use ~0U since I'm used to that. But, then I ended up using Optional since it makes it explicit that the value may not be set and the code doesn't rely on a sentinel value.