The existing way of creating the predicate in the guard blocks is using
a boolean value per outgoing block. This would be increase the number of
live booleans as we are having more outgoing blocks. The new way added in
this change is to store one integer to represent the outgoing block we
want to branch to, then at each guard block, an integer equality check
will be performed to decide which a specific outgoing block will be
taken.
Details
Diff Detail
Unit Tests
Time | Test | |
---|---|---|
40 ms | x64 debian > LLVM.CodeGen/MIR/AArch64::mirnamer.mir |
Event Timeline
I'd like to try to move this patch by Ruiling forward. Using integer values instead of
boolean values for regions with many outgoing blocks is beneficial by reducing
register pressure. Using boolean vlaues requires a live value for each outgoing block,
but using an integer requires one only. In loops with lot of outgoing blocks, we see
a significant reduction in register pressure and faster compile-time.
Added a test case when using integers. Existing test changes are needed because
Phis are added in a different order now. Another change is that a comand-line
option may be used to specify when to use boolean or integer values.
LGTM, with a minor issue in the option name and description.
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp | ||
---|---|---|
60–64 | The option name and its description does not really say which utility it's meant for in this file. I would suggest a name such as "MaxBooleansInControlFlowHub" along with a mention of the control flow hub in the description. |
Why do we still need the "old way"? Does the "new way" generate worse code in some cases?
Good question. The new way does require an additional compare instruction to test where control continues after the loop. One for each output edge. But, yes, there is a tradeoff here. I think it's worth assessing if the "new way" can replace the "old way" at some point. My thinking here is to be conservative and keep the "old way" for now as I'm a little hesitant to replace it without more performance data.
The "old way" uses one boolean value to record whether an outgoing block would be taken, we need N boolean values when we have N outgoing blocks. But the "new way" just needs one 32bits variable to record the outgoing target, meanwhile it needs some additional compare instructions to generate the block prediction. So, this is a tradeoff between the register usage and number of instructions.
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp | ||
---|---|---|
60–64 | Given this is a utility function I would expect this to be a parameter instead |
Right, I understand that using an i32 instead of an i1 means that you need extra icmp instructions in the IR, but does it actually cause more instructions in the final generated code (for AMDGPU)? If the i1 was stored in a general purpose register then you would need a cmp anyway to use it to control a conditional branch.
Based upon reviewer feedback, this update changes the name of the command-line option, and passes the command-line value as a parameter to CreateControlFlowHub
Only a handful of code gen LIT tests fail if I enable the new way, with an integer value for all cases. I was surprised it was so few. The code differences in the tests are not significant. When boolean values are used, the compiler generates and instructions, and when the integer value is used the compiler generate cmp instructions instead. Some other minor differences in one of the tests, e.g., moving a constant to an sgpr. I can try to find and run some other tests and check the generated code.
I would expect that the i1 values get packed into an sreg, and then used in cbranch_vccz? That wouldn't need a cmp to actually compare the i1 value.
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp | ||
---|---|---|
1807 | This is more curiosity than comment ... what is the advantage of using an Optional here? How does it compare to just using ~0 as the (maximum) default value? |
OK, if they are divergent and get packed into an SGPR with one bit per lane then that does sound much better than using one VGPR per value. And the threshold of 32 makes some sense, since it's roughly equal to the wavefront size.
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp | ||
---|---|---|
1807 | My first instinct was to use ~0U since I'm used to that. But, then I ended up using Optional since it makes it explicit that the value may not be set and the code doesn't rely on a sentinel value. |
The option name and its description does not really say which utility it's meant for in this file. I would suggest a name such as "MaxBooleansInControlFlowHub" along with a mention of the control flow hub in the description.