This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Combine BFI instructions.
AbandonedPublic

Authored by tsymalla on Oct 21 2022, 1:50 AM.

Details

Reviewers
nhaehnle
Summary

When translating multiple bitfieldInserts where one
bitfieldInsert is the base for another one, the current
mechanism of creating XOR, AND, XOR sequences in the
frontend and having them lowered in ISel is not sufficient,
as the information about the bitmasks is lost during InstCombine.
This leads to only one v_bfi instruction being generated.

When creating the canonical bitfieldInsert pattern directly in the
frontend, the constants will still be partially merged by
SimplifyDemandedBits, leading to a simplified pattern, however, the
original sequence can be reconstructed and the nested bitfieldInserts
can be generated.

In general, this approach tries to match sequences such as

(X1 & C1) | (((X2 & C2) | (X3 & (~C1 | ~C3))))

and checks if this pattern is derived from

(X1 & C1) | (~C1 & ((X2 & C2) | (X3 & ~C3)))

by looking at the constants and checking if they are disjoint and
partitioning -1.

In such cases, it will try to generate the appropriate v_bfi instructions.

Diff Detail

Event Timeline

tsymalla created this revision.Oct 21 2022, 1:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2022, 1:50 AM
tsymalla requested review of this revision.Oct 21 2022, 1:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2022, 1:50 AM

One inline comment on a comment :).

Slightly related: How do we handle (X1 & Y) | (X2 & ~Y) where Y is a variable?

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2164

It seems you support arbitrarily nested expressions, maybe mention here that the example with three clauses is just an example?

tsymalla added a comment.EditedOct 21 2022, 2:20 AM

One inline comment on a comment :).

Slightly related: How do we handle (X1 & Y) | (X2 & ~Y) where Y is a variable?

Thanks!
As such case cannot be combined into a BFI instruction, I guess it's relatively straight-forwardly translated into something like this:

v_and_b32 v0, x1, y ; v0 = X1 & Y
v_not_b32 v1, y ; v1 = ~Y
v_and_or_b32 v1, v1, x2, v0 ; v1 = (~Y & X2) | (X1 & Y)

As such case cannot be combined into a BFI instruction, I guess it's relatively straight-forwardly translated into something like this:

Why can't we, isn't this essentially the definition of BFI?

As such case cannot be combined into a BFI instruction, I guess it's relatively straight-forwardly translated into something like this:

Why can't we, isn't this essentially the definition of BFI?

I think I misunderstood your question. If you are asking how generic BFIs are generated:
Such cases where we can prove that Y is used on both sides of the OR expression, will be currently transformed by InstCombineAndOrXor to the canonical form after the non-canonical form is created in the frontend. The sequence is then directly picked up and transformed during ISel into a single V_BFI. There's a pattern match for it.

I see, thanks!

So Select_BFI just implements special case handling of BFI matching with more complicated expressions that we can still work with (because selectors are constants), and the default case is handled elsewhere.
Maybe we should rename Select_BFI to reflect that?

I see, thanks!

So Select_BFI just implements special case handling of BFI matching with more complicated expressions that we can still work with (because selectors are constants), and the default case is handled elsewhere.
Maybe we should rename Select_BFI to reflect that?

Maybe Selected_NestedBFI might be more accurate

nhaehnle added inline comments.Oct 21 2022, 4:45 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2164

Seconded. The comment as-is is bad because it is too stuck in the original motivation for the code. Comments need to start at a high-level.

We need to take a step back and think about what the generalized and high-level thing is we're doing here. And that is:

Match a 32-bit expression of the form

  (X1 & C1) | (X2 & C2) | ... | (Xk & Ck)

where the Ci are constants and select it to a sequence of (k-1) V_BFI_B32 instructions if possible.

That needs to be the headline of the comment here. (The original motivation can be mentioned later as a further detail.)

What I glossed over is the bracketing. The code as-is only matches one specific possibility for introducing brackets, which is the one you happened to see in the motivation yous tarted with. But that's the kind of thing where we should do better by default.

What if recursion is on the left instead of on the right? What if it is on both sides?

2202–2203

Can this case happen?

2220–2237

This whole code can be simplified by starting with NewNode = BFIOps.back().X->getOperand(0) and iterating over reverse(makeArrayRef(BFIOps).drop_back())

2239

Is this check necessary?

One more thought on longer expressions: Nicolai already mentioned that we currently only match a particular expression structure.
I wondered whether the expression structure has any performance impact?

For example, for four clauses, we could either generate BFI(BFI(X1, X2, C1), BFI(X3, X4, C3), C1 | C2) (i.e., a balanced binary tree) or BFI(BFI(BFI(X1, X2, C1), X3, C1 | C2), X4, C1 | C2 | C3) (i.e., a path).

The number of instructions is the same, but the balanced version has no data dependency between the two inner BFIs. Would that matter on our hardware?

foad added inline comments.Oct 21 2022, 5:23 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2160

Too many parens in this expression. The RHS should be just ((X2 & C2) | (X3 & (~C1 | ~C3)))

2168

uint32_t probably makes more sense since you are matching _B32 operations throughout.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
265

"SelectBFI" (no underscore) or "SelectV_BFI" for consistency.

I wondered whether the expression structure has any performance impact?

For example, for four clauses, we could either generate BFI(BFI(X1, X2, C1), BFI(X3, X4, C3), C1 | C2) (i.e., a balanced binary tree) or BFI(BFI(BFI(X1, X2, C1), X3, C1 | C2), X4, C1 | C2 | C3) (i.e., a path).

The number of instructions is the same, but the balanced version has no data dependency between the two inner BFIs. Would that matter on our hardware?

It can have an impact because I don't think there's a fast forwarding path in any of our HW, but:

  • the impact is minor, and
  • making the best decision also requires looking at the schedule of the instructions producing the input values, which we can't do here

Hmm... creating a plain binary tree is simple enough. Maybe we *should* do that as long as we're here...

foad added a comment.Oct 21 2022, 7:09 AM

Also it would be good to do the same thing for GlobalISel (in a separate patch).

tsymalla marked 4 inline comments as done.Oct 21 2022, 7:47 AM
tsymalla marked 3 inline comments as done.Oct 21 2022, 7:57 AM
foad added a comment.Oct 21 2022, 8:10 AM

Hmm... creating a plain binary tree is simple enough. Maybe we *should* do that as long as we're here...

I would suggest not doing that. It is not clearly better - it is a tradeoff between latency and register pressure. And I don't think SelectionDAG tries to balance trees in simpler cases, like just nested ANDs or nested ORs.

tsymalla updated this revision to Diff 469896.Oct 22 2022, 7:26 AM
tsymalla marked an inline comment as done.

Addressed review comments.
Simplified code and fixed comments.

Hmm... creating a plain binary tree is simple enough. Maybe we *should* do that as long as we're here...

I would suggest not doing that. It is not clearly better - it is a tradeoff between latency and register pressure. And I don't think SelectionDAG tries to balance trees in simpler cases, like just nested ANDs or nested ORs.

The situation is _slightly_ different because the BFI generation doesn't allow you to consider subtrees in isolation. If we want to generate optimal code in terms of number of instructions for ((x1 & c1) | (x2 & c2)) | ((x3 & c3) | (x4 & c4)), we can't select the subtrees separately (since they would fail the checks on the constants).

That said, it's a good point that we can't make an optimal decision here anyway and so it may be best to just go with the simplest thing, which is a chain of BFIs.

Thanks for the insights Nicolai and Jay!

I'm used to challenge latency all the time, less so other objectives like register pressure.

foad added inline comments.Oct 24 2022, 7:17 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2164

When you match this expression tree I think you need to check hasOneUse on every OR node - and possibly on the AND nodes too, but I'm not so sure about that.

tsymalla updated this revision to Diff 470201.Oct 24 2022, 10:12 AM

Add use checks.

tsymalla marked an inline comment as done.Oct 24 2022, 10:12 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2164

Thanks.

2220–2237

Thanks. Good pooint

foad added inline comments.Oct 25 2022, 1:20 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2194

The top level OR can have multiple uses.

2215

Why? Is there any disadvantage to letting this function select a single BFI?

tsymalla marked an inline comment as done.Oct 25 2022, 1:46 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2194

Yes. You are right. Thanks.

2215

Such case is handled by the TableGen patterns, so I don't want to handle the non-specialized case here (it would be possible, but I don't regard that as advantage in terms of code complexity).

tsymalla updated this revision to Diff 470414.Oct 25 2022, 2:01 AM

Only require single uses on nested ORs

foad added inline comments.Oct 25 2022, 3:08 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2194

This is not the right place to put the hasOneUse check anyway. What would happen if the check failed? You would break out of the loop, and then generate wrong code.

Also how can the check for CurrentOr ever fail?

tsymalla added inline comments.Oct 25 2022, 3:17 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2194

The right way to do so might be to just skip the whole function (to return false) if the OR gets used multiple times. Previously, it was possible for the CurrentOr check to fail, this is just a remnant. It currently works because the terminating condition for the loop is inside the non-OR handling branches of the main if which resides inside the loop. So either it continues looking for appropriate ORs or terminates once it found the final OR, AND, AND sequence. I'll simplify the whole thing and put another diff up.

tsymalla updated this revision to Diff 470447.Oct 25 2022, 4:58 AM

Change the structure of the main fetching loop.

tsymalla updated this revision to Diff 470828.Oct 26 2022, 9:00 AM

Fix the lit tests.
The restructuring of the main loop caused issues, because now it was
also possible that the the top-level OR (the one passed to selectV_BFI)
is matched, which is not intended here.

nhaehnle added inline comments.Oct 27 2022, 1:12 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

There is still a weird asymmetry here. Do you have a reason for treating the two operands differently? If not, you should treat them the same.

tsymalla added inline comments.Oct 27 2022, 2:54 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

The reason is that no transformation will happen if one of the operands doesn't fit the scheme. This is the case if

a) On the top level, a OR(AND, AND) pattern occurs (this is handled by ISel)
b) the second operand is no AND in either case (handled by the top-level O1.getOpcode() check)
c) A nested OR operand is used more than once - in that case, it cannot be optimized away
d) Any of the AND operands is not a constant value, e. g. doesn't contribute to the mask.
e) Any of the ANDs is being used more than once - in that case, it cannot be optimized away
f) If none of the assumptions of the structure can be made, just return false.

It simplifies things if the operands are handled differently. Even if, for instance, the InsertBFIOp(O1) could theoretically be hoisted out, this complicates things at the BFI generation part later, because now the base and the order at the tail of the vector are swapped, making it impossible to just assign the back (base) to NewNode and drop it. So, this loop basically relies on the state updated from the previous iteration and uses CurrentOr as controlling element.

nhaehnle added inline comments.Oct 28 2022, 7:00 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

b) the second operand is no AND in either case (handled by the top-level O1.getOpcode() check)

What's the justification for this rule? Couldn't you have N = (x1 & c1) | ((x2 & c2) | (x3 & c3))?

tsymalla added inline comments.Oct 28 2022, 7:48 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

In general, you can, but looking at existing tests, is it a) likely this will happen and b) all the other assumptions about the constants are true - so the compiler will actually create some BFI instructions from the sequence?
I can change the implementation to treat the operands equally, but for now, is it likely that this only includes additional checks and memory usage, but will never contribute to the generated code at this point?

nhaehnle added inline comments.Oct 28 2022, 8:02 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

You're asking the question the wrong way around.

The code that we are inspecting manually is a microscopically tiny fraction of the total amount of code that our compilers ever see. If there is an obvious generalization like this one that is easy to do, we should always take it. And when we notice it, we should add lit tests to ensure the case is covered.

Yes, there are additional checks for the generalization. What additional memory usage is there?

tsymalla added inline comments.Oct 28 2022, 8:07 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

You're asking the question the wrong way around.

The code that we are inspecting manually is a microscopically tiny fraction of the total amount of code that our compilers ever see. If there is an obvious generalization like this one that is easy to do, we should always take it. And when we notice it, we should add lit tests to ensure the case is covered.

Yes, there are additional checks for the generalization. What additional memory usage is there?

Storing multiple times the bytes for the SDNode * and the mask inside the small vector for each OR having an AND as operand even if its likely that the memory will get released in a lot of cases.
Sure, the overhead is neglectable, but I at least wanted to talk about that.

I'll just add the generalization in the next diff.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

The generalization would be to accept arbitrary OR-trees of AND clauses of the form (x_i & c_i) or (c_i & x_i)?

tsymalla added inline comments.Oct 28 2022, 9:55 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

Yes. The generalization would be to have the constant operands be in arbitrary order for a given AND, and also to have the nested ORs be at any position for a given OR node.

tsymalla updated this revision to Diff 471910.Oct 31 2022, 12:33 AM

Re-structure loop once again.
Operands can now be on arbitrary positions.

jsilvanus added inline comments.Oct 31 2022, 1:06 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

You still artificially require at least one operand of each OR to be an AND. However, we can have an arbitrary OR-tree that you would need to traverse using a stack to collect it's leaves -- the AND nodes.

For example, currently you reject N = ((x1 & c1) | (x2 & c2)) | ((x3 & c3) | (x4 & c4)) which can be mapped to three BFIs.

tsymalla added inline comments.Nov 2 2022, 5:42 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203

I am working on that, but it's not trivial to achieve that functionality.
To have all possible BFIs in that tree being match, everything has to be processed in a single pass. If any of the disjunction test fails, another more generic matcher might kick in and generate a V_OR, V_OR3 or V_AND node.

tsymalla updated this revision to Diff 474222.Nov 9 2022, 4:12 AM

This change handles most of the tree structures, but not
all of them, because this would add way more complexity.

This uses two separate stacks, one as "working list"
to traverse the ORs, and one as stack to collect all
possible ORs. This is done so the top-level OR is not
required to be processed separately.

This also adds divergence checks as they appear in the
standard BFI pattern matching, because we don't have a
scalar BFI instruction and don't want to utilize the VALU
for several scalar instructions (such patterns currently
appear in unrelated tests).

This also handles the mask and base appearing at
arbitrary positions of the AND instructions.

It also utilizes the SDLoc of the original OR instructions
for inserting the new DAG nodes, so the ordering might be
a bit better at the end.

The boolean return values of CreateBFIOpFromAnd were
replaced by a binary-value enum class, which
contains more descriptive entries. In some cases, we just
want to skip the current entry instead of returning false.

This also adds some more tests for changed operand order or
matching a unbalanced tree structure.

tsymalla updated this revision to Diff 474536.Nov 10 2022, 6:45 AM

Fixed error on Windows build and whitespace issues.

tsymalla updated this revision to Diff 478514.Nov 29 2022, 4:10 AM

Added test to show the isDivergence behavior (copied from another test).
Changed run line to not test a specific architecture. This adds a few
s_movs to the lit tests, but shows the behavior of the isDivergence test
(the GFX10 behavior would be different).

Ping.
If this is getting too complicated, we could still think about moving to intrinsics.

Sorry, this dropped off my radar for a while.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2165–2168

Despite us having repeatedly told you so, this is still missing a type check that N is a 32-bit integer. I suspect you may be getting lucky because other types are being legalized away? But you shouldn't really on it.

2172

This can just be an uint32_t and doesn't need to be an Optional: if the MaskOperand is not a constant, this struct shouldn't really be constructed in the first place. (It would be cleaner to inline ExtractMask into CreateBFIOpFromAnd, since ExtractMask isn't a particularly load-bearing abstraction here.)

2175–2178

You don't really need the Or in here. (It's only used to get the value type, but why -- you know that this only works with MVT::i32)

2229–2230

You can use OrStack.pop_back_val().

2235–2262

I think the logic here is needlessly complicated and hiding what is really happening in the cases where it's correct, and it is subtly incorrect in some cases.

For example, consider:

N = (x & C) | ((y & ~C) | (z |* w))   -- The |* node has more than one use

In the first loop, the two outermost OR nodes are recorded in OrCandidates, the innermost one isn't. Then, the loop over OrCandidates will collect the BFIOps (x, C) and (y, ~C) and proceed to replace the whole expression by a single BFI, which is incorrect.

It would be cleaner logic (and therefore easier to get correct!) to just merge the two loops into one, by replacing the core part of the top loop with something like:

if (O0->getOpcode() == ISD::OR) {
  if (!O0->isDivergent() || !O0->hasOneUse())
    return false;
  OrStack.push_back(O0.getNode());
} else if (!CreateBFIOpFromAnd(O0)) {
  return false;
}

if (O1->getOpcode() == ISD::OR) {
  if (!O1->isDivergent() || !O1->hasOneUse())
    return false;
  OrStack.push_back(O1.getNode());
} else if (!CreateBFIOpFromAnd(O1)) {
  return false;
}

(This implies that CreateBFIOpFromAnd is adjusted to return false if it wasn't able to recognize the given node as a BFI operand.)

tsymalla added inline comments.Dec 12 2022, 1:15 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2175–2178

That is not entirely correct. It is also used to retrieve the original SDLoc of the (nested) Or. So I think it can stay.

tsymalla abandoned this revision.Mar 31 2023, 12:12 AM