This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
1/1
AMDGPUISelDAGToDAG.h
19/32
AMDGPUISelDAGToDAG.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
bfi_nested.ll

Differential D136432

[AMDGPU] Combine BFI instructions.
AbandonedPublic

Authored by tsymalla on Oct 21 2022, 1:50 AM.

Download Raw Diff

Details

Reviewers

nhaehnle

Summary

When translating multiple bitfieldInserts where one
bitfieldInsert is the base for another one, the current
mechanism of creating XOR, AND, XOR sequences in the
frontend and having them lowered in ISel is not sufficient,
as the information about the bitmasks is lost during InstCombine.
This leads to only one v_bfi instruction being generated.

When creating the canonical bitfieldInsert pattern directly in the
frontend, the constants will still be partially merged by
SimplifyDemandedBits, leading to a simplified pattern, however, the
original sequence can be reconstructed and the nested bitfieldInserts
can be generated.

In general, this approach tries to match sequences such as

(X1 & C1) | (((X2 & C2) | (X3 & (~C1 | ~C3))))

and checks if this pattern is derived from

(X1 & C1) | (~C1 & ((X2 & C2) | (X3 & ~C3)))

by looking at the constants and checking if they are disjoint and
partitioning -1.

In such cases, it will try to generate the appropriate v_bfi instructions.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	350 ms	x64 debian > LLVM.Examples/OrcV2Examples::lljit-with-thinlto-summaries.test

Event Timeline

tsymalla created this revision.Oct 21 2022, 1:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2022, 1:50 AM

Herald added subscribers: kosarev, foad, kerbowa and 8 others. · View Herald Transcript

tsymalla requested review of this revision.Oct 21 2022, 1:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2022, 1:50 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

One inline comment on a comment :).

Slightly related: How do we handle (X1 & Y) | (X2 & ~Y) where Y is a variable?

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2164	It seems you support arbitrarily nested expressions, maybe mention here that the example with three clauses is just an example?

In D136432#3873859, @jsilvanus wrote:

One inline comment on a comment :).

Slightly related: How do we handle (X1 & Y) | (X2 & ~Y) where Y is a variable?

Thanks!
As such case cannot be combined into a BFI instruction, I guess it's relatively straight-forwardly translated into something like this:

v_and_b32 v0, x1, y ; v0 = X1 & Y
v_not_b32 v1, y ; v1 = ~Y
v_and_or_b32 v1, v1, x2, v0 ; v1 = (~Y & X2) | (X1 & Y)

As such case cannot be combined into a BFI instruction, I guess it's relatively straight-forwardly translated into something like this:

Why can't we, isn't this essentially the definition of BFI?

In D136432#3873879, @jsilvanus wrote:

As such case cannot be combined into a BFI instruction, I guess it's relatively straight-forwardly translated into something like this:

Why can't we, isn't this essentially the definition of BFI?

I think I misunderstood your question. If you are asking how generic BFIs are generated:
Such cases where we can prove that Y is used on both sides of the OR expression, will be currently transformed by InstCombineAndOrXor to the canonical form after the non-canonical form is created in the frontend. The sequence is then directly picked up and transformed during ISel into a single V_BFI. There's a pattern match for it.

I see, thanks!

So Select_BFI just implements special case handling of BFI matching with more complicated expressions that we can still work with (because selectors are constants), and the default case is handled elsewhere.
Maybe we should rename Select_BFI to reflect that?

In D136432#3873908, @jsilvanus wrote:

I see, thanks!

So Select_BFI just implements special case handling of BFI matching with more complicated expressions that we can still work with (because selectors are constants), and the default case is handled elsewhere.
Maybe we should rename Select_BFI to reflect that?

Maybe Selected_NestedBFI might be more accurate

Harbormaster completed remote builds in B193456: Diff 469509.Oct 21 2022, 3:11 AM

nhaehnle added inline comments.Oct 21 2022, 4:45 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2164	Seconded. The comment as-is is bad because it is too stuck in the original motivation for the code. Comments need to start at a high-level. We need to take a step back and think about what the generalized and high-level thing is we're doing here. And that is: Match a 32-bit expression of the form (X1 & C1) \| (X2 & C2) \| ... \| (Xk & Ck) where the Ci are constants and select it to a sequence of (k-1) V_BFI_B32 instructions if possible. That needs to be the headline of the comment here. (The original motivation can be mentioned later as a further detail.) What I glossed over is the bracketing. The code as-is only matches one specific possibility for introducing brackets, which is the one you happened to see in the motivation yous tarted with. But that's the kind of thing where we should do better by default. What if recursion is on the left instead of on the right? What if it is on both sides?
2202–2203	Can this case happen?
2220–2237	This whole code can be simplified by starting with `NewNode = BFIOps.back().X->getOperand(0)` and iterating over `reverse(makeArrayRef(BFIOps).drop_back())`
2239	Is this check necessary?

One more thought on longer expressions: Nicolai already mentioned that we currently only match a particular expression structure.
I wondered whether the expression structure has any performance impact?

For example, for four clauses, we could either generate BFI(BFI(X1, X2, C1), BFI(X3, X4, C3), C1 | C2) (i.e., a balanced binary tree) or BFI(BFI(BFI(X1, X2, C1), X3, C1 | C2), X4, C1 | C2 | C3) (i.e., a path).

The number of instructions is the same, but the balanced version has no data dependency between the two inner BFIs. Would that matter on our hardware?

foad added inline comments.Oct 21 2022, 5:23 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2160	Too many parens in this expression. The RHS should be just `((X2 & C2) \| (X3 & (~C1 \| ~C3)))`
2168	uint32_t probably makes more sense since you are matching _B32 operations throughout.
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
265	"SelectBFI" (no underscore) or "SelectV_BFI" for consistency.

In D136432#3874213, @jsilvanus wrote:

I wondered whether the expression structure has any performance impact?

For example, for four clauses, we could either generate BFI(BFI(X1, X2, C1), BFI(X3, X4, C3), C1 | C2) (i.e., a balanced binary tree) or BFI(BFI(BFI(X1, X2, C1), X3, C1 | C2), X4, C1 | C2 | C3) (i.e., a path).

The number of instructions is the same, but the balanced version has no data dependency between the two inner BFIs. Would that matter on our hardware?

It can have an impact because I don't think there's a fast forwarding path in any of our HW, but:

the impact is minor, and
making the best decision also requires looking at the schedule of the instructions producing the input values, which we can't do here

Hmm... creating a plain binary tree is simple enough. Maybe we *should* do that as long as we're here...

Also it would be good to do the same thing for GlobalISel (in a separate patch).

tsymalla marked 4 inline comments as done.Oct 21 2022, 7:47 AM

tsymalla marked 3 inline comments as done.Oct 21 2022, 7:57 AM

Hmm... creating a plain binary tree is simple enough. Maybe we *should* do that as long as we're here...

I would suggest not doing that. It is not clearly better - it is a tradeoff between latency and register pressure. And I don't think SelectionDAG tries to balance trees in simpler cases, like just nested ANDs or nested ORs.

Addressed review comments.
Simplified code and fixed comments.

Harbormaster completed remote builds in B193741: Diff 469896.Oct 22 2022, 8:14 AM

In D136432#3874692, @foad wrote:

Hmm... creating a plain binary tree is simple enough. Maybe we *should* do that as long as we're here...

I would suggest not doing that. It is not clearly better - it is a tradeoff between latency and register pressure. And I don't think SelectionDAG tries to balance trees in simpler cases, like just nested ANDs or nested ORs.

The situation is _slightly_ different because the BFI generation doesn't allow you to consider subtrees in isolation. If we want to generate optimal code in terms of number of instructions for ((x1 & c1) | (x2 & c2)) | ((x3 & c3) | (x4 & c4)), we can't select the subtrees separately (since they would fail the checks on the constants).

That said, it's a good point that we can't make an optimal decision here anyway and so it may be best to just go with the simplest thing, which is a chain of BFIs.

Thanks for the insights Nicolai and Jay!

I'm used to challenge latency all the time, less so other objectives like register pressure.

foad added inline comments.Oct 24 2022, 7:17 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2164	When you match this expression tree I think you need to check hasOneUse on every OR node - and possibly on the AND nodes too, but I'm not so sure about that.

Add use checks.

tsymalla marked an inline comment as done.Oct 24 2022, 10:12 AM

tsymalla added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2164	Thanks.
2220–2237	Thanks. Good pooint

Harbormaster completed remote builds in B193973: Diff 470201.Oct 24 2022, 11:55 AM

foad added inline comments.Oct 25 2022, 1:20 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2194	The top level OR can have multiple uses.
2215	Why? Is there any disadvantage to letting this function select a single BFI?

tsymalla marked an inline comment as done.Oct 25 2022, 1:46 AM

tsymalla added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2194	Yes. You are right. Thanks.
2215	Such case is handled by the TableGen patterns, so I don't want to handle the non-specialized case here (it would be possible, but I don't regard that as advantage in terms of code complexity).

tsymalla mentioned this in rG1f23cf4e5038: [NFC][AMDGPU] Pre-commit test for D136432.Oct 25 2022, 1:52 AM

Only require single uses on nested ORs

foad added inline comments.Oct 25 2022, 3:08 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2194	This is not the right place to put the hasOneUse check anyway. What would happen if the check failed? You would break out of the loop, and then generate wrong code. Also how can the check for `CurrentOr` ever fail?

tsymalla added inline comments.Oct 25 2022, 3:17 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2194	The right way to do so might be to just skip the whole function (to return false) if the OR gets used multiple times. Previously, it was possible for the CurrentOr check to fail, this is just a remnant. It currently works because the terminating condition for the loop is inside the non-OR handling branches of the main if which resides inside the loop. So either it continues looking for appropriate ORs or terminates once it found the final OR, AND, AND sequence. I'll simplify the whole thing and put another diff up.

Harbormaster completed remote builds in B194123: Diff 470414.Oct 25 2022, 4:35 AM

Change the structure of the main fetching loop.

Harbormaster completed remote builds in B194147: Diff 470447.Oct 25 2022, 8:33 AM

Fix the lit tests.
The restructuring of the main loop caused issues, because now it was
also possible that the the top-level OR (the one passed to selectV_BFI)
is matched, which is not intended here.

Harbormaster completed remote builds in B194422: Diff 470828.Oct 26 2022, 11:34 AM

nhaehnle added inline comments.Oct 27 2022, 1:12 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	There is still a weird asymmetry here. Do you have a reason for treating the two operands differently? If not, you should treat them the same.

tsymalla added inline comments.Oct 27 2022, 2:54 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	The reason is that no transformation will happen if one of the operands doesn't fit the scheme. This is the case if a) On the top level, a OR(AND, AND) pattern occurs (this is handled by ISel) b) the second operand is no AND in either case (handled by the top-level O1.getOpcode() check) c) A nested OR operand is used more than once - in that case, it cannot be optimized away d) Any of the AND operands is not a constant value, e. g. doesn't contribute to the mask. e) Any of the ANDs is being used more than once - in that case, it cannot be optimized away f) If none of the assumptions of the structure can be made, just return false. It simplifies things if the operands are handled differently. Even if, for instance, the InsertBFIOp(O1) could theoretically be hoisted out, this complicates things at the BFI generation part later, because now the base and the order at the tail of the vector are swapped, making it impossible to just assign the back (base) to `NewNode` and drop it. So, this loop basically relies on the state updated from the previous iteration and uses CurrentOr as controlling element.

nhaehnle added inline comments.Oct 28 2022, 7:00 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	b) the second operand is no AND in either case (handled by the top-level O1.getOpcode() check) What's the justification for this rule? Couldn't you have `N = (x1 & c1) \| ((x2 & c2) \| (x3 & c3))`?

tsymalla added inline comments.Oct 28 2022, 7:48 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	In general, you can, but looking at existing tests, is it a) likely this will happen and b) all the other assumptions about the constants are true - so the compiler will actually create some BFI instructions from the sequence? I can change the implementation to treat the operands equally, but for now, is it likely that this only includes additional checks and memory usage, but will never contribute to the generated code at this point?

nhaehnle added inline comments.Oct 28 2022, 8:02 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	You're asking the question the wrong way around. The code that we are inspecting manually is a microscopically tiny fraction of the total amount of code that our compilers ever see. If there is an obvious generalization like this one that is easy to do, we should always take it. And when we notice it, we should add lit tests to ensure the case is covered. Yes, there are additional checks for the generalization. What additional memory usage is there?

tsymalla added inline comments.Oct 28 2022, 8:07 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	You're asking the question the wrong way around. The code that we are inspecting manually is a microscopically tiny fraction of the total amount of code that our compilers ever see. If there is an obvious generalization like this one that is easy to do, we should always take it. And when we notice it, we should add lit tests to ensure the case is covered. Yes, there are additional checks for the generalization. What additional memory usage is there? Storing multiple times the bytes for the `SDNode *` and the mask inside the small vector for each OR having an AND as operand even if its likely that the memory will get released in a lot of cases. Sure, the overhead is neglectable, but I at least wanted to talk about that. I'll just add the generalization in the next diff.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	The generalization would be to accept arbitrary OR-trees of AND clauses of the form `(x_i & c_i)` or `(c_i & x_i)`?

tsymalla added inline comments.Oct 28 2022, 9:55 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	Yes. The generalization would be to have the constant operands be in arbitrary order for a given AND, and also to have the nested ORs be at any position for a given OR node.

Re-structure loop once again.
Operands can now be on arbitrary positions.

jsilvanus added inline comments.Oct 31 2022, 1:06 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	You still artificially require at least one operand of each OR to be an AND. However, we can have an arbitrary OR-tree that you would need to traverse using a stack to collect it's leaves -- the AND nodes. For example, currently you reject `N = ((x1 & c1) \| (x2 & c2)) \| ((x3 & c3) \| (x4 & c4))` which can be mapped to three BFIs.

Harbormaster completed remote builds in B195201: Diff 471910.Oct 31 2022, 3:07 AM

tsymalla added inline comments.Nov 2 2022, 5:42 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2196–2203	I am working on that, but it's not trivial to achieve that functionality. To have all possible BFIs in that tree being match, everything has to be processed in a single pass. If any of the disjunction test fails, another more generic matcher might kick in and generate a V_OR, V_OR3 or V_AND node.

tsymalla mentioned this in rGd0133bf86dbc: [NFC][AMDGPU] Pre-commit tests for D136432..Nov 9 2022, 2:24 AM

This change handles most of the tree structures, but not
all of them, because this would add way more complexity.

This uses two separate stacks, one as "working list"
to traverse the ORs, and one as stack to collect all
possible ORs. This is done so the top-level OR is not
required to be processed separately.

This also adds divergence checks as they appear in the
standard BFI pattern matching, because we don't have a
scalar BFI instruction and don't want to utilize the VALU
for several scalar instructions (such patterns currently
appear in unrelated tests).

This also handles the mask and base appearing at
arbitrary positions of the AND instructions.

It also utilizes the SDLoc of the original OR instructions
for inserting the new DAG nodes, so the ordering might be
a bit better at the end.

The boolean return values of CreateBFIOpFromAnd were
replaced by a binary-value enum class, which
contains more descriptive entries. In some cases, we just
want to skip the current entry instead of returning false.

This also adds some more tests for changed operand order or
matching a unbalanced tree structure.

Harbormaster completed remote builds in B196870: Diff 474222.Nov 9 2022, 4:43 AM

Fixed error on Windows build and whitespace issues.

Harbormaster completed remote builds in B197075: Diff 474536.Nov 10 2022, 8:07 AM

tsymalla mentioned this in rG5f77ee4011e6: [NFC][AMDGPU] Pre-commit tests for D136432.Nov 29 2022, 3:58 AM

Added test to show the isDivergence behavior (copied from another test).
Changed run line to not test a specific architecture. This adds a few
s_movs to the lit tests, but shows the behavior of the isDivergence test
(the GFX10 behavior would be different).

Harbormaster completed remote builds in B199979: Diff 478514.Nov 29 2022, 5:04 AM

Ping.
If this is getting too complicated, we could still think about moving to intrinsics.

Sorry, this dropped off my radar for a while.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2165–2168	Despite us having repeatedly told you so, this is still missing a type check that N is a 32-bit integer. I suspect you may be getting lucky because other types are being legalized away? But you shouldn't really on it.
2172	This can just be an uint32_t and doesn't need to be an Optional: if the MaskOperand is not a constant, this struct shouldn't really be constructed in the first place. (It would be cleaner to inline ExtractMask into CreateBFIOpFromAnd, since ExtractMask isn't a particularly load-bearing abstraction here.)
2175–2178	You don't really need the `Or` in here. (It's only used to get the value type, but why -- you know that this only works with MVT::i32)
2229–2230	You can use `OrStack.pop_back_val()`.
2235–2262	I think the logic here is needlessly complicated and hiding what is really happening in the cases where it's correct, and it is subtly incorrect in some cases. For example, consider: N = (x & C) \| ((y & ~C) \| (z \|* w)) -- The \|* node has more than one use In the first loop, the two outermost OR nodes are recorded in OrCandidates, the innermost one isn't. Then, the loop over OrCandidates will collect the BFIOps (x, C) and (y, ~C) and proceed to replace the whole expression by a single BFI, which is incorrect. It would be cleaner logic (and therefore easier to get correct!) to just merge the two loops into one, by replacing the core part of the top loop with something like: if (O0->getOpcode() == ISD::OR) { if (!O0->isDivergent() \|\| !O0->hasOneUse()) return false; OrStack.push_back(O0.getNode()); } else if (!CreateBFIOpFromAnd(O0)) { return false; } if (O1->getOpcode() == ISD::OR) { if (!O1->isDivergent() \|\| !O1->hasOneUse()) return false; OrStack.push_back(O1.getNode()); } else if (!CreateBFIOpFromAnd(O1)) { return false; } (This implies that CreateBFIOpFromAnd is adjusted to return false if it wasn't able to recognize the given node as a BFI operand.)

tsymalla added inline comments.Dec 12 2022, 1:15 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2175–2178	That is not entirely correct. It is also used to retrieve the original SDLoc of the (nested) Or. So I think it can stay.

tsymalla abandoned this revision.Mar 31 2023, 12:12 AM

tsymalla mentioned this in D154422: [AMDGPU] Add new BFI intrinsic.Jul 4 2023, 2:58 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUISelDAGToDAG.h

1 line

AMDGPUISelDAGToDAG.cpp

135 lines

test/

CodeGen/

AMDGPU/

bfi_nested.ll

44 lines

Diff 478514

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h

Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	private:
void SelectUADDO_USUBO(SDNode *N);		void SelectUADDO_USUBO(SDNode *N);
void SelectDIV_SCALE(SDNode *N);		void SelectDIV_SCALE(SDNode *N);
void SelectMAD_64_32(SDNode *N);		void SelectMAD_64_32(SDNode *N);
void SelectMUL_LOHI(SDNode *N);		void SelectMUL_LOHI(SDNode *N);
void SelectFMA_W_CHAIN(SDNode *N);		void SelectFMA_W_CHAIN(SDNode *N);
void SelectFMUL_W_CHAIN(SDNode *N);		void SelectFMUL_W_CHAIN(SDNode *N);
SDNode *getBFE32(bool IsSigned, const SDLoc &DL, SDValue Val, uint32_t Offset,		SDNode *getBFE32(bool IsSigned, const SDLoc &DL, SDValue Val, uint32_t Offset,
uint32_t Width);		uint32_t Width);
		bool SelectV_BFI(SDNode *N);
		foadUnsubmitted Done Reply Inline Actions "SelectBFI" (no underscore) or "SelectV_BFI" for consistency. foad: "SelectBFI" (no underscore) or "SelectV_BFI" for consistency.
void SelectS_BFEFromShifts(SDNode *N);		void SelectS_BFEFromShifts(SDNode *N);
void SelectS_BFE(SDNode *N);		void SelectS_BFE(SDNode *N);
bool isCBranchSCC(const SDNode *N) const;		bool isCBranchSCC(const SDNode *N) const;
void SelectBRCOND(SDNode *N);		void SelectBRCOND(SDNode *N);
void SelectFMAD_FMA(SDNode *N);		void SelectFMAD_FMA(SDNode *N);
void SelectDSAppendConsume(SDNode *N, unsigned IntrID);		void SelectDSAppendConsume(SDNode *N, unsigned IntrID);
void SelectDSBvhStackIntrinsic(SDNode *N);		void SelectDSBvhStackIntrinsic(SDNode *N);
void SelectDS_GWS(SDNode *N, unsigned IntrID);		void SelectDS_GWS(SDNode *N, unsigned IntrID);
Show All 11 Lines

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

Show First 20 Lines • Show All 668 Lines • ▼ Show 20 Lines	void AMDGPUDAGToDAGISel::Select(SDNode *N) {
case ISD::SRL:		case ISD::SRL:
case ISD::SRA:		case ISD::SRA:
case ISD::SIGN_EXTEND_INREG:		case ISD::SIGN_EXTEND_INREG:
if (N->getValueType(0) != MVT::i32)		if (N->getValueType(0) != MVT::i32)
break;		break;

SelectS_BFE(N);		SelectS_BFE(N);
return;		return;
		case ISD::OR:
		if (SelectV_BFI(N))
		return;

		break;
case ISD::BRCOND:		case ISD::BRCOND:
SelectBRCOND(N);		SelectBRCOND(N);
return;		return;
case ISD::FMAD:		case ISD::FMAD:
case ISD::FMA:		case ISD::FMA:
SelectFMAD_FMA(N);		SelectFMAD_FMA(N);
return;		return;
case AMDGPUISD::CVT_PKRTZ_F16_F32:		case AMDGPUISD::CVT_PKRTZ_F16_F32:
▲ Show 20 Lines • Show All 1,456 Lines • ▼ Show 20 Lines	if (0 < BVal && BVal <= CVal && CVal < 32) {
ReplaceNode(N, getBFE32(Signed, SDLoc(N), Shl.getOperand(0), CVal - BVal,		ReplaceNode(N, getBFE32(Signed, SDLoc(N), Shl.getOperand(0), CVal - BVal,
32 - CVal));		32 - CVal));
return;		return;
}		}
}		}
SelectCode(N);		SelectCode(N);
}		}

		// Try matching a 32-bit expression
		// (X_1 & C_1) \| (X_2 & C_2) \| ... \| (X_n & C_n)
		// with the constants C_i. If matching succeeds, convert
		// it to a sequence of nested V_BFI instructions, if possible.
		// Creating the canonical bitfieldInsert form in the frontend
		// leads to simplification of the used bitmasks through SimplifyDemandedBits,
		// where one bitfieldInsert is the base for another bitfieldInsert.
		foadUnsubmitted Done Reply Inline Actions Too many parens in this expression. The RHS should be just `((X2 & C2) \| (X3 & (~C1 \| ~C3)))` foad: Too many parens in this expression. The RHS should be just `((X2 & C2) \| (X3 & (~C1 \| ~C3)))`
		// Originally, the intended expression looks like:
		// (X1 & C1) \| (~C1 & ((X2 & C2) \| (X3 & ~C3)))
		// However, after SimplifyDemandedBits ISel, the expression will look like:
		// (X1 & C1) \| ((X2 & C2) \| (X3 & (~C1 \| ~C3)))
		jsilvanusUnsubmitted Done Reply Inline Actions It seems you support arbitrarily nested expressions, maybe mention here that the example with three clauses is just an example? jsilvanus: It seems you support arbitrarily nested expressions, maybe mention here that the example with…
		nhaehnleUnsubmitted Done Reply Inline Actions Seconded. The comment as-is is bad because it is too stuck in the original motivation for the code. Comments need to start at a high-level. We need to take a step back and think about what the generalized and high-level thing is we're doing here. And that is: Match a 32-bit expression of the form (X1 & C1) \| (X2 & C2) \| ... \| (Xk & Ck) where the Ci are constants and select it to a sequence of (k-1) V_BFI_B32 instructions if possible. That needs to be the headline of the comment here. (The original motivation can be mentioned later as a further detail.) What I glossed over is the bracketing. The code as-is only matches one specific possibility for introducing brackets, which is the one you happened to see in the motivation yous tarted with. But that's the kind of thing where we should do better by default. What if recursion is on the left instead of on the right? What if it is on both sides? nhaehnle: Seconded. The comment as-is is bad because it is too stuck in the original motivation for the…
		foadUnsubmitted Done Reply Inline Actions When you match this expression tree I think you need to check hasOneUse on every OR node - and possibly on the AND nodes too, but I'm not so sure about that. foad: When you match this expression tree I think you need to check hasOneUse on every OR node - and…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions Thanks. tsymalla: Thanks.
		bool AMDGPUDAGToDAGISel::SelectV_BFI(SDNode *N) {
		if (!N->isDivergent())
		return false;

		foadUnsubmitted Done Reply Inline Actions uint32_t probably makes more sense since you are matching _B32 operations throughout. foad: uint32_t probably makes more sense since you are matching _B32 operations throughout.
		nhaehnleUnsubmitted Not Done Reply Inline Actions Despite us having repeatedly told you so, this is still missing a type check that N is a 32-bit integer. I suspect you may be getting lucky because other types are being legalized away? But you shouldn't really on it. nhaehnle: Despite us having repeatedly told you so, this is still missing a type check that N is…
		struct MaskData {
		SDNode *BaseOperand;
		SDNode *MaskOperand;
		llvm::Optional<uint64_t> Mask;
		nhaehnleUnsubmitted Not Done Reply Inline Actions This can just be an uint32_t and doesn't need to be an Optional: if the MaskOperand is not a constant, this struct shouldn't really be constructed in the first place. (It would be cleaner to inline ExtractMask into CreateBFIOpFromAnd, since ExtractMask isn't a particularly load-bearing abstraction here.) nhaehnle: This can just be an uint32_t and doesn't need to be an Optional: if the MaskOperand is not a…
		};

		struct BFIOperandData {
		SDNode *Or;
		MaskData Mask;
		};
		nhaehnleUnsubmitted Not Done Reply Inline Actions You don't really need the `Or` in here. (It's only used to get the value type, but why -- you know that this only works with MVT::i32) nhaehnle: You don't really need the `Or` in here. (It's only used to get the value type, but why -- you…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions That is not entirely correct. It is also used to retrieve the original SDLoc of the (nested) Or. So I think it can stay. tsymalla: That is not entirely correct. It is also used to retrieve the original SDLoc of the (nested) Or.

		SmallVector<BFIOperandData, 3> BFIOps;
		uint64_t Disjunction = 0;

		auto ExtractMask = [](SDNode *And) -> MaskData {
		ConstantSDNode *MaskOperand = dyn_cast<ConstantSDNode>(And->getOperand(0));
		SDNode *BaseOperand = And->getOperand(1).getNode();
		if (!MaskOperand) {
		BaseOperand = And->getOperand(0).getNode();
		MaskOperand = dyn_cast<ConstantSDNode>(And->getOperand(1));
		}
		if (!MaskOperand)
		return {};

		return {BaseOperand, MaskOperand, MaskOperand->getZExtValue()};
		};
		foadUnsubmitted Not Done Reply Inline Actions The top level OR can have multiple uses. foad: The top level OR can have multiple uses.
		tsymallaAuthorUnsubmitted Done Reply Inline Actions Yes. You are right. Thanks. tsymalla: Yes. You are right. Thanks.
		foadUnsubmitted Not Done Reply Inline Actions This is not the right place to put the hasOneUse check anyway. What would happen if the check failed? You would break out of the loop, and then generate wrong code. Also how can the check for `CurrentOr` ever fail? foad: This is not the right place to put the hasOneUse check anyway. What would happen if the check…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions The right way to do so might be to just skip the whole function (to return false) if the OR gets used multiple times. Previously, it was possible for the CurrentOr check to fail, this is just a remnant. It currently works because the terminating condition for the loop is inside the non-OR handling branches of the main if which resides inside the loop. So either it continues looking for appropriate ORs or terminates once it found the final OR, AND, AND sequence. I'll simplify the whole thing and put another diff up. tsymalla: The right way to do so might be to just skip the whole function (to return false) if the OR…

		// Check if this is an AND with a constant operand.
		// Check if the mask from the constant operand fulfills several
		// conditions and update the total mask as well as saving the operand
		// nodes.
		enum class BFICheckStatus : uint8_t { Proceed, Abort };
		auto CreateBFIOpFromAnd = [&](SDNode Or, SDNode And) -> BFICheckStatus {
		if (And->getOpcode() != ISD::AND)
		return BFICheckStatus::Proceed;
		nhaehnleUnsubmitted Done Reply Inline Actions Can this case happen? nhaehnle: Can this case happen?
		nhaehnleUnsubmitted Not Done Reply Inline Actions There is still a weird asymmetry here. Do you have a reason for treating the two operands differently? If not, you should treat them the same. nhaehnle: There is still a weird asymmetry here. Do you have a reason for treating the two operands…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions The reason is that no transformation will happen if one of the operands doesn't fit the scheme. This is the case if a) On the top level, a OR(AND, AND) pattern occurs (this is handled by ISel) b) the second operand is no AND in either case (handled by the top-level O1.getOpcode() check) c) A nested OR operand is used more than once - in that case, it cannot be optimized away d) Any of the AND operands is not a constant value, e. g. doesn't contribute to the mask. e) Any of the ANDs is being used more than once - in that case, it cannot be optimized away f) If none of the assumptions of the structure can be made, just return false. It simplifies things if the operands are handled differently. Even if, for instance, the InsertBFIOp(O1) could theoretically be hoisted out, this complicates things at the BFI generation part later, because now the base and the order at the tail of the vector are swapped, making it impossible to just assign the back (base) to `NewNode` and drop it. So, this loop basically relies on the state updated from the previous iteration and uses CurrentOr as controlling element. tsymalla: The reason is that no transformation will happen if one of the operands doesn't fit the scheme.
		nhaehnleUnsubmitted Not Done Reply Inline Actions b) the second operand is no AND in either case (handled by the top-level O1.getOpcode() check) What's the justification for this rule? Couldn't you have `N = (x1 & c1) \| ((x2 & c2) \| (x3 & c3))`? nhaehnle: > b) the second operand is no AND in either case (handled by the top-level O1.getOpcode()…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions In general, you can, but looking at existing tests, is it a) likely this will happen and b) all the other assumptions about the constants are true - so the compiler will actually create some BFI instructions from the sequence? I can change the implementation to treat the operands equally, but for now, is it likely that this only includes additional checks and memory usage, but will never contribute to the generated code at this point? tsymalla: In general, you can, but looking at existing tests, is it a) likely this will happen and b) all…
		nhaehnleUnsubmitted Not Done Reply Inline Actions You're asking the question the wrong way around. The code that we are inspecting manually is a microscopically tiny fraction of the total amount of code that our compilers ever see. If there is an obvious generalization like this one that is easy to do, we should always take it. And when we notice it, we should add lit tests to ensure the case is covered. Yes, there are additional checks for the generalization. What additional memory usage is there? nhaehnle: You're asking the question the wrong way around. The code that we are inspecting manually is a…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions You're asking the question the wrong way around. The code that we are inspecting manually is a microscopically tiny fraction of the total amount of code that our compilers ever see. If there is an obvious generalization like this one that is easy to do, we should always take it. And when we notice it, we should add lit tests to ensure the case is covered. Yes, there are additional checks for the generalization. What additional memory usage is there? Storing multiple times the bytes for the `SDNode ` and the mask inside the small vector for each OR having an AND as operand even if its likely that the memory will get released in a lot of cases. Sure, the overhead is neglectable, but I at least wanted to talk about that. I'll just add the generalization in the next diff. tsymalla:* > You're asking the question the wrong way around. > > The code that we are inspecting…
		jsilvanusUnsubmitted Not Done Reply Inline Actions The generalization would be to accept arbitrary OR-trees of AND clauses of the form `(x_i & c_i)` or `(c_i & x_i)`? jsilvanus: The generalization would be to accept arbitrary OR-trees of AND clauses of the form `(x_i &…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions Yes. The generalization would be to have the constant operands be in arbitrary order for a given AND, and also to have the nested ORs be at any position for a given OR node. tsymalla: Yes. The generalization would be to have the constant operands be in arbitrary order for a…
		jsilvanusUnsubmitted Not Done Reply Inline Actions You still artificially require at least one operand of each OR to be an AND. However, we can have an arbitrary OR-tree that you would need to traverse using a stack to collect it's leaves -- the AND nodes. For example, currently you reject `N = ((x1 & c1) \| (x2 & c2)) \| ((x3 & c3) \| (x4 & c4))` which can be mapped to three BFIs. jsilvanus: You still artificially require at least one operand of each OR to be an AND. However, we can…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions I am working on that, but it's not trivial to achieve that functionality. To have all possible BFIs in that tree being match, everything has to be processed in a single pass. If any of the disjunction test fails, another more generic matcher might kick in and generate a V_OR, V_OR3 or V_AND node. tsymalla: I am working on that, but it's not trivial to achieve that functionality. To have all possible…
		const auto MaskData = ExtractMask(And);
		if (!MaskData.Mask.has_value())
		return BFICheckStatus::Proceed;

		// The masks need to be disjoint
		if ((Disjunction & MaskData.Mask.value()) != 0)
		return BFICheckStatus::Abort;

		Disjunction \|= MaskData.Mask.value();
		BFIOps.push_back({Or, MaskData});
		return BFICheckStatus::Proceed;
		};
		foadUnsubmitted Not Done Reply Inline Actions Why? Is there any disadvantage to letting this function select a single BFI? foad: Why? Is there any disadvantage to letting this function select a single BFI?
		tsymallaAuthorUnsubmitted Done Reply Inline Actions Such case is handled by the TableGen patterns, so I don't want to handle the non-specialized case here (it would be possible, but I don't regard that as advantage in terms of code complexity). tsymalla: Such case is handled by the TableGen patterns, so I don't want to handle the non-specialized…

		// Collect all OR instructions, so the constants masks from the AND
		// instructions can be fetched later.
		SmallVector<SDNode *, 2> OrStack;
		SmallVector<SDNode *, 2> OrCandidates;
		OrStack.push_back(N);
		OrCandidates.push_back(N);
		auto IsCandidate = [](SDValue Or) -> bool {
		return Or->isDivergent() && Or->getOpcode() == ISD::OR && Or->hasOneUse();
		};

		// Push all possible OR candidates including the function argument on a stack
		while (!OrStack.empty()) {
		SDNode *Current = OrStack.back();
		OrStack.pop_back();
		nhaehnleUnsubmitted Not Done Reply Inline Actions You can use `OrStack.pop_back_val()`. nhaehnle: You can use `OrStack.pop_back_val()`.

		SDValue O0 = Current->getOperand(0);
		SDValue O1 = Current->getOperand(1);

		if (IsCandidate(O0)) {
		OrStack.push_back(O0.getNode());
		OrCandidates.push_back(O0.getNode());
		nhaehnleUnsubmitted Done Reply Inline Actions This whole code can be simplified by starting with `NewNode = BFIOps.back().X->getOperand(0)` and iterating over `reverse(makeArrayRef(BFIOps).drop_back())` nhaehnle: This whole code can be simplified by starting with `NewNode = BFIOps.back().X->getOperand(0)`…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions Thanks. Good pooint tsymalla: Thanks. Good pooint
		}

		nhaehnleUnsubmitted Done Reply Inline Actions Is this check necessary? nhaehnle: Is this check necessary?
		if (IsCandidate(O1)) {
		OrStack.push_back(O1.getNode());
		OrCandidates.push_back(O1.getNode());
		}

		if (O0.getOpcode() != ISD::OR && O1.getOpcode() != ISD::OR)
		break;
		}

		// Recording the AND operands and bitmasks and return false,
		// if there was a failure (inappropriate bitmask).
		// This happens as a separate step so the top-level node does not
		// need to be treated separately.
		while (!OrCandidates.empty()) {
		SDNode *Current = OrCandidates.back();
		OrCandidates.pop_back();
		SDValue LHS = Current->getOperand(0);
		SDValue RHS = Current->getOperand(1);

		if (CreateBFIOpFromAnd(Current, RHS.getNode()) == BFICheckStatus::Abort \|\|
		CreateBFIOpFromAnd(Current, LHS.getNode()) == BFICheckStatus::Abort)
		return false;
		}
		nhaehnleUnsubmitted Not Done Reply Inline Actions I think the logic here is needlessly complicated and hiding what is really happening in the cases where it's correct, and it is subtly incorrect in some cases. For example, consider: N = (x & C) \| ((y & ~C) \| (z \|* w)) -- The \|* node has more than one use In the first loop, the two outermost OR nodes are recorded in OrCandidates, the innermost one isn't. Then, the loop over OrCandidates will collect the BFIOps (x, C) and (y, ~C) and proceed to replace the whole expression by a single BFI, which is incorrect. It would be cleaner logic (and therefore easier to get correct!) to just merge the two loops into one, by replacing the core part of the top loop with something like: if (O0->getOpcode() == ISD::OR) { if (!O0->isDivergent() \|\| !O0->hasOneUse()) return false; OrStack.push_back(O0.getNode()); } else if (!CreateBFIOpFromAnd(O0)) { return false; } if (O1->getOpcode() == ISD::OR) { if (!O1->isDivergent() \|\| !O1->hasOneUse()) return false; OrStack.push_back(O1.getNode()); } else if (!CreateBFIOpFromAnd(O1)) { return false; } (This implies that CreateBFIOpFromAnd is adjusted to return false if it wasn't able to recognize the given node as a BFI operand.) nhaehnle: I think the logic here is needlessly complicated and hiding what is really happening in the…

		// The constants are required to be a partition of -1.
		if (Disjunction != 0xFFFFFFFF)
		return false;

		// Transform the operands node by node and replace the original node.
		// Assume the last node is the base for BFI insertion.
		SDNode *BaseNode = BFIOps.front().Mask.BaseOperand;
		assert(BaseNode && "Cannot use non-existing node as base!");
		for (auto &BFITuple : makeArrayRef(BFIOps).drop_front()) {
		BaseNode = CurDAG->getMachineNode(
		AMDGPU::V_BFI_B32_e64, SDLoc(BFITuple.Or), BFITuple.Or->getValueType(0),
		{SDValue(BFITuple.Mask.MaskOperand, 0),
		SDValue(BFITuple.Mask.BaseOperand, 0), SDValue(BaseNode, 0)});
		}

		ReplaceNode(N, BaseNode);

		return true;
		}

void AMDGPUDAGToDAGISel::SelectS_BFE(SDNode *N) {		void AMDGPUDAGToDAGISel::SelectS_BFE(SDNode *N) {
switch (N->getOpcode()) {		switch (N->getOpcode()) {
case ISD::AND:		case ISD::AND:
if (N->getOperand(0).getOpcode() == ISD::SRL) {		if (N->getOperand(0).getOpcode() == ISD::SRL) {
// "(a srl b) & mask" ---> "BFE_U32 a, b, popcount(mask)"		// "(a srl b) & mask" ---> "BFE_U32 a, b, popcount(mask)"
// Predicate: isMask(mask)		// Predicate: isMask(mask)
const SDValue &Srl = N->getOperand(0);		const SDValue &Srl = N->getOperand(0);
ConstantSDNode *Shift = dyn_cast<ConstantSDNode>(Srl.getOperand(1));		ConstantSDNode *Shift = dyn_cast<ConstantSDNode>(Srl.getOperand(1));
▲ Show 20 Lines • Show All 864 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/bfi_nested.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s

	define float @v_bfi_single_nesting_level(float %x, float %y, float %z) {			define float @v_bfi_single_nesting_level(float %x, float %y, float %z) {
	; GCN-LABEL: v_bfi_single_nesting_level:			; GCN-LABEL: v_bfi_single_nesting_level:
	; GCN: ; %bb.0: ; %.entry			; GCN: ; %bb.0: ; %.entry
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_mul_f32_e32 v2, 0x447fc000, v2			; GCN-NEXT: v_mul_f32_e32 v2, 0x447fc000, v2
	; GCN-NEXT: v_cvt_u32_f32_e32 v1, v1			; GCN-NEXT: v_cvt_u32_f32_e32 v1, v1
				; GCN-NEXT: s_mov_b32 s4, 0xffc00
	; GCN-NEXT: v_mul_f32_e32 v0, 0x447fc000, v0			; GCN-NEXT: v_mul_f32_e32 v0, 0x447fc000, v0
	; GCN-NEXT: v_cvt_u32_f32_e32 v2, v2			; GCN-NEXT: v_cvt_u32_f32_e32 v2, v2
	; GCN-NEXT: v_lshlrev_b32_e32 v1, 10, v1			; GCN-NEXT: v_lshlrev_b32_e32 v1, 10, v1
	; GCN-NEXT: v_cvt_u32_f32_e32 v0, v0			; GCN-NEXT: v_cvt_u32_f32_e32 v0, v0
	; GCN-NEXT: v_and_b32_e32 v1, 0xffc00, v1			; GCN-NEXT: v_bfi_b32 v1, s4, v1, v2
	; GCN-NEXT: v_and_b32_e32 v2, 0xc00003ff, v2
	; GCN-NEXT: v_lshlrev_b32_e32 v0, 20, v0			; GCN-NEXT: v_lshlrev_b32_e32 v0, 20, v0
	; GCN-NEXT: v_or_b32_e32 v1, v1, v2			; GCN-NEXT: s_mov_b32 s4, 0x3ff00000
	; GCN-NEXT: v_and_b32_e32 v0, 0x3ff00000, v0			; GCN-NEXT: v_bfi_b32 v0, s4, v0, v1
	; GCN-NEXT: v_or_b32_e32 v0, v1, v0
	; GCN-NEXT: s_setpc_b64 s[30:31]			; GCN-NEXT: s_setpc_b64 s[30:31]
	.entry:			.entry:
	%mul.base = fmul reassoc nnan nsz arcp contract afn float %z, 1.023000e+03			%mul.base = fmul reassoc nnan nsz arcp contract afn float %z, 1.023000e+03
	%mul.base.i32 = fptoui float %mul.base to i32			%mul.base.i32 = fptoui float %mul.base to i32
	%y.i32 = fptoui float %y to i32			%y.i32 = fptoui float %y to i32
	%shl.inner.insert = shl i32 %y.i32, 10			%shl.inner.insert = shl i32 %y.i32, 10
	%bfi1.and = and i32 %shl.inner.insert, 1047552			%bfi1.and = and i32 %shl.inner.insert, 1047552
	%bfi1.andnot = and i32 %mul.base.i32, -1073740801			%bfi1.andnot = and i32 %mul.base.i32, -1073740801
	%bfi1.or = or i32 %bfi1.and, %bfi1.andnot			%bfi1.or = or i32 %bfi1.and, %bfi1.andnot
	%mul.outer.insert = fmul reassoc nnan nsz arcp contract afn float %x, 1.023000e+03			%mul.outer.insert = fmul reassoc nnan nsz arcp contract afn float %x, 1.023000e+03
	%mul.outer.insert.i32 = fptoui float %mul.outer.insert to i32			%mul.outer.insert.i32 = fptoui float %mul.outer.insert to i32
	%shl.outer.insert = shl i32 %mul.outer.insert.i32, 20			%shl.outer.insert = shl i32 %mul.outer.insert.i32, 20
	%and.outer = and i32 %shl.outer.insert, 1072693248			%and.outer = and i32 %shl.outer.insert, 1072693248
	%or.outer = or i32 %bfi1.or, %and.outer			%or.outer = or i32 %bfi1.or, %and.outer
	%result = bitcast i32 %or.outer to float			%result = bitcast i32 %or.outer to float
	ret float %result			ret float %result
	}			}

	define float @v_bfi_single_nesting_level_swapped_operands(float %x, float %y, float %z) {			define float @v_bfi_single_nesting_level_swapped_operands(float %x, float %y, float %z) {
	; GCN-LABEL: v_bfi_single_nesting_level_swapped_operands:			; GCN-LABEL: v_bfi_single_nesting_level_swapped_operands:
	; GCN: ; %bb.0: ; %.entry			; GCN: ; %bb.0: ; %.entry
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_mul_f32_e32 v2, 0x447fc000, v2			; GCN-NEXT: v_mul_f32_e32 v2, 0x447fc000, v2
	; GCN-NEXT: v_cvt_u32_f32_e32 v1, v1			; GCN-NEXT: v_cvt_u32_f32_e32 v1, v1
				; GCN-NEXT: s_mov_b32 s4, 0xffc00
	; GCN-NEXT: v_mul_f32_e32 v0, 0x447fc000, v0			; GCN-NEXT: v_mul_f32_e32 v0, 0x447fc000, v0
	; GCN-NEXT: v_cvt_u32_f32_e32 v2, v2			; GCN-NEXT: v_cvt_u32_f32_e32 v2, v2
	; GCN-NEXT: v_lshlrev_b32_e32 v1, 10, v1			; GCN-NEXT: v_lshlrev_b32_e32 v1, 10, v1
	; GCN-NEXT: v_cvt_u32_f32_e32 v0, v0			; GCN-NEXT: v_cvt_u32_f32_e32 v0, v0
	; GCN-NEXT: v_and_b32_e32 v1, 0xffc00, v1			; GCN-NEXT: v_bfi_b32 v1, s4, v1, v2
	; GCN-NEXT: v_and_b32_e32 v2, 0xc00003ff, v2
	; GCN-NEXT: v_lshlrev_b32_e32 v0, 20, v0			; GCN-NEXT: v_lshlrev_b32_e32 v0, 20, v0
	; GCN-NEXT: v_or_b32_e32 v1, v1, v2			; GCN-NEXT: s_mov_b32 s4, 0x3ff00000
	; GCN-NEXT: v_and_b32_e32 v0, 0x3ff00000, v0			; GCN-NEXT: v_bfi_b32 v0, s4, v0, v1
	; GCN-NEXT: v_or_b32_e32 v0, v0, v1
	; GCN-NEXT: s_setpc_b64 s[30:31]			; GCN-NEXT: s_setpc_b64 s[30:31]
	.entry:			.entry:
	%mul.base = fmul reassoc nnan nsz arcp contract afn float %z, 1.023000e+03			%mul.base = fmul reassoc nnan nsz arcp contract afn float %z, 1.023000e+03
	%mul.base.i32 = fptoui float %mul.base to i32			%mul.base.i32 = fptoui float %mul.base to i32
	%y.i32 = fptoui float %y to i32			%y.i32 = fptoui float %y to i32
	%shl.inner.insert = shl i32 %y.i32, 10			%shl.inner.insert = shl i32 %y.i32, 10
	%bfi1.and = and i32 1047552, %shl.inner.insert			%bfi1.and = and i32 1047552, %shl.inner.insert
	%bfi1.andnot = and i32 -1073740801, %mul.base.i32			%bfi1.andnot = and i32 -1073740801, %mul.base.i32
	%bfi1.or = or i32 %bfi1.and, %bfi1.andnot			%bfi1.or = or i32 %bfi1.and, %bfi1.andnot
	%mul.outer.insert = fmul reassoc nnan nsz arcp contract afn float %x, 1.023000e+03			%mul.outer.insert = fmul reassoc nnan nsz arcp contract afn float %x, 1.023000e+03
	%mul.outer.insert.i32 = fptoui float %mul.outer.insert to i32			%mul.outer.insert.i32 = fptoui float %mul.outer.insert to i32
	%shl.outer.insert = shl i32 %mul.outer.insert.i32, 20			%shl.outer.insert = shl i32 %mul.outer.insert.i32, 20
	%and.outer = and i32 %shl.outer.insert, 1072693248			%and.outer = and i32 %shl.outer.insert, 1072693248
	%or.outer = or i32 %and.outer, %bfi1.or			%or.outer = or i32 %and.outer, %bfi1.or
	%result = bitcast i32 %or.outer to float			%result = bitcast i32 %or.outer to float
	ret float %result			ret float %result
	}			}

	define float @v_bfi_single_nesting_level_unbalanced_subtree(float %x, float %y, float %z) {			define float @v_bfi_single_nesting_level_unbalanced_subtree(float %x, float %y, float %z) {
	; GCN-LABEL: v_bfi_single_nesting_level_unbalanced_subtree:			; GCN-LABEL: v_bfi_single_nesting_level_unbalanced_subtree:
	; GCN: ; %bb.0: ; %.entry			; GCN: ; %bb.0: ; %.entry
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_mul_f32_e32 v2, 0x447fc000, v2			; GCN-NEXT: v_mul_f32_e32 v2, 0x447fc000, v2
	; GCN-NEXT: v_cvt_u32_f32_e32 v1, v1			; GCN-NEXT: v_cvt_u32_f32_e32 v1, v1
				; GCN-NEXT: s_mov_b32 s4, 0xffc00
	; GCN-NEXT: v_mul_f32_e32 v0, 0x447fc000, v0			; GCN-NEXT: v_mul_f32_e32 v0, 0x447fc000, v0
				; GCN-NEXT: s_mov_b32 s5, 0xc000001f
	; GCN-NEXT: v_cvt_u32_f32_e32 v2, v2			; GCN-NEXT: v_cvt_u32_f32_e32 v2, v2
	; GCN-NEXT: v_lshlrev_b32_e32 v1, 10, v1			; GCN-NEXT: v_lshlrev_b32_e32 v1, 10, v1
	; GCN-NEXT: v_cvt_u32_f32_e32 v0, v0			; GCN-NEXT: v_cvt_u32_f32_e32 v0, v0
	; GCN-NEXT: v_and_b32_e32 v1, 0xffc00, v1			; GCN-NEXT: v_bfi_b32 v1, s4, v1, v2
	; GCN-NEXT: v_and_b32_e32 v3, 0x3e0, v2
	; GCN-NEXT: v_lshlrev_b32_e32 v0, 20, v0			; GCN-NEXT: v_lshlrev_b32_e32 v0, 20, v0
	; GCN-NEXT: v_or_b32_e32 v1, v1, v3			; GCN-NEXT: v_bfi_b32 v1, s5, v2, v1
	; GCN-NEXT: v_and_b32_e32 v0, 0x3ff00000, v0			; GCN-NEXT: s_mov_b32 s4, 0x3ff00000
	; GCN-NEXT: v_and_b32_e32 v2, 0xc000001f, v2			; GCN-NEXT: v_bfi_b32 v0, s4, v0, v1
	; GCN-NEXT: v_or_b32_e32 v1, v2, v1
	; GCN-NEXT: v_or_b32_e32 v0, v0, v1
	; GCN-NEXT: s_setpc_b64 s[30:31]			; GCN-NEXT: s_setpc_b64 s[30:31]
	.entry:			.entry:
	%mul.base = fmul reassoc nnan nsz arcp contract afn float %z, 1.023000e+03			%mul.base = fmul reassoc nnan nsz arcp contract afn float %z, 1.023000e+03
	%mul.base.i32 = fptoui float %mul.base to i32			%mul.base.i32 = fptoui float %mul.base to i32
	%y.i32 = fptoui float %y to i32			%y.i32 = fptoui float %y to i32
	%shl.inner.2.insert = shl i32 %y.i32, 10			%shl.inner.2.insert = shl i32 %y.i32, 10
	%bfi.inner.2.and.1 = and i32 %shl.inner.2.insert, 1047552			%bfi.inner.2.and.1 = and i32 %shl.inner.2.insert, 1047552
	%bfi.inner.2.and.2 = and i32 %mul.base.i32, 992			%bfi.inner.2.and.2 = and i32 %mul.base.i32, 992
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	}			}

	define float @v_bfi_two_levels(float %x, float %y, float %z) {			define float @v_bfi_two_levels(float %x, float %y, float %z) {
	; GCN-LABEL: v_bfi_two_levels:			; GCN-LABEL: v_bfi_two_levels:
	; GCN: ; %bb.0: ; %.entry			; GCN: ; %bb.0: ; %.entry
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_cvt_u32_f32_e32 v1, v1			; GCN-NEXT: v_cvt_u32_f32_e32 v1, v1
	; GCN-NEXT: v_cvt_u32_f32_e32 v2, v2			; GCN-NEXT: v_cvt_u32_f32_e32 v2, v2
				; GCN-NEXT: s_movk_i32 s4, 0x3e0
				; GCN-NEXT: s_mov_b32 s5, 0xffc00
	; GCN-NEXT: v_mul_f32_e32 v0, 0x447fc000, v0			; GCN-NEXT: v_mul_f32_e32 v0, 0x447fc000, v0
	; GCN-NEXT: v_lshlrev_b32_e32 v3, 5, v1			; GCN-NEXT: v_lshlrev_b32_e32 v3, 5, v1
	; GCN-NEXT: v_and_b32_e32 v2, 0xc000001f, v2
	; GCN-NEXT: v_lshlrev_b32_e32 v1, 10, v1			; GCN-NEXT: v_lshlrev_b32_e32 v1, 10, v1
	; GCN-NEXT: v_cvt_u32_f32_e32 v0, v0			; GCN-NEXT: v_cvt_u32_f32_e32 v0, v0
	; GCN-NEXT: v_and_b32_e32 v3, 0x3e0, v3			; GCN-NEXT: v_bfi_b32 v2, s4, v3, v2
	; GCN-NEXT: v_and_b32_e32 v1, 0xffc00, v1
	; GCN-NEXT: v_lshlrev_b32_e32 v0, 20, v0			; GCN-NEXT: v_lshlrev_b32_e32 v0, 20, v0
	; GCN-NEXT: v_or_b32_e32 v2, v3, v2			; GCN-NEXT: v_bfi_b32 v1, s5, v1, v2
	; GCN-NEXT: v_or_b32_e32 v1, v2, v1			; GCN-NEXT: s_mov_b32 s4, 0x3ff00000
	; GCN-NEXT: v_and_b32_e32 v0, 0x3ff00000, v0			; GCN-NEXT: v_bfi_b32 v0, s4, v0, v1
	; GCN-NEXT: v_or_b32_e32 v0, v1, v0
	; GCN-NEXT: s_setpc_b64 s[30:31]			; GCN-NEXT: s_setpc_b64 s[30:31]
	.entry:			.entry:
	%y.i32 = fptoui float %y to i32			%y.i32 = fptoui float %y to i32
	%shl.insert.inner = shl i32 %y.i32, 5			%shl.insert.inner = shl i32 %y.i32, 5
	%and.insert.inner = and i32 %shl.insert.inner, 992			%and.insert.inner = and i32 %shl.insert.inner, 992
	%z.i32 = fptoui float %z to i32			%z.i32 = fptoui float %z to i32
	%base.inner = and i32 %z.i32, -1073741793			%base.inner = and i32 %z.i32, -1073741793
	%or.inner = or i32 %and.insert.inner , %base.inner			%or.inner = or i32 %and.insert.inner , %base.inner
	▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines