Download Raw Diff

Details

Reviewers

RKSimon
craig.topper
davezarzycki
spatel

Commits

rG7c3d6f5a1bf6: [X86] X86DAGToDAGISel::matchBEXTRFromAndImm(): if can't use BEXTR, fallback to…
rL372532: [X86] X86DAGToDAGISel::matchBEXTRFromAndImm(): if can't use BEXTR, fallback to…

Summary

PR43381 notes that while we are good at matching (X >> C1) & C2 as BEXTR/BEXTRI,
we only do that if we either have BEXTRI (TBM),
or if BEXTR is marked as being fast (-mattr=+fast-bextr).
In all other cases we don't match.

But that is mainly only true for AMD CPU's.
However, for all the CPU's for which we have sched models,
the BZHI is always fast (or the sched models are all bad.)

So if we decide that it's unprofitable to emit BEXTR/BEXTRI,
we should fall-back to BZHI if it is available,
and follow-up with the shift.

While it's really tempting to do something because it's cool
it is wise to first thing whether it actually makes sense to do.
We shouldn't just use BZHI because we can, but only it it is beneficial.
In particular, it isn't really worth it if the input is a register,
mask is small, or we can fold a load.
But it is worth it if the mask does not fit into 32-bits.

(careful, i don't know much about intel cpu's my choice of -mcpu may be bad here)
Thus we manage to fold a load:
https://godbolt.org/z/Er0OQz
Or if we'd end up using BZHI anyways because the mask is large:
https://godbolt.org/z/dBJ_5h
But this isn'r actually profitable in general case,
e.g. here we'd increase microop count
(the register renaming is free, mca does not model that there it seems)
https://godbolt.org/z/k6wFoz
Likewise, not worth it if we just get load folding:
https://godbolt.org/z/1M1deG

https://bugs.llvm.org/show_bug.cgi?id=43381

Diff Detail

Repository: rL LLVM

Event Timeline

lebedev.ri created this revision.Sep 21 2019, 4:46 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptSep 21 2019, 4:46 AM

lebedev.ri edited the summary of this revision. (Show Details)Sep 21 2019, 4:46 AM

Fixup comment, NFC.

Neat! If you have the time, the BZHI bits-to-preserve operand only needs MOVB for initialization. That being said, MOVL probably avoids partial register update stalls, so maybe that’s why you’re seeing a performance gain.

In D67875#1677901, @davezarzycki wrote:

If you have the time, the BZHI bits-to-preserve operand only needs MOVB for initialization. That being said, MOVL probably avoids partial register update stalls, so maybe that’s why you’re seeing a performance gain.

Yes, it is intentional to use 32-bit writes to avoid partial register access.

Thanks for doing this work! I'm not an expert in this source code, so please wait for somebody else to approve it.

This revision is now accepted and ready to land.Sep 21 2019, 11:04 AM

craig.topper added inline comments.Sep 22 2019, 10:59 AM

llvm/test/CodeGen/X86/bmi-x86_64.ll
28 ↗	(On Diff #221170)	This doesn't look like an obvious improvement. The movq in the original code is basically free. So it was really 2 uops. The new code is 3 uops.

lebedev.ri added inline comments.Sep 22 2019, 11:17 AM

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
3441–3443 ↗	(On Diff #221170)	So, should we then fallback to BZHI only if we manage to fold the load?
llvm/test/CodeGen/X86/bmi-x86_64.ll
28 ↗	(On Diff #221170)	That is not what mca says, i guess it's not modelled in those sched models? Is there any particularly good intel cpu schedule model that is in LLVM i should use as reference? But yes, that is true.

craig.topper added a subscriber: andreadb.Sep 22 2019, 11:49 AM

craig.topper added inline comments.

llvm/test/CodeGen/X86/bmi-x86_64.ll
28 ↗	(On Diff #221170)	I don't think move elimination is modeled in any of the scheduler models. @andreadb is that right?

davezarzycki added inline comments.Sep 22 2019, 12:02 PM

llvm/test/CodeGen/X86/bmi-x86_64.ll
28 ↗	(On Diff #221170)	Hi @craig.topper – I think the original bug report about a load not being folded into BZHI is being lost in the noise of this change proposal.

craig.topper added inline comments.Sep 22 2019, 12:20 PM

llvm/test/CodeGen/X86/bmi-x86_64.ll
28 ↗	(On Diff #221170)	I think this proposed change is too general. The original case is only using BZHI to avoid a MOVABSQ to load the AND mask. So I think it's largely an i64 specific issue and we should probably handle it as such.

Diffusion mentioned this in rL372524: [NFC][X86] Add BEXTR test with load and 33-bit mask (PR43381 / D67875).Sep 22 2019, 12:38 PM

lebedev.ri mentioned this in rG24159592cac9: [NFC][X86] Add BEXTR test with load and 33-bit mask (PR43381 / D67875).Sep 22 2019, 12:38 PM

Scaling down ambigiousness of the patch: if not using BEXTR, only use BZHI if either the mask is larger than 32-bit, or the load folding will happen.

I'm not sure if it's okay to leave newly-created noted like that if we decide it's not worthwhile folding?

lebedev.ri marked 5 inline comments as done.Sep 22 2019, 12:54 PM

What if we just do the larger than 32-bit mask? Its not clear that making BZHI just to fold a load is an improvement. You have to materialize an immediate instead so the total uops increased.

In D67875#1678334, @craig.topper wrote:

What if we just do the larger than 32-bit mask? Its not clear that making BZHI just to fold a load is an improvement. You have to materialize an immediate instead so the total uops increased.

Actually, i think it's clearly still not an improvement: https://godbolt.org/z/1M1deG

lebedev.ri updated this revision to Diff 221235.Sep 22 2019, 1:26 PM

lebedev.ri edited the summary of this revision. (Show Details)

andreadb added inline comments.Sep 22 2019, 1:35 PM

llvm/test/CodeGen/X86/bmi-x86_64.ll
28 ↗	(On Diff #221170)	Sorry for the late reply. I only saw the message now. Move elimination is currently only modelled for BtVer2 (Jaguar allows a limited form of move elimination for cases where the source operand is known to be zero). That being said, BtVer2 does not feature BMI2. Move elimination is only enabled for models that provide a definition of tablegen class RegisterFile and a definition of IsOptimizableRegisterMove.

In D67875#1678335, @lebedev.ri wrote:

In D67875#1678334, @craig.topper wrote:

What if we just do the larger than 32-bit mask? Its not clear that making BZHI just to fold a load is an improvement. You have to materialize an immediate instead so the total uops increased.

Actually, i think it's clearly still not an improvement: https://godbolt.org/z/1M1deG

@craig.topper adjusted in last update

lebedev.ri requested review of this revision.Sep 22 2019, 2:39 PM

LGTM with those comment fixes.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
3485 ↗	(On Diff #221235)	available*
3487 ↗	(On Diff #221235)	Stray period at the beginning of the comment.

This revision is now accepted and ready to land.Sep 22 2019, 2:54 PM

Fixup comments.

Closed by commit rL372532: [X86] X86DAGToDAGISel::matchBEXTRFromAndImm(): if can't use BEXTR, fallback to… (authored by lebedevri). · Explain WhySep 22 2019, 3:05 PM

This revision was automatically updated to reflect the committed changes.

Diff 221242

llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp

Show First 20 Lines • Show All 3,435 Lines • ▼ Show 20 Lines	MachineSDNode X86DAGToDAGISel::matchBEXTRFromAndImm(SDNode Node) {
SDValue N1 = Node->getOperand(1);		SDValue N1 = Node->getOperand(1);

// If we have TBM we can use an immediate for the control. If we have BMI		// If we have TBM we can use an immediate for the control. If we have BMI
// we should only do this if the BEXTR instruction is implemented well.		// we should only do this if the BEXTR instruction is implemented well.
// Otherwise moving the control into a register makes this more costly.		// Otherwise moving the control into a register makes this more costly.
// TODO: Maybe load folding, greater than 32-bit masks, or a guarantee of LICM		// TODO: Maybe load folding, greater than 32-bit masks, or a guarantee of LICM
// hoisting the move immediate would make it worthwhile with a less optimal		// hoisting the move immediate would make it worthwhile with a less optimal
// BEXTR?		// BEXTR?
if (!Subtarget->hasTBM() &&		bool PreferBEXTR =
!(Subtarget->hasBMI() && Subtarget->hasFastBEXTR()))		Subtarget->hasTBM() \|\| (Subtarget->hasBMI() && Subtarget->hasFastBEXTR());
		if (!PreferBEXTR && !Subtarget->hasBMI2())
return nullptr;		return nullptr;

// Must have a shift right.		// Must have a shift right.
if (N0->getOpcode() != ISD::SRL && N0->getOpcode() != ISD::SRA)		if (N0->getOpcode() != ISD::SRL && N0->getOpcode() != ISD::SRA)
return nullptr;		return nullptr;

// Shift can't have additional users.		// Shift can't have additional users.
if (!N0->hasOneUse())		if (!N0->hasOneUse())
Show All 22 Lines	MachineSDNode X86DAGToDAGISel::matchBEXTRFromAndImm(SDNode Node) {
if (Shift == 8 && MaskSize == 8)		if (Shift == 8 && MaskSize == 8)
return nullptr;		return nullptr;

// Make sure we are only using bits that were in the original value, not		// Make sure we are only using bits that were in the original value, not
// shifted in.		// shifted in.
if (Shift + MaskSize > NVT.getSizeInBits())		if (Shift + MaskSize > NVT.getSizeInBits())
return nullptr;		return nullptr;

SDValue New = CurDAG->getTargetConstant(Shift \| (MaskSize << 8), dl, NVT);		// BZHI, if available, is always fast, unlike BEXTR. But even if we decide
unsigned ROpc = NVT == MVT::i64 ? X86::BEXTRI64ri : X86::BEXTRI32ri;		// that we can't use BEXTR, it is only worthwhile using BZHI if the mask
unsigned MOpc = NVT == MVT::i64 ? X86::BEXTRI64mi : X86::BEXTRI32mi;		// does not fit into 32 bits. Load folding is not a sufficient reason.
		if (!PreferBEXTR && MaskSize <= 32)
		return nullptr;

		SDValue Control;
		unsigned ROpc, MOpc;

		if (!PreferBEXTR) {
		assert(Subtarget->hasBMI2() && "We must have BMI2's BZHI then.");
		// If we can't make use of BEXTR then we can't fuse shift+mask stages.
		// Let's perform the mask first, and apply shift later. Note that we need to
		// widen the mask to account for the fact that we'll apply shift afterwards!
		Control = CurDAG->getTargetConstant(Shift + MaskSize, dl, NVT);
		ROpc = NVT == MVT::i64 ? X86::BZHI64rr : X86::BZHI32rr;
		MOpc = NVT == MVT::i64 ? X86::BZHI64rm : X86::BZHI32rm;
		unsigned NewOpc = NVT == MVT::i64 ? X86::MOV32ri64 : X86::MOV32ri;
		Control = SDValue(CurDAG->getMachineNode(NewOpc, dl, NVT, Control), 0);
		} else {
		// The 'control' of BEXTR has the pattern of:
		// [15...8 bit][ 7...0 bit] location
		// [ bit count][ shift] name
		// I.e. 0b000000011'00000001 means (x >> 0b1) & 0b11
		Control = CurDAG->getTargetConstant(Shift \| (MaskSize << 8), dl, NVT);
		if (Subtarget->hasTBM()) {
		ROpc = NVT == MVT::i64 ? X86::BEXTRI64ri : X86::BEXTRI32ri;
		MOpc = NVT == MVT::i64 ? X86::BEXTRI64mi : X86::BEXTRI32mi;
		} else {
		assert(Subtarget->hasBMI() && "We must have BMI1's BEXTR then.");
// BMI requires the immediate to placed in a register.		// BMI requires the immediate to placed in a register.
if (!Subtarget->hasTBM()) {
ROpc = NVT == MVT::i64 ? X86::BEXTR64rr : X86::BEXTR32rr;		ROpc = NVT == MVT::i64 ? X86::BEXTR64rr : X86::BEXTR32rr;
MOpc = NVT == MVT::i64 ? X86::BEXTR64rm : X86::BEXTR32rm;		MOpc = NVT == MVT::i64 ? X86::BEXTR64rm : X86::BEXTR32rm;
unsigned NewOpc = NVT == MVT::i64 ? X86::MOV32ri64 : X86::MOV32ri;		unsigned NewOpc = NVT == MVT::i64 ? X86::MOV32ri64 : X86::MOV32ri;
New = SDValue(CurDAG->getMachineNode(NewOpc, dl, NVT, New), 0);		Control = SDValue(CurDAG->getMachineNode(NewOpc, dl, NVT, Control), 0);
		}
}		}

MachineSDNode *NewNode;		MachineSDNode *NewNode;
SDValue Input = N0->getOperand(0);		SDValue Input = N0->getOperand(0);
SDValue Tmp0, Tmp1, Tmp2, Tmp3, Tmp4;		SDValue Tmp0, Tmp1, Tmp2, Tmp3, Tmp4;
if (tryFoldLoad(Node, N0.getNode(), Input, Tmp0, Tmp1, Tmp2, Tmp3, Tmp4)) {		if (tryFoldLoad(Node, N0.getNode(), Input, Tmp0, Tmp1, Tmp2, Tmp3, Tmp4)) {
SDValue Ops[] = { Tmp0, Tmp1, Tmp2, Tmp3, Tmp4, New, Input.getOperand(0) };		SDValue Ops[] = {
		Tmp0, Tmp1, Tmp2, Tmp3, Tmp4, Control, Input.getOperand(0)};
SDVTList VTs = CurDAG->getVTList(NVT, MVT::i32, MVT::Other);		SDVTList VTs = CurDAG->getVTList(NVT, MVT::i32, MVT::Other);
NewNode = CurDAG->getMachineNode(MOpc, dl, VTs, Ops);		NewNode = CurDAG->getMachineNode(MOpc, dl, VTs, Ops);
// Update the chain.		// Update the chain.
ReplaceUses(Input.getValue(1), SDValue(NewNode, 2));		ReplaceUses(Input.getValue(1), SDValue(NewNode, 2));
// Record the mem-refs		// Record the mem-refs
CurDAG->setNodeMemRefs(NewNode, {cast<LoadSDNode>(Input)->getMemOperand()});		CurDAG->setNodeMemRefs(NewNode, {cast<LoadSDNode>(Input)->getMemOperand()});
} else {		} else {
NewNode = CurDAG->getMachineNode(ROpc, dl, NVT, MVT::i32, Input, New);		NewNode = CurDAG->getMachineNode(ROpc, dl, NVT, MVT::i32, Input, Control);
		}

		if (!PreferBEXTR) {
		// We still need to apply the shift.
		SDValue ShAmt = CurDAG->getTargetConstant(Shift, dl, NVT);
		unsigned NewOpc = NVT == MVT::i64 ? X86::SHR64ri : X86::SHR32ri;
		NewNode =
		CurDAG->getMachineNode(NewOpc, dl, NVT, SDValue(NewNode, 0), ShAmt);
}		}

return NewNode;		return NewNode;
}		}

// Emit a PCMISTR(I/M) instruction.		// Emit a PCMISTR(I/M) instruction.
MachineSDNode *X86DAGToDAGISel::emitPCMPISTR(unsigned ROpc, unsigned MOpc,		MachineSDNode *X86DAGToDAGISel::emitPCMPISTR(unsigned ROpc, unsigned MOpc,
bool MayFoldLoad, const SDLoc &dl,		bool MayFoldLoad, const SDLoc &dl,
▲ Show 20 Lines • Show All 1,627 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/bmi-x86_64.ll

	Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines
	; BMI1-SLOW: # %bb.0: # %entry			; BMI1-SLOW: # %bb.0: # %entry
	; BMI1-SLOW-NEXT: shrq $2, %rdi			; BMI1-SLOW-NEXT: shrq $2, %rdi
	; BMI1-SLOW-NEXT: movl $8448, %eax # imm = 0x2100			; BMI1-SLOW-NEXT: movl $8448, %eax # imm = 0x2100
	; BMI1-SLOW-NEXT: bextrq %rax, %rdi, %rax			; BMI1-SLOW-NEXT: bextrq %rax, %rdi, %rax
	; BMI1-SLOW-NEXT: retq			; BMI1-SLOW-NEXT: retq
	;			;
	; BMI2-SLOW-LABEL: bextr64d:			; BMI2-SLOW-LABEL: bextr64d:
	; BMI2-SLOW: # %bb.0: # %entry			; BMI2-SLOW: # %bb.0: # %entry
	; BMI2-SLOW-NEXT: shrq $2, %rdi			; BMI2-SLOW-NEXT: movl $35, %eax
	; BMI2-SLOW-NEXT: movb $33, %al
	; BMI2-SLOW-NEXT: bzhiq %rax, %rdi, %rax			; BMI2-SLOW-NEXT: bzhiq %rax, %rdi, %rax
				; BMI2-SLOW-NEXT: shrq $2, %rax
	; BMI2-SLOW-NEXT: retq			; BMI2-SLOW-NEXT: retq
	;			;
	; BEXTR-FAST-LABEL: bextr64d:			; BEXTR-FAST-LABEL: bextr64d:
	; BEXTR-FAST: # %bb.0: # %entry			; BEXTR-FAST: # %bb.0: # %entry
	; BEXTR-FAST-NEXT: movl $8450, %eax # imm = 0x2102			; BEXTR-FAST-NEXT: movl $8450, %eax # imm = 0x2102
	; BEXTR-FAST-NEXT: bextrq %rax, %rdi, %rax			; BEXTR-FAST-NEXT: bextrq %rax, %rdi, %rax
	; BEXTR-FAST-NEXT: retq			; BEXTR-FAST-NEXT: retq
	entry:			entry:
	%shr = lshr i64 %a, 2			%shr = lshr i64 %a, 2
	%and = and i64 %shr, 8589934591			%and = and i64 %shr, 8589934591
	ret i64 %and			ret i64 %and
	}			}

	define i64 @bextr64d_load(i64* %aptr) {			define i64 @bextr64d_load(i64* %aptr) {
	; BMI1-SLOW-LABEL: bextr64d_load:			; BMI1-SLOW-LABEL: bextr64d_load:
	; BMI1-SLOW: # %bb.0: # %entry			; BMI1-SLOW: # %bb.0: # %entry
	; BMI1-SLOW-NEXT: movq (%rdi), %rax			; BMI1-SLOW-NEXT: movq (%rdi), %rax
	; BMI1-SLOW-NEXT: shrq $2, %rax			; BMI1-SLOW-NEXT: shrq $2, %rax
	; BMI1-SLOW-NEXT: movl $8448, %ecx # imm = 0x2100			; BMI1-SLOW-NEXT: movl $8448, %ecx # imm = 0x2100
	; BMI1-SLOW-NEXT: bextrq %rcx, %rax, %rax			; BMI1-SLOW-NEXT: bextrq %rcx, %rax, %rax
	; BMI1-SLOW-NEXT: retq			; BMI1-SLOW-NEXT: retq
	;			;
	; BMI2-SLOW-LABEL: bextr64d_load:			; BMI2-SLOW-LABEL: bextr64d_load:
	; BMI2-SLOW: # %bb.0: # %entry			; BMI2-SLOW: # %bb.0: # %entry
	; BMI2-SLOW-NEXT: movq (%rdi), %rax			; BMI2-SLOW-NEXT: movl $35, %eax
				; BMI2-SLOW-NEXT: bzhiq %rax, (%rdi), %rax
	; BMI2-SLOW-NEXT: shrq $2, %rax			; BMI2-SLOW-NEXT: shrq $2, %rax
	; BMI2-SLOW-NEXT: movb $33, %cl
	; BMI2-SLOW-NEXT: bzhiq %rcx, %rax, %rax
	; BMI2-SLOW-NEXT: retq			; BMI2-SLOW-NEXT: retq
	;			;
	; BEXTR-FAST-LABEL: bextr64d_load:			; BEXTR-FAST-LABEL: bextr64d_load:
	; BEXTR-FAST: # %bb.0: # %entry			; BEXTR-FAST: # %bb.0: # %entry
	; BEXTR-FAST-NEXT: movl $8450, %eax # imm = 0x2102			; BEXTR-FAST-NEXT: movl $8450, %eax # imm = 0x2102
	; BEXTR-FAST-NEXT: bextrq %rax, (%rdi), %rax			; BEXTR-FAST-NEXT: bextrq %rax, (%rdi), %rax
	; BEXTR-FAST-NEXT: retq			; BEXTR-FAST-NEXT: retq
	entry:			entry:
	Show All 18 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86] X86DAGToDAGISel::matchBEXTRFromAndImm(): if can't use BEXTR, fallback to BZHI (PR43381)
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 221242

llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp

llvm/trunk/test/CodeGen/X86/bmi-x86_64.ll

This is an archive of the discontinued LLVM Phabricator instance.

[X86] X86DAGToDAGISel::matchBEXTRFromAndImm(): if can't use BEXTR, fallback to BZHI (PR43381)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 221242

llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp

llvm/trunk/test/CodeGen/X86/bmi-x86_64.ll

[X86] X86DAGToDAGISel::matchBEXTRFromAndImm(): if can't use BEXTR, fallback to BZHI (PR43381)
ClosedPublic