This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
3/6
X86ISelDAGToDAG.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
cmp.ll

Differential D121147

[x86] try harder to use shift instead of test if it can save some immediate bytes
ClosedPublic

Authored by spatel on Mar 7 2022, 12:23 PM.

Download Raw Diff

Details

Reviewers

pengfei
RKSimon
craig.topper
MatzeB

Commits

rG67e91510963a: [x86] try harder to use shift instead of test if it can save some immediate…

Summary

We favor 'and' and later 'test' in earlier phases, and that's usually the better option, but we can save a few instruction bytes by converting a mask constant to a shift here.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Mar 7 2022, 12:23 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 12:23 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

spatel requested review of this revision.Mar 7 2022, 12:23 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 12:23 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B152996: Diff 413587.Mar 7 2022, 1:11 PM

I like this, but I don't know if any intel archs have slow shift-by-imm ops (slow eflags updates) like they do for non-immediate shift amounts?

In D121147#3365414, @RKSimon wrote:

I like this, but I don't know if any intel archs have slow shift-by-imm ops (slow eflags updates) like they do for non-immediate shift amounts?

I think it should be ok. The decoders can see the shift amount and do some tricks to make the flags efficient.

craig.topper mentioned this in D121320: X86ISelDAGToDAG: Transform TEST + MOV64ri to SHR + TEST.Mar 9 2022, 11:45 AM

RKSimon added a reviewer: MatzeB.Mar 9 2022, 2:00 PM

RKSimon added inline comments.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
5673	If we added shifted mask detection would we hit any of the folds that D121320 is targetting?

MatzeB added inline comments.Mar 9 2022, 3:19 PM

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
5673	Most likely yes. Let me try to put my transformation into this part of the code...

I was thinking about the same thing, though looking around https://www.uops.info/table.html it seems that many CPUs can only schedule shifts on 1 or 2 of their ports while and typically can be scheduled on all of them. I'm not really sure whether that means we should only perform this transformation when going for code-size or whether it's unlikely anyway to hit port constraints so we should always do it...

MatzeB added inline comments.Mar 9 2022, 4:06 PM

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
5673	I changed D121320 now to update this part of the code. However I think the cases are different enough that we can discuss this diff and D121320 separately.

spatel mentioned this in D121319: Tests for D121320.Mar 10 2022, 5:18 AM

MatzeB added inline comments.Mar 10 2022, 10:40 AM

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
5616–5617	Guess this is no longer matching just IMM64
5624	I think the `hasOneUse()` check makes sense for the cases where you turning a `testl xx, IMM32` into a shift. However for the IMM64 case we save a whole `movabsq` instruction so even if we end up with an extra COPY its a good deal.

spatel added inline comments.Mar 10 2022, 1:53 PM

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
5624	OK, we should add a test for that if it's worth doing. I think we should finalize D121320 first since that's the larger patch. Then we can add this as a small enhancement with the appropriate restrictions in place to avoid regressions. This patch is only saving 3 bytes. :)

Patch updated:
Rebased now that D121320 has landed. The checks are adjusted to fit the new code structure. We have negative tests in place to verify that we don't enable any size-increasing transforms.

Harbormaster completed remote builds in B154642: Diff 415899.Mar 16 2022, 10:37 AM

spatel marked 2 inline comments as done.Mar 16 2022, 10:38 AM

I have no hard numbers or experience how to value smaller code size against port constraints; Given that there's no other opinions and my intuition is that code size is likely more valuable here.

LGTM

This revision is now accepted and ready to land.Mar 16 2022, 1:46 PM

This revision was landed with ongoing or failed builds.Mar 17 2022, 6:14 AM

Closed by commit rG67e91510963a: [x86] try harder to use shift instead of test if it can save some immediate… (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG67e91510963a: [x86] try harder to use shift instead of test if it can save some immediate….

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelDAGToDAG.cpp

23 lines

test/

CodeGen/

X86/

cmp.ll

6 lines

Diff 416159

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp

Show First 20 Lines • Show All 5,607 Lines • ▼ Show 20 Lines	if (N0.getOpcode() == ISD::AND && N0.getNode()->hasOneUse() &&
auto *MaskC = dyn_cast<ConstantSDNode>(N0.getOperand(1));		auto *MaskC = dyn_cast<ConstantSDNode>(N0.getOperand(1));
if (!MaskC)		if (!MaskC)
break;		break;

// We may have looked through a truncate so mask off any bits that		// We may have looked through a truncate so mask off any bits that
// shouldn't be part of the compare.		// shouldn't be part of the compare.
uint64_t Mask = MaskC->getZExtValue();		uint64_t Mask = MaskC->getZExtValue();
Mask &= maskTrailingOnes<uint64_t>(CmpVT.getScalarSizeInBits());		Mask &= maskTrailingOnes<uint64_t>(CmpVT.getScalarSizeInBits());

// Check if we can replace AND+IMM64 with a shift. This is possible for		// Check if we can replace AND+IMM{32,64} with a shift. This is possible
		MatzeBUnsubmitted Done Reply Inline Actions Guess this is no longer matching just IMM64 MatzeB: Guess this is no longer matching just IMM64
// masks like 0xFF000000 or 0x00FFFFFF and if we care only about the zero		// for masks like 0xFF000000 or 0x00FFFFFF and if we care only about the
// flag.		// zero flag.
if (CmpVT == MVT::i64 && !isInt<32>(Mask) && isShiftedMask_64(Mask) &&		if (CmpVT == MVT::i64 && !isInt<8>(Mask) && isShiftedMask_64(Mask) &&
onlyUsesZeroFlag(SDValue(Node, 0))) {		onlyUsesZeroFlag(SDValue(Node, 0))) {
unsigned ShiftOpcode = ISD::DELETED_NODE;		unsigned ShiftOpcode = ISD::DELETED_NODE;
unsigned ShiftAmt;		unsigned ShiftAmt;
unsigned SubRegIdx;		unsigned SubRegIdx;
		MatzeBUnsubmitted Done Reply Inline Actions I think the `hasOneUse()` check makes sense for the cases where you turning a `testl xx, IMM32` into a shift. However for the IMM64 case we save a whole `movabsq` instruction so even if we end up with an extra COPY its a good deal. MatzeB: I think the `hasOneUse()` check makes sense for the cases where you turning a `testl xx, IMM32`…
		spatelAuthorUnsubmitted Done Reply Inline Actions OK, we should add a test for that if it's worth doing. I think we should finalize D121320 first since that's the larger patch. Then we can add this as a small enhancement with the appropriate restrictions in place to avoid regressions. This patch is only saving 3 bytes. :) spatel: OK, we should add a test for that if it's worth doing. I think we should finalize D121320…
MVT SubRegVT;		MVT SubRegVT;
unsigned TestOpcode;		unsigned TestOpcode;
unsigned LeadingZeros = countLeadingZeros(Mask);		unsigned LeadingZeros = countLeadingZeros(Mask);
unsigned TrailingZeros = countTrailingZeros(Mask);		unsigned TrailingZeros = countTrailingZeros(Mask);
if (LeadingZeros == 0) {
		// With leading/trailing zeros, the transform is profitable if we can
		// eliminate a movabsq or shrink a 32-bit immediate to 8-bit without
		// incurring any extra register moves.
		bool SavesBytes = !isInt<32>(Mask) \|\| N0.getOperand(0).hasOneUse();
		if (LeadingZeros == 0 && SavesBytes) {
// If the mask covers the most significant bit, then we can replace		// If the mask covers the most significant bit, then we can replace
// TEST+AND with a SHR and check eflags.		// TEST+AND with a SHR and check eflags.
// This emits a redundant TEST which is subsequently eliminated.		// This emits a redundant TEST which is subsequently eliminated.
ShiftOpcode = X86::SHR64ri;		ShiftOpcode = X86::SHR64ri;
ShiftAmt = TrailingZeros;		ShiftAmt = TrailingZeros;
SubRegIdx = 0;		SubRegIdx = 0;
TestOpcode = X86::TEST64rr;		TestOpcode = X86::TEST64rr;
} else if (TrailingZeros == 0) {		} else if (TrailingZeros == 0 && SavesBytes) {
// If the mask covers the least significant bit, then we can replace		// If the mask covers the least significant bit, then we can replace
// TEST+AND with a SHL and check eflags.		// TEST+AND with a SHL and check eflags.
// This emits a redundant TEST which is subsequently eliminated.		// This emits a redundant TEST which is subsequently eliminated.
ShiftOpcode = X86::SHL64ri;		ShiftOpcode = X86::SHL64ri;
ShiftAmt = LeadingZeros;		ShiftAmt = LeadingZeros;
SubRegIdx = 0;		SubRegIdx = 0;
TestOpcode = X86::TEST64rr;		TestOpcode = X86::TEST64rr;
} else if (MaskC->hasOneUse()) {		} else if (MaskC->hasOneUse() && !isInt<32>(Mask)) {
// If the mask is 8/16 or 32bits wide, then we can replace it with		// If the shifted mask extends into the high half and is 8/16/32 bits
// a SHR and a TEST8rr/TEST16rr/TEST32rr.		// wide, then replace it with a SHR and a TEST8rr/TEST16rr/TEST32rr.
unsigned PopCount = 64 - LeadingZeros - TrailingZeros;		unsigned PopCount = 64 - LeadingZeros - TrailingZeros;
if (PopCount == 8) {		if (PopCount == 8) {
ShiftOpcode = X86::SHR64ri;		ShiftOpcode = X86::SHR64ri;
ShiftAmt = TrailingZeros;		ShiftAmt = TrailingZeros;
SubRegIdx = X86::sub_8bit;		SubRegIdx = X86::sub_8bit;
SubRegVT = MVT::i8;		SubRegVT = MVT::i8;
TestOpcode = X86::TEST8rr;		TestOpcode = X86::TEST8rr;
} else if (PopCount == 16) {		} else if (PopCount == 16) {
ShiftOpcode = X86::SHR64ri;		ShiftOpcode = X86::SHR64ri;
ShiftAmt = TrailingZeros;		ShiftAmt = TrailingZeros;
SubRegIdx = X86::sub_16bit;		SubRegIdx = X86::sub_16bit;
SubRegVT = MVT::i16;		SubRegVT = MVT::i16;
TestOpcode = X86::TEST16rr;		TestOpcode = X86::TEST16rr;
} else if (PopCount == 32) {		} else if (PopCount == 32) {
ShiftOpcode = X86::SHR64ri;		ShiftOpcode = X86::SHR64ri;
ShiftAmt = TrailingZeros;		ShiftAmt = TrailingZeros;
SubRegIdx = X86::sub_32bit;		SubRegIdx = X86::sub_32bit;
SubRegVT = MVT::i32;		SubRegVT = MVT::i32;
TestOpcode = X86::TEST32rr;		TestOpcode = X86::TEST32rr;
}		}
}		}
		RKSimonUnsubmitted Not Done Reply Inline Actions If we added shifted mask detection would we hit any of the folds that D121320 is targetting? RKSimon: If we added shifted mask detection would we hit any of the folds that D121320 is targetting?
		MatzeBUnsubmitted Not Done Reply Inline Actions Most likely yes. Let me try to put my transformation into this part of the code... MatzeB: Most likely yes. Let me try to put my transformation into this part of the code...
		MatzeBUnsubmitted Not Done Reply Inline Actions I changed D121320 now to update this part of the code. However I think the cases are different enough that we can discuss this diff and D121320 separately. MatzeB: I changed D121320 now to update this part of the code. However I think the cases are different…
if (ShiftOpcode != ISD::DELETED_NODE) {		if (ShiftOpcode != ISD::DELETED_NODE) {
SDValue ShiftC = CurDAG->getTargetConstant(ShiftAmt, dl, MVT::i64);		SDValue ShiftC = CurDAG->getTargetConstant(ShiftAmt, dl, MVT::i64);
SDValue Shift = SDValue(		SDValue Shift = SDValue(
CurDAG->getMachineNode(ShiftOpcode, dl, MVT::i64, MVT::i32,		CurDAG->getMachineNode(ShiftOpcode, dl, MVT::i64, MVT::i32,
N0.getOperand(0), ShiftC),		N0.getOperand(0), ShiftC),
0);		0);
if (SubRegIdx != 0) {		if (SubRegIdx != 0) {
Shift =		Shift =
▲ Show 20 Lines • Show All 514 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/cmp.ll

Show First 20 Lines • Show All 424 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq # encoding: [0xc3]
%ret = mul i64 %z, %val		%ret = mul i64 %z, %val
ret i64 %ret		ret i64 %ret
}		}

define i32 @highmask_i64_mask32(i64 %val) {		define i32 @highmask_i64_mask32(i64 %val) {
; CHECK-LABEL: highmask_i64_mask32:		; CHECK-LABEL: highmask_i64_mask32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: xorl %eax, %eax # encoding: [0x31,0xc0]		; CHECK-NEXT: xorl %eax, %eax # encoding: [0x31,0xc0]
; CHECK-NEXT: testq $-1048576, %rdi # encoding: [0x48,0xf7,0xc7,0x00,0x00,0xf0,0xff]		; CHECK-NEXT: shrq $20, %rdi # encoding: [0x48,0xc1,0xef,0x14]
; CHECK-NEXT: # imm = 0xFFF00000
; CHECK-NEXT: sete %al # encoding: [0x0f,0x94,0xc0]		; CHECK-NEXT: sete %al # encoding: [0x0f,0x94,0xc0]
; CHECK-NEXT: retq # encoding: [0xc3]		; CHECK-NEXT: retq # encoding: [0xc3]
%and = and i64 %val, -1048576		%and = and i64 %val, -1048576
%cmp = icmp eq i64 %and, 0		%cmp = icmp eq i64 %and, 0
%ret = zext i1 %cmp to i32		%ret = zext i1 %cmp to i32
ret i32 %ret		ret i32 %ret
}		}

▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq # encoding: [0xc3]
%ret = mul i64 %z, %val		%ret = mul i64 %z, %val
ret i64 %ret		ret i64 %ret
}		}

define i32 @lowmask_i64_mask32(i64 %val) {		define i32 @lowmask_i64_mask32(i64 %val) {
; CHECK-LABEL: lowmask_i64_mask32:		; CHECK-LABEL: lowmask_i64_mask32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: xorl %eax, %eax # encoding: [0x31,0xc0]		; CHECK-NEXT: xorl %eax, %eax # encoding: [0x31,0xc0]
; CHECK-NEXT: testl $1048575, %edi # encoding: [0xf7,0xc7,0xff,0xff,0x0f,0x00]		; CHECK-NEXT: shlq $44, %rdi # encoding: [0x48,0xc1,0xe7,0x2c]
; CHECK-NEXT: # imm = 0xFFFFF
; CHECK-NEXT: setne %al # encoding: [0x0f,0x95,0xc0]		; CHECK-NEXT: setne %al # encoding: [0x0f,0x95,0xc0]
; CHECK-NEXT: retq # encoding: [0xc3]		; CHECK-NEXT: retq # encoding: [0xc3]
%and = and i64 %val, 1048575		%and = and i64 %val, 1048575
%cmp = icmp ne i64 %and, 0		%cmp = icmp ne i64 %and, 0
%ret = zext i1 %cmp to i32		%ret = zext i1 %cmp to i32
ret i32 %ret		ret i32 %ret
}		}

▲ Show 20 Lines • Show All 248 Lines • Show Last 20 Lines