This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
2
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
1/5
fshl.ll
-
fshr.ll
-
rotate-extract.ll

Differential D80466

[X86] Improve i8 + 'slow' i16 funnel shift codegen
ClosedPublic

Authored by RKSimon on May 23 2020, 3:09 AM.

Download Raw Diff

Details

Reviewers

craig.topper
spatel
lebedev.ri

Commits

rGcc65a7a5ea81: [X86] Improve i8 + 'slow' i16 funnel shift codegen

Summary

This is a preliminary patch before I deal with the xor+and issue raised in D77301.

We get much better code for i8/i16 funnel shifts by concatenating the operands together and performing the shift as a double width type, it avoids repeated use of the shift amount and partial registers.

fshl(x,y,z) -> (((zext(x) << bw) | zext(y)) << (z & (bw-1))) >> bw.
fshr(x,y,z) -> (((zext(x) << bw) | zext(y)) >> (z & (bw-1))) >> bw.

Alive2: http://volta.cs.utah.edu:8080/z/CZx7Cn

This doesn't do as well for i32 cases on x86_64 (the xor+and followup patch is much better) so I haven't bothered with that.

Cases with constant amounts are more dubious as well so I haven't currently bothered with those - its these kind of 'edge' cases that put me off trying to put this in TargetLowering::expandFunnelShift.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

RKSimon created this revision.May 23 2020, 3:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2020, 3:09 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

lebedev.ri added inline comments.May 23 2020, 4:10 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
19094	This can be anyext

Harbormaster failed remote builds in B57707: Diff 265845!May 23 2020, 5:15 AM

Use anyextend and always extend to i32 straight away (as I said i32 funnel shifts as i64 didn't make much sense so I've dropped that generalization).

LGTM

For fshl case, we could introduce some more ILP: http://volta.cs.utah.edu:8080/z/UJ6viM
https://godbolt.org/z/xsJgPb https://godbolt.org/z/5W26NV
Not sure it would be an improvement?
As a sidenote, we clearly don't fold to either variant in DAGCombiner.

Harbormaster failed remote builds in B57714: Diff 265858!May 23 2020, 8:28 AM

In D80466#2052153, @lebedev.ri wrote:

LGTM

For fshl case, we could introduce some more ILP: http://volta.cs.utah.edu:8080/z/UJ6viM
https://godbolt.org/z/xsJgPb https://godbolt.org/z/5W26NV
Not sure it would be an improvement?
As a sidenote, we clearly don't fold to either variant in DAGCombiner.

Looking at these cases in llvm-mca with 'slow shld' targets (btver2/bdver2/znver*) the naive cases all seem to give better throughput

lebedev.ri accepted this revision.May 23 2020, 11:50 AM

This revision is now accepted and ready to land.May 23 2020, 11:50 AM

Closed by commit rGcc65a7a5ea81: [X86] Improve i8 + 'slow' i16 funnel shift codegen (authored by RKSimon). · Explain WhyMay 24 2020, 12:29 AM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in D80489: [TargetLowering] Improve expandFunnelShift shift amount masking.May 24 2020, 2:40 AM

RKSimon mentioned this in rG16031067252d: [TargetLowering] Improve expandFunnelShift shift amount masking.May 24 2020, 3:44 AM

foad added inline comments.May 26 2020, 1:36 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
19087	The final `>> bw` is wrong for fshr.
llvm/test/CodeGen/X86/fshl.ll
22–23	Would it be worth trying to generate just `movb %al, %dh` instead of zext+shll+orl?

RKSimon marked an inline comment as done.May 26 2020, 3:08 AM

RKSimon added inline comments.

llvm/test/CodeGen/X86/fshl.ll
22–23	Yes that might be useful but probably should be done generally. I don't know much about the hi-byte move logic @craig.topper might be able to advise?

RKSimon mentioned this in rG6f802ec4333c: [X86] Fix fshr comment copy+paste typo. NFC..May 26 2020, 3:12 AM

craig.topper added inline comments.May 26 2020, 12:51 PM

llvm/test/CodeGen/X86/fshl.ll
22–23	I think you'd have to jump through some hoops to get the register allocator to do it. You'd need an INSERT_SUBREG to force the join. Possibly even a pseudo instruction on 64-bit to force NOREX on the other register to avoid an encoding issue. I'm not sure it makes sense to write an h register on modern Intel CPUs. It guarantees a merge uop needs to be generated when bits 15:8 and 7:0 are both read by the consuming instruction.

efriedma added a subscriber: efriedma.May 26 2020, 2:25 PM

efriedma added inline comments.

llvm/test/CodeGen/X86/fshl.ll
22–23	On processors that don't have special rename machinery for 8-bit registers, it should simply save an instruction, if it's legal. On big Intel cores, even if it doesn't save a uop, it should still be smaller. That said, even if it's profitable in this exact case, the register allocation constraints to make it work are really tight; it's probably only worthwhile if the values are already in ABCD registers.
26	We should probably prefer shrl over movb.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

32 lines

test/

CodeGen/

X86/

fshl.ll

61 lines

fshr.ll

63 lines

rotate-extract.ll

28 lines

Diff 265858

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 195 Lines • ▼ Show 20 Lines	X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
}		}
setOperationAction(ISD::ABS , MVT::i64 , Custom);		setOperationAction(ISD::ABS , MVT::i64 , Custom);

// Funnel shifts.		// Funnel shifts.
for (auto ShiftOp : {ISD::FSHL, ISD::FSHR}) {		for (auto ShiftOp : {ISD::FSHL, ISD::FSHR}) {
// For slow shld targets we only lower for code size.		// For slow shld targets we only lower for code size.
LegalizeAction ShiftDoubleAction = Subtarget.isSHLDSlow() ? Custom : Legal;		LegalizeAction ShiftDoubleAction = Subtarget.isSHLDSlow() ? Custom : Legal;

		setOperationAction(ShiftOp , MVT::i8 , Custom);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - setOperationAction(ShiftOp , MVT::i8 , Custom); + setOperationAction(ShiftOp, MVT::i8, Custom); Lint: Pre-merge checks: clang-format: please reformat the code ``` - setOperationAction(ShiftOp , MVT…
setOperationAction(ShiftOp , MVT::i16 , Custom);		setOperationAction(ShiftOp , MVT::i16 , Custom);
setOperationAction(ShiftOp , MVT::i32 , ShiftDoubleAction);		setOperationAction(ShiftOp , MVT::i32 , ShiftDoubleAction);
if (Subtarget.is64Bit())		if (Subtarget.is64Bit())
setOperationAction(ShiftOp , MVT::i64 , ShiftDoubleAction);		setOperationAction(ShiftOp , MVT::i64 , ShiftDoubleAction);
}		}

if (!Subtarget.useSoftFloat()) {		if (!Subtarget.useSoftFloat()) {
// Promote all UINT_TO_FP to larger SINT_TO_FP's, as X86 doesn't have this		// Promote all UINT_TO_FP to larger SINT_TO_FP's, as X86 doesn't have this
▲ Show 20 Lines • Show All 18,857 Lines • ▼ Show 20 Lines	if (X86::isConstantSplat(Amt, APIntShiftAmt)) {
uint64_t ShiftAmt = APIntShiftAmt.urem(VT.getScalarSizeInBits());		uint64_t ShiftAmt = APIntShiftAmt.urem(VT.getScalarSizeInBits());
return DAG.getNode(IsFSHR ? X86ISD::VSHRD : X86ISD::VSHLD, DL, VT, Op0,		return DAG.getNode(IsFSHR ? X86ISD::VSHRD : X86ISD::VSHLD, DL, VT, Op0,
Op1, DAG.getTargetConstant(ShiftAmt, DL, MVT::i8));		Op1, DAG.getTargetConstant(ShiftAmt, DL, MVT::i8));
}		}

return DAG.getNode(IsFSHR ? X86ISD::VSHRDV : X86ISD::VSHLDV, DL, VT,		return DAG.getNode(IsFSHR ? X86ISD::VSHRDV : X86ISD::VSHLDV, DL, VT,
Op0, Op1, Amt);		Op0, Op1, Amt);
}		}
		assert(
assert((VT == MVT::i16 \|\| VT == MVT::i32 \|\| VT == MVT::i64) &&		(VT == MVT::i8 \|\| VT == MVT::i16 \|\| VT == MVT::i32 \|\| VT == MVT::i64) &&
"Unexpected funnel shift type!");		"Unexpected funnel shift type!");

// Expand slow SHLD/SHRD cases if we are not optimizing for size.		// Expand slow SHLD/SHRD cases if we are not optimizing for size.
bool OptForSize = DAG.shouldOptForSize();		bool OptForSize = DAG.shouldOptForSize();
if (!OptForSize && Subtarget.isSHLDSlow())		bool ExpandFunnel = !OptForSize && Subtarget.isSHLDSlow();

		// fshl(x,y,z) -> (((aext(x) << bw) \| zext(y)) << (z & (bw-1))) >> bw.
		// fshr(x,y,z) -> (((aext(x) << bw) \| zext(y)) >> (z & (bw-1))) >> bw.
		foadUnsubmitted Not Done Reply Inline Actions The final `>> bw` is wrong for fshr. foad: The final `>> bw` is wrong for fshr.
		if ((VT == MVT::i8 \|\| (ExpandFunnel && VT == MVT::i16)) &&
		!isa<ConstantSDNode>(Amt)) {
		unsigned EltSizeInBits = VT.getScalarSizeInBits();
		SDValue Mask = DAG.getConstant(EltSizeInBits - 1, DL, Amt.getValueType());
		SDValue HiShift = DAG.getConstant(EltSizeInBits, DL, Amt.getValueType());
		Op0 = DAG.getAnyExtOrTrunc(Op0, DL, MVT::i32);
		Op1 = DAG.getZExtOrTrunc(Op1, DL, MVT::i32);
		lebedev.riUnsubmitted Not Done Reply Inline Actions This can be anyext lebedev.ri: This can be anyext
		Amt = DAG.getNode(ISD::AND, DL, Amt.getValueType(), Amt, Mask);
		SDValue Res = DAG.getNode(ISD::SHL, DL, MVT::i32, Op0, HiShift);
		Res = DAG.getNode(ISD::OR, DL, MVT::i32, Res, Op1);
		if (IsFSHR) {
		Res = DAG.getNode(ISD::SRL, DL, MVT::i32, Res, Amt);
		} else {
		Res = DAG.getNode(ISD::SHL, DL, MVT::i32, Res, Amt);
		Res = DAG.getNode(ISD::SRL, DL, MVT::i32, Res, HiShift);
		}
		return DAG.getZExtOrTrunc(Res, DL, VT);
		}

		if (VT == MVT::i8 \|\| ExpandFunnel)
return SDValue();		return SDValue();

// i16 needs to modulo the shift amount, but i32/i64 have implicit modulo.		// i16 needs to modulo the shift amount, but i32/i64 have implicit modulo.
if (VT == MVT::i16) {		if (VT == MVT::i16) {
Amt = DAG.getNode(ISD::AND, DL, Amt.getValueType(), Amt,		Amt = DAG.getNode(ISD::AND, DL, Amt.getValueType(), Amt,
DAG.getConstant(15, DL, Amt.getValueType()));		DAG.getConstant(15, DL, Amt.getValueType()));
unsigned FSHOp = (IsFSHR ? X86ISD::FSHR : X86ISD::FSHL);		unsigned FSHOp = (IsFSHR ? X86ISD::FSHR : X86ISD::FSHL);
return DAG.getNode(FSHOp, DL, VT, Op0, Op1, Amt);		return DAG.getNode(FSHOp, DL, VT, Op0, Op1, Amt);
▲ Show 20 Lines • Show All 9,991 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/fshl.ll

	Show All 10 Lines

	;			;
	; Variable Funnel Shift			; Variable Funnel Shift
	;			;

	define i8 @var_shift_i8(i8 %x, i8 %y, i8 %z) nounwind {			define i8 @var_shift_i8(i8 %x, i8 %y, i8 %z) nounwind {
	; X86-LABEL: var_shift_i8:			; X86-LABEL: var_shift_i8:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movb {{[0-9]+}}(%esp), %ah			; X86-NEXT: movb {{[0-9]+}}(%esp), %cl
	; X86-NEXT: movb {{[0-9]+}}(%esp), %al			; X86-NEXT: movzbl {{[0-9]+}}(%esp), %edx
	; X86-NEXT: movb {{[0-9]+}}(%esp), %dl			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: andb $7, %dl			; X86-NEXT: shll $8, %eax
	; X86-NEXT: movb %al, %ch			; X86-NEXT: orl %edx, %eax
				foadUnsubmitted Not Done Reply Inline Actions Would it be worth trying to generate just `movb %al, %dh` instead of zext+shll+orl? foad: Would it be worth trying to generate just `movb %al, %dh` instead of zext+shll+orl?
				RKSimonAuthorUnsubmitted Done Reply Inline Actions Yes that might be useful but probably should be done generally. I don't know much about the hi-byte move logic @craig.topper might be able to advise? RKSimon: Yes that might be useful but probably should be done generally. I don't know much about the hi…
				craig.topperUnsubmitted Not Done Reply Inline Actions I think you'd have to jump through some hoops to get the register allocator to do it. You'd need an INSERT_SUBREG to force the join. Possibly even a pseudo instruction on 64-bit to force NOREX on the other register to avoid an encoding issue. I'm not sure it makes sense to write an h register on modern Intel CPUs. It guarantees a merge uop needs to be generated when bits 15:8 and 7:0 are both read by the consuming instruction. craig.topper: I think you'd have to jump through some hoops to get the register allocator to do it. You'd…
				efriedmaUnsubmitted Not Done Reply Inline Actions On processors that don't have special rename machinery for 8-bit registers, it should simply save an instruction, if it's legal. On big Intel cores, even if it doesn't save a uop, it should still be smaller. That said, even if it's profitable in this exact case, the register allocation constraints to make it work are really tight; it's probably only worthwhile if the values are already in ABCD registers. efriedma: On processors that don't have special rename machinery for 8-bit registers, it should simply…
	; X86-NEXT: movb %dl, %cl			; X86-NEXT: andb $7, %cl
	; X86-NEXT: shlb %cl, %ch			; X86-NEXT: shll %cl, %eax
	; X86-NEXT: movb $8, %cl			; X86-NEXT: movb %ah, %al
				efriedmaUnsubmitted Not Done Reply Inline Actions We should probably prefer shrl over movb. efriedma: We should probably prefer shrl over movb.
	; X86-NEXT: subb %dl, %cl
	; X86-NEXT: shrb %cl, %ah
	; X86-NEXT: testb %dl, %dl
	; X86-NEXT: je .LBB0_2
	; X86-NEXT: # %bb.1:
	; X86-NEXT: orb %ah, %ch
	; X86-NEXT: movb %ch, %al
	; X86-NEXT: .LBB0_2:
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: var_shift_i8:			; X64-LABEL: var_shift_i8:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: andb $7, %dl
	; X64-NEXT: movl %edi, %eax
	; X64-NEXT: movl %edx, %ecx			; X64-NEXT: movl %edx, %ecx
	; X64-NEXT: shlb %cl, %al			; X64-NEXT: shll $8, %edi
	; X64-NEXT: movb $8, %cl
	; X64-NEXT: subb %dl, %cl
	; X64-NEXT: shrb %cl, %sil
	; X64-NEXT: orb %al, %sil
	; X64-NEXT: movzbl %sil, %eax			; X64-NEXT: movzbl %sil, %eax
	; X64-NEXT: testb %dl, %dl			; X64-NEXT: orl %edi, %eax
	; X64-NEXT: cmovel %edi, %eax			; X64-NEXT: andb $7, %cl
				; X64-NEXT: # kill: def $cl killed $cl killed $ecx
				; X64-NEXT: shll %cl, %eax
				; X64-NEXT: shrl $8, %eax
	; X64-NEXT: # kill: def $al killed $al killed $eax			; X64-NEXT: # kill: def $al killed $al killed $eax
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp = tail call i8 @llvm.fshl.i8(i8 %x, i8 %y, i8 %z)			%tmp = tail call i8 @llvm.fshl.i8(i8 %x, i8 %y, i8 %z)
	ret i8 %tmp			ret i8 %tmp
	}			}

	define i16 @var_shift_i16(i16 %x, i16 %y, i16 %z) nounwind {			define i16 @var_shift_i16(i16 %x, i16 %y, i16 %z) nounwind {
	; X86-FAST-LABEL: var_shift_i16:			; X86-FAST-LABEL: var_shift_i16:
	; X86-FAST: # %bb.0:			; X86-FAST: # %bb.0:
	; X86-FAST-NEXT: movzwl {{[0-9]+}}(%esp), %edx			; X86-FAST-NEXT: movzwl {{[0-9]+}}(%esp), %edx
	; X86-FAST-NEXT: movzwl {{[0-9]+}}(%esp), %eax			; X86-FAST-NEXT: movzwl {{[0-9]+}}(%esp), %eax
	; X86-FAST-NEXT: movb {{[0-9]+}}(%esp), %cl			; X86-FAST-NEXT: movb {{[0-9]+}}(%esp), %cl
	; X86-FAST-NEXT: andb $15, %cl			; X86-FAST-NEXT: andb $15, %cl
	; X86-FAST-NEXT: shldw %cl, %dx, %ax			; X86-FAST-NEXT: shldw %cl, %dx, %ax
	; X86-FAST-NEXT: retl			; X86-FAST-NEXT: retl
	;			;
	; X86-SLOW-LABEL: var_shift_i16:			; X86-SLOW-LABEL: var_shift_i16:
	; X86-SLOW: # %bb.0:			; X86-SLOW: # %bb.0:
	; X86-SLOW-NEXT: movzwl {{[0-9]+}}(%esp), %eax
	; X86-SLOW-NEXT: movl {{[0-9]+}}(%esp), %edx
	; X86-SLOW-NEXT: movb {{[0-9]+}}(%esp), %cl			; X86-SLOW-NEXT: movb {{[0-9]+}}(%esp), %cl
	; X86-SLOW-NEXT: andb $15, %cl			; X86-SLOW-NEXT: movzwl {{[0-9]+}}(%esp), %edx
	; X86-SLOW-NEXT: shll %cl, %edx			; X86-SLOW-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SLOW-NEXT: shrl %eax			; X86-SLOW-NEXT: shll $16, %eax
	; X86-SLOW-NEXT: xorb $15, %cl
	; X86-SLOW-NEXT: shrl %cl, %eax
	; X86-SLOW-NEXT: orl %edx, %eax			; X86-SLOW-NEXT: orl %edx, %eax
				; X86-SLOW-NEXT: andb $15, %cl
				; X86-SLOW-NEXT: shll %cl, %eax
				; X86-SLOW-NEXT: shrl $16, %eax
	; X86-SLOW-NEXT: # kill: def $ax killed $ax killed $eax			; X86-SLOW-NEXT: # kill: def $ax killed $ax killed $eax
	; X86-SLOW-NEXT: retl			; X86-SLOW-NEXT: retl
	;			;
	; X64-FAST-LABEL: var_shift_i16:			; X64-FAST-LABEL: var_shift_i16:
	; X64-FAST: # %bb.0:			; X64-FAST: # %bb.0:
	; X64-FAST-NEXT: movl %edx, %ecx			; X64-FAST-NEXT: movl %edx, %ecx
	; X64-FAST-NEXT: movl %edi, %eax			; X64-FAST-NEXT: movl %edi, %eax
	; X64-FAST-NEXT: andb $15, %cl			; X64-FAST-NEXT: andb $15, %cl
	; X64-FAST-NEXT: # kill: def $cl killed $cl killed $ecx			; X64-FAST-NEXT: # kill: def $cl killed $cl killed $ecx
	; X64-FAST-NEXT: shldw %cl, %si, %ax			; X64-FAST-NEXT: shldw %cl, %si, %ax
	; X64-FAST-NEXT: # kill: def $ax killed $ax killed $eax			; X64-FAST-NEXT: # kill: def $ax killed $ax killed $eax
	; X64-FAST-NEXT: retq			; X64-FAST-NEXT: retq
	;			;
	; X64-SLOW-LABEL: var_shift_i16:			; X64-SLOW-LABEL: var_shift_i16:
	; X64-SLOW: # %bb.0:			; X64-SLOW: # %bb.0:
	; X64-SLOW-NEXT: movl %edx, %ecx			; X64-SLOW-NEXT: movl %edx, %ecx
				; X64-SLOW-NEXT: shll $16, %edi
	; X64-SLOW-NEXT: movzwl %si, %eax			; X64-SLOW-NEXT: movzwl %si, %eax
				; X64-SLOW-NEXT: orl %edi, %eax
	; X64-SLOW-NEXT: andb $15, %cl			; X64-SLOW-NEXT: andb $15, %cl
	; X64-SLOW-NEXT: shll %cl, %edi
	; X64-SLOW-NEXT: xorb $15, %cl
	; X64-SLOW-NEXT: shrl %eax
	; X64-SLOW-NEXT: # kill: def $cl killed $cl killed $ecx			; X64-SLOW-NEXT: # kill: def $cl killed $cl killed $ecx
	; X64-SLOW-NEXT: shrl %cl, %eax			; X64-SLOW-NEXT: shll %cl, %eax
	; X64-SLOW-NEXT: orl %edi, %eax			; X64-SLOW-NEXT: shrl $16, %eax
	; X64-SLOW-NEXT: # kill: def $ax killed $ax killed $eax			; X64-SLOW-NEXT: # kill: def $ax killed $ax killed $eax
	; X64-SLOW-NEXT: retq			; X64-SLOW-NEXT: retq
	%tmp = tail call i16 @llvm.fshl.i16(i16 %x, i16 %y, i16 %z)			%tmp = tail call i16 @llvm.fshl.i16(i16 %x, i16 %y, i16 %z)
	ret i16 %tmp			ret i16 %tmp
	}			}

	define i32 @var_shift_i32(i32 %x, i32 %y, i32 %z) nounwind {			define i32 @var_shift_i32(i32 %x, i32 %y, i32 %z) nounwind {
	; X86-FAST-LABEL: var_shift_i32:			; X86-FAST-LABEL: var_shift_i32:
	▲ Show 20 Lines • Show All 469 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/fshr.ll

	Show All 10 Lines

	;			;
	; Variable Funnel Shift			; Variable Funnel Shift
	;			;

	define i8 @var_shift_i8(i8 %x, i8 %y, i8 %z) nounwind {			define i8 @var_shift_i8(i8 %x, i8 %y, i8 %z) nounwind {
	; X86-LABEL: var_shift_i8:			; X86-LABEL: var_shift_i8:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movb {{[0-9]+}}(%esp), %ah			; X86-NEXT: movb {{[0-9]+}}(%esp), %cl
	; X86-NEXT: movb {{[0-9]+}}(%esp), %al			; X86-NEXT: movzbl {{[0-9]+}}(%esp), %edx
	; X86-NEXT: movb {{[0-9]+}}(%esp), %dl			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: andb $7, %dl			; X86-NEXT: shll $8, %eax
	; X86-NEXT: movb %al, %ch			; X86-NEXT: orl %edx, %eax
	; X86-NEXT: movb %dl, %cl			; X86-NEXT: andb $7, %cl
	; X86-NEXT: shrb %cl, %ch			; X86-NEXT: shrl %cl, %eax
	; X86-NEXT: movb $8, %cl			; X86-NEXT: # kill: def $al killed $al killed $eax
	; X86-NEXT: subb %dl, %cl
	; X86-NEXT: shlb %cl, %ah
	; X86-NEXT: testb %dl, %dl
	; X86-NEXT: je .LBB0_2
	; X86-NEXT: # %bb.1:
	; X86-NEXT: orb %ch, %ah
	; X86-NEXT: movb %ah, %al
	; X86-NEXT: .LBB0_2:
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: var_shift_i8:			; X64-LABEL: var_shift_i8:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: andb $7, %dl
	; X64-NEXT: movl %esi, %eax
	; X64-NEXT: movl %edx, %ecx			; X64-NEXT: movl %edx, %ecx
	; X64-NEXT: shrb %cl, %al			; X64-NEXT: shll $8, %edi
	; X64-NEXT: movb $8, %cl			; X64-NEXT: movzbl %sil, %eax
	; X64-NEXT: subb %dl, %cl			; X64-NEXT: orl %edi, %eax
	; X64-NEXT: shlb %cl, %dil			; X64-NEXT: andb $7, %cl
	; X64-NEXT: orb %al, %dil			; X64-NEXT: # kill: def $cl killed $cl killed $ecx
	; X64-NEXT: movzbl %dil, %eax			; X64-NEXT: shrl %cl, %eax
	; X64-NEXT: testb %dl, %dl
	; X64-NEXT: cmovel %esi, %eax
	; X64-NEXT: # kill: def $al killed $al killed $eax			; X64-NEXT: # kill: def $al killed $al killed $eax
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp = tail call i8 @llvm.fshr.i8(i8 %x, i8 %y, i8 %z)			%tmp = tail call i8 @llvm.fshr.i8(i8 %x, i8 %y, i8 %z)
	ret i8 %tmp			ret i8 %tmp
	}			}

	define i16 @var_shift_i16(i16 %x, i16 %y, i16 %z) nounwind {			define i16 @var_shift_i16(i16 %x, i16 %y, i16 %z) nounwind {
	; X86-FAST-LABEL: var_shift_i16:			; X86-FAST-LABEL: var_shift_i16:
	; X86-FAST: # %bb.0:			; X86-FAST: # %bb.0:
	; X86-FAST-NEXT: movzwl {{[0-9]+}}(%esp), %edx			; X86-FAST-NEXT: movzwl {{[0-9]+}}(%esp), %edx
	; X86-FAST-NEXT: movzwl {{[0-9]+}}(%esp), %eax			; X86-FAST-NEXT: movzwl {{[0-9]+}}(%esp), %eax
	; X86-FAST-NEXT: movb {{[0-9]+}}(%esp), %cl			; X86-FAST-NEXT: movb {{[0-9]+}}(%esp), %cl
	; X86-FAST-NEXT: andb $15, %cl			; X86-FAST-NEXT: andb $15, %cl
	; X86-FAST-NEXT: shrdw %cl, %dx, %ax			; X86-FAST-NEXT: shrdw %cl, %dx, %ax
	; X86-FAST-NEXT: retl			; X86-FAST-NEXT: retl
	;			;
	; X86-SLOW-LABEL: var_shift_i16:			; X86-SLOW-LABEL: var_shift_i16:
	; X86-SLOW: # %bb.0:			; X86-SLOW: # %bb.0:
	; X86-SLOW-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SLOW-NEXT: movzwl {{[0-9]+}}(%esp), %edx
	; X86-SLOW-NEXT: movb {{[0-9]+}}(%esp), %cl			; X86-SLOW-NEXT: movb {{[0-9]+}}(%esp), %cl
	; X86-SLOW-NEXT: andb $15, %cl			; X86-SLOW-NEXT: movzwl {{[0-9]+}}(%esp), %edx
	; X86-SLOW-NEXT: shrl %cl, %edx			; X86-SLOW-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SLOW-NEXT: addl %eax, %eax			; X86-SLOW-NEXT: shll $16, %eax
	; X86-SLOW-NEXT: xorb $15, %cl
	; X86-SLOW-NEXT: shll %cl, %eax
	; X86-SLOW-NEXT: orl %edx, %eax			; X86-SLOW-NEXT: orl %edx, %eax
				; X86-SLOW-NEXT: andb $15, %cl
				; X86-SLOW-NEXT: shrl %cl, %eax
	; X86-SLOW-NEXT: # kill: def $ax killed $ax killed $eax			; X86-SLOW-NEXT: # kill: def $ax killed $ax killed $eax
	; X86-SLOW-NEXT: retl			; X86-SLOW-NEXT: retl
	;			;
	; X64-FAST-LABEL: var_shift_i16:			; X64-FAST-LABEL: var_shift_i16:
	; X64-FAST: # %bb.0:			; X64-FAST: # %bb.0:
	; X64-FAST-NEXT: movl %edx, %ecx			; X64-FAST-NEXT: movl %edx, %ecx
	; X64-FAST-NEXT: movl %esi, %eax			; X64-FAST-NEXT: movl %esi, %eax
	; X64-FAST-NEXT: andb $15, %cl			; X64-FAST-NEXT: andb $15, %cl
	; X64-FAST-NEXT: # kill: def $cl killed $cl killed $ecx			; X64-FAST-NEXT: # kill: def $cl killed $cl killed $ecx
	; X64-FAST-NEXT: shrdw %cl, %di, %ax			; X64-FAST-NEXT: shrdw %cl, %di, %ax
	; X64-FAST-NEXT: # kill: def $ax killed $ax killed $eax			; X64-FAST-NEXT: # kill: def $ax killed $ax killed $eax
	; X64-FAST-NEXT: retq			; X64-FAST-NEXT: retq
	;			;
	; X64-SLOW-LABEL: var_shift_i16:			; X64-SLOW-LABEL: var_shift_i16:
	; X64-SLOW: # %bb.0:			; X64-SLOW: # %bb.0:
	; X64-SLOW-NEXT: movl %edx, %ecx			; X64-SLOW-NEXT: movl %edx, %ecx
	; X64-SLOW-NEXT: # kill: def $edi killed $edi def $rdi			; X64-SLOW-NEXT: shll $16, %edi
	; X64-SLOW-NEXT: movzwl %si, %edx			; X64-SLOW-NEXT: movzwl %si, %eax
				; X64-SLOW-NEXT: orl %edi, %eax
	; X64-SLOW-NEXT: andb $15, %cl			; X64-SLOW-NEXT: andb $15, %cl
	; X64-SLOW-NEXT: shrl %cl, %edx
	; X64-SLOW-NEXT: leal (%rdi,%rdi), %eax
	; X64-SLOW-NEXT: xorb $15, %cl
	; X64-SLOW-NEXT: # kill: def $cl killed $cl killed $ecx			; X64-SLOW-NEXT: # kill: def $cl killed $cl killed $ecx
	; X64-SLOW-NEXT: shll %cl, %eax			; X64-SLOW-NEXT: shrl %cl, %eax
	; X64-SLOW-NEXT: orl %edx, %eax
	; X64-SLOW-NEXT: # kill: def $ax killed $ax killed $eax			; X64-SLOW-NEXT: # kill: def $ax killed $ax killed $eax
	; X64-SLOW-NEXT: retq			; X64-SLOW-NEXT: retq
	%tmp = tail call i16 @llvm.fshr.i16(i16 %x, i16 %y, i16 %z)			%tmp = tail call i16 @llvm.fshr.i16(i16 %x, i16 %y, i16 %z)
	ret i16 %tmp			ret i16 %tmp
	}			}

	define i32 @var_shift_i32(i32 %x, i32 %y, i32 %z) nounwind {			define i32 @var_shift_i32(i32 %x, i32 %y, i32 %z) nounwind {
	; X86-FAST-LABEL: var_shift_i32:			; X86-FAST-LABEL: var_shift_i32:
	▲ Show 20 Lines • Show All 466 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/rotate-extract.ll

	Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines
	}			}

	; Can't evenly factor 16 from 49			; Can't evenly factor 16 from 49
	define i8 @no_extract_udiv(i8 %i) nounwind {			define i8 @no_extract_udiv(i8 %i) nounwind {
	; X86-LABEL: no_extract_udiv:			; X86-LABEL: no_extract_udiv:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movzbl {{[0-9]+}}(%esp), %eax			; X86-NEXT: movzbl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: imull $171, %eax, %ecx			; X86-NEXT: imull $171, %eax, %ecx
	; X86-NEXT: shlb $3, %ch
	; X86-NEXT: andb $-16, %ch
	; X86-NEXT: imull $79, %eax, %edx			; X86-NEXT: imull $79, %eax, %edx
	; X86-NEXT: subb %dh, %al			; X86-NEXT: subb %dh, %al
	; X86-NEXT: shrb %al			; X86-NEXT: shrb %al
	; X86-NEXT: addb %dh, %al			; X86-NEXT: addb %dh, %al
	; X86-NEXT: shrb $5, %al			; X86-NEXT: shrb $5, %al
	; X86-NEXT: orb %ch, %al			; X86-NEXT: shlb $3, %ch
	; X86-NEXT: # kill: def $al killed $al killed $eax			; X86-NEXT: orb %al, %ch
				; X86-NEXT: andb $-9, %ch
				; X86-NEXT: movb %ch, %al
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: no_extract_udiv:			; X64-LABEL: no_extract_udiv:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: movzbl %dil, %eax			; X64-NEXT: movzbl %dil, %ecx
	; X64-NEXT: imull $171, %eax, %ecx			; X64-NEXT: imull $171, %ecx, %eax
	; X64-NEXT: shrl $8, %ecx			; X64-NEXT: shrl $8, %eax
	; X64-NEXT: shlb $3, %cl			; X64-NEXT: imull $79, %ecx, %edx
	; X64-NEXT: andb $-16, %cl
	; X64-NEXT: imull $79, %eax, %edx
	; X64-NEXT: shrl $8, %edx			; X64-NEXT: shrl $8, %edx
	; X64-NEXT: subb %dl, %al			; X64-NEXT: subb %dl, %cl
	; X64-NEXT: shrb %al			; X64-NEXT: shrb %cl
	; X64-NEXT: addb %dl, %al			; X64-NEXT: addb %dl, %cl
	; X64-NEXT: shrb $5, %al			; X64-NEXT: shrb $5, %cl
				; X64-NEXT: shlb $3, %al
	; X64-NEXT: orb %cl, %al			; X64-NEXT: orb %cl, %al
				; X64-NEXT: andb $-9, %al
	; X64-NEXT: # kill: def $al killed $al killed $eax			; X64-NEXT: # kill: def $al killed $al killed $eax
	; X64-NEXT: retq			; X64-NEXT: retq
	%lhs_div = udiv i8 %i, 3			%lhs_div = udiv i8 %i, 3
	%rhs_div = udiv i8 %i, 49			%rhs_div = udiv i8 %i, 49
	%lhs_shift = shl i8 %lhs_div,4			%lhs_shift = shl i8 %lhs_div,4
	%out = or i8 %lhs_shift, %rhs_div			%out = or i8 %lhs_shift, %rhs_div
	ret i8 %out			ret i8 %out
	}			}
	▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines