This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
4
DAGCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
arm64-nvcast.ll
-
SystemZ/
-
vec-trunc-to-i1.ll
-
WebAssembly/
-
simd-shift-complex-splats.ll
-
X86/
-
avx512-calling-conv.ll

Differential D83602

[DAGCombiner] Scalarize splats with just one demanded lane
AbandonedPublic

Authored by tlively on Jul 10 2020, 6:16 PM.

Download Raw Diff

Details

Reviewers

aheejin
dschuff
arsenm
spatel
RKSimon
craig.topper

Summary

This patch implements a combine to scalarize subtrees of the selection
DAG that produce splat values for which only a single lane is
demanded. The scalarization only happens when the target supports
scalar versions of each operation in the subtree to avoid introducing
any new transitions between vector and scalar registers and to avoid
potentially-expensive expansions of scalarized operations.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	530 ms	linux > SanitizerCommon-asan-x86_64-Linux.Linux::Unknown Unit Message ("")
	330 ms	linux > SanitizerCommon-lsan-x86_64-Linux.Linux::Unknown Unit Message ("")
	430 ms	linux > SanitizerCommon-msan-x86_64-Linux.Linux::Unknown Unit Message ("")
	530 ms	linux > SanitizerCommon-tsan-x86_64-Linux.Linux::Unknown Unit Message ("")
	390 ms	linux > SanitizerCommon-ubsan-x86_64-Linux.Linux::Unknown Unit Message ("")
		View Full Test Results (6 Failed)

Event Timeline

tlively created this revision.Jul 10 2020, 6:16 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2020, 6:16 PM

Herald added subscribers: llvm-commits, ecnelises, hiraditya and 3 others. · View Herald Transcript

tlively added a child revision: D83605: [SelectionDAG][WebAssembly] Recognize splat value ABS operations.Jul 10 2020, 6:46 PM

Harbormaster failed remote builds in B63833: Diff 277186!Jul 10 2020, 6:50 PM

Is this supposed to fix some lowering-produced code?
If not, shouldn't this be best done in the middle-end?

In D83602#2145785, @lebedev.ri wrote:

Is this supposed to fix some lowering-produced code?
If not, shouldn't this be best done in the middle-end?

Yes, this fixes lowering-produced code. In particular, WebAssembly's vector shift instructions take a scalar shift amount, but in LLVM IR vector shifts take vector shift amounts. WebAssembly's lowering then needs to scalarize the shift entirely except when the shift amount is a splat value, in which case it can just take one lane as the scalar shift amount. This sequence of patches improves codegen in that case.

I think this is a nice idea! But I'd like people working on other targets to check this too.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17658	I can imagine `ADD` is legal for `MVT::i32`, but when `Opc` is `BUILD_VECTOR` or `SPLAT_VECTOR`, are they legal operations for scalar types too, such as as `MVT::i32`? If so, why? Aren't they for vector types?
17676	Can't we use this code for `BUILD_VECTOR` here for all splat-type vectors, including `SPLAT_VECTOR`, `BUILD_VECTOR`, and `SHUFFLE_VECTOR`? `getSplatSourceVector` seems to handle all these.

Would it make sense to generalize or add a TLI hook similar to the one used in scalarizeExtractedBinop()?

/// Try to convert an extract element of a vector binary operation into an
/// extract element followed by a scalar operation.
virtual bool shouldScalarizeBinop(SDValue VecOp) const;

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17677–17679	Why only these 3 binops? Could this be TLI.isBinOp(Opc) instead?

srj added a subscriber: srj.Jul 14 2020, 2:50 PM

aheejin added inline comments.Jul 14 2020, 9:18 PM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17677–17679	I was wondering about the same thing. I suspect the author might have tried to match opcodes listed here. But those opcodes in `SelectionDAG::isSplatValue` look like they were selected somewhat arbitrarily in the first place (it says they are "common patterns").

We've been making progress performing similar ops in the vector-combiner pass based on cost metrics - have you looked at performing it there?

In D83602#2146156, @tlively wrote:

Yes, this fixes lowering-produced code. In particular, WebAssembly's vector shift instructions take a scalar shift amount, but in LLVM IR vector shifts take vector shift amounts. WebAssembly's lowering then needs to scalarize the shift entirely except when the shift amount is a splat value, in which case it can just take one lane as the scalar shift amount. This sequence of patches improves codegen in that case.

X86/SSE uses a similar 'vector shift by scalar' approach - and SimlifyDemandedVectorElts etc. manages to remove similar issues - is it possible that WebAssembly is just missing a combine from its shift ops to try and simplify the operands?

Thank you all for the comments! I agree that the opcodes in SelectionDAG::isSplatValue are rather arbitrary, so a more principled approach using a TLI hook might be better. I will take a look at what X86 is doing to see if a simpler solution would work for WebAssembly, too.

I'm taking this patch sequence out of the review queue for now, pending an investigation of the alternatives suggested above.

@tlively Is this still necessary? My guess is that SimplifyDemandedVectorElts target node handling should do everything you need?

Herald added subscribers: wingo, pengfei. · View Herald TranscriptJul 10 2021, 3:20 AM

I'm not sure, although I recently started working in this area again. I'll close this revision and open a new one if any changes do turn out to be necessary.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

50 lines

test/

CodeGen/

AArch64/

arm64-nvcast.ll

13 lines

SystemZ/

vec-trunc-to-i1.ll

7 lines

WebAssembly/

simd-shift-complex-splats.ll

9 lines

X86/

avx512-calling-conv.ll

190 lines

Diff 277186

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 17,634 Lines • ▼ Show 20 Lines	if (isAnyConstantBuildVector(Op0, true) \|\|
SDValue Ext0 = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Op0, Index);		SDValue Ext0 = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Op0, Index);
SDValue Ext1 = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Op1, Index);		SDValue Ext1 = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Op1, Index);
return DAG.getNode(Vec.getOpcode(), DL, VT, Ext0, Ext1);		return DAG.getNode(Vec.getOpcode(), DL, VT, Ext0, Ext1);
}		}

return SDValue();		return SDValue();
}		}

		/// Transform a vector operation on a splatted vector into a scalar operation on
		/// the splat value.
		static SDValue scalarizeSplatValue(SDValue Vec, EVT ResVT, SelectionDAG &DAG) {
		// Don't scalarize if we need the full vector anyway.
		// TODO: Handle the case where there are multiple extract_elt users
		if (!Vec.hasOneUse() \|\| Vec.getNode()->getNumValues() != 1)
		return SDValue();

		SDLoc DL(Vec);
		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		EVT ScalarVT = Vec.getValueType().getScalarType();
		unsigned Opc = Vec.getOpcode();

		// Don't scalarize if the target does not support the scalar operation
		if (!TLI.isOperationLegalOrCustomOrPromote(Opc, ResVT))
		return SDValue();
		aheejinUnsubmitted Not Done Reply Inline Actions I can imagine `ADD` is legal for `MVT::i32`, but when `Opc` is `BUILD_VECTOR` or `SPLAT_VECTOR`, are they legal operations for scalar types too, such as as `MVT::i32`? If so, why? Aren't they for vector types? aheejin: I can imagine `ADD` is legal for `MVT::i32`, but when `Opc` is `BUILD_VECTOR` or `SPLAT_VECTOR`…

		// TODO: Check for all the structures recognized by SelectionDAG::isSplatValue
		switch (Opc) {
		default:
		break;
		case ISD::SPLAT_VECTOR: {
		SDValue SplatVal = Vec.getOperand(0);
		return ResVT == ScalarVT ? SplatVal
		: DAG.getAnyExtOrTrunc(SplatVal, DL, ResVT);
		}
		case ISD::BUILD_VECTOR: {
		int SplatIdx;
		if (!DAG.getSplatSourceVector(Vec, SplatIdx))
		break;
		SDValue SplatVal = Vec.getOperand(SplatIdx);
		return ResVT == ScalarVT ? SplatVal
		: DAG.getAnyExtOrTrunc(SplatVal, DL, ResVT);
		}
		aheejinUnsubmitted Not Done Reply Inline Actions Can't we use this code for `BUILD_VECTOR` here for all splat-type vectors, including `SPLAT_VECTOR`, `BUILD_VECTOR`, and `SHUFFLE_VECTOR`? `getSplatSourceVector` seems to handle all these. aheejin: Can't we use this code for `BUILD_VECTOR` here for all splat-type vectors, including…
		case ISD::ADD:
		case ISD::SUB:
		case ISD::AND: {
		spatelUnsubmitted Not Done Reply Inline Actions Why only these 3 binops? Could this be TLI.isBinOp(Opc) instead? spatel: Why only these 3 binops? Could this be TLI.isBinOp(Opc) instead?
		aheejinUnsubmitted Not Done Reply Inline Actions I was wondering about the same thing. I suspect the author might have tried to match opcodes listed here. But those opcodes in `SelectionDAG::isSplatValue` look like they were selected somewhat arbitrarily in the first place (it says they are "common patterns"). aheejin: I was wondering about the same thing. I suspect the author might have tried to match opcodes…
		SDValue LHS, RHS;
		if ((LHS = scalarizeSplatValue(Vec.getOperand(0), ResVT, DAG)) &&
		(RHS = scalarizeSplatValue(Vec.getOperand(1), ResVT, DAG)))
		return DAG.getNode(Opc, DL, ResVT, LHS, RHS);
		break;
		}
		}
		return SDValue();
		}

SDValue DAGCombiner::visitEXTRACT_VECTOR_ELT(SDNode *N) {		SDValue DAGCombiner::visitEXTRACT_VECTOR_ELT(SDNode *N) {
SDValue VecOp = N->getOperand(0);		SDValue VecOp = N->getOperand(0);
SDValue Index = N->getOperand(1);		SDValue Index = N->getOperand(1);
EVT ScalarVT = N->getValueType(0);		EVT ScalarVT = N->getValueType(0);
EVT VecVT = VecOp.getValueType();		EVT VecVT = VecOp.getValueType();
if (VecOp.isUndef())		if (VecOp.isUndef())
return DAG.getUNDEF(ScalarVT);		return DAG.getUNDEF(ScalarVT);

▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	if (LegalTypes && BCSrc.getValueType().isInteger() &&
return DAG.getAnyExtOrTrunc(X, DL, ScalarVT);		return DAG.getAnyExtOrTrunc(X, DL, ScalarVT);
}		}
}		}
}		}

if (SDValue BO = scalarizeExtractedBinop(N, DAG, LegalOperations))		if (SDValue BO = scalarizeExtractedBinop(N, DAG, LegalOperations))
return BO;		return BO;

		if (SDValue SO = scalarizeSplatValue(VecOp, ScalarVT, DAG))
		return SO;

// Transform: (EXTRACT_VECTOR_ELT( VECTOR_SHUFFLE )) -> EXTRACT_VECTOR_ELT.		// Transform: (EXTRACT_VECTOR_ELT( VECTOR_SHUFFLE )) -> EXTRACT_VECTOR_ELT.
// We only perform this optimization before the op legalization phase because		// We only perform this optimization before the op legalization phase because
// we may introduce new vector instructions which are not backed by TD		// we may introduce new vector instructions which are not backed by TD
// patterns. For example on AVX, extracting elements from a wide vector		// patterns. For example on AVX, extracting elements from a wide vector
// without using extract_subvector. However, if we can find an underlying		// without using extract_subvector. However, if we can find an underlying
// scalar value, then we can always use that.		// scalar value, then we can always use that.
if (IndexC && VecOp.getOpcode() == ISD::VECTOR_SHUFFLE) {		if (IndexC && VecOp.getOpcode() == ISD::VECTOR_SHUFFLE) {
auto *Shuf = cast<ShuffleVectorSDNode>(VecOp);		auto *Shuf = cast<ShuffleVectorSDNode>(VecOp);
▲ Show 20 Lines • Show All 4,345 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-nvcast.ll

Show All 18 Lines	entry:
%v2 = extractelement <3 x float> <float 0.000000e+00, float 2.000000e+00, float 0.000000e+00>, i32 %v1		%v2 = extractelement <3 x float> <float 0.000000e+00, float 2.000000e+00, float 0.000000e+00>, i32 %v1
store float %v2, float* %p1, align 4		store float %v2, float* %p1, align 4
ret void		ret void
}		}

define void @test2(float * %p1, i32 %v1) {		define void @test2(float * %p1, i32 %v1) {
; CHECK-LABEL: test2:		; CHECK-LABEL: test2:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: sub sp, sp, #16 ; =16		; CHECK-NEXT: mov w8, #1061109567
; CHECK-NEXT: .cfi_def_cfa_offset 16		; CHECK-NEXT: str w8, [x0]
; CHECK-NEXT: ; kill: def $w1 killed $w1 def $x1
; CHECK-NEXT: movi.16b v0, #63
; CHECK-NEXT: and x8, x1, #0x3
; CHECK-NEXT: mov x9, sp
; CHECK-NEXT: str q0, [sp]
; CHECK-NEXT: bfi x9, x8, #2, #2
; CHECK-NEXT: ldr s0, [x9]
; CHECK-NEXT: str s0, [x0]
; CHECK-NEXT: add sp, sp, #16 ; =16
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%v2 = extractelement <3 x float> <float 0.7470588088035583, float 0.7470588088035583, float 0.7470588088035583>, i32 %v1		%v2 = extractelement <3 x float> <float 0.7470588088035583, float 0.7470588088035583, float 0.7470588088035583>, i32 %v1
store float %v2, float* %p1, align 4		store float %v2, float* %p1, align 4
ret void		ret void
}		}


Show All 26 Lines

llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s			; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s
	;			;
	; Check that a widening truncate to a vector of i1 elements can be handled.			; Check that a widening truncate to a vector of i1 elements can be handled.

	define void @pr32275(<4 x i8> %B15) {			define void @pr32275(<4 x i8> %B15) {
	; CHECK-LABEL: pr32275:			; CHECK-LABEL: pr32275:
	; CHECK: # %bb.0: # %BB			; CHECK: # %bb.0: # %BB
	; CHECK-NEXT: vlgvb %r0, %v24, 3			; CHECK-NEXT: vlgvb %r0, %v24, 3
	; CHECK-NEXT: vlvgp %v0, %r0, %r0
	; CHECK-NEXT: vrepif %v1, 1
	; CHECK-NEXT: vn %v0, %v0, %v1
	; CHECK-NEXT: vlgvf %r0, %v0, 3
	; CHECK-NEXT: .LBB0_1: # %CF34			; CHECK-NEXT: .LBB0_1: # %CF34
	; CHECK-NEXT: # =>This Inner Loop Header: Depth=1			; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: cijlh %r0, 0, .LBB0_1			; CHECK-NEXT: tmll %r0, 1
				; CHECK-NEXT: jne .LBB0_1
	; CHECK-NEXT: # %bb.2: # %CF36			; CHECK-NEXT: # %bb.2: # %CF36
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	BB:			BB:
	br label %CF34			br label %CF34

	CF34:			CF34:
	%Tr24 = trunc <4 x i8> %B15 to <4 x i1>			%Tr24 = trunc <4 x i8> %B15 to <4 x i1>
	%E28 = extractelement <4 x i1> %Tr24, i32 3			%E28 = extractelement <4 x i1> %Tr24, i32 3
	br i1 %E28, label %CF34, label %CF36			br i1 %E28, label %CF34, label %CF36

	CF36:			CF36:
	ret void			ret void
	}			}

llvm/test/CodeGen/WebAssembly/simd-shift-complex-splats.ll

	; RUN: llc < %s -asm-verbose=false -verify-machineinstrs -disable-wasm-fallthrough-return-opt -wasm-disable-explicit-locals -wasm-keep-registers -mattr=+simd128 \| FileCheck %s			; RUN: llc < %s -asm-verbose=false -verify-machineinstrs -disable-wasm-fallthrough-return-opt -wasm-disable-explicit-locals -wasm-keep-registers -mattr=+simd128 \| FileCheck %s

	; Test that SIMD shifts can be lowered correctly even with shift			; Test that SIMD shifts can be lowered correctly even with shift
	; values that are more complex than plain splats.			; values that are more complex than plain splats.

	target datalayout = "e-m:e-p:32:32-i64:64-n32:64-S128"			target datalayout = "e-m:e-p:32:32-i64:64-n32:64-S128"
	target triple = "wasm32-unknown-unknown"			target triple = "wasm32-unknown-unknown"

	;; TODO: Optimize this further by scalarizing the add			;; TODO: Optimize this further by scalarizing the add

	; CHECK-LABEL: shl_add:			; CHECK-LABEL: shl_add:
	; CHECK-NEXT: .functype shl_add (v128, i32, i32) -> (v128)			; CHECK-NEXT: .functype shl_add (v128, i32, i32) -> (v128)
	; CHECK-NEXT: i8x16.splat $push1=, $1			; CHECK-NEXT: i32.add $push0=, $1, $2
	; CHECK-NEXT: i8x16.splat $push0=, $2			; CHECK-NEXT: i8x16.shl $push1=, $0, $pop0
	; CHECK-NEXT: i8x16.add $push2=, $pop1, $pop0			; CHECK-NEXT: return $pop1
	; CHECK-NEXT: i8x16.extract_lane_u $push3=, $pop2, 0
	; CHECK-NEXT: i8x16.shl $push4=, $0, $pop3
	; CHECK-NEXT: return $pop4
	define <16 x i8> @shl_add(<16 x i8> %v, i8 %a, i8 %b) {			define <16 x i8> @shl_add(<16 x i8> %v, i8 %a, i8 %b) {
	%t1 = insertelement <16 x i8> undef, i8 %a, i32 0			%t1 = insertelement <16 x i8> undef, i8 %a, i32 0
	%va = shufflevector <16 x i8> %t1, <16 x i8> undef, <16 x i32> zeroinitializer			%va = shufflevector <16 x i8> %t1, <16 x i8> undef, <16 x i32> zeroinitializer
	%t2 = insertelement <16 x i8> undef, i8 %b, i32 0			%t2 = insertelement <16 x i8> undef, i8 %b, i32 0
	%vb = shufflevector <16 x i8> %t2, <16 x i8> undef, <16 x i32> zeroinitializer			%vb = shufflevector <16 x i8> %t2, <16 x i8> undef, <16 x i32> zeroinitializer
	%shift = add <16 x i8> %va, %vb			%shift = add <16 x i8> %va, %vb
	%r = shl <16 x i8> %v, %shift			%r = shl <16 x i8> %v, %shift
	ret <16 x i8> %r			ret <16 x i8> %r
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/avx512-calling-conv.ll

	Show First 20 Lines • Show All 779 Lines • ▼ Show 20 Lines
	; KNL-NEXT: korw %k1, %k0, %k0			; KNL-NEXT: korw %k1, %k0, %k0
	; KNL-NEXT: movw $-4097, %di ## imm = 0xEFFF			; KNL-NEXT: movw $-4097, %di ## imm = 0xEFFF
	; KNL-NEXT: kmovw %edi, %k2			; KNL-NEXT: kmovw %edi, %k2
	; KNL-NEXT: kandw %k2, %k0, %k0			; KNL-NEXT: kandw %k2, %k0, %k0
	; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dil			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dil
	; KNL-NEXT: kmovw %edi, %k1			; KNL-NEXT: kmovw %edi, %k1
	; KNL-NEXT: kshiftlw $15, %k1, %k1			; KNL-NEXT: kshiftlw $15, %k1, %k1
	; KNL-NEXT: kshiftrw $3, %k1, %k1			; KNL-NEXT: kshiftrw $3, %k1, %k1
	; KNL-NEXT: korw %k1, %k0, %k1			; KNL-NEXT: korw %k1, %k0, %k0
	; KNL-NEXT: movw $-8193, %di ## imm = 0xDFFF			; KNL-NEXT: movw $-8193, %di ## imm = 0xDFFF
	; KNL-NEXT: kmovw %edi, %k0			; KNL-NEXT: kmovw %edi, %k1
	; KNL-NEXT: kandw %k0, %k1, %k1			; KNL-NEXT: kandw %k1, %k0, %k0
	; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dil			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dil
	; KNL-NEXT: kmovw %edi, %k6			; KNL-NEXT: kmovw %edi, %k6
	; KNL-NEXT: kshiftlw $15, %k6, %k6			; KNL-NEXT: kshiftlw $15, %k6, %k6
	; KNL-NEXT: kshiftrw $2, %k6, %k6			; KNL-NEXT: kshiftrw $2, %k6, %k6
	; KNL-NEXT: korw %k6, %k1, %k6			; KNL-NEXT: korw %k6, %k0, %k6
	; KNL-NEXT: movw $-16385, %di ## imm = 0xBFFF			; KNL-NEXT: movw $-16385, %di ## imm = 0xBFFF
	; KNL-NEXT: kmovw %edi, %k1			; KNL-NEXT: kmovw %edi, %k0
	; KNL-NEXT: kandw %k1, %k6, %k6			; KNL-NEXT: kandw %k0, %k6, %k6
	; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dil			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dil
	; KNL-NEXT: kmovw %edi, %k7			; KNL-NEXT: kmovw %edi, %k7
	; KNL-NEXT: kshiftlw $14, %k7, %k7			; KNL-NEXT: kshiftlw $14, %k7, %k7
	; KNL-NEXT: korw %k7, %k6, %k6			; KNL-NEXT: korw %k7, %k6, %k6
	; KNL-NEXT: kshiftlw $1, %k6, %k6			; KNL-NEXT: kshiftlw $1, %k6, %k6
	; KNL-NEXT: kshiftrw $1, %k6, %k6			; KNL-NEXT: kshiftrw $1, %k6, %k6
	; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dil			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dil
	; KNL-NEXT: kmovw %edi, %k7			; KNL-NEXT: kmovw %edi, %k7
	▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines
	; KNL-NEXT: kshiftrw $4, %k4, %k4			; KNL-NEXT: kshiftrw $4, %k4, %k4
	; KNL-NEXT: korw %k4, %k3, %k3			; KNL-NEXT: korw %k4, %k3, %k3
	; KNL-NEXT: kandw %k2, %k3, %k2			; KNL-NEXT: kandw %k2, %k3, %k2
	; KNL-NEXT: movb {{[0-9]+}}(%rsp), %cl			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %cl
	; KNL-NEXT: kmovw %ecx, %k3			; KNL-NEXT: kmovw %ecx, %k3
	; KNL-NEXT: kshiftlw $15, %k3, %k3			; KNL-NEXT: kshiftlw $15, %k3, %k3
	; KNL-NEXT: kshiftrw $3, %k3, %k3			; KNL-NEXT: kshiftrw $3, %k3, %k3
	; KNL-NEXT: korw %k3, %k2, %k2			; KNL-NEXT: korw %k3, %k2, %k2
	; KNL-NEXT: kandw %k0, %k2, %k0			; KNL-NEXT: kandw %k1, %k2, %k1
	; KNL-NEXT: movb {{[0-9]+}}(%rsp), %cl			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %cl
	; KNL-NEXT: kmovw %ecx, %k2			; KNL-NEXT: kmovw %ecx, %k2
	; KNL-NEXT: kshiftlw $15, %k2, %k2			; KNL-NEXT: kshiftlw $15, %k2, %k2
	; KNL-NEXT: kshiftrw $2, %k2, %k2			; KNL-NEXT: kshiftrw $2, %k2, %k2
	; KNL-NEXT: korw %k2, %k0, %k0			; KNL-NEXT: korw %k2, %k1, %k1
	; KNL-NEXT: xorl %ecx, %ecx			; KNL-NEXT: kandw %k0, %k1, %k0
	; KNL-NEXT: testb $1, {{[0-9]+}}(%rsp)			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %cl
	; KNL-NEXT: movl $65535, %edx ## imm = 0xFFFF			; KNL-NEXT: kmovw %ecx, %k1
	; KNL-NEXT: movl $0, %esi
	; KNL-NEXT: cmovnel %edx, %esi
	; KNL-NEXT: testb $1, {{[0-9]+}}(%rsp)
	; KNL-NEXT: cmovnel %edx, %ecx
	; KNL-NEXT: kandw %k1, %k0, %k0
	; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dl
	; KNL-NEXT: kmovw %edx, %k1
	; KNL-NEXT: kshiftlw $14, %k1, %k1			; KNL-NEXT: kshiftlw $14, %k1, %k1
	; KNL-NEXT: korw %k1, %k0, %k0			; KNL-NEXT: korw %k1, %k0, %k0
	; KNL-NEXT: kshiftlw $1, %k0, %k0			; KNL-NEXT: kshiftlw $1, %k0, %k0
	; KNL-NEXT: kshiftrw $1, %k0, %k0			; KNL-NEXT: kshiftrw $1, %k0, %k0
	; KNL-NEXT: movb {{[0-9]+}}(%rsp), %dl			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %cl
	; KNL-NEXT: kmovw %edx, %k1			; KNL-NEXT: kmovw %ecx, %k1
	; KNL-NEXT: kshiftlw $15, %k1, %k1			; KNL-NEXT: kshiftlw $15, %k1, %k1
	; KNL-NEXT: korw %k1, %k0, %k0			; KNL-NEXT: korw %k1, %k0, %k0
	; KNL-NEXT: kmovw %esi, %k1			; KNL-NEXT: kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k1 ## 2-byte Reload
	; KNL-NEXT: kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 ## 2-byte Reload			; KNL-NEXT: kandw %k1, %k0, %k0
	; KNL-NEXT: kandw %k2, %k0, %k0			; KNL-NEXT: movb {{[0-9]+}}(%rsp), %cl
	; KNL-NEXT: kmovw %ecx, %k2
	; KNL-NEXT: kandw %k1, %k2, %k1
	; KNL-NEXT: kmovw %k1, %r8d
	; KNL-NEXT: kshiftrw $1, %k0, %k1			; KNL-NEXT: kshiftrw $1, %k0, %k1
	; KNL-NEXT: kmovw %k1, %r9d			; KNL-NEXT: kmovw %k1, %r9d
	; KNL-NEXT: kshiftrw $2, %k0, %k1			; KNL-NEXT: kshiftrw $2, %k0, %k1
	; KNL-NEXT: kmovw %k1, %r10d			; KNL-NEXT: kmovw %k1, %r10d
	; KNL-NEXT: kshiftrw $3, %k0, %k1			; KNL-NEXT: kshiftrw $3, %k0, %k1
	; KNL-NEXT: kmovw %k1, %r11d			; KNL-NEXT: kmovw %k1, %r11d
	; KNL-NEXT: kshiftrw $4, %k0, %k1			; KNL-NEXT: kshiftrw $4, %k0, %k1
	; KNL-NEXT: kmovw %k1, %r12d			; KNL-NEXT: kmovw %k1, %r12d
	; KNL-NEXT: kshiftrw $5, %k0, %k1			; KNL-NEXT: kshiftrw $5, %k0, %k1
	; KNL-NEXT: kmovw %k1, %r15d			; KNL-NEXT: kmovw %k1, %r15d
	; KNL-NEXT: kshiftrw $6, %k0, %k1			; KNL-NEXT: kshiftrw $6, %k0, %k1
	; KNL-NEXT: kmovw %k1, %r14d			; KNL-NEXT: kmovw %k1, %r14d
	; KNL-NEXT: kshiftrw $7, %k0, %k1			; KNL-NEXT: kshiftrw $7, %k0, %k1
	; KNL-NEXT: kmovw %k1, %r13d			; KNL-NEXT: kmovw %k1, %r13d
	; KNL-NEXT: kshiftrw $8, %k0, %k1			; KNL-NEXT: kshiftrw $8, %k0, %k1
	; KNL-NEXT: kmovw %k1, %ebx			; KNL-NEXT: kmovw %k1, %ebx
	; KNL-NEXT: kshiftrw $9, %k0, %k1			; KNL-NEXT: kshiftrw $9, %k0, %k1
	; KNL-NEXT: kmovw %k1, %esi			; KNL-NEXT: kmovw %k1, %esi
	; KNL-NEXT: kshiftrw $10, %k0, %k1			; KNL-NEXT: kshiftrw $10, %k0, %k1
	; KNL-NEXT: kmovw %k1, %ebp			; KNL-NEXT: kmovw %k1, %ebp
	; KNL-NEXT: kshiftrw $11, %k0, %k1			; KNL-NEXT: kshiftrw $11, %k0, %k1
	; KNL-NEXT: kmovw %k1, %ecx			; KNL-NEXT: kmovw %k1, %r8d
	; KNL-NEXT: kshiftrw $12, %k0, %k1			; KNL-NEXT: kshiftrw $12, %k0, %k1
	; KNL-NEXT: kmovw %k1, %edx			; KNL-NEXT: kmovw %k1, %edx
	; KNL-NEXT: kshiftrw $13, %k0, %k1			; KNL-NEXT: kshiftrw $13, %k0, %k1
	; KNL-NEXT: kmovw %k1, %edi			; KNL-NEXT: kmovw %k1, %edi
	; KNL-NEXT: kshiftrw $14, %k0, %k1			; KNL-NEXT: kshiftrw $14, %k0, %k1
	; KNL-NEXT: andl $1, %r8d			; KNL-NEXT: andb {{[0-9]+}}(%rsp), %cl
	; KNL-NEXT: movb %r8b, 2(%rax)			; KNL-NEXT: movzbl %cl, %ecx
	; KNL-NEXT: kmovw %k0, %r8d			; KNL-NEXT: andl $1, %ecx
	; KNL-NEXT: andl $1, %r8d			; KNL-NEXT: movb %cl, 2(%rax)
				; KNL-NEXT: kmovw %k0, %ecx
				; KNL-NEXT: andl $1, %ecx
	; KNL-NEXT: andl $1, %r9d			; KNL-NEXT: andl $1, %r9d
	; KNL-NEXT: leal (%r8,%r9,2), %r8d			; KNL-NEXT: leal (%rcx,%r9,2), %r9d
	; KNL-NEXT: kmovw %k1, %r9d			; KNL-NEXT: kmovw %k1, %ecx
	; KNL-NEXT: kshiftrw $15, %k0, %k0			; KNL-NEXT: kshiftrw $15, %k0, %k0
	; KNL-NEXT: andl $1, %r10d			; KNL-NEXT: andl $1, %r10d
	; KNL-NEXT: leal (%r8,%r10,4), %r8d			; KNL-NEXT: leal (%r9,%r10,4), %r9d
	; KNL-NEXT: kmovw %k0, %r10d			; KNL-NEXT: kmovw %k0, %r10d
	; KNL-NEXT: andl $1, %r11d			; KNL-NEXT: andl $1, %r11d
	; KNL-NEXT: leal (%r8,%r11,8), %r8d			; KNL-NEXT: leal (%r9,%r11,8), %r9d
	; KNL-NEXT: andl $1, %r12d			; KNL-NEXT: andl $1, %r12d
	; KNL-NEXT: shll $4, %r12d			; KNL-NEXT: shll $4, %r12d
	; KNL-NEXT: orl %r8d, %r12d			; KNL-NEXT: orl %r9d, %r12d
	; KNL-NEXT: andl $1, %r15d			; KNL-NEXT: andl $1, %r15d
	; KNL-NEXT: shll $5, %r15d			; KNL-NEXT: shll $5, %r15d
	; KNL-NEXT: orl %r12d, %r15d			; KNL-NEXT: orl %r12d, %r15d
	; KNL-NEXT: andl $1, %r14d			; KNL-NEXT: andl $1, %r14d
	; KNL-NEXT: shll $6, %r14d			; KNL-NEXT: shll $6, %r14d
	; KNL-NEXT: andl $1, %r13d			; KNL-NEXT: andl $1, %r13d
	; KNL-NEXT: shll $7, %r13d			; KNL-NEXT: shll $7, %r13d
	; KNL-NEXT: orl %r14d, %r13d			; KNL-NEXT: orl %r14d, %r13d
	; KNL-NEXT: andl $1, %ebx			; KNL-NEXT: andl $1, %ebx
	; KNL-NEXT: shll $8, %ebx			; KNL-NEXT: shll $8, %ebx
	; KNL-NEXT: orl %r13d, %ebx			; KNL-NEXT: orl %r13d, %ebx
	; KNL-NEXT: andl $1, %esi			; KNL-NEXT: andl $1, %esi
	; KNL-NEXT: shll $9, %esi			; KNL-NEXT: shll $9, %esi
	; KNL-NEXT: orl %ebx, %esi			; KNL-NEXT: orl %ebx, %esi
	; KNL-NEXT: andl $1, %ebp			; KNL-NEXT: andl $1, %ebp
	; KNL-NEXT: shll $10, %ebp			; KNL-NEXT: shll $10, %ebp
	; KNL-NEXT: orl %esi, %ebp			; KNL-NEXT: orl %esi, %ebp
	; KNL-NEXT: orl %r15d, %ebp			; KNL-NEXT: orl %r15d, %ebp
	; KNL-NEXT: andl $1, %ecx			; KNL-NEXT: andl $1, %r8d
	; KNL-NEXT: shll $11, %ecx			; KNL-NEXT: shll $11, %r8d
	; KNL-NEXT: andl $1, %edx			; KNL-NEXT: andl $1, %edx
	; KNL-NEXT: shll $12, %edx			; KNL-NEXT: shll $12, %edx
	; KNL-NEXT: orl %ecx, %edx			; KNL-NEXT: orl %r8d, %edx
	; KNL-NEXT: andl $1, %edi			; KNL-NEXT: andl $1, %edi
	; KNL-NEXT: shll $13, %edi			; KNL-NEXT: shll $13, %edi
	; KNL-NEXT: orl %edx, %edi			; KNL-NEXT: orl %edx, %edi
	; KNL-NEXT: andl $1, %r9d			; KNL-NEXT: andl $1, %ecx
	; KNL-NEXT: shll $14, %r9d			; KNL-NEXT: shll $14, %ecx
	; KNL-NEXT: orl %edi, %r9d			; KNL-NEXT: orl %edi, %ecx
	; KNL-NEXT: andl $1, %r10d			; KNL-NEXT: andl $1, %r10d
	; KNL-NEXT: shll $15, %r10d			; KNL-NEXT: shll $15, %r10d
	; KNL-NEXT: orl %r9d, %r10d			; KNL-NEXT: orl %ecx, %r10d
	; KNL-NEXT: orl %ebp, %r10d			; KNL-NEXT: orl %ebp, %r10d
	; KNL-NEXT: movw %r10w, (%rax)			; KNL-NEXT: movw %r10w, (%rax)
	; KNL-NEXT: popq %rbx			; KNL-NEXT: popq %rbx
	; KNL-NEXT: popq %r12			; KNL-NEXT: popq %r12
	; KNL-NEXT: popq %r13			; KNL-NEXT: popq %r13
	; KNL-NEXT: popq %r14			; KNL-NEXT: popq %r14
	; KNL-NEXT: popq %r15			; KNL-NEXT: popq %r15
	; KNL-NEXT: popq %rbp			; KNL-NEXT: popq %rbp
	▲ Show 20 Lines • Show All 532 Lines • ▼ Show 20 Lines
	; KNL_X32-NEXT: kshiftrw $3, %k3, %k3			; KNL_X32-NEXT: kshiftrw $3, %k3, %k3
	; KNL_X32-NEXT: korw %k3, %k2, %k2			; KNL_X32-NEXT: korw %k3, %k2, %k2
	; KNL_X32-NEXT: kandw %k1, %k2, %k1			; KNL_X32-NEXT: kandw %k1, %k2, %k1
	; KNL_X32-NEXT: movb {{[0-9]+}}(%esp), %al			; KNL_X32-NEXT: movb {{[0-9]+}}(%esp), %al
	; KNL_X32-NEXT: kmovw %eax, %k2			; KNL_X32-NEXT: kmovw %eax, %k2
	; KNL_X32-NEXT: kshiftlw $15, %k2, %k2			; KNL_X32-NEXT: kshiftlw $15, %k2, %k2
	; KNL_X32-NEXT: kshiftrw $2, %k2, %k2			; KNL_X32-NEXT: kshiftrw $2, %k2, %k2
	; KNL_X32-NEXT: korw %k2, %k1, %k1			; KNL_X32-NEXT: korw %k2, %k1, %k1
	; KNL_X32-NEXT: xorl %eax, %eax
	; KNL_X32-NEXT: testb $1, {{[0-9]+}}(%esp)
	; KNL_X32-NEXT: movl $65535, %ecx ## imm = 0xFFFF
	; KNL_X32-NEXT: movl $0, %edx
	; KNL_X32-NEXT: cmovnel %ecx, %edx
	; KNL_X32-NEXT: kandw %k0, %k1, %k0			; KNL_X32-NEXT: kandw %k0, %k1, %k0
	; KNL_X32-NEXT: movb {{[0-9]+}}(%esp), %bl			; KNL_X32-NEXT: movb {{[0-9]+}}(%esp), %al
	; KNL_X32-NEXT: kmovw %ebx, %k1			; KNL_X32-NEXT: kmovw %eax, %k1
	; KNL_X32-NEXT: kshiftlw $14, %k1, %k1			; KNL_X32-NEXT: kshiftlw $14, %k1, %k1
	; KNL_X32-NEXT: korw %k1, %k0, %k0			; KNL_X32-NEXT: korw %k1, %k0, %k0
	; KNL_X32-NEXT: kshiftlw $1, %k0, %k0			; KNL_X32-NEXT: kshiftlw $1, %k0, %k0
	; KNL_X32-NEXT: kshiftrw $1, %k0, %k0			; KNL_X32-NEXT: kshiftrw $1, %k0, %k0
	; KNL_X32-NEXT: movb {{[0-9]+}}(%esp), %bl			; KNL_X32-NEXT: movb {{[0-9]+}}(%esp), %al
	; KNL_X32-NEXT: kmovw %ebx, %k1			; KNL_X32-NEXT: kmovw %eax, %k1
	; KNL_X32-NEXT: kshiftlw $15, %k1, %k1			; KNL_X32-NEXT: kshiftlw $15, %k1, %k1
	; KNL_X32-NEXT: korw %k1, %k0, %k0			; KNL_X32-NEXT: korw %k1, %k0, %k0
	; KNL_X32-NEXT: kmovw %edx, %k1			; KNL_X32-NEXT: kmovw {{[-0-9]+}}(%e{{[sb]}}p), %k1 ## 2-byte Reload
	; KNL_X32-NEXT: testb $1, {{[0-9]+}}(%esp)			; KNL_X32-NEXT: kandw %k1, %k0, %k0
	; KNL_X32-NEXT: cmovnel %ecx, %eax
	; KNL_X32-NEXT: kmovw {{[-0-9]+}}(%e{{[sb]}}p), %k2 ## 2-byte Reload
	; KNL_X32-NEXT: kandw %k2, %k0, %k0
	; KNL_X32-NEXT: kmovw %eax, %k2
	; KNL_X32-NEXT: kandw %k1, %k2, %k1
	; KNL_X32-NEXT: movl {{[0-9]+}}(%esp), %eax			; KNL_X32-NEXT: movl {{[0-9]+}}(%esp), %eax
	; KNL_X32-NEXT: kmovw %k1, %ebx			; KNL_X32-NEXT: movb {{[0-9]+}}(%esp), %dl
	; KNL_X32-NEXT: kshiftrw $1, %k0, %k1			; KNL_X32-NEXT: kshiftrw $1, %k0, %k1
	; KNL_X32-NEXT: kmovw %k1, %esi
	; KNL_X32-NEXT: kshiftrw $2, %k0, %k1
	; KNL_X32-NEXT: kmovw %k1, %edi			; KNL_X32-NEXT: kmovw %k1, %edi
				; KNL_X32-NEXT: kshiftrw $2, %k0, %k1
				; KNL_X32-NEXT: kmovw %k1, %ebx
	; KNL_X32-NEXT: kshiftrw $3, %k0, %k1			; KNL_X32-NEXT: kshiftrw $3, %k0, %k1
	; KNL_X32-NEXT: kmovw %k1, %ebp			; KNL_X32-NEXT: kmovw %k1, %ebp
	; KNL_X32-NEXT: kshiftrw $4, %k0, %k1			; KNL_X32-NEXT: kshiftrw $4, %k0, %k1
	; KNL_X32-NEXT: kmovw %k1, %edx			; KNL_X32-NEXT: kmovw %k1, %esi
	; KNL_X32-NEXT: kshiftrw $5, %k0, %k1			; KNL_X32-NEXT: kshiftrw $5, %k0, %k1
	; KNL_X32-NEXT: kmovw %k1, %ecx			; KNL_X32-NEXT: kmovw %k1, %ecx
	; KNL_X32-NEXT: kshiftrw $6, %k0, %k1			; KNL_X32-NEXT: kshiftrw $6, %k0, %k1
	; KNL_X32-NEXT: andl $1, %ebx			; KNL_X32-NEXT: andb {{[0-9]+}}(%esp), %dl
	; KNL_X32-NEXT: movb %bl, 2(%eax)			; KNL_X32-NEXT: movzbl %dl, %edx
	; KNL_X32-NEXT: kmovw %k0, %ebx			; KNL_X32-NEXT: andl $1, %edx
	; KNL_X32-NEXT: andl $1, %ebx			; KNL_X32-NEXT: movb %dl, 2(%eax)
	; KNL_X32-NEXT: andl $1, %esi			; KNL_X32-NEXT: kmovw %k0, %edx
	; KNL_X32-NEXT: leal (%ebx,%esi,2), %esi			; KNL_X32-NEXT: andl $1, %edx
	; KNL_X32-NEXT: kmovw %k1, %ebx
	; KNL_X32-NEXT: kshiftrw $7, %k0, %k1
	; KNL_X32-NEXT: andl $1, %edi			; KNL_X32-NEXT: andl $1, %edi
	; KNL_X32-NEXT: leal (%esi,%edi,4), %esi			; KNL_X32-NEXT: leal (%edx,%edi,2), %edx
	; KNL_X32-NEXT: kmovw %k1, %edi			; KNL_X32-NEXT: kmovw %k1, %edi
				; KNL_X32-NEXT: kshiftrw $7, %k0, %k1
				; KNL_X32-NEXT: andl $1, %ebx
				; KNL_X32-NEXT: leal (%edx,%ebx,4), %edx
				; KNL_X32-NEXT: kmovw %k1, %ebx
	; KNL_X32-NEXT: kshiftrw $8, %k0, %k1			; KNL_X32-NEXT: kshiftrw $8, %k0, %k1
	; KNL_X32-NEXT: andl $1, %ebp			; KNL_X32-NEXT: andl $1, %ebp
	; KNL_X32-NEXT: leal (%esi,%ebp,8), %esi			; KNL_X32-NEXT: leal (%edx,%ebp,8), %edx
	; KNL_X32-NEXT: kmovw %k1, %ebp			; KNL_X32-NEXT: kmovw %k1, %ebp
	; KNL_X32-NEXT: kshiftrw $9, %k0, %k1			; KNL_X32-NEXT: kshiftrw $9, %k0, %k1
	; KNL_X32-NEXT: andl $1, %edx			; KNL_X32-NEXT: andl $1, %esi
	; KNL_X32-NEXT: shll $4, %edx			; KNL_X32-NEXT: shll $4, %esi
	; KNL_X32-NEXT: orl %esi, %edx			; KNL_X32-NEXT: orl %edx, %esi
	; KNL_X32-NEXT: kmovw %k1, %esi			; KNL_X32-NEXT: kmovw %k1, %edx
	; KNL_X32-NEXT: kshiftrw $10, %k0, %k1			; KNL_X32-NEXT: kshiftrw $10, %k0, %k1
	; KNL_X32-NEXT: andl $1, %ecx			; KNL_X32-NEXT: andl $1, %ecx
	; KNL_X32-NEXT: shll $5, %ecx			; KNL_X32-NEXT: shll $5, %ecx
	; KNL_X32-NEXT: orl %edx, %ecx			; KNL_X32-NEXT: orl %esi, %ecx
	; KNL_X32-NEXT: kmovw %k1, %edx			; KNL_X32-NEXT: kmovw %k1, %esi
	; KNL_X32-NEXT: kshiftrw $11, %k0, %k1			; KNL_X32-NEXT: kshiftrw $11, %k0, %k1
	; KNL_X32-NEXT: andl $1, %ebx
	; KNL_X32-NEXT: shll $6, %ebx
	; KNL_X32-NEXT: andl $1, %edi			; KNL_X32-NEXT: andl $1, %edi
	; KNL_X32-NEXT: shll $7, %edi			; KNL_X32-NEXT: shll $6, %edi
	; KNL_X32-NEXT: orl %ebx, %edi			; KNL_X32-NEXT: andl $1, %ebx
	; KNL_X32-NEXT: kmovw %k1, %ebx			; KNL_X32-NEXT: shll $7, %ebx
				; KNL_X32-NEXT: orl %edi, %ebx
				; KNL_X32-NEXT: kmovw %k1, %edi
	; KNL_X32-NEXT: kshiftrw $12, %k0, %k1			; KNL_X32-NEXT: kshiftrw $12, %k0, %k1
	; KNL_X32-NEXT: andl $1, %ebp			; KNL_X32-NEXT: andl $1, %ebp
	; KNL_X32-NEXT: shll $8, %ebp			; KNL_X32-NEXT: shll $8, %ebp
	; KNL_X32-NEXT: orl %edi, %ebp			; KNL_X32-NEXT: orl %ebx, %ebp
	; KNL_X32-NEXT: kmovw %k1, %edi			; KNL_X32-NEXT: kmovw %k1, %ebx
	; KNL_X32-NEXT: kshiftrw $13, %k0, %k1			; KNL_X32-NEXT: kshiftrw $13, %k0, %k1
	; KNL_X32-NEXT: andl $1, %esi			; KNL_X32-NEXT: andl $1, %edx
	; KNL_X32-NEXT: shll $9, %esi			; KNL_X32-NEXT: shll $9, %edx
	; KNL_X32-NEXT: orl %ebp, %esi			; KNL_X32-NEXT: orl %ebp, %edx
	; KNL_X32-NEXT: kmovw %k1, %ebp			; KNL_X32-NEXT: kmovw %k1, %ebp
	; KNL_X32-NEXT: kshiftrw $14, %k0, %k1			; KNL_X32-NEXT: kshiftrw $14, %k0, %k1
	; KNL_X32-NEXT: andl $1, %edx			; KNL_X32-NEXT: andl $1, %esi
	; KNL_X32-NEXT: shll $10, %edx			; KNL_X32-NEXT: shll $10, %esi
	; KNL_X32-NEXT: orl %esi, %edx			; KNL_X32-NEXT: orl %edx, %esi
	; KNL_X32-NEXT: kmovw %k1, %esi			; KNL_X32-NEXT: kmovw %k1, %edx
	; KNL_X32-NEXT: kshiftrw $15, %k0, %k0			; KNL_X32-NEXT: kshiftrw $15, %k0, %k0
	; KNL_X32-NEXT: orl %ecx, %edx			; KNL_X32-NEXT: orl %ecx, %esi
	; KNL_X32-NEXT: kmovw %k0, %ecx			; KNL_X32-NEXT: kmovw %k0, %ecx
	; KNL_X32-NEXT: andl $1, %ebx
	; KNL_X32-NEXT: shll $11, %ebx
	; KNL_X32-NEXT: andl $1, %edi			; KNL_X32-NEXT: andl $1, %edi
	; KNL_X32-NEXT: shll $12, %edi			; KNL_X32-NEXT: shll $11, %edi
	; KNL_X32-NEXT: orl %ebx, %edi			; KNL_X32-NEXT: andl $1, %ebx
				; KNL_X32-NEXT: shll $12, %ebx
				; KNL_X32-NEXT: orl %edi, %ebx
	; KNL_X32-NEXT: andl $1, %ebp			; KNL_X32-NEXT: andl $1, %ebp
	; KNL_X32-NEXT: shll $13, %ebp			; KNL_X32-NEXT: shll $13, %ebp
	; KNL_X32-NEXT: orl %edi, %ebp			; KNL_X32-NEXT: orl %ebx, %ebp
	; KNL_X32-NEXT: andl $1, %esi			; KNL_X32-NEXT: andl $1, %edx
	; KNL_X32-NEXT: shll $14, %esi			; KNL_X32-NEXT: shll $14, %edx
	; KNL_X32-NEXT: orl %ebp, %esi			; KNL_X32-NEXT: orl %ebp, %edx
	; KNL_X32-NEXT: andl $1, %ecx			; KNL_X32-NEXT: andl $1, %ecx
	; KNL_X32-NEXT: shll $15, %ecx			; KNL_X32-NEXT: shll $15, %ecx
	; KNL_X32-NEXT: orl %esi, %ecx
	; KNL_X32-NEXT: orl %edx, %ecx			; KNL_X32-NEXT: orl %edx, %ecx
				; KNL_X32-NEXT: orl %esi, %ecx
	; KNL_X32-NEXT: movw %cx, (%eax)			; KNL_X32-NEXT: movw %cx, (%eax)
	; KNL_X32-NEXT: addl $20, %esp			; KNL_X32-NEXT: addl $20, %esp
	; KNL_X32-NEXT: popl %esi			; KNL_X32-NEXT: popl %esi
	; KNL_X32-NEXT: popl %edi			; KNL_X32-NEXT: popl %edi
	; KNL_X32-NEXT: popl %ebx			; KNL_X32-NEXT: popl %ebx
	; KNL_X32-NEXT: popl %ebp			; KNL_X32-NEXT: popl %ebp
	; KNL_X32-NEXT: retl $4			; KNL_X32-NEXT: retl $4
	;			;
	▲ Show 20 Lines • Show All 2,552 Lines • Show Last 20 Lines