This is an archive of the discontinued LLVM Phabricator instance.

AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation.
ClosedPublic

Authored by igorb on Jan 13 2016, 2:16 AM.

Download Raw Diff

Details

Reviewers

AsafBadouh
delena
mbodart
DavidKreitzer

Commits

rG1e5bafbc826a: AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation.
rL258657: AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation.

Summary

AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation.

LowerOperationWrapper() for X86 target was changed in this patch.
The reason is the following: we have Load Intrinsics on avx512 with 64-bit mask (the <64 x i8> case). I64 requires legalization on 32-bit mode.
All other intrinsics with 64-bit mask are lowered during type legalization and it works perfect.
But the Load Intrinsic is legalized and converted to a Load SDNode. And the Load SDNode has 2 results - value and chain.
That's why we decided to customize the LowerOperationWrapper() - in order to push more than one result.
But, in some cases, the original node has one result and the new node comes with two. This is the case of SINT_TO_FP with store-load (fild).
In this case we just drop the second result (the chain).

Diff Detail

Repository: rL LLVM

Event Timeline

igorb updated this revision to Diff 44719.Jan 13 2016, 2:16 AM

igorb retitled this revision from to AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation..

igorb updated this object.

igorb added reviewers: delena, AsafBadouh.

igorb set the repository for this revision to rL LLVM.

igorb added a subscriber: llvm-commits.

delena added a reviewer: mbodart.Jan 13 2016, 11:42 AM

igorb updated this object.Jan 14 2016, 6:35 AM

igorb added a reviewer: DavidKreitzer.

I'm not sure I understand why this is just now becoming an issue.

Is the need for an X86-specific override of LowerOperationWrapper driven by an existing problem
with SINT_TO_FP lowering, or by a problem that is only exposed when adding the new masked load intrinsics?

Is the chain being dropped for both SINT_TO_FP and masked loads, or just one of them.

What are the safety consequences of dropping the chain, wrt losing an ordering dependence?

And why is I64 write mask legalization fine for most masked intrinsics, but not the masked loads?
Is it because none of the other existing I64-masked intrinsics produce an additional chain result?

A concrete example or two, showing the DAG snippets during legalization , would be helpful.

mitch

lib/Target/X86/X86ISelLowering.h
689	Please add the "override" indicator.
lib/Target/X86/X86InstrAVX512.td
2753–2754	Can you please explain why these patterns are being deleted?
test/CodeGen/X86/avx512-intrinsics.ll
6643–6644	This kind of test is OK for now. But note that the second vmovdqu32 is completely redundant. And the last vmovdqu32's source operand could be replaced by %zmm0. So the test may need maintenance as optimizations improve. This same pattern is used in many test functions in these test files, and of course they will all have the same issue.

Thanks for the review!

In D16137#327385, @mbodart wrote:

I'm not sure I understand why this is just now becoming an issue.

Is the need for an X86-specific override of LowerOperationWrapper driven by an existing problem
with SINT_TO_FP lowering, or by a problem that is only exposed when adding the new masked load intrinsics?

The problem exposed only when adding new masked load intrinsics. In previous implementation only one value was taken ( chain was dropped ).

Is the chain being dropped for both SINT_TO_FP and masked loads, or just one of them.

Only for SINT_TO_FP ( or any other similar nodes).

What are the safety consequences of dropping the chain, wrt losing an ordering dependence?

In SINT_TO_FP store->load chain preserved, as i understand LOAD node chain could be dropped.

And why is I64 write mask legalization fine for most masked intrinsics, but not the masked loads?
Is it because none of the other existing I64-masked intrinsics produce an additional chain result?

Yes, masked load intrinsic (packed bytes) operand type legalization is the first one that produce additional chain result.

A concrete example or two, showing the DAG snippets during legalization , would be helpful.

 SINT_TO_FP DAG snippet

 t6: f64 = sint_to_fp t5
    t5: i64 = build_pair t2, t4
      t2: i32,ch = CopyFromReg t0, Register:i32 %vreg0
        t0: ch = EntryToken
      t4: i32,ch = CopyFromReg t0, Register:i32 %vreg1

Transformed to

  t13: f64,ch = X86ISD::FILD<LD8[FixedStack0]> t11, FrameIndex:i32<0>, ValueType:ch:i64
    t11: ch = store<ST8[FixedStack0](align=4)> t0, t5, FrameIndex:i32<0>, undef:i32
      t0: ch = EntryToken
      t5: i64 = build_pair t2, t4
        t2: i32,ch = CopyFromReg t0, Register:i32 %vreg0
        t4: i32,ch = CopyFromReg t0, Register:i32 %vreg1                 
  --------------------------------------------------------
masked loads snippet

  t12: v64i8,ch = llvm.x86.avx512.mask.loadu.b.512<LD64[%x0](align=1)> t0, TargetConstant:i32<4681>, t3, t5, t17
    t0: ch = EntryToken
    t3: i32,ch = load<LD4[FixedStack-1](align=16)> t0, FrameIndex:i32<-1>, undef:i32
    t5: v64i8,ch = CopyFromReg t0, Register:v64i8 %vreg0
    t17: i64,ch = load<LD8[FixedStack-2](align=4)> t0, FrameIndex:i32<-2>, undef:i32
    
Transformed to
    
  t30: v64i8,ch = masked_load<LD64[%x0](align=1)> t0, t3, t29, t5
    t0: ch = EntryToken
    t3: i32,ch = load<LD4[FixedStack-1](align=16)> t0, FrameIndex:i32<-1>, undef:i32
    t29: v64i1 = concat_vectors t27, t28
      t27: v32i1 = bitcast t24
        t24: i32 = extract_element t17, Constant:i32<0>
          t17: i64,ch = load<LD8[FixedStack-2](align=4)> t0, FrameIndex:i32<-2>, undef:i32
      t28: v32i1 = bitcast t26
        t26: i32 = extract_element t17, Constant:i32<1>
    t5: v64i8,ch = CopyFromReg t0, Register:v64i8 %vreg0

lib/Target/X86/X86InstrAVX512.td
2753–2754	This intrinsics is handled by DAG Legalization pass (X86ISelLowering.cpp , lowerINTRINSIC_W_CHAIN() function)

Hi Asaf,

Thanks for the answers.

Note that SelectionDAGLegalize::LegalizeOp, in the case of custom lowering, already handles
the case of multiple results, albeit with this FIXME comment:

// FIXME: The handling for custom lowering with multiple results is
// a complete mess.

So it is not clear to me why the new masked load intrinsics require additional changes,
especially since we have existing masked load intrinsics with a chain result.
I'm guessing it's because legalization of the I64 mask operand on IA32 does not go through
this SelectionDAGLegalize::LegalizeOp path. Is that correct?

Assuming we do need an X86 override of LowerOperationWrapper, then my next thought
would be to try and fix the FILD issue in X86TargetLowering::FP_TO_INTHelper. But I don't
know how to express there, in Selection Dag representation, that only one result is needed.
If you know how to do that, then that would be a preferable solution.

If we really need LowerOperationWrapper to discard the FILD's Chain operand, then I think
we need to make it clear in that routine that it should only be happening for the FILD case,
as in general it is not safe to simply drop a result. So a source comment there describing the
situation is needed. And when we do need to drop a result, I think there should be
an assertion checking exactly for the FILD case. That is, we should only ever drop
one result, and it should be a Chain, and if reasonable, we should check that the retained
result is an FILD. This will make it clear to developers why this code is needed, and
thus make it easier to modify in the future.

Thanks for updating the test functions!
One other thing I noticed about the test files is that most of them do not test for a 32-bit target.
That doesn't need to be added for this change set. But as I64 masking has interesting behavior
on IA32, it would be good to beef up testing there.

regards,

mitch

I think you meant "Hi Igor"
:)

Oops! Thanks for pointing that out Asaf.

Yes, thanks Igor

Elena

In D16137#330156, @mbodart wrote:
Hi Asaf,

Thanks for the answers.

Note that SelectionDAGLegalize::LegalizeOp, in the case of custom lowering, already handles
the case of multiple results, albeit with this FIXME comment:
// FIXME: The handling for custom lowering with multiple results is
// a complete mess.
So it is not clear to me why the new masked load intrinsics require additional changes,
especially since we have existing masked load intrinsics with a chain result.
I'm guessing it's because legalization of the I64 mask operand on IA32 does not go through
this SelectionDAGLegalize::LegalizeOp path. Is that correct?

The existing mask load intrinsics does not have i64 operand. (You mean @llvm.masked.load(), right).
They go to promote/widening.

Assuming we do need an X86 override of LowerOperationWrapper, then my next thought
would be to try and fix the FILD issue in X86TargetLowering::FP_TO_INTHelper. But I don't
know how to express there, in Selection Dag representation, that only one result is needed.
If you know how to do that, then that would be a preferable solution.

This workaround is not necessary. There are many places in the code, where we convert a one-result-node to two-result-node.
It happens every time, when we use memory for node transformation. In all these cases we drop the second result. We even don't check that the second result is a chain. See ReplaceAllUsesWith() code:

void SelectionDAG::ReplaceAllUsesWith(SDNode *From, const SDValue *To) {

if (From->getNumValues() == 1)  // Handle the simple case efficiently.
  return ReplaceAllUsesWith(SDValue(From, 0), To[0]);

If we really need LowerOperationWrapper to discard the FILD's Chain operand, then I think
we need to make it clear in that routine that it should only be happening for the FILD case,
as in general it is not safe to simply drop a result. So a source comment there describing the
situation is needed. And when we do need to drop a result, I think there should be
an assertion checking exactly for the FILD case. That is, we should only ever drop
one result, and it should be a Chain, and if reasonable, we should check that the retained
result is an FILD. This will make it clear to developers why this code is needed, and
thus make it easier to modify in the future.

Thanks for updating the test functions!
One other thing I noticed about the test files is that most of them do not test for a 32-bit target.
That doesn't need to be added for this change set. But as I64 masking has interesting behavior
on IA32, it would be good to beef up testing there.

regards,

mitch

Thanks for the explanations (and patience)!

As dropping the chain seems to be common practice, and the source comment in LowerOperationWrapper describes that, I am fine with the changes.

So, LGTM.

mitch

Thanks everybody for the review!

Closed by commit rL258657: AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation. (authored by ibreger). · Explain WhyJan 24 2016, 12:08 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsX86.td

58 lines

lib/

Target/

X86/

8 lines

22 lines

8 lines

12 lines

test/

CodeGen/

X86/

avx512-intrinsics.ll

39 lines

avx512bw-intrinsics.ll

60 lines

avx512bwvl-intrinsics.ll

75 lines

avx512vl-intrinsics.ll

77 lines

Diff 45104

include/llvm/IR/IntrinsicsX86.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,824 Lines • ▼ Show 20 Lines	def int_x86_avx2_maskload_q : GCCBuiltin<"__builtin_ia32_maskloadq">,
Intrinsic<[llvm_v2i64_ty], [llvm_ptr_ty, llvm_v2i64_ty],		Intrinsic<[llvm_v2i64_ty], [llvm_ptr_ty, llvm_v2i64_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx2_maskload_d_256 : GCCBuiltin<"__builtin_ia32_maskloadd256">,		def int_x86_avx2_maskload_d_256 : GCCBuiltin<"__builtin_ia32_maskloadd256">,
Intrinsic<[llvm_v8i32_ty], [llvm_ptr_ty, llvm_v8i32_ty],		Intrinsic<[llvm_v8i32_ty], [llvm_ptr_ty, llvm_v8i32_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx2_maskload_q_256 : GCCBuiltin<"__builtin_ia32_maskloadq256">,		def int_x86_avx2_maskload_q_256 : GCCBuiltin<"__builtin_ia32_maskloadq256">,
Intrinsic<[llvm_v4i64_ty], [llvm_ptr_ty, llvm_v4i64_ty],		Intrinsic<[llvm_v4i64_ty], [llvm_ptr_ty, llvm_v4i64_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx512_mask_loadu_d_512 : GCCBuiltin<"__builtin_ia32_loaddqusi512_mask">,
Intrinsic<[llvm_v16i32_ty], [llvm_ptr_ty, llvm_v16i32_ty, llvm_i16_ty],		def int_x86_avx512_mask_loadu_b_128 :
[IntrReadArgMem]>;		GCCBuiltin<"__builtin_ia32_loaddquqi128_mask">,
def int_x86_avx512_mask_loadu_q_512 : GCCBuiltin<"__builtin_ia32_loaddqudi512_mask">,		Intrinsic<[llvm_v16i8_ty],
Intrinsic<[llvm_v8i64_ty], [llvm_ptr_ty, llvm_v8i64_ty, llvm_i8_ty],		[llvm_ptr_ty, llvm_v16i8_ty, llvm_i16_ty], [IntrReadArgMem]>;
[IntrReadArgMem]>;		def int_x86_avx512_mask_loadu_b_256 :
		GCCBuiltin<"__builtin_ia32_loaddquqi256_mask">,
		Intrinsic<[llvm_v32i8_ty],
		[llvm_ptr_ty, llvm_v32i8_ty, llvm_i32_ty], [IntrReadArgMem]>;
		def int_x86_avx512_mask_loadu_b_512 :
		GCCBuiltin<"__builtin_ia32_loaddquqi512_mask">,
		Intrinsic<[llvm_v64i8_ty],
		[llvm_ptr_ty, llvm_v64i8_ty, llvm_i64_ty], [IntrReadArgMem]>;

		def int_x86_avx512_mask_loadu_w_128 :
		GCCBuiltin<"__builtin_ia32_loaddquhi128_mask">,
		Intrinsic<[llvm_v8i16_ty],
		[llvm_ptr_ty, llvm_v8i16_ty, llvm_i8_ty], [IntrReadArgMem]>;
		def int_x86_avx512_mask_loadu_w_256 :
		GCCBuiltin<"__builtin_ia32_loaddquhi256_mask">,
		Intrinsic<[llvm_v16i16_ty],
		[llvm_ptr_ty, llvm_v16i16_ty, llvm_i16_ty], [IntrReadArgMem]>;
		def int_x86_avx512_mask_loadu_w_512 :
		GCCBuiltin<"__builtin_ia32_loaddquhi512_mask">,
		Intrinsic<[llvm_v32i16_ty],
		[llvm_ptr_ty, llvm_v32i16_ty, llvm_i32_ty], [IntrReadArgMem]>;

		def int_x86_avx512_mask_loadu_d_128 :
		GCCBuiltin<"__builtin_ia32_loaddqusi128_mask">,
		Intrinsic<[llvm_v4i32_ty],
		[llvm_ptr_ty, llvm_v4i32_ty, llvm_i8_ty], [IntrReadArgMem]>;
		def int_x86_avx512_mask_loadu_d_256 :
		GCCBuiltin<"__builtin_ia32_loaddqusi256_mask">,
		Intrinsic<[llvm_v8i32_ty],
		[llvm_ptr_ty, llvm_v8i32_ty, llvm_i8_ty], [IntrReadArgMem]>;
		def int_x86_avx512_mask_loadu_d_512 :
		GCCBuiltin<"__builtin_ia32_loaddqusi512_mask">,
		Intrinsic<[llvm_v16i32_ty],
		[llvm_ptr_ty, llvm_v16i32_ty, llvm_i16_ty], [IntrReadArgMem]>;

		def int_x86_avx512_mask_loadu_q_128 :
		GCCBuiltin<"__builtin_ia32_loaddqudi128_mask">,
		Intrinsic<[llvm_v2i64_ty],
		[llvm_ptr_ty, llvm_v2i64_ty, llvm_i8_ty], [IntrReadArgMem]>;
		def int_x86_avx512_mask_loadu_q_256 :
		GCCBuiltin<"__builtin_ia32_loaddqudi256_mask">,
		Intrinsic<[llvm_v4i64_ty],
		[llvm_ptr_ty, llvm_v4i64_ty, llvm_i8_ty], [IntrReadArgMem]>;
		def int_x86_avx512_mask_loadu_q_512 :
		GCCBuiltin<"__builtin_ia32_loaddqudi512_mask">,
		Intrinsic<[llvm_v8i64_ty],
		[llvm_ptr_ty, llvm_v8i64_ty, llvm_i8_ty], [IntrReadArgMem]>;
}		}

// Conditional store ops		// Conditional store ops
let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".		let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".
def int_x86_avx2_maskstore_d : GCCBuiltin<"__builtin_ia32_maskstored">,		def int_x86_avx2_maskstore_d : GCCBuiltin<"__builtin_ia32_maskstored">,
Intrinsic<[], [llvm_ptr_ty, llvm_v4i32_ty, llvm_v4i32_ty],		Intrinsic<[], [llvm_ptr_ty, llvm_v4i32_ty, llvm_v4i32_ty],
[IntrReadWriteArgMem]>;		[IntrReadWriteArgMem]>;
def int_x86_avx2_maskstore_q : GCCBuiltin<"__builtin_ia32_maskstoreq">,		def int_x86_avx2_maskstore_q : GCCBuiltin<"__builtin_ia32_maskstoreq">,
▲ Show 20 Lines • Show All 5,042 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 674 Lines • ▼ Show 20 Lines	public:
/// specified type. Returns whether it is "fast" in the last argument.		/// specified type. Returns whether it is "fast" in the last argument.
bool allowsMisalignedMemoryAccesses(EVT VT, unsigned AS, unsigned Align,		bool allowsMisalignedMemoryAccesses(EVT VT, unsigned AS, unsigned Align,
bool *Fast) const override;		bool *Fast) const override;

/// Provide custom lowering hooks for some operations.		/// Provide custom lowering hooks for some operations.
///		///
SDValue LowerOperation(SDValue Op, SelectionDAG &DAG) const override;		SDValue LowerOperation(SDValue Op, SelectionDAG &DAG) const override;

		/// Places new result values for the node in Results (their number
		/// and types must exactly match those of the original return values of
		/// the node), or leaves Results empty, which indicates that the node is not
		/// to be custom lowered after all.
		virtual void LowerOperationWrapper(SDNode *N,
		SmallVectorImpl<SDValue> &Results,
		SelectionDAG &DAG) const override;
		mbodartUnsubmitted Done Reply Inline Actions Please add the "override" indicator. mbodart: Please add the "override" indicator.

/// Replace the results of node with an illegal result		/// Replace the results of node with an illegal result
/// type with new values built out of custom code.		/// type with new values built out of custom code.
///		///
void ReplaceNodeResults(SDNode *N, SmallVectorImpl<SDValue>&Results,		void ReplaceNodeResults(SDNode *N, SmallVectorImpl<SDValue>&Results,
SelectionDAG &DAG) const override;		SelectionDAG &DAG) const override;


SDValue PerformDAGCombine(SDNode *N, DAGCombinerInfo &DCI) const override;		SDValue PerformDAGCombine(SDNode *N, DAGCombinerInfo &DCI) const override;
▲ Show 20 Lines • Show All 487 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 20,431 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
case ISD::MGATHER: return LowerMGATHER(Op, Subtarget, DAG);		case ISD::MGATHER: return LowerMGATHER(Op, Subtarget, DAG);
case ISD::MSCATTER: return LowerMSCATTER(Op, Subtarget, DAG);		case ISD::MSCATTER: return LowerMSCATTER(Op, Subtarget, DAG);
case ISD::GC_TRANSITION_START:		case ISD::GC_TRANSITION_START:
return LowerGC_TRANSITION_START(Op, DAG);		return LowerGC_TRANSITION_START(Op, DAG);
case ISD::GC_TRANSITION_END: return LowerGC_TRANSITION_END(Op, DAG);		case ISD::GC_TRANSITION_END: return LowerGC_TRANSITION_END(Op, DAG);
}		}
}		}

		/// Places new result values for the node in Results (their number
		/// and types must exactly match those of the original return values of
		/// the node), or leaves Results empty, which indicates that the node is not
		/// to be custom lowered after all.
		void X86TargetLowering::LowerOperationWrapper(SDNode *N,
		SmallVectorImpl<SDValue> &Results,
		SelectionDAG &DAG) const {
		SDValue Res = LowerOperation(SDValue(N, 0), DAG);

		if (!Res.getNode())
		return;

		assert((N->getNumValues() <= Res->getNumValues()) &&
		"Lowering returned the wrong number of results!");

		// Places new result values base on N result number.
		// In some cases (LowerSINT_TO_FP for example) Res has more result values
		// than original node, chain should be dropped(last value).
		for (unsigned I = 0, E = N->getNumValues(); I != E; ++I)
		Results.push_back(Res.getValue(I));
		}

/// ReplaceNodeResults - Replace a node with an illegal result type		/// ReplaceNodeResults - Replace a node with an illegal result type
/// with a new node built out of custom code.		/// with a new node built out of custom code.
void X86TargetLowering::ReplaceNodeResults(SDNode *N,		void X86TargetLowering::ReplaceNodeResults(SDNode *N,
SmallVectorImpl<SDValue>&Results,		SmallVectorImpl<SDValue>&Results,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc dl(N);		SDLoc dl(N);
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
switch (N->getOpcode()) {		switch (N->getOpcode()) {
▲ Show 20 Lines • Show All 8,489 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrAVX512.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 2,744 Lines • ▼ Show 20 Lines

	defm VMOVDQU32 : avx512_load_vl<0x6F, "vmovdqu32", avx512vl_i32_info, HasAVX512>,			defm VMOVDQU32 : avx512_load_vl<0x6F, "vmovdqu32", avx512vl_i32_info, HasAVX512>,
	avx512_store_vl<0x7F, "vmovdqu32", avx512vl_i32_info,			avx512_store_vl<0x7F, "vmovdqu32", avx512vl_i32_info,
	HasAVX512>, XS, EVEX_CD8<32, CD8VF>;			HasAVX512>, XS, EVEX_CD8<32, CD8VF>;

	defm VMOVDQU64 : avx512_load_vl<0x6F, "vmovdqu64", avx512vl_i64_info, HasAVX512>,			defm VMOVDQU64 : avx512_load_vl<0x6F, "vmovdqu64", avx512vl_i64_info, HasAVX512>,
	avx512_store_vl<0x7F, "vmovdqu64", avx512vl_i64_info,			avx512_store_vl<0x7F, "vmovdqu64", avx512vl_i64_info,
	HasAVX512>, XS, VEX_W, EVEX_CD8<64, CD8VF>;			HasAVX512>, XS, VEX_W, EVEX_CD8<64, CD8VF>;

	def: Pat<(v16i32 (int_x86_avx512_mask_loadu_d_512 addr:$ptr,
	(v16i32 immAllZerosV), GR16:$mask)),
	(VMOVDQU32Zrmkz (v16i1 (COPY_TO_REGCLASS GR16:$mask, VK16WM)), addr:$ptr)>;

	def: Pat<(v8i64 (int_x86_avx512_mask_loadu_q_512 addr:$ptr,
	(bc_v8i64 (v16i32 immAllZerosV)), GR8:$mask)),
	(VMOVDQU64Zrmkz (v8i1 (COPY_TO_REGCLASS GR8:$mask, VK8WM)), addr:$ptr)>;

	def: Pat<(int_x86_avx512_mask_storeu_d_512 addr:$ptr, (v16i32 VR512:$src),			def: Pat<(int_x86_avx512_mask_storeu_d_512 addr:$ptr, (v16i32 VR512:$src),
				mbodartUnsubmitted Not Done Reply Inline Actions Can you please explain why these patterns are being deleted? mbodart: Can you please explain why these patterns are being deleted?
				igorbAuthorUnsubmitted Not Done Reply Inline Actions This intrinsics is handled by DAG Legalization pass (X86ISelLowering.cpp , lowerINTRINSIC_W_CHAIN() function) igorb: This intrinsics is handled by DAG Legalization pass (X86ISelLowering.cpp…
	GR16:$mask),			GR16:$mask),
	(VMOVDQU32Zmrk addr:$ptr, (v16i1 (COPY_TO_REGCLASS GR16:$mask, VK16WM)),			(VMOVDQU32Zmrk addr:$ptr, (v16i1 (COPY_TO_REGCLASS GR16:$mask, VK16WM)),
	VR512:$src)>;			VR512:$src)>;
	def: Pat<(int_x86_avx512_mask_storeu_q_512 addr:$ptr, (v8i64 VR512:$src),			def: Pat<(int_x86_avx512_mask_storeu_q_512 addr:$ptr, (v8i64 VR512:$src),
	GR8:$mask),			GR8:$mask),
	(VMOVDQU64Zmrk addr:$ptr, (v8i1 (COPY_TO_REGCLASS GR8:$mask, VK8WM)),			(VMOVDQU64Zmrk addr:$ptr, (v8i1 (COPY_TO_REGCLASS GR8:$mask, VK8WM)),
	VR512:$src)>;			VR512:$src)>;

	▲ Show 20 Lines • Show All 4,717 Lines • Show Last 20 Lines

lib/Target/X86/X86IntrinsicsInfo.h

Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	static const IntrinsicData IntrinsicsWithChain[] = {
X86_INTRINSIC_DATA(avx512_mask_expand_load_q_512,		X86_INTRINSIC_DATA(avx512_mask_expand_load_q_512,
EXPAND_FROM_MEM, X86ISD::EXPAND, 0),		EXPAND_FROM_MEM, X86ISD::EXPAND, 0),
X86_INTRINSIC_DATA(avx512_mask_load_pd_128, LOADA, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_load_pd_128, LOADA, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_load_pd_256, LOADA, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_load_pd_256, LOADA, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_load_pd_512, LOADA, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_load_pd_512, LOADA, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_load_ps_128, LOADA, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_load_ps_128, LOADA, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_load_ps_256, LOADA, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_load_ps_256, LOADA, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_load_ps_512, LOADA, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_load_ps_512, LOADA, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_b_128, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_b_256, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_b_512, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_d_128, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_d_256, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_d_512, LOADU, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_loadu_pd_128, LOADU, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_loadu_pd_128, LOADU, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_loadu_pd_256, LOADU, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_loadu_pd_256, LOADU, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_loadu_pd_512, LOADU, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_loadu_pd_512, LOADU, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_loadu_ps_128, LOADU, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_loadu_ps_128, LOADU, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_loadu_ps_256, LOADU, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_loadu_ps_256, LOADU, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_loadu_ps_512, LOADU, ISD::DELETED_NODE, 0),		X86_INTRINSIC_DATA(avx512_mask_loadu_ps_512, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_q_128, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_q_256, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_q_512, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_w_128, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_w_256, LOADU, ISD::DELETED_NODE, 0),
		X86_INTRINSIC_DATA(avx512_mask_loadu_w_512, LOADU, ISD::DELETED_NODE, 0),
X86_INTRINSIC_DATA(avx512_mask_pmov_db_mem_128, TRUNCATE_TO_MEM_VI8,		X86_INTRINSIC_DATA(avx512_mask_pmov_db_mem_128, TRUNCATE_TO_MEM_VI8,
X86ISD::VTRUNC, 0),		X86ISD::VTRUNC, 0),
X86_INTRINSIC_DATA(avx512_mask_pmov_db_mem_256, TRUNCATE_TO_MEM_VI8,		X86_INTRINSIC_DATA(avx512_mask_pmov_db_mem_256, TRUNCATE_TO_MEM_VI8,
X86ISD::VTRUNC, 0),		X86ISD::VTRUNC, 0),
X86_INTRINSIC_DATA(avx512_mask_pmov_db_mem_512, TRUNCATE_TO_MEM_VI8,		X86_INTRINSIC_DATA(avx512_mask_pmov_db_mem_512, TRUNCATE_TO_MEM_VI8,
X86ISD::VTRUNC, 0),		X86ISD::VTRUNC, 0),
X86_INTRINSIC_DATA(avx512_mask_pmov_dw_mem_128, TRUNCATE_TO_MEM_VI16,		X86_INTRINSIC_DATA(avx512_mask_pmov_dw_mem_128, TRUNCATE_TO_MEM_VI16,
X86ISD::VTRUNC, 0),		X86ISD::VTRUNC, 0),
▲ Show 20 Lines • Show All 1,939 Lines • Show Last 20 Lines

test/CodeGen/X86/avx512-intrinsics.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,627 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
%res = call <8 x i64> @llvm.x86.avx512.mask.prorv.q.512(<8 x i64> %x0, <8 x i64> %x1, <8 x i64> %x2, i8 %x3)		%res = call <8 x i64> @llvm.x86.avx512.mask.prorv.q.512(<8 x i64> %x0, <8 x i64> %x1, <8 x i64> %x2, i8 %x3)
%res1 = call <8 x i64> @llvm.x86.avx512.mask.prorv.q.512(<8 x i64> %x0, <8 x i64> %x1, <8 x i64> zeroinitializer, i8 %x3)		%res1 = call <8 x i64> @llvm.x86.avx512.mask.prorv.q.512(<8 x i64> %x0, <8 x i64> %x1, <8 x i64> zeroinitializer, i8 %x3)
%res2 = call <8 x i64> @llvm.x86.avx512.mask.prorv.q.512(<8 x i64> %x0, <8 x i64> %x1, <8 x i64> %x2, i8 -1)		%res2 = call <8 x i64> @llvm.x86.avx512.mask.prorv.q.512(<8 x i64> %x0, <8 x i64> %x1, <8 x i64> %x2, i8 -1)
%res3 = add <8 x i64> %res, %res1		%res3 = add <8 x i64> %res, %res1
%res4 = add <8 x i64> %res3, %res2		%res4 = add <8 x i64> %res3, %res2
ret <8 x i64> %res4		ret <8 x i64> %res4
}		}

		declare <16 x i32> @llvm.x86.avx512.mask.loadu.d.512(i8*, <16 x i32>, i16)

		define <16 x i32> @test_mask_load_unaligned_d(i8* %ptr, i8* %ptr2, <16 x i32> %data, i16 %mask) {
		; CHECK-LABEL: test_mask_load_unaligned_d:
		; CHECK: ## BB#0:
		; CHECK-NEXT: kmovw %edx, %k1
		; CHECK-NEXT: vmovdqu32 (%rdi), %zmm0
		; CHECK-NEXT: vmovdqu32 (%rsi), %zmm0 {%k1}
		; CHECK-NEXT: vmovdqu32 (%rdi), %zmm1 {%k1} {z}
		mbodartUnsubmitted Done Reply Inline Actions This kind of test is OK for now. But note that the second vmovdqu32 is completely redundant. And the last vmovdqu32's source operand could be replaced by %zmm0. So the test may need maintenance as optimizations improve. This same pattern is used in many test functions in these test files, and of course they will all have the same issue. mbodart: This kind of test is OK for now. But note that the second vmovdqu32 is completely redundant.
		; CHECK-NEXT: vpaddd %zmm0, %zmm1, %zmm0
		; CHECK-NEXT: retq
		%res = call <16 x i32> @llvm.x86.avx512.mask.loadu.d.512(i8* %ptr, <16 x i32> zeroinitializer, i16 -1)
		%res1 = call <16 x i32> @llvm.x86.avx512.mask.loadu.d.512(i8* %ptr2, <16 x i32> %res, i16 %mask)
		%res2 = call <16 x i32> @llvm.x86.avx512.mask.loadu.d.512(i8* %ptr, <16 x i32> zeroinitializer, i16 %mask)
		%res4 = add <16 x i32> %res2, %res1
		ret <16 x i32> %res4
		}

		declare <8 x i64> @llvm.x86.avx512.mask.loadu.q.512(i8*, <8 x i64>, i8)

		define <8 x i64> @test_mask_load_unaligned_q(i8* %ptr, i8* %ptr2, <8 x i64> %data, i8 %mask) {
		; CHECK-LABEL: test_mask_load_unaligned_q:
		; CHECK: ## BB#0:
		; CHECK-NEXT: movzbl %dl, %eax
		; CHECK-NEXT: kmovw %eax, %k1
		; CHECK-NEXT: vmovdqu64 (%rdi), %zmm0
		; CHECK-NEXT: vmovdqu64 (%rsi), %zmm0 {%k1}
		; CHECK-NEXT: vmovdqu64 (%rdi), %zmm1 {%k1} {z}
		; CHECK-NEXT: vpaddq %zmm0, %zmm1, %zmm0
		; CHECK-NEXT: retq
		%res = call <8 x i64> @llvm.x86.avx512.mask.loadu.q.512(i8* %ptr, <8 x i64> zeroinitializer, i8 -1)
		%res1 = call <8 x i64> @llvm.x86.avx512.mask.loadu.q.512(i8* %ptr2, <8 x i64> %res, i8 %mask)
		%res2 = call <8 x i64> @llvm.x86.avx512.mask.loadu.q.512(i8* %ptr, <8 x i64> zeroinitializer, i8 %mask)
		%res4 = add <8 x i64> %res2, %res1
		ret <8 x i64> %res4
		}

declare <16 x i32> @llvm.x86.avx512.mask.prol.d.512(<16 x i32>, i8, <16 x i32>, i16)		declare <16 x i32> @llvm.x86.avx512.mask.prol.d.512(<16 x i32>, i8, <16 x i32>, i16)

define <16 x i32>@test_int_x86_avx512_mask_prol_d_512(<16 x i32> %x0, i8 %x1, <16 x i32> %x2, i16 %x3) {		define <16 x i32>@test_int_x86_avx512_mask_prol_d_512(<16 x i32> %x0, i8 %x1, <16 x i32> %x2, i16 %x3) {
; CHECK-LABEL: test_int_x86_avx512_mask_prol_d_512:		; CHECK-LABEL: test_int_x86_avx512_mask_prol_d_512:
; CHECK: ## BB#0:		; CHECK: ## BB#0:
; CHECK-NEXT: kmovw %esi, %k1		; CHECK-NEXT: kmovw %esi, %k1
; CHECK-NEXT: vprold $3, %zmm0, %zmm1 {%k1}		; CHECK-NEXT: vprold $3, %zmm0, %zmm1 {%k1}
; CHECK-NEXT: vprold $3, %zmm0, %zmm2 {%k1} {z}		; CHECK-NEXT: vprold $3, %zmm0, %zmm2 {%k1} {z}
Show All 24 Lines
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%res = call <8 x i64> @llvm.x86.avx512.mask.prol.q.512(<8 x i64> %x0, i8 3, <8 x i64> %x2, i8 %x3)		%res = call <8 x i64> @llvm.x86.avx512.mask.prol.q.512(<8 x i64> %x0, i8 3, <8 x i64> %x2, i8 %x3)
%res1 = call <8 x i64> @llvm.x86.avx512.mask.prol.q.512(<8 x i64> %x0, i8 3, <8 x i64> zeroinitializer, i8 %x3)		%res1 = call <8 x i64> @llvm.x86.avx512.mask.prol.q.512(<8 x i64> %x0, i8 3, <8 x i64> zeroinitializer, i8 %x3)
%res2 = call <8 x i64> @llvm.x86.avx512.mask.prol.q.512(<8 x i64> %x0, i8 3, <8 x i64> %x2, i8 -1)		%res2 = call <8 x i64> @llvm.x86.avx512.mask.prol.q.512(<8 x i64> %x0, i8 3, <8 x i64> %x2, i8 -1)
%res3 = add <8 x i64> %res, %res1		%res3 = add <8 x i64> %res, %res1
%res4 = add <8 x i64> %res3, %res2		%res4 = add <8 x i64> %res3, %res2
ret <8 x i64> %res4		ret <8 x i64> %res4
}		}

test/CodeGen/X86/avx512bw-intrinsics.ll

	Show First 20 Lines • Show All 3,080 Lines • ▼ Show 20 Lines
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	%res = call <32 x i16> @llvm.x86.avx512.mask.psllv32hi(<32 x i16> %x0, <32 x i16> %x1, <32 x i16> %x2, i32 %x3)			%res = call <32 x i16> @llvm.x86.avx512.mask.psllv32hi(<32 x i16> %x0, <32 x i16> %x1, <32 x i16> %x2, i32 %x3)
	%res1 = call <32 x i16> @llvm.x86.avx512.mask.psllv32hi(<32 x i16> %x0, <32 x i16> %x1, <32 x i16> zeroinitializer, i32 %x3)			%res1 = call <32 x i16> @llvm.x86.avx512.mask.psllv32hi(<32 x i16> %x0, <32 x i16> %x1, <32 x i16> zeroinitializer, i32 %x3)
	%res2 = call <32 x i16> @llvm.x86.avx512.mask.psllv32hi(<32 x i16> %x0, <32 x i16> %x1, <32 x i16> %x2, i32 -1)			%res2 = call <32 x i16> @llvm.x86.avx512.mask.psllv32hi(<32 x i16> %x0, <32 x i16> %x1, <32 x i16> %x2, i32 -1)
	%res3 = add <32 x i16> %res, %res1			%res3 = add <32 x i16> %res, %res1
	%res4 = add <32 x i16> %res3, %res2			%res4 = add <32 x i16> %res3, %res2
	ret <32 x i16> %res4			ret <32 x i16> %res4
	}			}

				declare <32 x i16> @llvm.x86.avx512.mask.loadu.w.512(i8*, <32 x i16>, i32)

				define <32 x i16>@test_int_x86_avx512_mask_loadu_w_512(i8* %ptr, i8* %ptr2, <32 x i16> %x1, i32 %mask) {
				; AVX512BW-LABEL: test_int_x86_avx512_mask_loadu_w_512:
				; AVX512BW: ## BB#0:
				; AVX512BW-NEXT: kmovd %edx, %k1
				; AVX512BW-NEXT: vmovdqu16 (%rdi), %zmm0
				; AVX512BW-NEXT: vmovdqu16 (%rsi), %zmm0 {%k1}
				; AVX512BW-NEXT: vmovdqu16 (%rdi), %zmm1 {%k1} {z}
				; AVX512BW-NEXT: vpaddw %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: retq
				;
				; AVX512F-32-LABEL: test_int_x86_avx512_mask_loadu_w_512:
				; AVX512F-32: # BB#0:
				; AVX512F-32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; AVX512F-32-NEXT: kmovd {{[0-9]+}}(%esp), %k1
				; AVX512F-32-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; AVX512F-32-NEXT: vmovdqu16 (%ecx), %zmm0
				; AVX512F-32-NEXT: vmovdqu16 (%eax), %zmm0 {%k1}
				; AVX512F-32-NEXT: vmovdqu16 (%ecx), %zmm1 {%k1} {z}
				; AVX512F-32-NEXT: vpaddw %zmm1, %zmm0, %zmm0
				; AVX512F-32-NEXT: retl
				%res0 = call <32 x i16> @llvm.x86.avx512.mask.loadu.w.512(i8* %ptr, <32 x i16> %x1, i32 -1)
				%res = call <32 x i16> @llvm.x86.avx512.mask.loadu.w.512(i8* %ptr2, <32 x i16> %res0, i32 %mask)
				%res1 = call <32 x i16> @llvm.x86.avx512.mask.loadu.w.512(i8* %ptr, <32 x i16> zeroinitializer, i32 %mask)
				%res2 = add <32 x i16> %res, %res1
				ret <32 x i16> %res2
				}

				declare <64 x i8> @llvm.x86.avx512.mask.loadu.b.512(i8*, <64 x i8>, i64)

				define <64 x i8>@test_int_x86_avx512_mask_loadu_b_512(i8* %ptr, i8* %ptr2, <64 x i8> %x1, i64 %mask) {
				; AVX512BW-LABEL: test_int_x86_avx512_mask_loadu_b_512:
				; AVX512BW: ## BB#0:
				; AVX512BW-NEXT: kmovq %rdx, %k1
				; AVX512BW-NEXT: vmovdqu8 (%rdi), %zmm0
				; AVX512BW-NEXT: vmovdqu8 (%rsi), %zmm0 {%k1}
				; AVX512BW-NEXT: vmovdqu8 (%rdi), %zmm1 {%k1} {z}
				; AVX512BW-NEXT: vpaddb %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: retq
				;
				; AVX512F-32-LABEL: test_int_x86_avx512_mask_loadu_b_512:
				; AVX512F-32: # BB#0:
				; AVX512F-32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; AVX512F-32-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; AVX512F-32-NEXT: vmovdqu8 (%ecx), %zmm0
				; AVX512F-32-NEXT: kmovd {{[0-9]+}}(%esp), %k0
				; AVX512F-32-NEXT: kmovd {{[0-9]+}}(%esp), %k1
				; AVX512F-32-NEXT: kunpckdq %k0, %k1, %k1
				; AVX512F-32-NEXT: vmovdqu8 (%eax), %zmm0 {%k1}
				; AVX512F-32-NEXT: vmovdqu8 (%ecx), %zmm1 {%k1} {z}
				; AVX512F-32-NEXT: vpaddb %zmm1, %zmm0, %zmm0
				; AVX512F-32-NEXT: retl
				%res0 = call <64 x i8> @llvm.x86.avx512.mask.loadu.b.512(i8* %ptr, <64 x i8> %x1, i64 -1)
				%res = call <64 x i8> @llvm.x86.avx512.mask.loadu.b.512(i8* %ptr2, <64 x i8> %res0, i64 %mask)
				%res1 = call <64 x i8> @llvm.x86.avx512.mask.loadu.b.512(i8* %ptr, <64 x i8> zeroinitializer, i64 %mask)
				%res2 = add <64 x i8> %res, %res1
				ret <64 x i8> %res2
				}

test/CodeGen/X86/avx512bwvl-intrinsics.ll

	Show First 20 Lines • Show All 5,002 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: vpaddw %xmm0, %xmm1, %xmm0			; CHECK-NEXT: vpaddw %xmm0, %xmm1, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%res = call <8 x i16> @llvm.x86.avx512.mask.psllv8.hi(<8 x i16> %x0, <8 x i16> %x1, <8 x i16> %x2, i8 %x3)			%res = call <8 x i16> @llvm.x86.avx512.mask.psllv8.hi(<8 x i16> %x0, <8 x i16> %x1, <8 x i16> %x2, i8 %x3)
	%res1 = call <8 x i16> @llvm.x86.avx512.mask.psllv8.hi(<8 x i16> %x0, <8 x i16> %x1, <8 x i16> zeroinitializer, i8 %x3)			%res1 = call <8 x i16> @llvm.x86.avx512.mask.psllv8.hi(<8 x i16> %x0, <8 x i16> %x1, <8 x i16> zeroinitializer, i8 %x3)
	%res2 = call <8 x i16> @llvm.x86.avx512.mask.psllv8.hi(<8 x i16> %x0, <8 x i16> %x1, <8 x i16> %x2, i8 -1)			%res2 = call <8 x i16> @llvm.x86.avx512.mask.psllv8.hi(<8 x i16> %x0, <8 x i16> %x1, <8 x i16> %x2, i8 -1)
	%res3 = add <8 x i16> %res, %res1			%res3 = add <8 x i16> %res, %res1
	%res4 = add <8 x i16> %res3, %res2			%res4 = add <8 x i16> %res3, %res2
	ret <8 x i16> %res4			ret <8 x i16> %res4
	}			}
	No newline at end of file
				declare <8 x i16> @llvm.x86.avx512.mask.loadu.w.128(i8*, <8 x i16>, i8)

				define <8 x i16>@test_int_x86_avx512_mask_loadu_w_128(i8* %ptr, i8* %ptr2, <8 x i16> %x1, i8 %mask) {
				; CHECK-LABEL: test_int_x86_avx512_mask_loadu_w_128:
				; CHECK: ## BB#0:
				; CHECK-NEXT: movzbl %dl, %eax
				; CHECK-NEXT: kmovw %eax, %k1
				; CHECK-NEXT: vmovdqu16 (%rdi), %xmm0
				; CHECK-NEXT: vmovdqu16 (%rsi), %xmm0 {%k1}
				; CHECK-NEXT: vmovdqu16 (%rdi), %xmm1 {%k1} {z}
				; CHECK-NEXT: vpaddw %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%res0 = call <8 x i16> @llvm.x86.avx512.mask.loadu.w.128(i8* %ptr, <8 x i16> %x1, i8 -1)
				%res = call <8 x i16> @llvm.x86.avx512.mask.loadu.w.128(i8* %ptr2, <8 x i16> %res0, i8 %mask)
				%res1 = call <8 x i16> @llvm.x86.avx512.mask.loadu.w.128(i8* %ptr, <8 x i16> zeroinitializer, i8 %mask)
				%res2 = add <8 x i16> %res, %res1
				ret <8 x i16> %res2
				}

				declare <16 x i16> @llvm.x86.avx512.mask.loadu.w.256(i8*, <16 x i16>, i16)

				define <16 x i16>@test_int_x86_avx512_mask_loadu_w_256(i8* %ptr, i8* %ptr2, <16 x i16> %x1, i16 %mask) {
				; CHECK-LABEL: test_int_x86_avx512_mask_loadu_w_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: kmovw %edx, %k1
				; CHECK-NEXT: vmovdqu16 (%rdi), %ymm0
				; CHECK-NEXT: vmovdqu16 (%rsi), %ymm0 {%k1}
				; CHECK-NEXT: vmovdqu16 (%rdi), %ymm1 {%k1} {z}
				; CHECK-NEXT: vpaddw %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: retq
				%res0 = call <16 x i16> @llvm.x86.avx512.mask.loadu.w.256(i8* %ptr, <16 x i16> %x1, i16 -1)
				%res = call <16 x i16> @llvm.x86.avx512.mask.loadu.w.256(i8* %ptr2, <16 x i16> %res0, i16 %mask)
				%res1 = call <16 x i16> @llvm.x86.avx512.mask.loadu.w.256(i8* %ptr, <16 x i16> zeroinitializer, i16 %mask)
				%res2 = add <16 x i16> %res, %res1
				ret <16 x i16> %res2
				}

				declare <16 x i8> @llvm.x86.avx512.mask.loadu.b.128(i8*, <16 x i8>, i16)

				define <16 x i8>@test_int_x86_avx512_mask_loadu_b_128(i8* %ptr, i8* %ptr2, <16 x i8> %x1, i16 %mask) {
				; CHECK-LABEL: test_int_x86_avx512_mask_loadu_b_128:
				; CHECK: ## BB#0:
				; CHECK-NEXT: kmovw %edx, %k1
				; CHECK-NEXT: vmovdqu8 (%rdi), %xmm0
				; CHECK-NEXT: vmovdqu8 (%rsi), %xmm0 {%k1}
				; CHECK-NEXT: vmovdqu8 (%rdi), %xmm1 {%k1} {z}
				; CHECK-NEXT: vpaddb %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%res0 = call <16 x i8> @llvm.x86.avx512.mask.loadu.b.128(i8* %ptr, <16 x i8> %x1, i16 -1)
				%res = call <16 x i8> @llvm.x86.avx512.mask.loadu.b.128(i8* %ptr2, <16 x i8> %res0, i16 %mask)
				%res1 = call <16 x i8> @llvm.x86.avx512.mask.loadu.b.128(i8* %ptr, <16 x i8> zeroinitializer, i16 %mask)
				%res2 = add <16 x i8> %res, %res1
				ret <16 x i8> %res2
				}

				declare <32 x i8> @llvm.x86.avx512.mask.loadu.b.256(i8*, <32 x i8>, i32)

				define <32 x i8>@test_int_x86_avx512_mask_loadu_b_256(i8* %ptr, i8* %ptr2, <32 x i8> %x1, i32 %mask) {
				; CHECK-LABEL: test_int_x86_avx512_mask_loadu_b_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: kmovd %edx, %k1
				; CHECK-NEXT: vmovdqu8 (%rdi), %ymm0
				; CHECK-NEXT: vmovdqu8 (%rsi), %ymm0 {%k1}
				; CHECK-NEXT: vmovdqu8 (%rdi), %ymm1 {%k1} {z}
				; CHECK-NEXT: vpaddb %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: retq
				%res0 = call <32 x i8> @llvm.x86.avx512.mask.loadu.b.256(i8* %ptr, <32 x i8> %x1, i32 -1)
				%res = call <32 x i8> @llvm.x86.avx512.mask.loadu.b.256(i8* %ptr2, <32 x i8> %res0, i32 %mask)
				%res1 = call <32 x i8> @llvm.x86.avx512.mask.loadu.b.256(i8* %ptr, <32 x i8> zeroinitializer, i32 %mask)
				%res2 = add <32 x i8> %res, %res1
				ret <32 x i8> %res2
				}

test/CodeGen/X86/avx512vl-intrinsics.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 6,785 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%res = call <4 x i64> @llvm.x86.avx512.mask.prorv.q.256(<4 x i64> %x0, <4 x i64> %x1, <4 x i64> %x2, i8 %x3)			%res = call <4 x i64> @llvm.x86.avx512.mask.prorv.q.256(<4 x i64> %x0, <4 x i64> %x1, <4 x i64> %x2, i8 %x3)
	%res1 = call <4 x i64> @llvm.x86.avx512.mask.prorv.q.256(<4 x i64> %x0, <4 x i64> %x1, <4 x i64> zeroinitializer, i8 %x3)			%res1 = call <4 x i64> @llvm.x86.avx512.mask.prorv.q.256(<4 x i64> %x0, <4 x i64> %x1, <4 x i64> zeroinitializer, i8 %x3)
	%res2 = call <4 x i64> @llvm.x86.avx512.mask.prorv.q.256(<4 x i64> %x0, <4 x i64> %x1, <4 x i64> %x2, i8 -1)			%res2 = call <4 x i64> @llvm.x86.avx512.mask.prorv.q.256(<4 x i64> %x0, <4 x i64> %x1, <4 x i64> %x2, i8 -1)
	%res3 = add <4 x i64> %res, %res1			%res3 = add <4 x i64> %res, %res1
	%res4 = add <4 x i64> %res3, %res2			%res4 = add <4 x i64> %res3, %res2
	ret <4 x i64> %res4			ret <4 x i64> %res4
	}			}

				declare <4 x i32> @llvm.x86.avx512.mask.loadu.d.128(i8*, <4 x i32>, i8)

				define <4 x i32> @test_mask_load_unaligned_d_128(i8* %ptr, i8* %ptr2, <4 x i32> %data, i8 %mask) {
				; CHECK-LABEL: test_mask_load_unaligned_d_128:
				; CHECK: ## BB#0:
				; CHECK-NEXT: movzbl %dl, %eax
				; CHECK-NEXT: kmovw %eax, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0
				; CHECK-NEXT: vmovdqu32 (%rsi), %xmm0 {%k1}
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm1 {%k1} {z}
				; CHECK-NEXT: vpaddd %xmm0, %xmm1, %xmm0
				; CHECK-NEXT: retq
				%res = call <4 x i32> @llvm.x86.avx512.mask.loadu.d.128(i8* %ptr, <4 x i32> zeroinitializer, i8 -1)
				%res1 = call <4 x i32> @llvm.x86.avx512.mask.loadu.d.128(i8* %ptr2, <4 x i32> %res, i8 %mask)
				%res2 = call <4 x i32> @llvm.x86.avx512.mask.loadu.d.128(i8* %ptr, <4 x i32> zeroinitializer, i8 %mask)
				%res4 = add <4 x i32> %res2, %res1
				ret <4 x i32> %res4
				}

				declare <8 x i32> @llvm.x86.avx512.mask.loadu.d.256(i8*, <8 x i32>, i8)

				define <8 x i32> @test_mask_load_unaligned_d_256(i8* %ptr, i8* %ptr2, <8 x i32> %data, i8 %mask) {
				; CHECK-LABEL: test_mask_load_unaligned_d_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: movzbl %dl, %eax
				; CHECK-NEXT: kmovw %eax, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0
				; CHECK-NEXT: vmovdqu32 (%rsi), %ymm0 {%k1}
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm1 {%k1} {z}
				; CHECK-NEXT: vpaddd %ymm0, %ymm1, %ymm0
				; CHECK-NEXT: retq
				%res = call <8 x i32> @llvm.x86.avx512.mask.loadu.d.256(i8* %ptr, <8 x i32> zeroinitializer, i8 -1)
				%res1 = call <8 x i32> @llvm.x86.avx512.mask.loadu.d.256(i8* %ptr2, <8 x i32> %res, i8 %mask)
				%res2 = call <8 x i32> @llvm.x86.avx512.mask.loadu.d.256(i8* %ptr, <8 x i32> zeroinitializer, i8 %mask)
				%res4 = add <8 x i32> %res2, %res1
				ret <8 x i32> %res4
				}

				declare <2 x i64> @llvm.x86.avx512.mask.loadu.q.128(i8*, <2 x i64>, i8)

				define <2 x i64> @test_mask_load_unaligned_q_128(i8* %ptr, i8* %ptr2, <2 x i64> %data, i8 %mask) {
				; CHECK-LABEL: test_mask_load_unaligned_q_128:
				; CHECK: ## BB#0:
				; CHECK-NEXT: movzbl %dl, %eax
				; CHECK-NEXT: kmovw %eax, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0
				; CHECK-NEXT: vmovdqu64 (%rsi), %xmm0 {%k1}
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm1 {%k1} {z}
				; CHECK-NEXT: vpaddq %xmm0, %xmm1, %xmm0
				; CHECK-NEXT: retq
				%res = call <2 x i64> @llvm.x86.avx512.mask.loadu.q.128(i8* %ptr, <2 x i64> zeroinitializer, i8 -1)
				%res1 = call <2 x i64> @llvm.x86.avx512.mask.loadu.q.128(i8* %ptr2, <2 x i64> %res, i8 %mask)
				%res2 = call <2 x i64> @llvm.x86.avx512.mask.loadu.q.128(i8* %ptr, <2 x i64> zeroinitializer, i8 %mask)
				%res4 = add <2 x i64> %res2, %res1
				ret <2 x i64> %res4
				}

				declare <4 x i64> @llvm.x86.avx512.mask.loadu.q.256(i8*, <4 x i64>, i8)

				define <4 x i64> @test_mask_load_unaligned_q_256(i8* %ptr, i8* %ptr2, <4 x i64> %data, i8 %mask) {
				; CHECK-LABEL: test_mask_load_unaligned_q_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: movzbl %dl, %eax
				; CHECK-NEXT: kmovw %eax, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0
				; CHECK-NEXT: vmovdqu64 (%rsi), %ymm0 {%k1}
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm1 {%k1} {z}
				; CHECK-NEXT: vpaddq %ymm0, %ymm1, %ymm0
				; CHECK-NEXT: retq
				%res = call <4 x i64> @llvm.x86.avx512.mask.loadu.q.256(i8* %ptr, <4 x i64> zeroinitializer, i8 -1)
				%res1 = call <4 x i64> @llvm.x86.avx512.mask.loadu.q.256(i8* %ptr2, <4 x i64> %res, i8 %mask)
				%res2 = call <4 x i64> @llvm.x86.avx512.mask.loadu.q.256(i8* %ptr, <4 x i64> zeroinitializer, i8 %mask)
				%res4 = add <4 x i64> %res2, %res1
				ret <4 x i64> %res4
				}

	declare <4 x i32> @llvm.x86.avx512.mask.prol.d.128(<4 x i32>, i8, <4 x i32>, i8)			declare <4 x i32> @llvm.x86.avx512.mask.prol.d.128(<4 x i32>, i8, <4 x i32>, i8)

	define <4 x i32>@test_int_x86_avx512_mask_prol_d_128(<4 x i32> %x0, i8 %x1, <4 x i32> %x2, i8 %x3) {			define <4 x i32>@test_int_x86_avx512_mask_prol_d_128(<4 x i32> %x0, i8 %x1, <4 x i32> %x2, i8 %x3) {
	; CHECK-LABEL: test_int_x86_avx512_mask_prol_d_128:			; CHECK-LABEL: test_int_x86_avx512_mask_prol_d_128:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movzbl %sil, %eax			; CHECK-NEXT: movzbl %sil, %eax
	; CHECK-NEXT: kmovw %eax, %k1			; CHECK-NEXT: kmovw %eax, %k1
	; CHECK-NEXT: vprold $3, %xmm0, %xmm1 {%k1}			; CHECK-NEXT: vprold $3, %xmm0, %xmm1 {%k1}
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines