This is an archive of the discontinued LLVM Phabricator instance.

AVX intrinsics were broken on CPU with AVX-512 instruction set
Needs ReviewPublic

Authored by rob.khasanov on Jul 21 2014, 11:49 AM.

Download Raw Diff

Details

Reviewers

nadav
delena

Summary

This patch fixes AVX intrinsics on AVX512 target.
UseAVX is changed to HasAVX for some instructions. AVX512 instructions are prioritized over AVX instructions through AddedComplexity.
Added avx-intrinsics-x86.ll testing on skx.

Diff Detail

Event Timeline

rob.khasanov updated this revision to Diff 11715.Jul 21 2014, 11:49 AM

rob.khasanov retitled this revision from to AVX intrinsics were broken on CPU with AVX-512 instruction set.

rob.khasanov updated this object.

rob.khasanov edited the test plan for this revision. (Show Details)

rob.khasanov added reviewers: delena, nadav.

rob.khasanov set the repository for this revision to rL LLVM.

rob.khasanov added a subscriber: Unknown Object (MLST).

Hi Robert,

Is this a correctness or performance fix?

Thanks,
Adam

In D4605#5, @anemet wrote:

Hi Robert,

Is this a correctness or performance fix?

Thanks,
Adam

Hi Adam,

This is correctness fix.

Thanks,
Robert

Sorry I forget add context to the diff (-U999999).
The bug is that AVX intrinsics is not generated with -mcpu=knl (and other cpu with AVX512 ISA).
Running "llc < test/CodeGen/X86/avx-intrinsics-x86.ll -mtriple=x86_64-apple-darwin -march=x86 -mcpu=knl" you can see error:
LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse2.add.sd

In this patch I just fix avx-intrinsics-x86.ll.
I found that the reason was incorrect predicates: they used UseAVX instead HasAVX, i changed this.
However, after this fix some codegen tests for AVX512 instructions fails. Reason is generating AVX instructions instead of AVX512 on AVX512 tests, this is due to equal complexity of AVX and AVX512 patterns. So to fix this I added for AVX512 instructions patterns more complexity.

In D4605#8, @rob.khasanov wrote:

Sorry I forget add context to the diff (-U999999).
The bug is that AVX intrinsics is not generated with -mcpu=knl (and other cpu with AVX512 ISA).
Running "llc < test/CodeGen/X86/avx-intrinsics-x86.ll -mtriple=x86_64-apple-darwin -march=x86 -mcpu=knl" you can see error:
LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse2.add.sd

In this patch I just fix avx-intrinsics-x86.ll.

Ah, thanks!

I found that the reason was incorrect predicates: they used UseAVX instead HasAVX, i changed this.

Hmm, don't we want an AVX512 version of the scalar operations as well to allow for the 32 vector registers? We could probably tablegen these from the AVX definitions. Although it looks like that for the packed versions we chose to duplicate them so that's probably what we want to do for scalar as well.

However, after this fix some codegen tests for AVX512 instructions fails. Reason is generating AVX instructions instead of AVX512 on AVX512 tests, this is due to equal complexity of AVX and AVX512 patterns. So to fix this I added for AVX512 instructions patterns more complexity.

Yeah I don't think this is ideal. Just like we didn't have to do this between SSE and AVX, it shouldn't be necessary between AVX512 and AVX.

Adam

Adam,

Your point makes sense. I will rewrite my patch by duplicating AVX intrinsics patterns in X86InstrAVX512.td to generate EVEX instructions.

Revision Contents

Path

Size

lib/

Target/

X86/

X86InstrAVX512.td

5 lines

X86InstrFormats.td

4 lines

X86InstrSSE.td

10 lines

test/

CodeGen/

X86/

avx-intrinsics-x86.ll

6 lines

Diff 11715

lib/Target/X86/X86InstrAVX512.td

Context not available.
	multiclass avx512_move_scalar <string asm, RegisterClass RC,	multiclass avx512_move_scalar <string asm, RegisterClass RC,
	SDNode OpNode, ValueType vt,	SDNode OpNode, ValueType vt,
	X86MemOperand x86memop, PatFrag mem_pat> {	X86MemOperand x86memop, PatFrag mem_pat> {
	let hasSideEffects = 0 in {	let hasSideEffects = 0, AddedComplexity = 30 in {
	def rr : SI<0x10, MRMSrcReg, (outs VR128X:$dst), (ins VR128X:$src1, RC:$src2),	def rr : SI<0x10, MRMSrcReg, (outs VR128X:$dst), (ins VR128X:$src1, RC:$src2),
	!strconcat(asm, " \t{$src2, $src1, $dst\|$dst, $src1, $src2}"),	!strconcat(asm, " \t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set VR128X:$dst, (vt (OpNode VR128X:$src1,	[(set VR128X:$dst, (vt (OpNode VR128X:$src1,
Context not available.
	(bc_v16f32(v16i32 immAllZerosV)), (i16 -1), (i32 FROUND_CURRENT))),	(bc_v16f32(v16i32 immAllZerosV)), (i16 -1), (i32 FROUND_CURRENT))),
	(VCVTPH2PSZrr VR256X:$src)>;	(VCVTPH2PSZrr VR256X:$src)>;

	let Defs = [EFLAGS], Predicates = [HasAVX512] in {	let Defs = [EFLAGS], Predicates = [HasAVX512],
		AddedComplexity = 10 in {
	defm VUCOMISSZ : sse12_ord_cmp<0x2E, FR32X, X86cmp, f32, f32mem, loadf32,	defm VUCOMISSZ : sse12_ord_cmp<0x2E, FR32X, X86cmp, f32, f32mem, loadf32,
	"ucomiss">, PS, EVEX, VEX_LIG,	"ucomiss">, PS, EVEX, VEX_LIG,
	EVEX_CD8<32, CD8VT1>;	EVEX_CD8<32, CD8VT1>;
Context not available.

lib/Target/X86/X86InstrFormats.td

Context not available.
	list<dag> pattern, InstrItinClass itin = NoItinerary>	list<dag> pattern, InstrItinClass itin = NoItinerary>
	: I<o, F, outs, ins, asm, pattern, itin> {	: I<o, F, outs, ins, asm, pattern, itin> {
	let Predicates = !if(!eq(OpEnc.Value, EncEVEX.Value), [HasAVX512],	let Predicates = !if(!eq(OpEnc.Value, EncEVEX.Value), [HasAVX512],
	!if(!eq(OpEnc.Value, EncVEX.Value), [UseAVX],	!if(!eq(OpEnc.Value, EncVEX.Value), [HasAVX],
	!if(!eq(OpPrefix.Value, XS.Value), [UseSSE1],	!if(!eq(OpPrefix.Value, XS.Value), [UseSSE1],
	!if(!eq(OpPrefix.Value, XD.Value), [UseSSE2],	!if(!eq(OpPrefix.Value, XD.Value), [UseSSE2],
	!if(!eq(OpPrefix.Value, PD.Value), [UseSSE2],	!if(!eq(OpPrefix.Value, PD.Value), [UseSSE2],
Context not available.
	list<dag> pattern, InstrItinClass itin = NoItinerary>	list<dag> pattern, InstrItinClass itin = NoItinerary>
	: Ii8<o, F, outs, ins, asm, pattern, itin> {	: Ii8<o, F, outs, ins, asm, pattern, itin> {
	let Predicates = !if(!eq(OpEnc.Value, EncEVEX.Value), [HasAVX512],	let Predicates = !if(!eq(OpEnc.Value, EncEVEX.Value), [HasAVX512],
	!if(!eq(OpEnc.Value, EncVEX.Value), [UseAVX],	!if(!eq(OpEnc.Value, EncVEX.Value), [HasAVX],
	!if(!eq(OpPrefix.Value, XS.Value), [UseSSE1],	!if(!eq(OpPrefix.Value, XS.Value), [UseSSE1],
	[UseSSE2])));	[UseSSE2])));

Context not available.

lib/Target/X86/X86InstrSSE.td

Context not available.
	"vcvtsd2ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}",	"vcvtsd2ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set VR128:$dst,	[(set VR128:$dst,
	(int_x86_sse2_cvtsd2ss VR128:$src1, VR128:$src2))],	(int_x86_sse2_cvtsd2ss VR128:$src1, VR128:$src2))],
	IIC_SSE_CVT_Scalar_RR>, XD, VEX_4V, Requires<[UseAVX]>,	IIC_SSE_CVT_Scalar_RR>, XD, VEX_4V, Requires<[HasAVX]>,
	Sched<[WriteCvtF2F]>;	Sched<[WriteCvtF2F]>;
	def Int_VCVTSD2SSrm: I<0x5A, MRMSrcReg,	def Int_VCVTSD2SSrm: I<0x5A, MRMSrcReg,
	(outs VR128:$dst), (ins VR128:$src1, sdmem:$src2),	(outs VR128:$dst), (ins VR128:$src1, sdmem:$src2),
	"vcvtsd2ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}",	"vcvtsd2ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set VR128:$dst, (int_x86_sse2_cvtsd2ss	[(set VR128:$dst, (int_x86_sse2_cvtsd2ss
	VR128:$src1, sse_load_f64:$src2))],	VR128:$src1, sse_load_f64:$src2))],
	IIC_SSE_CVT_Scalar_RM>, XD, VEX_4V, Requires<[UseAVX]>,	IIC_SSE_CVT_Scalar_RM>, XD, VEX_4V, Requires<[HasAVX]>,
	Sched<[WriteCvtF2FLd, ReadAfterLd]>;	Sched<[WriteCvtF2FLd, ReadAfterLd]>;

	let Constraints = "$src1 = $dst" in {	let Constraints = "$src1 = $dst" in {
Context not available.
	"vcvtss2sd\t{$src2, $src1, $dst\|$dst, $src1, $src2}",	"vcvtss2sd\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set VR128:$dst,	[(set VR128:$dst,
	(int_x86_sse2_cvtss2sd VR128:$src1, VR128:$src2))],	(int_x86_sse2_cvtss2sd VR128:$src1, VR128:$src2))],
	IIC_SSE_CVT_Scalar_RR>, XS, VEX_4V, Requires<[UseAVX]>,	IIC_SSE_CVT_Scalar_RR>, XS, VEX_4V, Requires<[HasAVX]>,
	Sched<[WriteCvtF2F]>;	Sched<[WriteCvtF2F]>;
	def Int_VCVTSS2SDrm: I<0x5A, MRMSrcMem,	def Int_VCVTSS2SDrm: I<0x5A, MRMSrcMem,
	(outs VR128:$dst), (ins VR128:$src1, ssmem:$src2),	(outs VR128:$dst), (ins VR128:$src1, ssmem:$src2),
	"vcvtss2sd\t{$src2, $src1, $dst\|$dst, $src1, $src2}",	"vcvtss2sd\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set VR128:$dst,	[(set VR128:$dst,
	(int_x86_sse2_cvtss2sd VR128:$src1, sse_load_f32:$src2))],	(int_x86_sse2_cvtss2sd VR128:$src1, sse_load_f32:$src2))],
	IIC_SSE_CVT_Scalar_RM>, XS, VEX_4V, Requires<[UseAVX]>,	IIC_SSE_CVT_Scalar_RM>, XS, VEX_4V, Requires<[HasAVX]>,
	Sched<[WriteCvtF2FLd, ReadAfterLd]>;	Sched<[WriteCvtF2FLd, ReadAfterLd]>;
	let Constraints = "$src1 = $dst" in { // SSE2 instructions with XS prefix	let Constraints = "$src1 = $dst" in { // SSE2 instructions with XS prefix
	def Int_CVTSS2SDrr: I<0x5A, MRMSrcReg,	def Int_CVTSS2SDrr: I<0x5A, MRMSrcReg,
Context not available.
	//===---------------------------------------------------------------------===//	//===---------------------------------------------------------------------===//
	// Store / copy lower 64-bits of a XMM register.	// Store / copy lower 64-bits of a XMM register.
	//	//
	let Predicates = [UseAVX] in	let Predicates = [HasAVX] in
	def : Pat<(int_x86_sse2_storel_dq addr:$dst, VR128:$src),	def : Pat<(int_x86_sse2_storel_dq addr:$dst, VR128:$src),
	(VMOVPQI2QImr addr:$dst, VR128:$src)>;	(VMOVPQI2QImr addr:$dst, VR128:$src)>;
	let Predicates = [UseSSE2] in	let Predicates = [UseSSE2] in
Context not available.

test/CodeGen/X86/avx-intrinsics-x86.ll

	; RUN: llc < %s -mtriple=x86_64-apple-darwin -march=x86 -mcpu=corei7-avx \| FileCheck %s	; RUN: llc < %s -mtriple=x86_64-apple-darwin -march=x86 -mcpu=corei7-avx \| FileCheck %s --check-prefix CHECK --check-prefix AVX
		; RUN: llc < %s -mtriple=x86_64-apple-darwin -march=x86 -mcpu=skx \| FileCheck %s --check-prefix CHECK --check-prefix SKX

	define <2 x i64> @test_x86_aesni_aesdec(<2 x i64> %a0, <2 x i64> %a1) {	define <2 x i64> @test_x86_aesni_aesdec(<2 x i64> %a0, <2 x i64> %a1) {
	; CHECK: vaesdec	; CHECK: vaesdec
Context not available.

	define void @test_x86_avx_storeu_dq_256(i8* %a0, <32 x i8> %a1) {	define void @test_x86_avx_storeu_dq_256(i8* %a0, <32 x i8> %a1) {
	; FIXME: unfortunately the execution domain fix pass changes this to vmovups and its hard to force with no 256-bit integer instructions	; FIXME: unfortunately the execution domain fix pass changes this to vmovups and its hard to force with no 256-bit integer instructions
	; CHECK: vmovups	; AVX: vmovups
		; SKX: vmovdqu
	; add operation forces the execution domain.	; add operation forces the execution domain.
	%a2 = add <32 x i8> %a1, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>	%a2 = add <32 x i8> %a1, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
	call void @llvm.x86.avx.storeu.dq.256(i8* %a0, <32 x i8> %a2)	call void @llvm.x86.avx.storeu.dq.256(i8* %a0, <32 x i8> %a2)
Context not available.