This is an archive of the discontinued LLVM Phabricator instance.

New X86 FMA3*_Int opcodes for scalar FMA intrinsics.
ClosedPublic

Authored by v_klochkov on Oct 13 2015, 4:00 PM.

Download Raw Diff

Details

Reviewers

delena
mkuper

Commits

rGe41a8c4182f4: Created new X86 FMA3 opcodes (FMA*_Int) that are used now for lowering of…
rL252060: Created new X86 FMA3 opcodes (FMA*_Int) that are used now for lowering of…

Summary

This change-set is one in the series of change-sets improving X86-FMA3 optimizations.
Please see (D11370) and (D13269) for more details.

This change-sets adds new X86-FMA3 opcodes that must be used for SCALAR FMA INTRINSICS.
The new FMA*_Int opcodes are similar to existing ADD*_Int, SUB*_Int, MUL*_Int opcodes.

The key difference between FMA* and FMA*_Int opcodes is that FMA*_Int opcodes are handled
more conservatively. For example, it is illegal to commute 1st and 2nd operands of FMA*_Int
as such commute transformation would change the upper bits of the intrinsic result which should be taken
from the 1st operand of the FMA intrinsic.

So, this patch fixes the existing problem in LLVM X86 Code-Gen.

The definitions of X86-FMA3 opcodes were simplified a lot.
Unused or unnecessary template parameters were removed.
Now the definitions look quite similar to definitions of ADD/SUB/MUL opcodes.

Temporarily, the FMA*_Int opcodes are defined as non-commutable.
This constraint was added to reduce the size of the current patch and it will be eliminated
in the next changes very soon.

X86InstrFMA.td:

Simplified the definitions of scalar FMA3 opcodes by removing the template parameters that were unused or not really necessary.
Added definitions for FMA3 opcodes generated for scalar FMA instructions.

X86InstrInfo.cpp:

Added the new FMA*_Int opcodes to MemoryFoldTable3 to enable memory-folding optimization.

fma-intrinsics-ph-213-to-231.ll:

Added tests for scalar FMA intrinscis. PHI-213-to-231 optimization should not be used for scalar intrinsics.
Added tests for FNMADD and FNMSUB intrinsics to make the test more complete.

fma-intrinsics-x86.ll:

Added more test cases to check that 1st and 2nd operands of scalar FMAs generated for intrinsics are not commuted anymore.

Diff Detail

Event Timeline

v_klochkov updated this revision to Diff 37297.Oct 13 2015, 4:00 PM

v_klochkov retitled this revision from to New X86 FMA3*_Int opcodes for scalar FMA intrinsics..

v_klochkov updated this object.

v_klochkov added subscribers: ab, qcolombet, llvm-commits.

qcolombet added inline comments.Oct 13 2015, 4:04 PM

llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
4	Could you add more check lines to check that the registers are set to the right value? This just this check I guess that previously buggy code would also match and we don’t want that.

v_klochkov mentioned this in D13269: Improved X86-FMA3 mem-folding & coalescing.Oct 13 2015, 4:04 PM

v_klochkov added a subscriber: DavidKreitzer.Oct 13 2015, 4:07 PM

Hi Quentin,

Thank you for the comment. I would rather not adding additional checks to that test,
but would remove/cancel the scalar test cases I added to fma-intrinsics-phi-213-to-231.ll test.
Please see see more details in my answer to your inline comment.

Thank you,
Slava

llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
4	The test cases in this LIT test are quite complex, and adding such additional checks should be quite complex and it may make the test unstable, pass/fail may depend on many optimizations working on the tested loops. The initial idea of the test fma-intrinsics-phi-213-to-231.ll _probably_ was only to check if the corresponding PHI-213-to-231 optimization did its work for FMAs with ADD accumulators. The idea of my changes was to check that the optimization DID NOT do changes for SCALAR FMA instructions generated for scalar FMA intrinsics. Your guess that the new scalar test case would pass all checks even without my new patch is correct. I did an additional investigation and discovered that PHI-213-to-231 optimization never did any changes for scalar FMA instructions generated for intrinsics because of the additional REG copies inserted in the pattern (see the base version of the file X86InstrFMA.td the lines 194-196). After giving it some more thought I think that the scalar test cases I added to this test are redundant as PHI-213-to-231 optimization is just a special instruction inserter which is never called for the new FMA*_Int instructions. The other aspects/transformations like commuting operands should be checked by other specialized tests like fma-commutex86.ll added in (D13269). Do you agree with my plan to remove the scalar cases from the test? Also, it may be useful to add test cases for 128-bit packed FMAs to this test, I just surprisingly realized that the test has only 256-bit test cases in it.

qcolombet added inline comments.Oct 19 2015, 10:11 AM

llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
4	Do you agree with my plan to remove the scalar cases from the test? I believe you if you say they are redundant with the one we will add with D13269. So I am fine with that plan. However, that does not resolve the problem that the check lines for the PHIs test would have still matched the bad code generation. How do you suggest to fix that?

Updated the unit test fma-intrinsics-phi-213-to-231.ll:

cancelled the insertion of new scalar test cases (they were added in previous version of patch);
added 128-bit packed test cases
tightened the checks;

Thank you for the comments. Please review the new/updated test.

Thank you,
Slava

llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
107	I updated the test. The checks look reasonably tight now.

mkuper added a reviewer: mkuper.Oct 28 2015, 5:41 AM

Is it possible to convert intrinsic to FMA node in DAG lowering phase, like we did in X86IntrinsicInfo.h?

llvm/lib/Target/X86/X86InstrInfo.cpp
1737	Do you have a test that checks memory folding of intrinsic?
llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
274	Why do you need so long test in order to check only one operation? The comment is related to all tests.

Hi Elena,

Updated the test fma-intrinsics-x86.ll:

added Windows target code-gen;
added checks for memory folding optimization opt of FMAs generated for intrinsics.

Hi Elena,

Thank you for reviewing the patch.
I have updated it to add more checks testing the memory folding optimization of FMAs generated for intrinsics.

Also, I am pretty sure that there is no need to use the scheme from X86IntrinsicInfo.h for SCALAR FMA intrinsics.
That approach is usable for packed FMA intrinsics, but for SCALAR FMA intrinscis that would require adding new SDNode operations, the existing FMADD,FMSUB,FNMSUB,FNMADD will not work for SCALAR intrinsics.
There are only few exceptions in X86IntinsicInfo.h where SCALAR intrinsics are handled there, but those are really special operations like VGETMANT intrinsics.

Thank you,
Slava

llvm/lib/Target/X86/X86InstrInfo.cpp
1737	There were no any tests checking memory folding of FMA intrinsics. In the updated patch updated the test fma-intrinsics-x86.ll such a way that it tests Windows target and it also checks Memory Folding.
llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
274	This LIT test is created not by me, that would be a good question to author, but in my opinion the test is reasonably long. (I suppose that you say that each of test cases is long, right?) Test case here represents a very typical situation when there is an FMA instruction in a loop and there is a loop dependency going through the ADD path of FMA (i.e. through the 3rd operand of a*b + c operation). This test case looks a pretty laconic representation of a loop.

delena added inline comments.Nov 3 2015, 7:02 AM

llvm/lib/Target/X86/X86InstrInfo.cpp
1815	I don't understand how you can use the 231 form for scalar intrinsic: intr_fmadd_ss( a, b, c) may be translated as VFMADD213SS a, b, c or VFMADD132SS a, c, b but you can't generate VFMADD231SS because "a" should go first, you are taking the upper part from it.
llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
274	The test checks that FMA intrinsic gives the right form of FMA instruction. I don't understand why do you need a loop here. We wrote a lot of FMA intrinsic tests without any loops.
llvm/test/CodeGen/X86/fma-intrinsics-x86.ll
408	you check folding vector load into scalar intrinsic. On AVX-512 we support folding scalar load to scalar intrinsic., by matching scalar_to_vector(loadf32) pattern in td file

Elena,

Please see the answers to your questions.

Thank you,
Slava

llvm/lib/Target/X86/X86InstrInfo.cpp
1815	Very good question. In the file X86InstrFMA.td I intentionally added a comment noticing that problem. Please see the line 215 in that file: // The FMA 231 form can be get only by commuting the 1st operand of 213 or 231 // forms and is possible only after special analysis of all uses of the initial // instruction. Such analysis do not exist yet and thus introducing the 231 // form of FMA*_Int instructions is done using an optimistic assumption that // such analysis will be implemented eventually. BTW, I noticed a misprint in that comment and I'll fix it: "213 or 231" --> "213 or 132". If ONLY the lowest element of FMA213 result is used then it is possible to commute the 1st operand. Such analysis exist and used in other compilers.
llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
274	The loop is needed to get the right form of FMA instruction, i.e. the 231 form is generated when there is a LOOP DEPENDENCY on the ADD path. The test checks that 231 form is generated for such loops.
llvm/test/CodeGen/X86/fma-intrinsics-x86.ll
408	I agree, the check tests memory folding of vector load into scalar intrinsic. Memory folding does not work for such test cases (with and without my patch): __m128d m = _mm_load_sd(mem); __m128d res = _mm_fmadd_sd(a, b, m); This should be fixed, and I think I know how to easily do that, but I would rather do that in a separate patch.

I don't have more questions. Thank you.
LGTM.

llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll
274	ok

This revision is now accepted and ready to land.Nov 3 2015, 12:40 PM

Closed by commit rL252060: Created new X86 FMA3 opcodes (FMA*_Int) that are used now for lowering of… (authored by akaylor). · Explain WhyNov 4 2015, 10:13 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86InstrFMA.td

134 lines

X86InstrInfo.cpp

24 lines

test/

CodeGen/

X86/

fma-intrinsics-phi-213-to-231.ll

201 lines

fma-intrinsics-x86.ll

120 lines

Diff 37297

llvm/lib/Target/X86/X86InstrFMA.td

	Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines
	let ExeDomain = SSEPackedDouble in {			let ExeDomain = SSEPackedDouble in {
	defm VFNMADDPD : fma3p_forms<0x9C, 0xAC, 0xBC, "vfnmadd", "pd", loadv2f64,			defm VFNMADDPD : fma3p_forms<0x9C, 0xAC, 0xBC, "vfnmadd", "pd", loadv2f64,
	loadv4f64, X86Fnmadd, v2f64, v4f64>, VEX_W;			loadv4f64, X86Fnmadd, v2f64, v4f64>, VEX_W;
	defm VFNMSUBPD : fma3p_forms<0x9E, 0xAE, 0xBE, "vfnmsub", "pd",			defm VFNMSUBPD : fma3p_forms<0x9E, 0xAE, 0xBE, "vfnmsub", "pd",
	loadv2f64, loadv4f64, X86Fnmsub, v2f64,			loadv2f64, loadv4f64, X86Fnmsub, v2f64,
	v4f64>, VEX_W;			v4f64>, VEX_W;
	}			}

	let Constraints = "$src1 = $dst" in {			// All source register operands of FMA instructions can be commuted.
	multiclass fma3s_rm<bits<8> opc, string OpcodeStr, X86MemOperand x86memop,			// In many cases such commute transformation requres an opcode adjustment,
	RegisterClass RC, ValueType OpVT, PatFrag mem_frag,			// for example, commuting the operands 1 and 2 in FMA*132 form would require
				// an opcode change to FMA*231:
				// FMA132 reg1, reg2, reg3; // reg1 * reg3 + reg2;
				// -->
				// FMA231 reg2, reg1, reg3; // reg1 * reg3 + reg2;
				// Currently, the commute transformation is supported for only few FMA forms.
				// That is the reason why \p IsRVariantCommutable and \p IsMVariantCommutable
				// parameters are used here.
				// The general commute operands optimization working for all forms is going
				// to be implemented soon. (Please, see http://reviews.llvm.org/D13269
				// for details).
				let Constraints = "$src1 = $dst", hasSideEffects = 0 in {
				multiclass fma3s_rm<bits<8> opc, string OpcodeStr,
				X86MemOperand x86memop, RegisterClass RC,
	bit IsRVariantCommutable = 0, bit IsMVariantCommutable = 0,			bit IsRVariantCommutable = 0, bit IsMVariantCommutable = 0,
	SDPatternOperator OpNode = null_frag> {			SDPatternOperator OpNode = null_frag> {
	let usesCustomInserter = 1, isCommutable = IsRVariantCommutable in			let usesCustomInserter = 1, isCommutable = IsRVariantCommutable in
	def r : FMA3<opc, MRMSrcReg, (outs RC:$dst),			def r : FMA3<opc, MRMSrcReg, (outs RC:$dst),
	(ins RC:$src1, RC:$src2, RC:$src3),			(ins RC:$src1, RC:$src2, RC:$src3),
	!strconcat(OpcodeStr,			!strconcat(OpcodeStr,
	"\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),			"\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
	[(set RC:$dst,			[(set RC:$dst, (OpNode RC:$src2, RC:$src1, RC:$src3))]>;
	(OpVT (OpNode RC:$src2, RC:$src1, RC:$src3)))]>;

	let mayLoad = 1, isCommutable = IsMVariantCommutable in			let mayLoad = 1, isCommutable = IsMVariantCommutable in
	def m : FMA3<opc, MRMSrcMem, (outs RC:$dst),			def m : FMA3<opc, MRMSrcMem, (outs RC:$dst),
	(ins RC:$src1, RC:$src2, x86memop:$src3),			(ins RC:$src1, RC:$src2, x86memop:$src3),
	!strconcat(OpcodeStr,			!strconcat(OpcodeStr,
	"\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),			"\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
	[(set RC:$dst,			[(set RC:$dst,
	(OpVT (OpNode RC:$src2, RC:$src1,			(OpNode RC:$src2, RC:$src1, (load addr:$src3)))]>;
	(mem_frag addr:$src3))))]>;
	}			}
	} // Constraints = "$src1 = $dst"			} // Constraints = "$src1 = $dst", hasSideEffects = 0

	multiclass fma3s_forms<bits<8> opc132, bits<8> opc213, bits<8> opc231,			// These FMA*_Int instructions are defined specially for being used when
	string OpStr, string PackTy, string PT2, Intrinsic Int,			// the scalar FMA intrinsics are lowered to machine instructions, and in that
	SDNode OpNode, RegisterClass RC, ValueType OpVT,			// sence they are similar to existing ADD_Int, SUB_Int, MUL*_Int, etc.
	X86MemOperand x86memop, Operand memop, PatFrag mem_frag,			// instructions.
	ComplexPattern mem_cpat> {			//
	let hasSideEffects = 0 in {			// The FMA*_Int instructions are _TEMPORARILY_ defined as NOT commutable.
	defm r132 : fma3s_rm<opc132, !strconcat(OpStr, "132", PackTy),			// The upper bits of the result of scalar FMA intrinsics must be copied from
	x86memop, RC, OpVT, mem_frag>;			// the upper bits of the 1st operand. So, commuting the 1st operand would
	// See the other defm of r231 for the explanation regarding the			// invalidate the upper bits of the intrinsic result.
	// commutable flags.			// The corresponding optimization which allows commuting 2nd and 3rd operands
	defm r231 : fma3s_rm<opc231, !strconcat(OpStr, "231", PackTy),			// of FMA*_Int instructions has been developed and is waiting for
	x86memop, RC, OpVT, mem_frag,			// code-review approval and checkin (Please see http://reviews.llvm.org/D13269).
	/* IsRVariantCommutable */ 1,			let Constraints = "$src1 = $dst", isCommutable = 0, isCodeGenOnly =1,
	/* IsMVariantCommutable */ 0>;			hasSideEffects = 0 in {
				multiclass fma3s_rm_int<bits<8> opc, string OpcodeStr,
				Operand memopr, RegisterClass RC> {
				def r_Int : FMA3<opc, MRMSrcReg, (outs RC:$dst),
				(ins RC:$src1, RC:$src2, RC:$src3),
				!strconcat(OpcodeStr,
				"\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
				[]>;

				let mayLoad = 1 in
				def m_Int : FMA3<opc, MRMSrcMem, (outs RC:$dst),
				(ins RC:$src1, RC:$src2, memopr:$src3),
				!strconcat(OpcodeStr,
				"\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
				[]>;
	}			}
				} // Constraints = "$src1 = $dst", isCommutable = 0, isCodeGenOnly =1,
				// hasSideEffects = 0

	// See the other defm of r213 for the explanation regarding the			multiclass fma3s_forms<bits<8> opc132, bits<8> opc213, bits<8> opc231,
	// commutable flags.			string OpStr, string PackTy,
	defm r213 : fma3s_rm<opc213, !strconcat(OpStr, "213", PackTy),			SDNode OpNode, RegisterClass RC,
	x86memop, RC, OpVT, mem_frag,			X86MemOperand x86memop> {
				defm r132 : fma3s_rm<opc132, !strconcat(OpStr, "132", PackTy), x86memop, RC>;
				defm r213 : fma3s_rm<opc213, !strconcat(OpStr, "213", PackTy), x86memop, RC,
	/* IsRVariantCommutable */ 1,			/* IsRVariantCommutable */ 1,
	/* IsMVariantCommutable */ 1,			/* IsMVariantCommutable */ 1,
	OpNode>;			OpNode>;
				defm r231 : fma3s_rm<opc231, !strconcat(OpStr, "231", PackTy), x86memop, RC,
				/* IsRVariantCommutable */ 1,
				/* IsMVariantCommutable */ 0,
				null_frag>;
				}

				// The FMA 213 form is created for lowering of scalar FMA intrinscis
				// to machine instructions.
				// The FMA 132 form can trivially be get by commuting the 2nd and 3rd operands
				// of FMA 213 form.
				// The FMA 231 form can be get only by commuting the 1st operand of 213 or 231
				// forms and is possible only after special analysis of all uses of the initial
				// instruction. Such analysis do not exist yet and thus introducing the 231
				// form of FMA*_Int instructions is done using an optimistic assumption that
				// such analysis will be implemented eventually.
				multiclass fma3s_int_forms<bits<8> opc132, bits<8> opc213, bits<8> opc231,
				string OpStr, string PackTy,
				RegisterClass RC, Operand memop> {
				defm r132 : fma3s_rm_int<opc132, !strconcat(OpStr, "132", PackTy),
				memop, RC>;
				defm r213 : fma3s_rm_int<opc213, !strconcat(OpStr, "213", PackTy),
				memop, RC>;
				defm r231 : fma3s_rm_int<opc231, !strconcat(OpStr, "231", PackTy),
				memop, RC>;
	}			}

	multiclass fma3s<bits<8> opc132, bits<8> opc213, bits<8> opc231,			multiclass fma3s<bits<8> opc132, bits<8> opc213, bits<8> opc231,
	string OpStr, Intrinsic IntF32, Intrinsic IntF64,			string OpStr, Intrinsic IntF32, Intrinsic IntF64,
	SDNode OpNode> {			SDNode OpNode> {
	defm SS : fma3s_forms<opc132, opc213, opc231, OpStr, "ss", "SS", IntF32, OpNode,			defm SS : fma3s_forms<opc132, opc213, opc231, OpStr, "ss", OpNode,
	FR32, f32, f32mem, ssmem, loadf32, sse_load_f32>;			FR32, f32mem>,
	defm SD : fma3s_forms<opc132, opc213, opc231, OpStr, "sd", "PD", IntF64, OpNode,			fma3s_int_forms<opc132, opc213, opc231, OpStr, "ss", VR128, ssmem>;
	FR64, f64, f64mem, sdmem, loadf64, sse_load_f64>, VEX_W;			defm SD : fma3s_forms<opc132, opc213, opc231, OpStr, "sd", OpNode,
				FR64, f64mem>,
				fma3s_int_forms<opc132, opc213, opc231, OpStr, "sd", VR128, sdmem>,
				VEX_W;

	// These patterns use the 123 ordering, instead of 213, even though			// These patterns use the 123 ordering, instead of 213, even though
	// they match the intrinsic to the 213 version of the instruction.			// they match the intrinsic to the 213 version of the instruction.
	// This is because src1 is tied to dest, and the scalar intrinsics			// This is because src1 is tied to dest, and the scalar intrinsics
	// require the pass-through values to come from the first source			// require the pass-through values to come from the first source
	// operand, not the second.			// operand, not the second.
	def : Pat<(IntF32 VR128:$src1, VR128:$src2, VR128:$src3),			def : Pat<(IntF32 VR128:$src1, VR128:$src2, VR128:$src3),
	(COPY_TO_REGCLASS			(COPY_TO_REGCLASS
	(!cast<Instruction>(NAME#"SSr213r")			(!cast<Instruction>(NAME#"SSr213r_Int")
	(COPY_TO_REGCLASS $src1, FR32),			(COPY_TO_REGCLASS $src1, FR32),
	(COPY_TO_REGCLASS $src2, FR32),			(COPY_TO_REGCLASS $src2, FR32),
	(COPY_TO_REGCLASS $src3, FR32)),			(COPY_TO_REGCLASS $src3, FR32)),
	VR128)>;			VR128)>;

	def : Pat<(IntF64 VR128:$src1, VR128:$src2, VR128:$src3),			def : Pat<(IntF64 VR128:$src1, VR128:$src2, VR128:$src3),
	(COPY_TO_REGCLASS			(COPY_TO_REGCLASS
	(!cast<Instruction>(NAME#"SDr213r")			(!cast<Instruction>(NAME#"SDr213r_Int")
	(COPY_TO_REGCLASS $src1, FR64),			(COPY_TO_REGCLASS $src1, FR64),
	(COPY_TO_REGCLASS $src2, FR64),			(COPY_TO_REGCLASS $src2, FR64),
	(COPY_TO_REGCLASS $src3, FR64)),			(COPY_TO_REGCLASS $src3, FR64)),
	VR128)>;			VR128)>;
	}			}

	defm VFMADD : fma3s<0x99, 0xA9, 0xB9, "vfmadd", int_x86_fma_vfmadd_ss,			defm VFMADD : fma3s<0x99, 0xA9, 0xB9, "vfmadd", int_x86_fma_vfmadd_ss,
	int_x86_fma_vfmadd_sd, X86Fmadd>, VEX_LIG;			int_x86_fma_vfmadd_sd, X86Fmadd>, VEX_LIG;
	▲ Show 20 Lines • Show All 186 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,728 Lines • ▼ Show 20 Lines	AddTableEntry(RegOp2MemOpTable2, MemOp2RegOpTable,
Entry.RegOp, Entry.MemOp,		Entry.RegOp, Entry.MemOp,
// Index 2, folded load		// Index 2, folded load
Entry.Flags \| TB_INDEX_2 \| TB_FOLDED_LOAD);		Entry.Flags \| TB_INDEX_2 \| TB_FOLDED_LOAD);
}		}

static const X86MemoryFoldTableEntry MemoryFoldTable3[] = {		static const X86MemoryFoldTableEntry MemoryFoldTable3[] = {
// FMA foldable instructions		// FMA foldable instructions
{ X86::VFMADDSSr231r, X86::VFMADDSSr231m, TB_ALIGN_NONE },		{ X86::VFMADDSSr231r, X86::VFMADDSSr231m, TB_ALIGN_NONE },
		{ X86::VFMADDSSr231r_Int, X86::VFMADDSSr231m_Int, TB_ALIGN_NONE },
		delenaUnsubmitted Done Reply Inline Actions Do you have a test that checks memory folding of intrinsic? delena: Do you have a test that checks memory folding of intrinsic?
		v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions There were no any tests checking memory folding of FMA intrinsics. In the updated patch updated the test fma-intrinsics-x86.ll such a way that it tests Windows target and it also checks Memory Folding. v_klochkov: There were no any tests checking memory folding of FMA intrinsics. In the updated patch updated…
{ X86::VFMADDSDr231r, X86::VFMADDSDr231m, TB_ALIGN_NONE },		{ X86::VFMADDSDr231r, X86::VFMADDSDr231m, TB_ALIGN_NONE },
		{ X86::VFMADDSDr231r_Int, X86::VFMADDSDr231m_Int, TB_ALIGN_NONE },
{ X86::VFMADDSSr132r, X86::VFMADDSSr132m, TB_ALIGN_NONE },		{ X86::VFMADDSSr132r, X86::VFMADDSSr132m, TB_ALIGN_NONE },
		{ X86::VFMADDSSr132r_Int, X86::VFMADDSSr132m_Int, TB_ALIGN_NONE },
{ X86::VFMADDSDr132r, X86::VFMADDSDr132m, TB_ALIGN_NONE },		{ X86::VFMADDSDr132r, X86::VFMADDSDr132m, TB_ALIGN_NONE },
		{ X86::VFMADDSDr132r_Int, X86::VFMADDSDr132m_Int, TB_ALIGN_NONE },
{ X86::VFMADDSSr213r, X86::VFMADDSSr213m, TB_ALIGN_NONE },		{ X86::VFMADDSSr213r, X86::VFMADDSSr213m, TB_ALIGN_NONE },
		{ X86::VFMADDSSr213r_Int, X86::VFMADDSSr213m_Int, TB_ALIGN_NONE },
{ X86::VFMADDSDr213r, X86::VFMADDSDr213m, TB_ALIGN_NONE },		{ X86::VFMADDSDr213r, X86::VFMADDSDr213m, TB_ALIGN_NONE },
		{ X86::VFMADDSDr213r_Int, X86::VFMADDSDr213m_Int, TB_ALIGN_NONE },

{ X86::VFMADDPSr231r, X86::VFMADDPSr231m, TB_ALIGN_NONE },		{ X86::VFMADDPSr231r, X86::VFMADDPSr231m, TB_ALIGN_NONE },
{ X86::VFMADDPDr231r, X86::VFMADDPDr231m, TB_ALIGN_NONE },		{ X86::VFMADDPDr231r, X86::VFMADDPDr231m, TB_ALIGN_NONE },
{ X86::VFMADDPSr132r, X86::VFMADDPSr132m, TB_ALIGN_NONE },		{ X86::VFMADDPSr132r, X86::VFMADDPSr132m, TB_ALIGN_NONE },
{ X86::VFMADDPDr132r, X86::VFMADDPDr132m, TB_ALIGN_NONE },		{ X86::VFMADDPDr132r, X86::VFMADDPDr132m, TB_ALIGN_NONE },
{ X86::VFMADDPSr213r, X86::VFMADDPSr213m, TB_ALIGN_NONE },		{ X86::VFMADDPSr213r, X86::VFMADDPSr213m, TB_ALIGN_NONE },
{ X86::VFMADDPDr213r, X86::VFMADDPDr213m, TB_ALIGN_NONE },		{ X86::VFMADDPDr213r, X86::VFMADDPDr213m, TB_ALIGN_NONE },
{ X86::VFMADDPSr231rY, X86::VFMADDPSr231mY, TB_ALIGN_NONE },		{ X86::VFMADDPSr231rY, X86::VFMADDPSr231mY, TB_ALIGN_NONE },
{ X86::VFMADDPDr231rY, X86::VFMADDPDr231mY, TB_ALIGN_NONE },		{ X86::VFMADDPDr231rY, X86::VFMADDPDr231mY, TB_ALIGN_NONE },
{ X86::VFMADDPSr132rY, X86::VFMADDPSr132mY, TB_ALIGN_NONE },		{ X86::VFMADDPSr132rY, X86::VFMADDPSr132mY, TB_ALIGN_NONE },
{ X86::VFMADDPDr132rY, X86::VFMADDPDr132mY, TB_ALIGN_NONE },		{ X86::VFMADDPDr132rY, X86::VFMADDPDr132mY, TB_ALIGN_NONE },
{ X86::VFMADDPSr213rY, X86::VFMADDPSr213mY, TB_ALIGN_NONE },		{ X86::VFMADDPSr213rY, X86::VFMADDPSr213mY, TB_ALIGN_NONE },
{ X86::VFMADDPDr213rY, X86::VFMADDPDr213mY, TB_ALIGN_NONE },		{ X86::VFMADDPDr213rY, X86::VFMADDPDr213mY, TB_ALIGN_NONE },

{ X86::VFNMADDSSr231r, X86::VFNMADDSSr231m, TB_ALIGN_NONE },		{ X86::VFNMADDSSr231r, X86::VFNMADDSSr231m, TB_ALIGN_NONE },
		{ X86::VFNMADDSSr231r_Int, X86::VFNMADDSSr231m_Int, TB_ALIGN_NONE },
{ X86::VFNMADDSDr231r, X86::VFNMADDSDr231m, TB_ALIGN_NONE },		{ X86::VFNMADDSDr231r, X86::VFNMADDSDr231m, TB_ALIGN_NONE },
		{ X86::VFNMADDSDr231r_Int, X86::VFNMADDSDr231m_Int, TB_ALIGN_NONE },
{ X86::VFNMADDSSr132r, X86::VFNMADDSSr132m, TB_ALIGN_NONE },		{ X86::VFNMADDSSr132r, X86::VFNMADDSSr132m, TB_ALIGN_NONE },
		{ X86::VFNMADDSSr132r_Int, X86::VFNMADDSSr132m_Int, TB_ALIGN_NONE },
{ X86::VFNMADDSDr132r, X86::VFNMADDSDr132m, TB_ALIGN_NONE },		{ X86::VFNMADDSDr132r, X86::VFNMADDSDr132m, TB_ALIGN_NONE },
		{ X86::VFNMADDSDr132r_Int, X86::VFNMADDSDr132m_Int, TB_ALIGN_NONE },
{ X86::VFNMADDSSr213r, X86::VFNMADDSSr213m, TB_ALIGN_NONE },		{ X86::VFNMADDSSr213r, X86::VFNMADDSSr213m, TB_ALIGN_NONE },
		{ X86::VFNMADDSSr213r_Int, X86::VFNMADDSSr213m_Int, TB_ALIGN_NONE },
{ X86::VFNMADDSDr213r, X86::VFNMADDSDr213m, TB_ALIGN_NONE },		{ X86::VFNMADDSDr213r, X86::VFNMADDSDr213m, TB_ALIGN_NONE },
		{ X86::VFNMADDSDr213r_Int, X86::VFNMADDSDr213m_Int, TB_ALIGN_NONE },

{ X86::VFNMADDPSr231r, X86::VFNMADDPSr231m, TB_ALIGN_NONE },		{ X86::VFNMADDPSr231r, X86::VFNMADDPSr231m, TB_ALIGN_NONE },
{ X86::VFNMADDPDr231r, X86::VFNMADDPDr231m, TB_ALIGN_NONE },		{ X86::VFNMADDPDr231r, X86::VFNMADDPDr231m, TB_ALIGN_NONE },
{ X86::VFNMADDPSr132r, X86::VFNMADDPSr132m, TB_ALIGN_NONE },		{ X86::VFNMADDPSr132r, X86::VFNMADDPSr132m, TB_ALIGN_NONE },
{ X86::VFNMADDPDr132r, X86::VFNMADDPDr132m, TB_ALIGN_NONE },		{ X86::VFNMADDPDr132r, X86::VFNMADDPDr132m, TB_ALIGN_NONE },
{ X86::VFNMADDPSr213r, X86::VFNMADDPSr213m, TB_ALIGN_NONE },		{ X86::VFNMADDPSr213r, X86::VFNMADDPSr213m, TB_ALIGN_NONE },
{ X86::VFNMADDPDr213r, X86::VFNMADDPDr213m, TB_ALIGN_NONE },		{ X86::VFNMADDPDr213r, X86::VFNMADDPDr213m, TB_ALIGN_NONE },
{ X86::VFNMADDPSr231rY, X86::VFNMADDPSr231mY, TB_ALIGN_NONE },		{ X86::VFNMADDPSr231rY, X86::VFNMADDPSr231mY, TB_ALIGN_NONE },
{ X86::VFNMADDPDr231rY, X86::VFNMADDPDr231mY, TB_ALIGN_NONE },		{ X86::VFNMADDPDr231rY, X86::VFNMADDPDr231mY, TB_ALIGN_NONE },
{ X86::VFNMADDPSr132rY, X86::VFNMADDPSr132mY, TB_ALIGN_NONE },		{ X86::VFNMADDPSr132rY, X86::VFNMADDPSr132mY, TB_ALIGN_NONE },
{ X86::VFNMADDPDr132rY, X86::VFNMADDPDr132mY, TB_ALIGN_NONE },		{ X86::VFNMADDPDr132rY, X86::VFNMADDPDr132mY, TB_ALIGN_NONE },
{ X86::VFNMADDPSr213rY, X86::VFNMADDPSr213mY, TB_ALIGN_NONE },		{ X86::VFNMADDPSr213rY, X86::VFNMADDPSr213mY, TB_ALIGN_NONE },
{ X86::VFNMADDPDr213rY, X86::VFNMADDPDr213mY, TB_ALIGN_NONE },		{ X86::VFNMADDPDr213rY, X86::VFNMADDPDr213mY, TB_ALIGN_NONE },

{ X86::VFMSUBSSr231r, X86::VFMSUBSSr231m, TB_ALIGN_NONE },		{ X86::VFMSUBSSr231r, X86::VFMSUBSSr231m, TB_ALIGN_NONE },
		{ X86::VFMSUBSSr231r_Int, X86::VFMSUBSSr231m_Int, TB_ALIGN_NONE },
{ X86::VFMSUBSDr231r, X86::VFMSUBSDr231m, TB_ALIGN_NONE },		{ X86::VFMSUBSDr231r, X86::VFMSUBSDr231m, TB_ALIGN_NONE },
		{ X86::VFMSUBSDr231r_Int, X86::VFMSUBSDr231m_Int, TB_ALIGN_NONE },
{ X86::VFMSUBSSr132r, X86::VFMSUBSSr132m, TB_ALIGN_NONE },		{ X86::VFMSUBSSr132r, X86::VFMSUBSSr132m, TB_ALIGN_NONE },
		{ X86::VFMSUBSSr132r_Int, X86::VFMSUBSSr132m_Int, TB_ALIGN_NONE },
{ X86::VFMSUBSDr132r, X86::VFMSUBSDr132m, TB_ALIGN_NONE },		{ X86::VFMSUBSDr132r, X86::VFMSUBSDr132m, TB_ALIGN_NONE },
		{ X86::VFMSUBSDr132r_Int, X86::VFMSUBSDr132m_Int, TB_ALIGN_NONE },
{ X86::VFMSUBSSr213r, X86::VFMSUBSSr213m, TB_ALIGN_NONE },		{ X86::VFMSUBSSr213r, X86::VFMSUBSSr213m, TB_ALIGN_NONE },
		{ X86::VFMSUBSSr213r_Int, X86::VFMSUBSSr213m_Int, TB_ALIGN_NONE },
{ X86::VFMSUBSDr213r, X86::VFMSUBSDr213m, TB_ALIGN_NONE },		{ X86::VFMSUBSDr213r, X86::VFMSUBSDr213m, TB_ALIGN_NONE },
		{ X86::VFMSUBSDr213r_Int, X86::VFMSUBSDr213m_Int, TB_ALIGN_NONE },

{ X86::VFMSUBPSr231r, X86::VFMSUBPSr231m, TB_ALIGN_NONE },		{ X86::VFMSUBPSr231r, X86::VFMSUBPSr231m, TB_ALIGN_NONE },
{ X86::VFMSUBPDr231r, X86::VFMSUBPDr231m, TB_ALIGN_NONE },		{ X86::VFMSUBPDr231r, X86::VFMSUBPDr231m, TB_ALIGN_NONE },
{ X86::VFMSUBPSr132r, X86::VFMSUBPSr132m, TB_ALIGN_NONE },		{ X86::VFMSUBPSr132r, X86::VFMSUBPSr132m, TB_ALIGN_NONE },
{ X86::VFMSUBPDr132r, X86::VFMSUBPDr132m, TB_ALIGN_NONE },		{ X86::VFMSUBPDr132r, X86::VFMSUBPDr132m, TB_ALIGN_NONE },
{ X86::VFMSUBPSr213r, X86::VFMSUBPSr213m, TB_ALIGN_NONE },		{ X86::VFMSUBPSr213r, X86::VFMSUBPSr213m, TB_ALIGN_NONE },
{ X86::VFMSUBPDr213r, X86::VFMSUBPDr213m, TB_ALIGN_NONE },		{ X86::VFMSUBPDr213r, X86::VFMSUBPDr213m, TB_ALIGN_NONE },
{ X86::VFMSUBPSr231rY, X86::VFMSUBPSr231mY, TB_ALIGN_NONE },		{ X86::VFMSUBPSr231rY, X86::VFMSUBPSr231mY, TB_ALIGN_NONE },
{ X86::VFMSUBPDr231rY, X86::VFMSUBPDr231mY, TB_ALIGN_NONE },		{ X86::VFMSUBPDr231rY, X86::VFMSUBPDr231mY, TB_ALIGN_NONE },
{ X86::VFMSUBPSr132rY, X86::VFMSUBPSr132mY, TB_ALIGN_NONE },		{ X86::VFMSUBPSr132rY, X86::VFMSUBPSr132mY, TB_ALIGN_NONE },
{ X86::VFMSUBPDr132rY, X86::VFMSUBPDr132mY, TB_ALIGN_NONE },		{ X86::VFMSUBPDr132rY, X86::VFMSUBPDr132mY, TB_ALIGN_NONE },
{ X86::VFMSUBPSr213rY, X86::VFMSUBPSr213mY, TB_ALIGN_NONE },		{ X86::VFMSUBPSr213rY, X86::VFMSUBPSr213mY, TB_ALIGN_NONE },
{ X86::VFMSUBPDr213rY, X86::VFMSUBPDr213mY, TB_ALIGN_NONE },		{ X86::VFMSUBPDr213rY, X86::VFMSUBPDr213mY, TB_ALIGN_NONE },

{ X86::VFNMSUBSSr231r, X86::VFNMSUBSSr231m, TB_ALIGN_NONE },		{ X86::VFNMSUBSSr231r, X86::VFNMSUBSSr231m, TB_ALIGN_NONE },
		{ X86::VFNMSUBSSr231r_Int, X86::VFNMSUBSSr231m_Int, TB_ALIGN_NONE },
		delenaUnsubmitted Not Done Reply Inline Actions I don't understand how you can use the 231 form for scalar intrinsic: intr_fmadd_ss( a, b, c) may be translated as VFMADD213SS a, b, c or VFMADD132SS a, c, b but you can't generate VFMADD231SS because "a" should go first, you are taking the upper part from it. delena: I don't understand how you can use the 231 form for scalar intrinsic: intr_fmadd_ss( a, b, c)…
		v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions Very good question. In the file X86InstrFMA.td I intentionally added a comment noticing that problem. Please see the line 215 in that file: // The FMA 231 form can be get only by commuting the 1st operand of 213 or 231 // forms and is possible only after special analysis of all uses of the initial // instruction. Such analysis do not exist yet and thus introducing the 231 // form of FMA_Int instructions is done using an optimistic assumption that // such analysis will be implemented eventually. BTW, I noticed a misprint in that comment and I'll fix it: "213 or 231" --> "213 or 132". If ONLY the lowest element of FMA213 result is used then it is possible to commute the 1st operand. Such analysis exist and used in other compilers. v_klochkov:* Very good question. In the file X86InstrFMA.td I intentionally added a comment noticing that…
{ X86::VFNMSUBSDr231r, X86::VFNMSUBSDr231m, TB_ALIGN_NONE },		{ X86::VFNMSUBSDr231r, X86::VFNMSUBSDr231m, TB_ALIGN_NONE },
		{ X86::VFNMSUBSDr231r_Int, X86::VFNMSUBSDr231m_Int, TB_ALIGN_NONE },
{ X86::VFNMSUBSSr132r, X86::VFNMSUBSSr132m, TB_ALIGN_NONE },		{ X86::VFNMSUBSSr132r, X86::VFNMSUBSSr132m, TB_ALIGN_NONE },
		{ X86::VFNMSUBSSr132r_Int, X86::VFNMSUBSSr132m_Int, TB_ALIGN_NONE },
{ X86::VFNMSUBSDr132r, X86::VFNMSUBSDr132m, TB_ALIGN_NONE },		{ X86::VFNMSUBSDr132r, X86::VFNMSUBSDr132m, TB_ALIGN_NONE },
		{ X86::VFNMSUBSDr132r_Int, X86::VFNMSUBSDr132m_Int, TB_ALIGN_NONE },
{ X86::VFNMSUBSSr213r, X86::VFNMSUBSSr213m, TB_ALIGN_NONE },		{ X86::VFNMSUBSSr213r, X86::VFNMSUBSSr213m, TB_ALIGN_NONE },
		{ X86::VFNMSUBSSr213r_Int, X86::VFNMSUBSSr213m_Int, TB_ALIGN_NONE },
{ X86::VFNMSUBSDr213r, X86::VFNMSUBSDr213m, TB_ALIGN_NONE },		{ X86::VFNMSUBSDr213r, X86::VFNMSUBSDr213m, TB_ALIGN_NONE },
		{ X86::VFNMSUBSDr213r_Int, X86::VFNMSUBSDr213m_Int, TB_ALIGN_NONE },

{ X86::VFNMSUBPSr231r, X86::VFNMSUBPSr231m, TB_ALIGN_NONE },		{ X86::VFNMSUBPSr231r, X86::VFNMSUBPSr231m, TB_ALIGN_NONE },
{ X86::VFNMSUBPDr231r, X86::VFNMSUBPDr231m, TB_ALIGN_NONE },		{ X86::VFNMSUBPDr231r, X86::VFNMSUBPDr231m, TB_ALIGN_NONE },
{ X86::VFNMSUBPSr132r, X86::VFNMSUBPSr132m, TB_ALIGN_NONE },		{ X86::VFNMSUBPSr132r, X86::VFNMSUBPSr132m, TB_ALIGN_NONE },
{ X86::VFNMSUBPDr132r, X86::VFNMSUBPDr132m, TB_ALIGN_NONE },		{ X86::VFNMSUBPDr132r, X86::VFNMSUBPDr132m, TB_ALIGN_NONE },
{ X86::VFNMSUBPSr213r, X86::VFNMSUBPSr213m, TB_ALIGN_NONE },		{ X86::VFNMSUBPSr213r, X86::VFNMSUBPSr213m, TB_ALIGN_NONE },
{ X86::VFNMSUBPDr213r, X86::VFNMSUBPDr213m, TB_ALIGN_NONE },		{ X86::VFNMSUBPDr213r, X86::VFNMSUBPDr213m, TB_ALIGN_NONE },
{ X86::VFNMSUBPSr231rY, X86::VFNMSUBPSr231mY, TB_ALIGN_NONE },		{ X86::VFNMSUBPSr231rY, X86::VFNMSUBPSr231mY, TB_ALIGN_NONE },
▲ Show 20 Lines • Show All 4,881 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll

; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2,+fma \| FileCheck %s		; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2,+fma \| FileCheck %s

		; CHECK-LABEL: fmaddsd_loop:
		; CHECK: vfmadd213sd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
		qcolombetUnsubmitted Not Done Reply Inline Actions Could you add more check lines to check that the registers are set to the right value? This just this check I guess that previously buggy code would also match and we don’t want that. qcolombet: Could you add more check lines to check that the registers are set to the right value? This…
		v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions The test cases in this LIT test are quite complex, and adding such additional checks should be quite complex and it may make the test unstable, pass/fail may depend on many optimizations working on the tested loops. The initial idea of the test fma-intrinsics-phi-213-to-231.ll _probably_ was only to check if the corresponding PHI-213-to-231 optimization did its work for FMAs with ADD accumulators. The idea of my changes was to check that the optimization DID NOT do changes for SCALAR FMA instructions generated for scalar FMA intrinsics. Your guess that the new scalar test case would pass all checks even without my new patch is correct. I did an additional investigation and discovered that PHI-213-to-231 optimization never did any changes for scalar FMA instructions generated for intrinsics because of the additional REG copies inserted in the pattern (see the base version of the file X86InstrFMA.td the lines 194-196). After giving it some more thought I think that the scalar test cases I added to this test are redundant as PHI-213-to-231 optimization is just a special instruction inserter which is never called for the new FMA_Int instructions. The other aspects/transformations like commuting operands should be checked by other specialized tests like fma-commutex86.ll added in (D13269). Do you agree with my plan to remove the scalar cases from the test? Also, it may be useful to add test cases for 128-bit packed FMAs to this test, I just surprisingly realized that the test has only 256-bit test cases in it. v_klochkov:* The test cases in this LIT test are quite complex, and adding such additional checks should be…
		qcolombetUnsubmitted Done Reply Inline Actions Do you agree with my plan to remove the scalar cases from the test? I believe you if you say they are redundant with the one we will add with D13269. So I am fine with that plan. However, that does not resolve the problem that the check lines for the PHIs test would have still matched the bad code generation. How do you suggest to fix that? qcolombet: > Do you agree with my plan to remove the scalar cases from the test? I believe you if you say…
		define <2 x double> @fmaddsd_loop(i32 %iter, <2 x double> %a, <2 x double> %b, <2 x double> %c) {
		entry:
		br label %for.cond

		for.cond:
		%c.addr.0 = phi <2 x double> [ %c, %entry ], [ %0, %for.inc ]
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
		%cmp = icmp slt i32 %i.0, %iter
		br i1 %cmp, label %for.body, label %for.end

		for.body:
		br label %for.inc

		for.inc:
		%0 = call <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double> %a, <2 x double> %b, <2 x double> %c.addr.0)
		%inc = add nsw i32 %i.0, 1
		br label %for.cond

		for.end:
		ret <2 x double> %c.addr.0
		}

		; CHECK-LABEL: fmsubsd_loop:
		; CHECK: vfmsub213sd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
		define <2 x double> @fmsubsd_loop(i32 %iter, <2 x double> %a, <2 x double> %b, <2 x double> %c) {
		entry:
		br label %for.cond

		for.cond:
		%c.addr.0 = phi <2 x double> [ %c, %entry ], [ %0, %for.inc ]
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
		%cmp = icmp slt i32 %i.0, %iter
		br i1 %cmp, label %for.body, label %for.end

		for.body:
		br label %for.inc

		for.inc:
		%0 = call <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double> %a, <2 x double> %b, <2 x double> %c.addr.0)
		%inc = add nsw i32 %i.0, 1
		br label %for.cond

		for.end:
		ret <2 x double> %c.addr.0
		}

		; CHECK-LABEL: fnmaddsd_loop:
		; CHECK: vfnmadd213sd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
		define <2 x double> @fnmaddsd_loop(i32 %iter, <2 x double> %a, <2 x double> %b, <2 x double> %c) {
		entry:
		br label %for.cond

		for.cond:
		%c.addr.0 = phi <2 x double> [ %c, %entry ], [ %0, %for.inc ]
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
		%cmp = icmp slt i32 %i.0, %iter
		br i1 %cmp, label %for.body, label %for.end

		for.body:
		br label %for.inc

		for.inc:
		%0 = call <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double> %a, <2 x double> %b, <2 x double> %c.addr.0)
		%inc = add nsw i32 %i.0, 1
		br label %for.cond

		for.end:
		ret <2 x double> %c.addr.0
		}

		; CHECK-LABEL: fnmsubsd_loop:
		; CHECK: vfnmsub213sd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
		define <2 x double> @fnmsubsd_loop(i32 %iter, <2 x double> %a, <2 x double> %b, <2 x double> %c) {
		entry:
		br label %for.cond

		for.cond:
		%c.addr.0 = phi <2 x double> [ %c, %entry ], [ %0, %for.inc ]
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
		%cmp = icmp slt i32 %i.0, %iter
		br i1 %cmp, label %for.body, label %for.end

		for.body:
		br label %for.inc

		for.inc:
		%0 = call <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double> %a, <2 x double> %b, <2 x double> %c.addr.0)
		%inc = add nsw i32 %i.0, 1
		br label %for.cond

		for.end:
		ret <2 x double> %c.addr.0
		}

		declare <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double>, <2 x double>, <2 x double>)
		declare <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double>, <2 x double>, <2 x double>)
		declare <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double>, <2 x double>, <2 x double>)
		declare <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double>, <2 x double>, <2 x double>)

; CHECK-LABEL: fmaddsubpd_loop:		; CHECK-LABEL: fmaddsubpd_loop:
; CHECK: vfmaddsub231pd %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}		; CHECK: vfmaddsub231pd %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}
define <4 x double> @fmaddsubpd_loop(i32 %iter, <4 x double> %a, <4 x double> %b, <4 x double> %c) {		define <4 x double> @fmaddsubpd_loop(i32 %iter, <4 x double> %a, <4 x double> %b, <4 x double> %c) {
entry:		entry:
		v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions I updated the test. The checks look reasonably tight now. v_klochkov: I updated the test. The checks look reasonably tight now.
br label %for.cond		br label %for.cond

for.cond:		for.cond:
%c.addr.0 = phi <4 x double> [ %c, %entry ], [ %0, %for.inc ]		%c.addr.0 = phi <4 x double> [ %c, %entry ], [ %0, %for.inc ]
%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
%cmp = icmp slt i32 %i.0, %iter		%cmp = icmp slt i32 %i.0, %iter
br i1 %cmp, label %for.body, label %for.end		br i1 %cmp, label %for.body, label %for.end

▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	for.inc:
%0 = call <4 x double> @llvm.x86.fma.vfmsub.pd.256(<4 x double> %a, <4 x double> %b, <4 x double> %c.addr.0)		%0 = call <4 x double> @llvm.x86.fma.vfmsub.pd.256(<4 x double> %a, <4 x double> %b, <4 x double> %c.addr.0)
%inc = add nsw i32 %i.0, 1		%inc = add nsw i32 %i.0, 1
br label %for.cond		br label %for.cond

for.end:		for.end:
ret <4 x double> %c.addr.0		ret <4 x double> %c.addr.0
}		}

		; CHECK-LABEL: fnmaddpd_loop:
		; CHECK: vfnmadd231pd %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}
		define <4 x double> @fnmaddpd_loop(i32 %iter, <4 x double> %a, <4 x double> %b, <4 x double> %c) {
		entry:
		br label %for.cond

		for.cond:
		%c.addr.0 = phi <4 x double> [ %c, %entry ], [ %0, %for.inc ]
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
		%cmp = icmp slt i32 %i.0, %iter
		br i1 %cmp, label %for.body, label %for.end

		for.body:
		br label %for.inc

		for.inc:
		%0 = call <4 x double> @llvm.x86.fma.vfnmadd.pd.256(<4 x double> %a, <4 x double> %b, <4 x double> %c.addr.0)
		%inc = add nsw i32 %i.0, 1
		br label %for.cond

		for.end:
		ret <4 x double> %c.addr.0
		}

		; CHECK-LABEL: fnmsubpd_loop:
		; CHECK: vfnmsub231pd %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}
		define <4 x double> @fnmsubpd_loop(i32 %iter, <4 x double> %a, <4 x double> %b, <4 x double> %c) {
		entry:
		br label %for.cond

		for.cond:
		%c.addr.0 = phi <4 x double> [ %c, %entry ], [ %0, %for.inc ]
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
		%cmp = icmp slt i32 %i.0, %iter
		br i1 %cmp, label %for.body, label %for.end

		for.body:
		br label %for.inc

		for.inc:
		%0 = call <4 x double> @llvm.x86.fma.vfnmsub.pd.256(<4 x double> %a, <4 x double> %b, <4 x double> %c.addr.0)
		%inc = add nsw i32 %i.0, 1
		br label %for.cond

		for.end:
		ret <4 x double> %c.addr.0
		}

declare <4 x double> @llvm.x86.fma.vfmaddsub.pd.256(<4 x double>, <4 x double>, <4 x double>)		declare <4 x double> @llvm.x86.fma.vfmaddsub.pd.256(<4 x double>, <4 x double>, <4 x double>)
declare <4 x double> @llvm.x86.fma.vfmsubadd.pd.256(<4 x double>, <4 x double>, <4 x double>)		declare <4 x double> @llvm.x86.fma.vfmsubadd.pd.256(<4 x double>, <4 x double>, <4 x double>)
declare <4 x double> @llvm.x86.fma.vfmadd.pd.256(<4 x double>, <4 x double>, <4 x double>)		declare <4 x double> @llvm.x86.fma.vfmadd.pd.256(<4 x double>, <4 x double>, <4 x double>)
declare <4 x double> @llvm.x86.fma.vfmsub.pd.256(<4 x double>, <4 x double>, <4 x double>)		declare <4 x double> @llvm.x86.fma.vfmsub.pd.256(<4 x double>, <4 x double>, <4 x double>)
		declare <4 x double> @llvm.x86.fma.vfnmadd.pd.256(<4 x double>, <4 x double>, <4 x double>)
		declare <4 x double> @llvm.x86.fma.vfnmsub.pd.256(<4 x double>, <4 x double>, <4 x double>)


; CHECK-LABEL: fmaddsubps_loop:		; CHECK-LABEL: fmaddsubps_loop:
; CHECK: vfmaddsub231ps %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}		; CHECK: vfmaddsub231ps %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}
define <8 x float> @fmaddsubps_loop(i32 %iter, <8 x float> %a, <8 x float> %b, <8 x float> %c) {		define <8 x float> @fmaddsubps_loop(i32 %iter, <8 x float> %a, <8 x float> %b, <8 x float> %c) {
entry:		entry:
br label %for.cond		br label %for.cond

for.cond:		for.cond:
%c.addr.0 = phi <8 x float> [ %c, %entry ], [ %0, %for.inc ]		%c.addr.0 = phi <8 x float> [ %c, %entry ], [ %0, %for.inc ]
%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
%cmp = icmp slt i32 %i.0, %iter		%cmp = icmp slt i32 %i.0, %iter
br i1 %cmp, label %for.body, label %for.end		br i1 %cmp, label %for.body, label %for.end

for.body:		for.body:
br label %for.inc		br label %for.inc

for.inc:		for.inc:
%0 = call <8 x float> @llvm.x86.fma.vfmaddsub.ps.256(<8 x float> %a, <8 x float> %b, <8 x float> %c.addr.0)		%0 = call <8 x float> @llvm.x86.fma.vfmaddsub.ps.256(<8 x float> %a, <8 x float> %b, <8 x float> %c.addr.0)
%inc = add nsw i32 %i.0, 1		%inc = add nsw i32 %i.0, 1
br label %for.cond		br label %for.cond
		delenaUnsubmitted Not Done Reply Inline Actions Why do you need so long test in order to check only one operation? The comment is related to all tests. delena: Why do you need so long test in order to check only one operation? The comment is related to…
		v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions This LIT test is created not by me, that would be a good question to author, but in my opinion the test is reasonably long. (I suppose that you say that each of test cases is long, right?) Test case here represents a very typical situation when there is an FMA instruction in a loop and there is a loop dependency going through the ADD path of FMA (i.e. through the 3rd operand of ab + c operation). This test case looks a pretty laconic representation of a loop. v_klochkov:* This LIT test is created not by me, that would be a good question to author, but in my opinion…
		delenaUnsubmitted Not Done Reply Inline Actions The test checks that FMA intrinsic gives the right form of FMA instruction. I don't understand why do you need a loop here. We wrote a lot of FMA intrinsic tests without any loops. delena: The test checks that FMA intrinsic gives the right form of FMA instruction. I don't understand…
		v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions The loop is needed to get the right form of FMA instruction, i.e. the 231 form is generated when there is a LOOP DEPENDENCY on the ADD path. The test checks that 231 form is generated for such loops. v_klochkov: The loop is needed to get the right form of FMA instruction, i.e. the 231 form is generated…
		delenaUnsubmitted Not Done Reply Inline Actions ok delena: ok

for.end:		for.end:
ret <8 x float> %c.addr.0		ret <8 x float> %c.addr.0
}		}

; CHECK-LABEL: fmsubaddps_loop:		; CHECK-LABEL: fmsubaddps_loop:
; CHECK: vfmsubadd231ps %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}		; CHECK: vfmsubadd231ps %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}
define <8 x float> @fmsubaddps_loop(i32 %iter, <8 x float> %a, <8 x float> %b, <8 x float> %c) {		define <8 x float> @fmsubaddps_loop(i32 %iter, <8 x float> %a, <8 x float> %b, <8 x float> %c) {
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	for.inc:
%0 = call <8 x float> @llvm.x86.fma.vfmsub.ps.256(<8 x float> %a, <8 x float> %b, <8 x float> %c.addr.0)		%0 = call <8 x float> @llvm.x86.fma.vfmsub.ps.256(<8 x float> %a, <8 x float> %b, <8 x float> %c.addr.0)
%inc = add nsw i32 %i.0, 1		%inc = add nsw i32 %i.0, 1
br label %for.cond		br label %for.cond

for.end:		for.end:
ret <8 x float> %c.addr.0		ret <8 x float> %c.addr.0
}		}

		; CHECK-LABEL: fnmaddps_loop:
		; CHECK: vfnmadd231ps %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}
		define <8 x float> @fnmaddps_loop(i32 %iter, <8 x float> %a, <8 x float> %b, <8 x float> %c) {
		entry:
		br label %for.cond

		for.cond:
		%c.addr.0 = phi <8 x float> [ %c, %entry ], [ %0, %for.inc ]
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
		%cmp = icmp slt i32 %i.0, %iter
		br i1 %cmp, label %for.body, label %for.end

		for.body:
		br label %for.inc

		for.inc:
		%0 = call <8 x float> @llvm.x86.fma.vfnmadd.ps.256(<8 x float> %a, <8 x float> %b, <8 x float> %c.addr.0)
		%inc = add nsw i32 %i.0, 1
		br label %for.cond

		for.end:
		ret <8 x float> %c.addr.0
		}

		; CHECK-LABEL: fnmsubps_loop:
		; CHECK: vfnmsub231ps %ymm{{[0-9]+}}, %ymm{{[0-9]+}}, %ymm{{[0-9]+}}
		define <8 x float> @fnmsubps_loop(i32 %iter, <8 x float> %a, <8 x float> %b, <8 x float> %c) {
		entry:
		br label %for.cond

		for.cond:
		%c.addr.0 = phi <8 x float> [ %c, %entry ], [ %0, %for.inc ]
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
		%cmp = icmp slt i32 %i.0, %iter
		br i1 %cmp, label %for.body, label %for.end

		for.body:
		br label %for.inc

		for.inc:
		%0 = call <8 x float> @llvm.x86.fma.vfnmsub.ps.256(<8 x float> %a, <8 x float> %b, <8 x float> %c.addr.0)
		%inc = add nsw i32 %i.0, 1
		br label %for.cond

		for.end:
		ret <8 x float> %c.addr.0
		}

declare <8 x float> @llvm.x86.fma.vfmaddsub.ps.256(<8 x float>, <8 x float>, <8 x float>)		declare <8 x float> @llvm.x86.fma.vfmaddsub.ps.256(<8 x float>, <8 x float>, <8 x float>)
declare <8 x float> @llvm.x86.fma.vfmsubadd.ps.256(<8 x float>, <8 x float>, <8 x float>)		declare <8 x float> @llvm.x86.fma.vfmsubadd.ps.256(<8 x float>, <8 x float>, <8 x float>)
declare <8 x float> @llvm.x86.fma.vfmadd.ps.256(<8 x float>, <8 x float>, <8 x float>)		declare <8 x float> @llvm.x86.fma.vfmadd.ps.256(<8 x float>, <8 x float>, <8 x float>)
declare <8 x float> @llvm.x86.fma.vfmsub.ps.256(<8 x float>, <8 x float>, <8 x float>)		declare <8 x float> @llvm.x86.fma.vfmsub.ps.256(<8 x float>, <8 x float>, <8 x float>)
		declare <8 x float> @llvm.x86.fma.vfnmadd.ps.256(<8 x float>, <8 x float>, <8 x float>)
		declare <8 x float> @llvm.x86.fma.vfnmsub.ps.256(<8 x float>, <8 x float>, <8 x float>)

llvm/test/CodeGen/X86/fma-intrinsics-x86.ll

	Show All 12 Lines
	;			;
	; CHECK-FMA4-LABEL: test_x86_fma_vfmadd_ss:			; CHECK-FMA4-LABEL: test_x86_fma_vfmadd_ss:
	; CHECK-FMA4: # BB#0:			; CHECK-FMA4: # BB#0:
	; CHECK-FMA4-NEXT: vfmaddss %xmm2, %xmm1, %xmm0, %xmm0			; CHECK-FMA4-NEXT: vfmaddss %xmm2, %xmm1, %xmm0, %xmm0
	; CHECK-FMA4-NEXT: retq			; CHECK-FMA4-NEXT: retq
	%res = call <4 x float> @llvm.x86.fma.vfmadd.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)			%res = call <4 x float> @llvm.x86.fma.vfmadd.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)
	ret <4 x float> %res			ret <4 x float> %res
	}			}

				define <4 x float> @test_x86_fma_vfmadd_bac_ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {
				; CHECK-FMA-LABEL: test_x86_fma_vfmadd_bac_ss:
				; CHECK-FMA: # BB#0:
				; CHECK-FMA-NEXT: vfmadd213ss %xmm2, %xmm0, %xmm1
				; CHECK-FMA-NEXT: vmovaps %xmm1, %xmm0
				; CHECK-FMA-NEXT: retq
				;
				; CHECK-FMA4-LABEL: test_x86_fma_vfmadd_bac_ss:
				; CHECK-FMA4: # BB#0:
				; CHECK-FMA4-NEXT: vfmaddss %xmm2, %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: retq
				%res = call <4 x float> @llvm.x86.fma.vfmadd.ss(<4 x float> %a1, <4 x float> %a0, <4 x float> %a2)
				ret <4 x float> %res
				}
	declare <4 x float> @llvm.x86.fma.vfmadd.ss(<4 x float>, <4 x float>, <4 x float>)			declare <4 x float> @llvm.x86.fma.vfmadd.ss(<4 x float>, <4 x float>, <4 x float>)

	define <2 x double> @test_x86_fma_vfmadd_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {			define <2 x double> @test_x86_fma_vfmadd_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {
	; CHECK-FMA-LABEL: test_x86_fma_vfmadd_sd:			; CHECK-FMA-LABEL: test_x86_fma_vfmadd_sd:
	; CHECK-FMA: # BB#0:			; CHECK-FMA: # BB#0:
	; CHECK-FMA-NEXT: vfmadd213sd %xmm2, %xmm1, %xmm0			; CHECK-FMA-NEXT: vfmadd213sd %xmm2, %xmm1, %xmm0
	; CHECK-FMA-NEXT: retq			; CHECK-FMA-NEXT: retq
	;			;
	; CHECK-FMA4-LABEL: test_x86_fma_vfmadd_sd:			; CHECK-FMA4-LABEL: test_x86_fma_vfmadd_sd:
	; CHECK-FMA4: # BB#0:			; CHECK-FMA4: # BB#0:
	; CHECK-FMA4-NEXT: vfmaddsd %xmm2, %xmm1, %xmm0, %xmm0			; CHECK-FMA4-NEXT: vfmaddsd %xmm2, %xmm1, %xmm0, %xmm0
	; CHECK-FMA4-NEXT: retq			; CHECK-FMA4-NEXT: retq
	%res = call <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)			%res = call <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)
	ret <2 x double> %res			ret <2 x double> %res
	}			}

				define <2 x double> @test_x86_fma_vfmadd_bac_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {
				; CHECK-FMA-LABEL: test_x86_fma_vfmadd_bac_sd:
				; CHECK-FMA: # BB#0:
				; CHECK-FMA-NEXT: vfmadd213sd %xmm2, %xmm0, %xmm1
				; CHECK-FMA-NEXT: vmovaps %xmm1, %xmm0
				; CHECK-FMA-NEXT: retq
				;
				; CHECK-FMA4-LABEL: test_x86_fma_vfmadd_bac_sd:
				; CHECK-FMA4: # BB#0:
				; CHECK-FMA4-NEXT: vfmaddsd %xmm2, %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: retq
				%res = call <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double> %a1, <2 x double> %a0, <2 x double> %a2)
				ret <2 x double> %res
				}
	declare <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double>, <2 x double>, <2 x double>)			declare <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double>, <2 x double>, <2 x double>)

	define <4 x float> @test_x86_fma_vfmadd_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {			define <4 x float> @test_x86_fma_vfmadd_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {
	; CHECK-FMA-LABEL: test_x86_fma_vfmadd_ps:			; CHECK-FMA-LABEL: test_x86_fma_vfmadd_ps:
	; CHECK-FMA: # BB#0:			; CHECK-FMA: # BB#0:
	; CHECK-FMA-NEXT: vfmadd213ps %xmm2, %xmm1, %xmm0			; CHECK-FMA-NEXT: vfmadd213ps %xmm2, %xmm1, %xmm0
	; CHECK-FMA-NEXT: retq			; CHECK-FMA-NEXT: retq
	;			;
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	;			;
	; CHECK-FMA4-LABEL: test_x86_fma_vfmsub_ss:			; CHECK-FMA4-LABEL: test_x86_fma_vfmsub_ss:
	; CHECK-FMA4: # BB#0:			; CHECK-FMA4: # BB#0:
	; CHECK-FMA4-NEXT: vfmsubss %xmm2, %xmm1, %xmm0, %xmm0			; CHECK-FMA4-NEXT: vfmsubss %xmm2, %xmm1, %xmm0, %xmm0
	; CHECK-FMA4-NEXT: retq			; CHECK-FMA4-NEXT: retq
	%res = call <4 x float> @llvm.x86.fma.vfmsub.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)			%res = call <4 x float> @llvm.x86.fma.vfmsub.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)
	ret <4 x float> %res			ret <4 x float> %res
	}			}

				define <4 x float> @test_x86_fma_vfmsub_bac_ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {
				; CHECK-FMA-LABEL: test_x86_fma_vfmsub_bac_ss:
				; CHECK-FMA: # BB#0:
				; CHECK-FMA-NEXT: vfmsub213ss %xmm2, %xmm0, %xmm1
				; CHECK-FMA-NEXT: vmovaps %xmm1, %xmm0
				; CHECK-FMA-NEXT: retq
				;
				; CHECK-FMA4-LABEL: test_x86_fma_vfmsub_bac_ss:
				; CHECK-FMA4: # BB#0:
				; CHECK-FMA4-NEXT: vfmsubss %xmm2, %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: retq
				%res = call <4 x float> @llvm.x86.fma.vfmsub.ss(<4 x float> %a1, <4 x float> %a0, <4 x float> %a2)
				ret <4 x float> %res
				}
	declare <4 x float> @llvm.x86.fma.vfmsub.ss(<4 x float>, <4 x float>, <4 x float>)			declare <4 x float> @llvm.x86.fma.vfmsub.ss(<4 x float>, <4 x float>, <4 x float>)

	define <2 x double> @test_x86_fma_vfmsub_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {			define <2 x double> @test_x86_fma_vfmsub_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {
	; CHECK-FMA-LABEL: test_x86_fma_vfmsub_sd:			; CHECK-FMA-LABEL: test_x86_fma_vfmsub_sd:
	; CHECK-FMA: # BB#0:			; CHECK-FMA: # BB#0:
	; CHECK-FMA-NEXT: vfmsub213sd %xmm2, %xmm1, %xmm0			; CHECK-FMA-NEXT: vfmsub213sd %xmm2, %xmm1, %xmm0
	; CHECK-FMA-NEXT: retq			; CHECK-FMA-NEXT: retq
	;			;
	; CHECK-FMA4-LABEL: test_x86_fma_vfmsub_sd:			; CHECK-FMA4-LABEL: test_x86_fma_vfmsub_sd:
	; CHECK-FMA4: # BB#0:			; CHECK-FMA4: # BB#0:
	; CHECK-FMA4-NEXT: vfmsubsd %xmm2, %xmm1, %xmm0, %xmm0			; CHECK-FMA4-NEXT: vfmsubsd %xmm2, %xmm1, %xmm0, %xmm0
	; CHECK-FMA4-NEXT: retq			; CHECK-FMA4-NEXT: retq
	%res = call <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)			%res = call <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)
	ret <2 x double> %res			ret <2 x double> %res
	}			}

				define <2 x double> @test_x86_fma_vfmsub_bac_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {
				; CHECK-FMA-LABEL: test_x86_fma_vfmsub_bac_sd:
				; CHECK-FMA: # BB#0:
				; CHECK-FMA-NEXT: vfmsub213sd %xmm2, %xmm0, %xmm1
				; CHECK-FMA-NEXT: vmovaps %xmm1, %xmm0
				; CHECK-FMA-NEXT: retq
				;
				; CHECK-FMA4-LABEL: test_x86_fma_vfmsub_bac_sd:
				; CHECK-FMA4: # BB#0:
				; CHECK-FMA4-NEXT: vfmsubsd %xmm2, %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: retq
				%res = call <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double> %a1, <2 x double> %a0, <2 x double> %a2)
				ret <2 x double> %res
				}
	declare <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double>, <2 x double>, <2 x double>)			declare <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double>, <2 x double>, <2 x double>)

	define <4 x float> @test_x86_fma_vfmsub_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {			define <4 x float> @test_x86_fma_vfmsub_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {
	; CHECK-FMA-LABEL: test_x86_fma_vfmsub_ps:			; CHECK-FMA-LABEL: test_x86_fma_vfmsub_ps:
	; CHECK-FMA: # BB#0:			; CHECK-FMA: # BB#0:
	; CHECK-FMA-NEXT: vfmsub213ps %xmm2, %xmm1, %xmm0			; CHECK-FMA-NEXT: vfmsub213ps %xmm2, %xmm1, %xmm0
	; CHECK-FMA-NEXT: retq			; CHECK-FMA-NEXT: retq
	;			;
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	;			;
	; CHECK-FMA4-LABEL: test_x86_fma_vfnmadd_ss:			; CHECK-FMA4-LABEL: test_x86_fma_vfnmadd_ss:
	; CHECK-FMA4: # BB#0:			; CHECK-FMA4: # BB#0:
	; CHECK-FMA4-NEXT: vfnmaddss %xmm2, %xmm1, %xmm0, %xmm0			; CHECK-FMA4-NEXT: vfnmaddss %xmm2, %xmm1, %xmm0, %xmm0
	; CHECK-FMA4-NEXT: retq			; CHECK-FMA4-NEXT: retq
	%res = call <4 x float> @llvm.x86.fma.vfnmadd.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)			%res = call <4 x float> @llvm.x86.fma.vfnmadd.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)
	ret <4 x float> %res			ret <4 x float> %res
	}			}

				define <4 x float> @test_x86_fma_vfnmadd_bac_ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {
				; CHECK-FMA-LABEL: test_x86_fma_vfnmadd_bac_ss:
				; CHECK-FMA: # BB#0:
				; CHECK-FMA-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm1
				; CHECK-FMA-NEXT: vmovaps %xmm1, %xmm0
				; CHECK-FMA-NEXT: retq
				;
				; CHECK-FMA4-LABEL: test_x86_fma_vfnmadd_bac_ss:
				; CHECK-FMA4: # BB#0:
				; CHECK-FMA4-NEXT: vfnmaddss %xmm2, %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: retq
				%res = call <4 x float> @llvm.x86.fma.vfnmadd.ss(<4 x float> %a1, <4 x float> %a0, <4 x float> %a2)
				ret <4 x float> %res
				}
	declare <4 x float> @llvm.x86.fma.vfnmadd.ss(<4 x float>, <4 x float>, <4 x float>)			declare <4 x float> @llvm.x86.fma.vfnmadd.ss(<4 x float>, <4 x float>, <4 x float>)

	define <2 x double> @test_x86_fma_vfnmadd_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {			define <2 x double> @test_x86_fma_vfnmadd_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {
	; CHECK-FMA-LABEL: test_x86_fma_vfnmadd_sd:			; CHECK-FMA-LABEL: test_x86_fma_vfnmadd_sd:
	; CHECK-FMA: # BB#0:			; CHECK-FMA: # BB#0:
	; CHECK-FMA-NEXT: vfnmadd213sd %xmm2, %xmm1, %xmm0			; CHECK-FMA-NEXT: vfnmadd213sd %xmm2, %xmm1, %xmm0
	; CHECK-FMA-NEXT: retq			; CHECK-FMA-NEXT: retq
	;			;
	; CHECK-FMA4-LABEL: test_x86_fma_vfnmadd_sd:			; CHECK-FMA4-LABEL: test_x86_fma_vfnmadd_sd:
	; CHECK-FMA4: # BB#0:			; CHECK-FMA4: # BB#0:
	; CHECK-FMA4-NEXT: vfnmaddsd %xmm2, %xmm1, %xmm0, %xmm0			; CHECK-FMA4-NEXT: vfnmaddsd %xmm2, %xmm1, %xmm0, %xmm0
	; CHECK-FMA4-NEXT: retq			; CHECK-FMA4-NEXT: retq
	%res = call <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)			%res = call <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)
	ret <2 x double> %res			ret <2 x double> %res
	}			}

				define <2 x double> @test_x86_fma_vfnmadd_bac_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {
				; CHECK-FMA-LABEL: test_x86_fma_vfnmadd_bac_sd:
				; CHECK-FMA: # BB#0:
				; CHECK-FMA-NEXT: vfnmadd213sd %xmm2, %xmm0, %xmm1
				; CHECK-FMA-NEXT: vmovaps %xmm1, %xmm0
				; CHECK-FMA-NEXT: retq
				;
				; CHECK-FMA4-LABEL: test_x86_fma_vfnmadd_bac_sd:
				; CHECK-FMA4: # BB#0:
				; CHECK-FMA4-NEXT: vfnmaddsd %xmm2, %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: retq
				%res = call <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double> %a1, <2 x double> %a0, <2 x double> %a2)
				ret <2 x double> %res
				}
	declare <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double>, <2 x double>, <2 x double>)			declare <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double>, <2 x double>, <2 x double>)

	define <4 x float> @test_x86_fma_vfnmadd_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {			define <4 x float> @test_x86_fma_vfnmadd_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {
	; CHECK-FMA-LABEL: test_x86_fma_vfnmadd_ps:			; CHECK-FMA-LABEL: test_x86_fma_vfnmadd_ps:
	; CHECK-FMA: # BB#0:			; CHECK-FMA: # BB#0:
	; CHECK-FMA-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0			; CHECK-FMA-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
	; CHECK-FMA-NEXT: retq			; CHECK-FMA-NEXT: retq
	;			;
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	;			;
	; CHECK-FMA4-LABEL: test_x86_fma_vfnmsub_ss:			; CHECK-FMA4-LABEL: test_x86_fma_vfnmsub_ss:
	; CHECK-FMA4: # BB#0:			; CHECK-FMA4: # BB#0:
	; CHECK-FMA4-NEXT: vfnmsubss %xmm2, %xmm1, %xmm0, %xmm0			; CHECK-FMA4-NEXT: vfnmsubss %xmm2, %xmm1, %xmm0, %xmm0
	; CHECK-FMA4-NEXT: retq			; CHECK-FMA4-NEXT: retq
	%res = call <4 x float> @llvm.x86.fma.vfnmsub.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)			%res = call <4 x float> @llvm.x86.fma.vfnmsub.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)
	ret <4 x float> %res			ret <4 x float> %res
	}			}

				define <4 x float> @test_x86_fma_vfnmsub_bac_ss(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {
				; CHECK-FMA-LABEL: test_x86_fma_vfnmsub_bac_ss:
				; CHECK-FMA: # BB#0:
				; CHECK-FMA-NEXT: vfnmsub213ss %xmm2, %xmm0, %xmm1
				; CHECK-FMA-NEXT: vmovaps %xmm1, %xmm0
				; CHECK-FMA-NEXT: retq
				;
				; CHECK-FMA4-LABEL: test_x86_fma_vfnmsub_bac_ss:
				; CHECK-FMA4: # BB#0:
				; CHECK-FMA4-NEXT: vfnmsubss %xmm2, %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: retq
				%res = call <4 x float> @llvm.x86.fma.vfnmsub.ss(<4 x float> %a1, <4 x float> %a0, <4 x float> %a2)
				ret <4 x float> %res
				}
	declare <4 x float> @llvm.x86.fma.vfnmsub.ss(<4 x float>, <4 x float>, <4 x float>)			declare <4 x float> @llvm.x86.fma.vfnmsub.ss(<4 x float>, <4 x float>, <4 x float>)

	define <2 x double> @test_x86_fma_vfnmsub_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {			define <2 x double> @test_x86_fma_vfnmsub_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {
	; CHECK-FMA-LABEL: test_x86_fma_vfnmsub_sd:			; CHECK-FMA-LABEL: test_x86_fma_vfnmsub_sd:
	; CHECK-FMA: # BB#0:			; CHECK-FMA: # BB#0:
	; CHECK-FMA-NEXT: vfnmsub213sd %xmm2, %xmm1, %xmm0			; CHECK-FMA-NEXT: vfnmsub213sd %xmm2, %xmm1, %xmm0
	; CHECK-FMA-NEXT: retq			; CHECK-FMA-NEXT: retq
	;			;
	; CHECK-FMA4-LABEL: test_x86_fma_vfnmsub_sd:			; CHECK-FMA4-LABEL: test_x86_fma_vfnmsub_sd:
	; CHECK-FMA4: # BB#0:			; CHECK-FMA4: # BB#0:
				delenaUnsubmitted Not Done Reply Inline Actions you check folding vector load into scalar intrinsic. On AVX-512 we support folding scalar load to scalar intrinsic., by matching scalar_to_vector(loadf32) pattern in td file delena: you check folding vector load into scalar intrinsic. On AVX-512 we support folding scalar load…
				v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions I agree, the check tests memory folding of vector load into scalar intrinsic. Memory folding does not work for such test cases (with and without my patch): __m128d m = _mm_load_sd(mem); __m128d res = _mm_fmadd_sd(a, b, m); This should be fixed, and I think I know how to easily do that, but I would rather do that in a separate patch. v_klochkov: I agree, the check tests memory folding of vector load into scalar intrinsic. Memory folding…
	; CHECK-FMA4-NEXT: vfnmsubsd %xmm2, %xmm1, %xmm0, %xmm0			; CHECK-FMA4-NEXT: vfnmsubsd %xmm2, %xmm1, %xmm0, %xmm0
	; CHECK-FMA4-NEXT: retq			; CHECK-FMA4-NEXT: retq
	%res = call <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)			%res = call <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)
	ret <2 x double> %res			ret <2 x double> %res
	}			}

				define <2 x double> @test_x86_fma_vfnmsub_bac_sd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2) #0 {
				; CHECK-FMA-LABEL: test_x86_fma_vfnmsub_bac_sd:
				; CHECK-FMA: # BB#0:
				; CHECK-FMA-NEXT: vfnmsub213sd %xmm2, %xmm0, %xmm1
				; CHECK-FMA-NEXT: vmovaps %xmm1, %xmm0
				; CHECK-FMA-NEXT: retq
				;
				; CHECK-FMA4-LABEL: test_x86_fma_vfnmsub_bac_sd:
				; CHECK-FMA4: # BB#0:
				; CHECK-FMA4-NEXT: vfnmsubsd %xmm2, %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: retq
				%res = call <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double> %a1, <2 x double> %a0, <2 x double> %a2)
				ret <2 x double> %res
				}
	declare <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double>, <2 x double>, <2 x double>)			declare <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double>, <2 x double>, <2 x double>)

	define <4 x float> @test_x86_fma_vfnmsub_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {			define <4 x float> @test_x86_fma_vfnmsub_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) #0 {
	; CHECK-FMA-LABEL: test_x86_fma_vfnmsub_ps:			; CHECK-FMA-LABEL: test_x86_fma_vfnmsub_ps:
	; CHECK-FMA: # BB#0:			; CHECK-FMA: # BB#0:
	; CHECK-FMA-NEXT: vfnmsub213ps %xmm2, %xmm1, %xmm0			; CHECK-FMA-NEXT: vfnmsub213ps %xmm2, %xmm1, %xmm0
	; CHECK-FMA-NEXT: retq			; CHECK-FMA-NEXT: retq
	;			;
	▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

New X86 FMA3*_Int opcodes for scalar FMA intrinsics.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 37297

llvm/lib/Target/X86/X86InstrFMA.td

llvm/lib/Target/X86/X86InstrInfo.cpp

llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll

llvm/test/CodeGen/X86/fma-intrinsics-x86.ll

New X86 FMA3*_Int opcodes for scalar FMA intrinsics.
ClosedPublic