This is an archive of the discontinued LLVM Phabricator instance.

X86-FMA3: Memory folding for scalar loads + FMA3
ClosedPublic

Authored by v_klochkov on Nov 17 2015, 3:15 PM.

Download Raw Diff

Details

Reviewers

Commits

rGed865dfcc50d: X86-FMA3: Improved/enabled the memory folding optimization for scalar loads…
rL254140: X86-FMA3: Improved/enabled the memory folding optimization for scalar loads

Summary

Hello,

Please review the patch that enables memory folding optimization for
sequences like this:

#include <immintrin.h>
double mem;
__m128d func(__m128d a, __m128d b) {
  __m128d m = _mm_load_sd(&mem);
  return _mm_fmadd_sd(a, b, m);
}

Code without the patch (clang -O3 -S):

func:                                   # @func
        .cfi_startproc
# BB#0:                                 # %entry
        movsd   mem(%rip), %xmm2        # xmm2 = mem[0],zero
        vfmadd213sd     %xmm2, %xmm1, %xmm0
        retq

Code with the patch:

func:                                   # @func
        .cfi_startproc
# BB#0:                                 # %entry
        vfmadd213sd     mem(%rip), %xmm1, %xmm0
        retq

The load can be folded into 2nd or 3rd operand of FMA*_Int instruction.
The newly added test fma-scalar-memfold.ll checks memory folding for both of operands.

lib/Target/X86/X86InstrFMA.td:

Removed the redundant register to register moves.
Memory folding does not work with those moves.
// TODO: perhaps, the register-to-register moves can be just stripped in such/some cases,
// but that is a separate optimization/change-set.

lib/Target/X86/X86InstrInfo.cpp:

Added the FMA*_Int opcodes to the routine
isNonFoldablePartialRegisterLoad()

test/CodeGen/X86/fma-scalar-memfold.ll:

New test. Checks that result of _mm_load_{s,d}() can be folded into 2nd or 3rd operand of FMA*_Int.

Thank you,
Slava

Diff Detail

Repository: rL LLVM

Event Timeline

v_klochkov updated this revision to Diff 40441.Nov 17 2015, 3:15 PM

v_klochkov retitled this revision from to X86-FMA3: Memory folding for scalar loads + FMA3.

v_klochkov updated this object.

v_klochkov added a reviewer: DavidKreitzer.

v_klochkov added subscribers: llvm-commits, qcolombet.

Hi Slava,

Everything looks straightforward to me. I just have a few minor comments.

Thanks,
-Dave

llvm/lib/Target/X86/X86InstrFMA.td
164 ↗	(On Diff #40441)	Just noticed a few typos in this comments while I was reviewing the code. sence --> sense
173 ↗	(On Diff #40441)	implemened --> implemented
245 ↗	(On Diff #40441)	Please add a space after ','
llvm/test/CodeGen/X86/fma-scalar-memfold.ll
1 ↗	(On Diff #40441)	I like the thoroughness of your test! A couple ideas to make the test a little less sensitive to innocuous changes. (1) You could avoid checking for the block labels, e.g. "# BB#0" and just change the subsequent CHECK-NEXT to CHECK. (2) You could avoid explicitly checking for xmm0 and instead use a variable.
5 ↗	(On Diff #40441)	"#3" is not defined.

Fixed the misprints and updated the unit test.

Hi David,

Thank you for the quick code-review. Excuse me for the delay - I am traveling these days.
I fixed the misprints and updated the unit test.

Thank you,
Slava

llvm/test/CodeGen/X86/fma-scalar-memfold.ll
2 ↗	(On Diff #40905)	I replaced xmm0 with a variable as you recommended. Regarding the BB#0 label. Due to some unknown reasons the script update_llc_test_checks.py does not work when I run it, but that script usually generates "CHECK: # BB#0:" line (I noticed that in other people's change-sets fixing tests with help of that script). So, to relax the test checks a little bit I replaced CHECK-NEXT with CHECK (i.e. it ma be ok to have another label between func entry and # BB#0, which happens on some targets if not use { nounwind }). Please let me know if it looks good now.
5 ↗	(On Diff #40441)	Fixed.

DavidKreitzer added inline comments.Nov 23 2015, 5:46 AM

llvm/test/CodeGen/X86/fma-scalar-memfold.ll
17 ↗	(On Diff #40905)	This isn't quite what I meant about the block labels. I think you should just delete line 17 here and change line 18 "CHECK-NEXT" --> "CHECK". That way, if # BB#0 changes to something else, it won't affect this test. The %[[XMM]] changes are great, thanks!

RKSimon added a subscriber: RKSimon.Nov 23 2015, 3:27 PM

RKSimon added inline comments.

llvm/lib/Target/X86/X86InstrFMA.td
233 ↗	(On Diff #40905)	Please would it be possible to add ExecutionDomains to these definitions? For some reason only the packed FMA instructions have Single/Double domains set.

Updated the unit test.

Thank you for the review!

I updated the unit test.
Undefined ExeDomain for scalar FMAs is the problem unrelated to the one fixed by this patch (memory folding for FMA*_Int). So, it should be fixed in a separate patch. I can prepare such patch later.

Thank you,
Slava

llvm/lib/Target/X86/X86InstrFMA.td
233 ↗	(On Diff #40905)	Good catch, thank you for the comment! That would be a simple additional fix, but I want to follow the recommendation: "1 patch fixes 1 problem", I.e. I can set the ExeDomain for scalar FMAs in a separate patch.
llvm/test/CodeGen/X86/fma-scalar-memfold.ll
17 ↗	(On Diff #40905)	Ok, understood, I'll fix the checks.

Thanks, Slava, looks good! I have no further comments.

Dave

DavidKreitzer accepted this revision.Nov 25 2015, 7:43 AM

DavidKreitzer edited edge metadata.

This revision is now accepted and ready to land.Nov 25 2015, 7:43 AM

Closed by commit rL254140: X86-FMA3: Improved/enabled the memory folding optimization for scalar loads (authored by v_klochkov). · Explain WhyNov 25 2015, 11:48 PM

This revision was automatically updated to reflect the committed changes.

v_klochkov mentioned this in D15317: X86-FMA3: Defined ExeDomain for Scalar FMA3 opcodes.Dec 7 2015, 4:28 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86InstrFMA.td

18 lines

X86InstrInfo.cpp

12 lines

test/

CodeGen/

X86/

fma-scalar-memfold.ll

383 lines

Diff 41211

llvm/trunk/lib/Target/X86/X86InstrFMA.td

Show First 20 Lines • Show All 164 Lines • ▼ Show 20 Lines
// instructions.		// instructions.
//		//
// All of the FMA*_Int opcodes are defined as commutable here.		// All of the FMA*_Int opcodes are defined as commutable here.
// Commuting the 2nd and 3rd source register operands of FMAs is quite trivial		// Commuting the 2nd and 3rd source register operands of FMAs is quite trivial
// and the corresponding optimizations have been developed.		// and the corresponding optimizations have been developed.
// Commuting the 1st operand of FMA*_Int requires some additional analysis,		// Commuting the 1st operand of FMA*_Int requires some additional analysis,
// the commute optimization is legal only if all users of FMA*_Int use only		// the commute optimization is legal only if all users of FMA*_Int use only
// the lowest element of the FMA*_Int instruction. Even though such analysis		// the lowest element of the FMA*_Int instruction. Even though such analysis
// may be not implemened yet we allow the routines doing the actual commute		// may be not implemented yet we allow the routines doing the actual commute
// transformation to decide if one or another instruction is commutable or not.		// transformation to decide if one or another instruction is commutable or not.
let Constraints = "$src1 = $dst", isCommutable = 1, isCodeGenOnly = 1,		let Constraints = "$src1 = $dst", isCommutable = 1, isCodeGenOnly = 1,
hasSideEffects = 0 in		hasSideEffects = 0 in
multiclass fma3s_rm_int<bits<8> opc, string OpcodeStr,		multiclass fma3s_rm_int<bits<8> opc, string OpcodeStr,
Operand memopr, RegisterClass RC> {		Operand memopr, RegisterClass RC> {
def r_Int : FMA3<opc, MRMSrcReg, (outs RC:$dst),		def r_Int : FMA3<opc, MRMSrcReg, (outs RC:$dst),
(ins RC:$src1, RC:$src2, RC:$src3),		(ins RC:$src1, RC:$src2, RC:$src3),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	defm SD : fma3s_forms<opc132, opc213, opc231, OpStr, "sd", OpNode,
VEX_W;		VEX_W;

// These patterns use the 123 ordering, instead of 213, even though		// These patterns use the 123 ordering, instead of 213, even though
// they match the intrinsic to the 213 version of the instruction.		// they match the intrinsic to the 213 version of the instruction.
// This is because src1 is tied to dest, and the scalar intrinsics		// This is because src1 is tied to dest, and the scalar intrinsics
// require the pass-through values to come from the first source		// require the pass-through values to come from the first source
// operand, not the second.		// operand, not the second.
def : Pat<(IntF32 VR128:$src1, VR128:$src2, VR128:$src3),		def : Pat<(IntF32 VR128:$src1, VR128:$src2, VR128:$src3),
(COPY_TO_REGCLASS		(COPY_TO_REGCLASS(!cast<Instruction>(NAME#"SSr213r_Int")
(!cast<Instruction>(NAME#"SSr213r_Int")		$src1, $src2, $src3), VR128)>;
(COPY_TO_REGCLASS $src1, FR32),
(COPY_TO_REGCLASS $src2, FR32),
(COPY_TO_REGCLASS $src3, FR32)),
VR128)>;

def : Pat<(IntF64 VR128:$src1, VR128:$src2, VR128:$src3),		def : Pat<(IntF64 VR128:$src1, VR128:$src2, VR128:$src3),
(COPY_TO_REGCLASS		(COPY_TO_REGCLASS(!cast<Instruction>(NAME#"SDr213r_Int")
(!cast<Instruction>(NAME#"SDr213r_Int")		$src1, $src2, $src3), VR128)>;
(COPY_TO_REGCLASS $src1, FR64),
(COPY_TO_REGCLASS $src2, FR64),
(COPY_TO_REGCLASS $src3, FR64)),
VR128)>;
}		}

defm VFMADD : fma3s<0x99, 0xA9, 0xB9, "vfmadd", int_x86_fma_vfmadd_ss,		defm VFMADD : fma3s<0x99, 0xA9, 0xB9, "vfmadd", int_x86_fma_vfmadd_ss,
int_x86_fma_vfmadd_sd, X86Fmadd>, VEX_LIG;		int_x86_fma_vfmadd_sd, X86Fmadd>, VEX_LIG;
defm VFMSUB : fma3s<0x9B, 0xAB, 0xBB, "vfmsub", int_x86_fma_vfmsub_ss,		defm VFMSUB : fma3s<0x9B, 0xAB, 0xBB, "vfmsub", int_x86_fma_vfmsub_ss,
int_x86_fma_vfmsub_sd, X86Fmsub>, VEX_LIG;		int_x86_fma_vfmsub_sd, X86Fmsub>, VEX_LIG;

defm VFNMADD : fma3s<0x9D, 0xAD, 0xBD, "vfnmadd", int_x86_fma_vfnmadd_ss,		defm VFNMADD : fma3s<0x9D, 0xAD, 0xBD, "vfnmadd", int_x86_fma_vfnmadd_ss,
▲ Show 20 Lines • Show All 182 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,861 Lines • ▼ Show 20 Lines	if ((Opc == X86::MOVSSrm \|\| Opc == X86::VMOVSSrm) && RegSize > 4) {
// These instructions only load 32 bits, we can't fold them if the		// These instructions only load 32 bits, we can't fold them if the
// destination register is wider than 32 bits (4 bytes), and its user		// destination register is wider than 32 bits (4 bytes), and its user
// instruction isn't scalar (SS).		// instruction isn't scalar (SS).
switch (UserOpc) {		switch (UserOpc) {
case X86::ADDSSrr_Int: case X86::VADDSSrr_Int:		case X86::ADDSSrr_Int: case X86::VADDSSrr_Int:
case X86::DIVSSrr_Int: case X86::VDIVSSrr_Int:		case X86::DIVSSrr_Int: case X86::VDIVSSrr_Int:
case X86::MULSSrr_Int: case X86::VMULSSrr_Int:		case X86::MULSSrr_Int: case X86::VMULSSrr_Int:
case X86::SUBSSrr_Int: case X86::VSUBSSrr_Int:		case X86::SUBSSrr_Int: case X86::VSUBSSrr_Int:
		case X86::VFMADDSSr132r_Int: case X86::VFNMADDSSr132r_Int:
		case X86::VFMADDSSr213r_Int: case X86::VFNMADDSSr213r_Int:
		case X86::VFMADDSSr231r_Int: case X86::VFNMADDSSr231r_Int:
		case X86::VFMSUBSSr132r_Int: case X86::VFNMSUBSSr132r_Int:
		case X86::VFMSUBSSr213r_Int: case X86::VFNMSUBSSr213r_Int:
		case X86::VFMSUBSSr231r_Int: case X86::VFNMSUBSSr231r_Int:
return false;		return false;
default:		default:
return true;		return true;
}		}
}		}

if ((Opc == X86::MOVSDrm \|\| Opc == X86::VMOVSDrm) && RegSize > 8) {		if ((Opc == X86::MOVSDrm \|\| Opc == X86::VMOVSDrm) && RegSize > 8) {
// These instructions only load 64 bits, we can't fold them if the		// These instructions only load 64 bits, we can't fold them if the
// destination register is wider than 64 bits (8 bytes), and its user		// destination register is wider than 64 bits (8 bytes), and its user
// instruction isn't scalar (SD).		// instruction isn't scalar (SD).
switch (UserOpc) {		switch (UserOpc) {
case X86::ADDSDrr_Int: case X86::VADDSDrr_Int:		case X86::ADDSDrr_Int: case X86::VADDSDrr_Int:
case X86::DIVSDrr_Int: case X86::VDIVSDrr_Int:		case X86::DIVSDrr_Int: case X86::VDIVSDrr_Int:
case X86::MULSDrr_Int: case X86::VMULSDrr_Int:		case X86::MULSDrr_Int: case X86::VMULSDrr_Int:
case X86::SUBSDrr_Int: case X86::VSUBSDrr_Int:		case X86::SUBSDrr_Int: case X86::VSUBSDrr_Int:
		case X86::VFMADDSDr132r_Int: case X86::VFNMADDSDr132r_Int:
		case X86::VFMADDSDr213r_Int: case X86::VFNMADDSDr213r_Int:
		case X86::VFMADDSDr231r_Int: case X86::VFNMADDSDr231r_Int:
		case X86::VFMSUBSDr132r_Int: case X86::VFNMSUBSDr132r_Int:
		case X86::VFMSUBSDr213r_Int: case X86::VFNMSUBSDr213r_Int:
		case X86::VFMSUBSDr231r_Int: case X86::VFNMSUBSDr231r_Int:
return false;		return false;
default:		default:
return true;		return true;
}		}
}		}

return false;		return false;
}		}
▲ Show 20 Lines • Show All 1,328 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fma-scalar-memfold.ll

				; RUN: llc < %s -mtriple=x86_64-pc-win32 -mcpu=core-avx2 \| FileCheck %s

				attributes #0 = { nounwind }

				declare <4 x float> @llvm.x86.fma.vfmadd.ss(<4 x float>, <4 x float>, <4 x float>)
				declare <4 x float> @llvm.x86.fma.vfmsub.ss(<4 x float>, <4 x float>, <4 x float>)
				declare <4 x float> @llvm.x86.fma.vfnmadd.ss(<4 x float>, <4 x float>, <4 x float>)
				declare <4 x float> @llvm.x86.fma.vfnmsub.ss(<4 x float>, <4 x float>, <4 x float>)

				declare <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double>, <2 x double>, <2 x double>)
				declare <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double>, <2 x double>, <2 x double>)
				declare <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double>, <2 x double>, <2 x double>)
				declare <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double>, <2 x double>, <2 x double>)

				define void @fmadd_aab_ss(float* %a, float* %b) #0 {
				; CHECK-LABEL: fmadd_aab_ss:
				; CHECK: vmovss (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfmadd213ss (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovss %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load float, float* %a
				%av0 = insertelement <4 x float> undef, float %a.val, i32 0
				%av1 = insertelement <4 x float> %av0, float 0.000000e+00, i32 1
				%av2 = insertelement <4 x float> %av1, float 0.000000e+00, i32 2
				%av = insertelement <4 x float> %av2, float 0.000000e+00, i32 3

				%b.val = load float, float* %b
				%bv0 = insertelement <4 x float> undef, float %b.val, i32 0
				%bv1 = insertelement <4 x float> %bv0, float 0.000000e+00, i32 1
				%bv2 = insertelement <4 x float> %bv1, float 0.000000e+00, i32 2
				%bv = insertelement <4 x float> %bv2, float 0.000000e+00, i32 3

				%vr = call <4 x float> @llvm.x86.fma.vfmadd.ss(<4 x float> %av, <4 x float> %av, <4 x float> %bv)

				%sr = extractelement <4 x float> %vr, i32 0
				store float %sr, float* %a
				ret void
				}

				define void @fmadd_aba_ss(float* %a, float* %b) #0 {
				; CHECK-LABEL: fmadd_aba_ss:
				; CHECK: vmovss (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfmadd132ss (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovss %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load float, float* %a
				%av0 = insertelement <4 x float> undef, float %a.val, i32 0
				%av1 = insertelement <4 x float> %av0, float 0.000000e+00, i32 1
				%av2 = insertelement <4 x float> %av1, float 0.000000e+00, i32 2
				%av = insertelement <4 x float> %av2, float 0.000000e+00, i32 3

				%b.val = load float, float* %b
				%bv0 = insertelement <4 x float> undef, float %b.val, i32 0
				%bv1 = insertelement <4 x float> %bv0, float 0.000000e+00, i32 1
				%bv2 = insertelement <4 x float> %bv1, float 0.000000e+00, i32 2
				%bv = insertelement <4 x float> %bv2, float 0.000000e+00, i32 3

				%vr = call <4 x float> @llvm.x86.fma.vfmadd.ss(<4 x float> %av, <4 x float> %bv, <4 x float> %av)

				%sr = extractelement <4 x float> %vr, i32 0
				store float %sr, float* %a
				ret void
				}

				define void @fmsub_aab_ss(float* %a, float* %b) #0 {
				; CHECK-LABEL: fmsub_aab_ss:
				; CHECK: vmovss (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfmsub213ss (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovss %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load float, float* %a
				%av0 = insertelement <4 x float> undef, float %a.val, i32 0
				%av1 = insertelement <4 x float> %av0, float 0.000000e+00, i32 1
				%av2 = insertelement <4 x float> %av1, float 0.000000e+00, i32 2
				%av = insertelement <4 x float> %av2, float 0.000000e+00, i32 3

				%b.val = load float, float* %b
				%bv0 = insertelement <4 x float> undef, float %b.val, i32 0
				%bv1 = insertelement <4 x float> %bv0, float 0.000000e+00, i32 1
				%bv2 = insertelement <4 x float> %bv1, float 0.000000e+00, i32 2
				%bv = insertelement <4 x float> %bv2, float 0.000000e+00, i32 3

				%vr = call <4 x float> @llvm.x86.fma.vfmsub.ss(<4 x float> %av, <4 x float> %av, <4 x float> %bv)

				%sr = extractelement <4 x float> %vr, i32 0
				store float %sr, float* %a
				ret void
				}

				define void @fmsub_aba_ss(float* %a, float* %b) #0 {
				; CHECK-LABEL: fmsub_aba_ss:
				; CHECK: vmovss (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfmsub132ss (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovss %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load float, float* %a
				%av0 = insertelement <4 x float> undef, float %a.val, i32 0
				%av1 = insertelement <4 x float> %av0, float 0.000000e+00, i32 1
				%av2 = insertelement <4 x float> %av1, float 0.000000e+00, i32 2
				%av = insertelement <4 x float> %av2, float 0.000000e+00, i32 3

				%b.val = load float, float* %b
				%bv0 = insertelement <4 x float> undef, float %b.val, i32 0
				%bv1 = insertelement <4 x float> %bv0, float 0.000000e+00, i32 1
				%bv2 = insertelement <4 x float> %bv1, float 0.000000e+00, i32 2
				%bv = insertelement <4 x float> %bv2, float 0.000000e+00, i32 3

				%vr = call <4 x float> @llvm.x86.fma.vfmsub.ss(<4 x float> %av, <4 x float> %bv, <4 x float> %av)

				%sr = extractelement <4 x float> %vr, i32 0
				store float %sr, float* %a
				ret void
				}

				define void @fnmadd_aab_ss(float* %a, float* %b) #0 {
				; CHECK-LABEL: fnmadd_aab_ss:
				; CHECK: vmovss (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfnmadd213ss (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovss %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load float, float* %a
				%av0 = insertelement <4 x float> undef, float %a.val, i32 0
				%av1 = insertelement <4 x float> %av0, float 0.000000e+00, i32 1
				%av2 = insertelement <4 x float> %av1, float 0.000000e+00, i32 2
				%av = insertelement <4 x float> %av2, float 0.000000e+00, i32 3

				%b.val = load float, float* %b
				%bv0 = insertelement <4 x float> undef, float %b.val, i32 0
				%bv1 = insertelement <4 x float> %bv0, float 0.000000e+00, i32 1
				%bv2 = insertelement <4 x float> %bv1, float 0.000000e+00, i32 2
				%bv = insertelement <4 x float> %bv2, float 0.000000e+00, i32 3

				%vr = call <4 x float> @llvm.x86.fma.vfnmadd.ss(<4 x float> %av, <4 x float> %av, <4 x float> %bv)

				%sr = extractelement <4 x float> %vr, i32 0
				store float %sr, float* %a
				ret void
				}

				define void @fnmadd_aba_ss(float* %a, float* %b) #0 {
				; CHECK-LABEL: fnmadd_aba_ss:
				; CHECK: vmovss (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfnmadd132ss (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovss %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load float, float* %a
				%av0 = insertelement <4 x float> undef, float %a.val, i32 0
				%av1 = insertelement <4 x float> %av0, float 0.000000e+00, i32 1
				%av2 = insertelement <4 x float> %av1, float 0.000000e+00, i32 2
				%av = insertelement <4 x float> %av2, float 0.000000e+00, i32 3

				%b.val = load float, float* %b
				%bv0 = insertelement <4 x float> undef, float %b.val, i32 0
				%bv1 = insertelement <4 x float> %bv0, float 0.000000e+00, i32 1
				%bv2 = insertelement <4 x float> %bv1, float 0.000000e+00, i32 2
				%bv = insertelement <4 x float> %bv2, float 0.000000e+00, i32 3

				%vr = call <4 x float> @llvm.x86.fma.vfnmadd.ss(<4 x float> %av, <4 x float> %bv, <4 x float> %av)

				%sr = extractelement <4 x float> %vr, i32 0
				store float %sr, float* %a
				ret void
				}

				define void @fnmsub_aab_ss(float* %a, float* %b) #0 {
				; CHECK-LABEL: fnmsub_aab_ss:
				; CHECK: vmovss (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfnmsub213ss (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovss %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load float, float* %a
				%av0 = insertelement <4 x float> undef, float %a.val, i32 0
				%av1 = insertelement <4 x float> %av0, float 0.000000e+00, i32 1
				%av2 = insertelement <4 x float> %av1, float 0.000000e+00, i32 2
				%av = insertelement <4 x float> %av2, float 0.000000e+00, i32 3

				%b.val = load float, float* %b
				%bv0 = insertelement <4 x float> undef, float %b.val, i32 0
				%bv1 = insertelement <4 x float> %bv0, float 0.000000e+00, i32 1
				%bv2 = insertelement <4 x float> %bv1, float 0.000000e+00, i32 2
				%bv = insertelement <4 x float> %bv2, float 0.000000e+00, i32 3

				%vr = call <4 x float> @llvm.x86.fma.vfnmsub.ss(<4 x float> %av, <4 x float> %av, <4 x float> %bv)

				%sr = extractelement <4 x float> %vr, i32 0
				store float %sr, float* %a
				ret void
				}

				define void @fnmsub_aba_ss(float* %a, float* %b) #0 {
				; CHECK-LABEL: fnmsub_aba_ss:
				; CHECK: vmovss (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfnmsub132ss (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovss %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load float, float* %a
				%av0 = insertelement <4 x float> undef, float %a.val, i32 0
				%av1 = insertelement <4 x float> %av0, float 0.000000e+00, i32 1
				%av2 = insertelement <4 x float> %av1, float 0.000000e+00, i32 2
				%av = insertelement <4 x float> %av2, float 0.000000e+00, i32 3

				%b.val = load float, float* %b
				%bv0 = insertelement <4 x float> undef, float %b.val, i32 0
				%bv1 = insertelement <4 x float> %bv0, float 0.000000e+00, i32 1
				%bv2 = insertelement <4 x float> %bv1, float 0.000000e+00, i32 2
				%bv = insertelement <4 x float> %bv2, float 0.000000e+00, i32 3

				%vr = call <4 x float> @llvm.x86.fma.vfnmsub.ss(<4 x float> %av, <4 x float> %bv, <4 x float> %av)

				%sr = extractelement <4 x float> %vr, i32 0
				store float %sr, float* %a
				ret void
				}

				define void @fmadd_aab_sd(double* %a, double* %b) #0 {
				; CHECK-LABEL: fmadd_aab_sd:
				; CHECK: vmovsd (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfmadd213sd (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovlps %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load double, double* %a
				%av0 = insertelement <2 x double> undef, double %a.val, i32 0
				%av = insertelement <2 x double> %av0, double 0.000000e+00, i32 1

				%b.val = load double, double* %b
				%bv0 = insertelement <2 x double> undef, double %b.val, i32 0
				%bv = insertelement <2 x double> %bv0, double 0.000000e+00, i32 1

				%vr = call <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double> %av, <2 x double> %av, <2 x double> %bv)

				%sr = extractelement <2 x double> %vr, i32 0
				store double %sr, double* %a
				ret void
				}

				define void @fmadd_aba_sd(double* %a, double* %b) #0 {
				; CHECK-LABEL: fmadd_aba_sd:
				; CHECK: vmovsd (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfmadd132sd (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovlps %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load double, double* %a
				%av0 = insertelement <2 x double> undef, double %a.val, i32 0
				%av = insertelement <2 x double> %av0, double 0.000000e+00, i32 1

				%b.val = load double, double* %b
				%bv0 = insertelement <2 x double> undef, double %b.val, i32 0
				%bv = insertelement <2 x double> %bv0, double 0.000000e+00, i32 1

				%vr = call <2 x double> @llvm.x86.fma.vfmadd.sd(<2 x double> %av, <2 x double> %bv, <2 x double> %av)

				%sr = extractelement <2 x double> %vr, i32 0
				store double %sr, double* %a
				ret void
				}

				define void @fmsub_aab_sd(double* %a, double* %b) #0 {
				; CHECK-LABEL: fmsub_aab_sd:
				; CHECK: vmovsd (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfmsub213sd (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovlps %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load double, double* %a
				%av0 = insertelement <2 x double> undef, double %a.val, i32 0
				%av = insertelement <2 x double> %av0, double 0.000000e+00, i32 1

				%b.val = load double, double* %b
				%bv0 = insertelement <2 x double> undef, double %b.val, i32 0
				%bv = insertelement <2 x double> %bv0, double 0.000000e+00, i32 1

				%vr = call <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double> %av, <2 x double> %av, <2 x double> %bv)

				%sr = extractelement <2 x double> %vr, i32 0
				store double %sr, double* %a
				ret void
				}

				define void @fmsub_aba_sd(double* %a, double* %b) #0 {
				; CHECK-LABEL: fmsub_aba_sd:
				; CHECK: vmovsd (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfmsub132sd (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovlps %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load double, double* %a
				%av0 = insertelement <2 x double> undef, double %a.val, i32 0
				%av = insertelement <2 x double> %av0, double 0.000000e+00, i32 1

				%b.val = load double, double* %b
				%bv0 = insertelement <2 x double> undef, double %b.val, i32 0
				%bv = insertelement <2 x double> %bv0, double 0.000000e+00, i32 1

				%vr = call <2 x double> @llvm.x86.fma.vfmsub.sd(<2 x double> %av, <2 x double> %bv, <2 x double> %av)

				%sr = extractelement <2 x double> %vr, i32 0
				store double %sr, double* %a
				ret void
				}

				define void @fnmadd_aab_sd(double* %a, double* %b) #0 {
				; CHECK-LABEL: fnmadd_aab_sd:
				; CHECK: vmovsd (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfnmadd213sd (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovlps %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load double, double* %a
				%av0 = insertelement <2 x double> undef, double %a.val, i32 0
				%av = insertelement <2 x double> %av0, double 0.000000e+00, i32 1

				%b.val = load double, double* %b
				%bv0 = insertelement <2 x double> undef, double %b.val, i32 0
				%bv = insertelement <2 x double> %bv0, double 0.000000e+00, i32 1

				%vr = call <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double> %av, <2 x double> %av, <2 x double> %bv)

				%sr = extractelement <2 x double> %vr, i32 0
				store double %sr, double* %a
				ret void
				}

				define void @fnmadd_aba_sd(double* %a, double* %b) #0 {
				; CHECK-LABEL: fnmadd_aba_sd:
				; CHECK: vmovsd (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfnmadd132sd (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovlps %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load double, double* %a
				%av0 = insertelement <2 x double> undef, double %a.val, i32 0
				%av = insertelement <2 x double> %av0, double 0.000000e+00, i32 1

				%b.val = load double, double* %b
				%bv0 = insertelement <2 x double> undef, double %b.val, i32 0
				%bv = insertelement <2 x double> %bv0, double 0.000000e+00, i32 1

				%vr = call <2 x double> @llvm.x86.fma.vfnmadd.sd(<2 x double> %av, <2 x double> %bv, <2 x double> %av)

				%sr = extractelement <2 x double> %vr, i32 0
				store double %sr, double* %a
				ret void
				}

				define void @fnmsub_aab_sd(double* %a, double* %b) #0 {
				; CHECK-LABEL: fnmsub_aab_sd:
				; CHECK: vmovsd (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfnmsub213sd (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovlps %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load double, double* %a
				%av0 = insertelement <2 x double> undef, double %a.val, i32 0
				%av = insertelement <2 x double> %av0, double 0.000000e+00, i32 1

				%b.val = load double, double* %b
				%bv0 = insertelement <2 x double> undef, double %b.val, i32 0
				%bv = insertelement <2 x double> %bv0, double 0.000000e+00, i32 1

				%vr = call <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double> %av, <2 x double> %av, <2 x double> %bv)

				%sr = extractelement <2 x double> %vr, i32 0
				store double %sr, double* %a
				ret void
				}

				define void @fnmsub_aba_sd(double* %a, double* %b) #0 {
				; CHECK-LABEL: fnmsub_aba_sd:
				; CHECK: vmovsd (%rcx), %[[XMM:xmm[0-9]+]]
				; CHECK-NEXT: vfnmsub132sd (%rdx), %[[XMM]], %[[XMM]]
				; CHECK-NEXT: vmovlps %[[XMM]], (%rcx)
				; CHECK-NEXT: ret
				%a.val = load double, double* %a
				%av0 = insertelement <2 x double> undef, double %a.val, i32 0
				%av = insertelement <2 x double> %av0, double 0.000000e+00, i32 1

				%b.val = load double, double* %b
				%bv0 = insertelement <2 x double> undef, double %b.val, i32 0
				%bv = insertelement <2 x double> %bv0, double 0.000000e+00, i32 1

				%vr = call <2 x double> @llvm.x86.fma.vfnmsub.sd(<2 x double> %av, <2 x double> %bv, <2 x double> %av)

				%sr = extractelement <2 x double> %vr, i32 0
				store double %sr, double* %a
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

X86-FMA3: Memory folding for scalar loads + FMA3ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 41211

llvm/trunk/lib/Target/X86/X86InstrFMA.td

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

llvm/trunk/test/CodeGen/X86/fma-scalar-memfold.ll

X86-FMA3: Memory folding for scalar loads + FMA3
ClosedPublic