This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1/3
sve-fixed-length-fp-fma.ll

Differential D112557

[SVE] Fix VLS FMA generation at CodeGenOpt::Aggressive
ClosedPublic

Authored by cameron.mcinally on Oct 26 2021, 9:41 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
david-arm
c-rhodes
efriedma

Commits

rG702fd3d323aa: [SVE] Fix VLS FMA matching for CodeGenOpt::Aggressive.

Summary

@paulwalker-arm Here is a hastily prepared patch of the problem we discussed earlier today. I'm a little out of the VLS loop, so perhaps there is a better way to handle this...

For VLS lowering, the DAGCombiner is not matching fixed width vector FMAs at CodeGenOpt::Aggressive, since the fixed width FMA matching at CodeGenOpt::Aggressive is done in the MachineCombiner. However, when the MachineCombiner runs, we have already lowered the fixed width vectors to scalable vectors, so FMAs are not generated at all. This patch corrects that by allowing the DAGCombiner to match FMAs if we are using VLS.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

cameron.mcinally created this revision.Oct 26 2021, 9:41 AM

Herald added a reviewer: efriedma. · View Herald TranscriptOct 26 2021, 9:41 AM

Herald added subscribers: steven.zhang, psnobl, hiraditya and 2 others. · View Herald Transcript

cameron.mcinally requested review of this revision.Oct 26 2021, 9:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2021, 9:41 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

madhur13490 added a subscriber: madhur13490.Oct 26 2021, 9:59 AM

madhur13490 added inline comments.

llvm/test/CodeGen/AArch64/sve-fixed-length-fp-fma.ll
82	Do we want to have a test where %a and %b are globals?

cameron.mcinally added inline comments.Oct 26 2021, 10:14 AM

llvm/test/CodeGen/AArch64/sve-fixed-length-fp-fma.ll
82	I haven't looked at VLS tests in a while, so that may have changed, but IINM this is the standard format for those tests. See sve-fixed-length*.ll for reference.

Harbormaster completed remote builds in B130746: Diff 382372.Oct 26 2021, 10:34 AM

Fix clang-format warning and add the missing '+' to "+sve".

Harbormaster completed remote builds in B130775: Diff 382409.Oct 26 2021, 12:33 PM

Thanks @cameron.mcinally, I see what you mean now.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12688	Does `!useSVEForFixedLengthVectorVT(VT)` also work? That way we maintain existing behaviour for NEON sized vectors.
llvm/test/CodeGen/AArch64/sve-fixed-length-fp-fma.ll
17	Given this is a DAG combine I don't see much value in testing all combinations of `-aarch64-sve-vector-bits-min` so perhaps just have a single `RUN` line using the maximum value. That way you don't need `VBYTES` and can use `update_llc_test_checks.py` to generate the `CHECK` lines.

Updated Diff for @paulwalker-arm's reviews...

Harbormaster completed remote builds in B131601: Diff 383616.Oct 30 2021, 1:42 PM

Perhaps worth adding matching half/fp16 tests to sve-fixed-length-fp-fma.ll but otherwise looks good.

This revision is now accepted and ready to land.Nov 1 2021, 6:03 AM

cameron.mcinally edited the summary of this revision. (Show Details)Nov 1 2021, 8:47 AM

Closed by commit rG702fd3d323aa: [SVE] Fix VLS FMA matching for CodeGenOpt::Aggressive. (authored by cameron.mcinally). · Explain WhyNov 1 2021, 10:44 AM

This revision was automatically updated to reflect the committed changes.

cameron.mcinally added a commit: rG702fd3d323aa: [SVE] Fix VLS FMA matching for CodeGenOpt::Aggressive..

Perhaps worth adding matching half/fp16 tests to sve-fixed-length-fp-fma.ll but otherwise looks good.

Added and submitted. Will address any post-commit reviews as needed.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

3 lines

test/

CodeGen/

AArch64/

sve-fixed-length-fp-fma.ll

309 lines

Diff 383834

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,678 Lines • ▼ Show 20 Lines	case Type::DoubleTyID:
return true;		return true;
default:		default:
return false;		return false;
}		}
}		}

bool AArch64TargetLowering::generateFMAsInMachineCombiner(		bool AArch64TargetLowering::generateFMAsInMachineCombiner(
EVT VT, CodeGenOpt::Level OptLevel) const {		EVT VT, CodeGenOpt::Level OptLevel) const {
return (OptLevel >= CodeGenOpt::Aggressive) && !VT.isScalableVector();		return (OptLevel >= CodeGenOpt::Aggressive) && !VT.isScalableVector() &&
		!useSVEForFixedLengthVectorVT(VT);
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Does `!useSVEForFixedLengthVectorVT(VT)` also work? That way we maintain existing behaviour for NEON sized vectors. paulwalker-arm: Does `!useSVEForFixedLengthVectorVT(VT)` also work? That way we maintain existing behaviour for…
}		}

const MCPhysReg *		const MCPhysReg *
AArch64TargetLowering::getScratchRegisters(CallingConv::ID) const {		AArch64TargetLowering::getScratchRegisters(CallingConv::ID) const {
// LR is a callee-save register, but we must treat it as clobbered by any call		// LR is a callee-save register, but we must treat it as clobbered by any call
// site. Hence we include LR in the scratch registers, which are in turn added		// site. Hence we include LR in the scratch registers, which are in turn added
// as implicit-defs for stackmaps and patchpoints.		// as implicit-defs for stackmaps and patchpoints.
static const MCPhysReg ScratchRegs[] = {		static const MCPhysReg ScratchRegs[] = {
▲ Show 20 Lines • Show All 6,636 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-fp-fma.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -O3 -aarch64-sve-vector-bits-min=2048 < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; FMA
				;

				; Don't use SVE for 64-bit vectors.
				define <4 x half> @fma_v4f16(<4 x half> %op1, <4 x half> %op2, <4 x half> %op3) #0 {
				; CHECK-LABEL: fma_v4f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmla v2.4h, v0.4h, v1.4h
				; CHECK-NEXT: fmov d0, d2
				; CHECK-NEXT: ret
				%mul = fmul contract <4 x half> %op1, %op2
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Given this is a DAG combine I don't see much value in testing all combinations of `-aarch64-sve-vector-bits-min` so perhaps just have a single `RUN` line using the maximum value. That way you don't need `VBYTES` and can use `update_llc_test_checks.py` to generate the `CHECK` lines. paulwalker-arm: Given this is a DAG combine I don't see much value in testing all combinations of `-aarch64-sve…
				%res = fadd contract <4 x half> %mul, %op3
				ret <4 x half> %res
				}

				; Don't use SVE for 128-bit vectors.
				define <8 x half> @fma_v8f16(<8 x half> %op1, <8 x half> %op2, <8 x half> %op3) #0 {
				; CHECK-LABEL: fma_v8f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmla v2.8h, v0.8h, v1.8h
				; CHECK-NEXT: mov v0.16b, v2.16b
				; CHECK-NEXT: ret
				%mul = fmul contract <8 x half> %op1, %op2
				%res = fadd contract <8 x half> %mul, %op3
				ret <8 x half> %res
				}

				define void @fma_v16f16(<16 x half>* %a, <16 x half>* %b, <16 x half>* %c) #0 {
				; CHECK-LABEL: fma_v16f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl16
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x half>, <16 x half>* %a
				%op2 = load <16 x half>, <16 x half>* %b
				%op3 = load <16 x half>, <16 x half>* %c
				%mul = fmul contract <16 x half> %op1, %op2
				%res = fadd contract <16 x half> %mul, %op3
				store <16 x half> %res, <16 x half>* %a
				ret void
				}

				define void @fma_v32f16(<32 x half>* %a, <32 x half>* %b, <32 x half>* %c) #0 {
				; CHECK-LABEL: fma_v32f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl32
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x half>, <32 x half>* %a
				%op2 = load <32 x half>, <32 x half>* %b
				%op3 = load <32 x half>, <32 x half>* %c
				%mul = fmul contract <32 x half> %op1, %op2
				%res = fadd contract <32 x half> %mul, %op3
				store <32 x half> %res, <32 x half>* %a
				ret void
				}

				define void @fma_v64f16(<64 x half>* %a, <64 x half>* %b, <64 x half>* %c) #0 {
				; CHECK-LABEL: fma_v64f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl64
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <64 x half>, <64 x half>* %a
				madhur13490Unsubmitted Not Done Reply Inline Actions Do we want to have a test where %a and %b are globals? madhur13490: Do we want to have a test where %a and %b are globals?
				cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions I haven't looked at VLS tests in a while, so that may have changed, but IINM this is the standard format for those tests. See sve-fixed-length.ll for reference. cameron.mcinally:* I haven't looked at VLS tests in a while, so that may have changed, but IINM this is the…
				%op2 = load <64 x half>, <64 x half>* %b
				%op3 = load <64 x half>, <64 x half>* %c
				%mul = fmul contract <64 x half> %op1, %op2
				%res = fadd contract <64 x half> %mul, %op3
				store <64 x half> %res, <64 x half>* %a
				ret void
				}

				define void @fma_v128f16(<128 x half>* %a, <128 x half>* %b, <128 x half>* %c) #0 {
				; CHECK-LABEL: fma_v128f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl128
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <128 x half>, <128 x half>* %a
				%op2 = load <128 x half>, <128 x half>* %b
				%op3 = load <128 x half>, <128 x half>* %c
				%mul = fmul contract <128 x half> %op1, %op2
				%res = fadd contract <128 x half> %mul, %op3
				store <128 x half> %res, <128 x half>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <2 x float> @fma_v2f32(<2 x float> %op1, <2 x float> %op2, <2 x float> %op3) #0 {
				; CHECK-LABEL: fma_v2f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmla v2.2s, v0.2s, v1.2s
				; CHECK-NEXT: fmov d0, d2
				; CHECK-NEXT: ret
				%mul = fmul contract <2 x float> %op1, %op2
				%res = fadd contract <2 x float> %mul, %op3
				ret <2 x float> %res
				}

				; Don't use SVE for 128-bit vectors.
				define <4 x float> @fma_v4f32(<4 x float> %op1, <4 x float> %op2, <4 x float> %op3) #0 {
				; CHECK-LABEL: fma_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmla v2.4s, v0.4s, v1.4s
				; CHECK-NEXT: mov v0.16b, v2.16b
				; CHECK-NEXT: ret
				%mul = fmul contract <4 x float> %op1, %op2
				%res = fadd contract <4 x float> %mul, %op3
				ret <4 x float> %res
				}

				define void @fma_v8f32(<8 x float>* %a, <8 x float>* %b, <8 x float>* %c) #0 {
				; CHECK-LABEL: fma_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x float>, <8 x float>* %a
				%op2 = load <8 x float>, <8 x float>* %b
				%op3 = load <8 x float>, <8 x float>* %c
				%mul = fmul contract <8 x float> %op1, %op2
				%res = fadd contract <8 x float> %mul, %op3
				store <8 x float> %res, <8 x float>* %a
				ret void
				}

				define void @fma_v16f32(<16 x float>* %a, <16 x float>* %b, <16 x float>* %c) #0 {
				; CHECK-LABEL: fma_v16f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl16
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x float>, <16 x float>* %a
				%op2 = load <16 x float>, <16 x float>* %b
				%op3 = load <16 x float>, <16 x float>* %c
				%mul = fmul contract <16 x float> %op1, %op2
				%res = fadd contract <16 x float> %mul, %op3
				store <16 x float> %res, <16 x float>* %a
				ret void
				}

				define void @fma_v32f32(<32 x float>* %a, <32 x float>* %b, <32 x float>* %c) #0 {
				; CHECK-LABEL: fma_v32f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl32
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x float>, <32 x float>* %a
				%op2 = load <32 x float>, <32 x float>* %b
				%op3 = load <32 x float>, <32 x float>* %c
				%mul = fmul contract <32 x float> %op1, %op2
				%res = fadd contract <32 x float> %mul, %op3
				store <32 x float> %res, <32 x float>* %a
				ret void
				}

				define void @fma_v64f32(<64 x float>* %a, <64 x float>* %b, <64 x float>* %c) #0 {
				; CHECK-LABEL: fma_v64f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl64
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <64 x float>, <64 x float>* %a
				%op2 = load <64 x float>, <64 x float>* %b
				%op3 = load <64 x float>, <64 x float>* %c
				%mul = fmul contract <64 x float> %op1, %op2
				%res = fadd contract <64 x float> %mul, %op3
				store <64 x float> %res, <64 x float>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <1 x double> @fma_v1f64(<1 x double> %op1, <1 x double> %op2, <1 x double> %op3) #0 {
				; CHECK-LABEL: fma_v1f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmadd d0, d0, d1, d2
				; CHECK-NEXT: ret
				%mul = fmul contract <1 x double> %op1, %op2
				%res = fadd contract <1 x double> %mul, %op3
				ret <1 x double> %res
				}

				; Don't use SVE for 128-bit vectors.
				define <2 x double> @fma_v2f64(<2 x double> %op1, <2 x double> %op2, <2 x double> %op3) #0 {
				; CHECK-LABEL: fma_v2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmla v2.2d, v0.2d, v1.2d
				; CHECK-NEXT: mov v0.16b, v2.16b
				; CHECK-NEXT: ret
				%mul = fmul contract <2 x double> %op1, %op2
				%res = fadd contract <2 x double> %mul, %op3
				ret <2 x double> %res
				}

				define void @fma_v4f64(<4 x double>* %a, <4 x double>* %b, <4 x double>* %c) #0 {
				; CHECK-LABEL: fma_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x double>, <4 x double>* %a
				%op2 = load <4 x double>, <4 x double>* %b
				%op3 = load <4 x double>, <4 x double>* %c
				%mul = fmul contract <4 x double> %op1, %op2
				%res = fadd contract <4 x double> %mul, %op3
				store <4 x double> %res, <4 x double>* %a
				ret void
				}

				define void @fma_v8f64(<8 x double>* %a, <8 x double>* %b, <8 x double>* %c) #0 {
				; CHECK-LABEL: fma_v8f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl8
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x double>, <8 x double>* %a
				%op2 = load <8 x double>, <8 x double>* %b
				%op3 = load <8 x double>, <8 x double>* %c
				%mul = fmul contract <8 x double> %op1, %op2
				%res = fadd contract <8 x double> %mul, %op3
				store <8 x double> %res, <8 x double>* %a
				ret void
				}

				define void @fma_v16f64(<16 x double>* %a, <16 x double>* %b, <16 x double>* %c) #0 {
				; CHECK-LABEL: fma_v16f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl16
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x double>, <16 x double>* %a
				%op2 = load <16 x double>, <16 x double>* %b
				%op3 = load <16 x double>, <16 x double>* %c
				%mul = fmul contract <16 x double> %op1, %op2
				%res = fadd contract <16 x double> %mul, %op3
				store <16 x double> %res, <16 x double>* %a
				ret void
				}

				define void @fma_v32f64(<32 x double>* %a, <32 x double>* %b, <32 x double>* %c) #0 {
				; CHECK-LABEL: fma_v32f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl32
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x2]
				; CHECK-NEXT: fmad z0.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x double>, <32 x double>* %a
				%op2 = load <32 x double>, <32 x double>* %b
				%op3 = load <32 x double>, <32 x double>* %c
				%mul = fmul contract <32 x double> %op1, %op2
				%res = fadd contract <32 x double> %mul, %op3
				store <32 x double> %res, <32 x double>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }