Page MenuHomePhabricator

[ARM] Do not fuse VADD and VMUL on the Cortex-M4 and Cortex-M33
ClosedPublic

Authored by SjoerdMeijer on Sep 20 2018, 1:53 AM.

Details

Summary

A sequence of VMUL and VADD instructions always give the same or better
performance than a fused VMLA instruction on the Cortex-M4 and Cortex-M33.
Executing the VMUL and VADD back-to-back requires the same cycles, but
having separate instructions allows scheduling to avoid the hazard between
these 2 instructions.

Diff Detail

Repository
rL LLVM

Event Timeline

SjoerdMeijer created this revision.Sep 20 2018, 1:53 AM

Reshuffled the tests a bit.

Shouldn't we also consider code size here?

Good point. I wanted to worry about that later in a follow up patch, but perhaps that doesn't make sense. I will fix it now.

dmgreen added inline comments.Sep 20 2018, 3:41 AM
test/CodeGen/Thumb2/float-intrinsics-float.ll
192 ↗(On Diff #166250)

This comment can be removed now?

Thanks for the reviews.
Now take code size into account, and removed outdated comment.

LGTM, cheers.

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com


From: Sjoerd Meijer via Phabricator <reviews@reviews.llvm.org>
Sent: 21 September 2018 15:45:01
To: Sjoerd Meijer; Sam Parker; David Green; t.p.northover@gmail.com; John Brawn; Javed Absar
Cc: Kristof Beyls; christian.bruel@st.com; llvm-commits@lists.llvm.org; kanheim@a-bix.com; James Molloy; diana.picus@linaro.org; Florian Hahn
Subject: [PATCH] D52289: [ARM] Do not fuse VADD and VMUL on the Cortex-M4 and Cortex-M33

SjoerdMeijer updated this revision to Diff 166483.
SjoerdMeijer added a comment.

Thanks for the reviews.
Now take code size into account, and removed outdated comment.

https://reviews.llvm.org/D52289

Files:

lib/Target/ARM/ARM.td
lib/Target/ARM/ARMInstrInfo.td
test/CodeGen/ARM/fmacs.ll
test/CodeGen/Thumb2/float-intrinsics-float.ll

Index: test/CodeGen/Thumb2/float-intrinsics-float.ll

  • test/CodeGen/Thumb2/float-intrinsics-float.ll

+++ test/CodeGen/Thumb2/float-intrinsics-float.ll
@@ -1,5 +1,6 @@
; RUN: llc < %s -mtriple=thumbv7-none-eabi -mcpu=cortex-m3 | FileCheck %s -check-prefix=CHECK -check-prefix=SOFT -check-prefix=NONE
-; RUN: llc < %s -mtriple=thumbv7-none-eabihf -mcpu=cortex-m4 | FileCheck %s -check-prefix=CHECK -check-prefix=HARD -check-prefix=SP -check-prefix=VMLA
+; RUN: llc < %s -mtriple=thumbv7-none-eabihf -mcpu=cortex-m4 | FileCheck %s -check-prefix=CHECK -check-prefix=HARD -check-prefix=SP -check-prefix=NO-VMLA
+; RUN: llc < %s -mtriple=thumbv7-none-eabihf -mcpu=cortex-m33 | FileCheck %s -check-prefix=CHECK -check-prefix=HARD -check-prefix=SP -check-prefix=NO-VMLA
; RUN: llc < %s -mtriple=thumbv7-none-eabihf -mcpu=cortex-m7 | FileCheck %s -check-prefix=CHECK -check-prefix=HARD -check-prefix=DP -check-prefix=VFP -check-prefix=FP-ARMv8 -check-prefix=VMLA
; RUN: llc < %s -mtriple=thumbv7-none-eabihf -mcpu=cortex-m7 -mattr=+fp-only-sp | FileCheck %s -check-prefix=CHECK -check-prefix=HARD -check-prefix=SP -check-prefix=FP-ARMv8 -check-prefix=VMLA
; RUN: llc < %s -mtriple=thumbv7-none-eabihf -mcpu=cortex-a7 | FileCheck %s -check-prefix=CHECK -check-prefix=HARD -check-prefix=DP -check-prefix=NEON -check-prefix=VFP4 -check-prefix=NO-VMLA
@@ -188,8 +189,6 @@

ret float %1

}

-; FIXME: why does cortex-m4 use vmla, while cortex-a7 uses vmul+vadd?
-; (these should be equivalent, even the rounding is the same)
declare float @llvm.fmuladd.f32(float %a, float %b, float %c)
define float @fmuladd_f(float %a, float %b, float %c) {
; CHECK-LABEL: fmuladd_f:

Index: test/CodeGen/ARM/fmacs.ll

  • test/CodeGen/ARM/fmacs.ll

+++ test/CodeGen/ARM/fmacs.ll
@@ -3,6 +3,8 @@
; RUN: llc -mtriple=arm-eabi -mcpu=cortex-a8 %s -o - | FileCheck %s -check-prefix=A8
; RUN: llc -mtriple=arm-eabi -mcpu=cortex-a9 %s -o - | FileCheck %s -check-prefix=A9
; RUN: llc -mtriple=arm-linux-gnueabi -mcpu=cortex-a9 -float-abi=hard %s -o - | FileCheck %s -check-prefix=HARD
+; RUN: llc -mtriple=arm-linux-gnueabi -mcpu=cortex-m4 -float-abi=hard %s -o - | FileCheck %s -check-prefix=VMLA
+; RUN: llc -mtriple=arm-linux-gnueabi -mcpu=cortex-m33 -float-abi=hard %s -o - | FileCheck %s -check-prefix=VMLA

define float @t1(float %acc, float %a, float %b) {
entry:
@@ -15,6 +17,21 @@
; A8-LABEL: t1:
; A8: vmul.f32
; A8: vadd.f32
+
+; VMLA-LABEL: t1:
+; VMLA: vmul.f32
+; VMLA-NEXT: vadd.f32
+
+ %0 = fmul float %a, %b
+ %1 = fadd float %acc, %0
+ ret float %1
+}
+
+define float @vlma_minsize(float %acc, float %a, float %b) #0 {
+entry:
+; VMLA-LABEL: vlma_minsize:
+; VLMA: vmla.f32 s0, s1, s2
+

%0 = fmul float %a, %b
%1 = fadd float %acc, %0
      ret float %1

@@ -102,3 +119,5 @@

%3 = fadd float %1, %2
ret float %3

}
+
+attributes #0 = { minsize nounwind optsize }

Index: lib/Target/ARM/ARMInstrInfo.td

  • lib/Target/ARM/ARMInstrInfo.td

+++ lib/Target/ARM/ARMInstrInfo.td
@@ -353,10 +353,10 @@
let RecomputePerFunction = 1 in {

def UseMovt          : Predicate<"Subtarget->useMovt(*MF)">;
def DontUseMovt      : Predicate<"!Subtarget->useMovt(*MF)">;
  • def UseMovtInPic : Predicate<"Subtarget->useMovt(*MF) && Subtarget->allowPositionIndependentMovt()">;
  • def DontUseMovtInPic : Predicate<"!Subtarget->useMovt(*MF) || !Subtarget->allowPositionIndependentMovt()">;

+ def UseMovtInPic : Predicate<"Subtarget->useMovt(*MF) && Subtarget->allowPositionIndependentMovt()">;
+ def DontUseMovtInPic : Predicate<"!Subtarget->useMovt(*MF) || !Subtarget->allowPositionIndependentMovt()">;
+ def UseFPVMLx : Predicate<"Subtarget->useFPVMLx() || MF->getFunction().optForMinSize()">;
}
-def UseFPVMLx : Predicate<"Subtarget->useFPVMLx()">;
def UseMulOps : Predicate<"Subtarget->useMulOps()">;

// Prefer fused MAC for fp mul + add over fp VMLA / VMLS if they are available.

Index: lib/Target/ARM/ARM.td

  • lib/Target/ARM/ARM.td

+++ lib/Target/ARM/ARM.td
@@ -966,6 +966,7 @@

FeatureVFPOnlySP,
FeatureD16,
FeaturePrefLoopAlign32,

+ FeatureHasSlowFPVMLx,

FeatureHasNoBranchPredictor]>;

def : ProcNoItin<"cortex-m7", [ARMv7em,
@@ -981,6 +982,7 @@

FeatureD16,
FeatureVFPOnlySP,
FeaturePrefLoopAlign32,

+ FeatureHasSlowFPVMLx,

FeatureHasNoBranchPredictor]>;

def : ProcNoItin<"cortex-a32", [ARMv8a,

This revision was not accepted when it landed; it landed in state Needs Review.Sep 24 2018, 7:24 AM
This revision was automatically updated to reflect the committed changes.