This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add patterns for scalar FMUL, FMULX
ClosedPublic

Authored by overmighty on Jun 17 2023, 2:47 PM.

Details

Summary

Scalar FMUL, FMULX instructions perform better or the same compared to indexed
FMUL, FMULX.

For example, the Arm Cortex-A55 Software Optimization Guide lists the following
instructions with a throughput of 2 IPC:

  • "FP multiply" FMUL
  • "ASIMD FP multiply" FMULX

whereas it lists the following with a throughput of 1 IPC:

  • "ASIMD FP multiply, by element" FMUL, FMULX

The Arm Cortex-A510 Software Optimization Guide, however, does not separately
list "by element" variants of the "ASIMD FP multiply" instructions, which are
listed with the same throughput as the non-ASIMD ones.

Fixes #60817.

Diff Detail

Event Timeline

overmighty created this revision.Jun 17 2023, 2:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2023, 2:47 PM
overmighty requested review of this revision.Jun 17 2023, 2:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2023, 2:47 PM
overmighty added inline comments.Jun 17 2023, 3:05 PM
llvm/lib/Target/AArch64/AArch64InstrInfo.td
6891

For example, this prevents the following regression:

define float @test_v3f32(<3 x float> %a) nounwind {
; CHECK-LABEL: test_v3f32:
; CHECK:       // %bb.0:
; CHECK-NEXT:    fmul s1, s0, v0.s[1]
; CHECK-NEXT:    fmul s0, s1, v0.s[2]
; CHECK-NEXT:    ret
  %b = call float @llvm.vector.reduce.fmul.f32.v3f32(float 1.0, <3 x float> %a)
  ret float %b
}
test_v3f32:                             // @test_v3f32
// %bb.0:
	mov	s1, v0.s[1]
	fmul	s1, s1, s0
	fmul	s0, s1, v0.s[2]
	ret
llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-mul.ll
21

Ideally this would be fmul h4, h0, h3, but this is prevented by the patterns to avoid using scalar FMUL if it might introduce an extra DUP/mov.

Updated a failing Clang test.

Looks worthwhile to me so far but I think the test changes need some clean-up.

llvm/test/CodeGen/AArch64/fp16_intrinsic_lane.ll
490–500

Some of the test changes, such as this one being moved and renamed, make it hard to see what has changed. Could you try to reduce the number of such differences?

overmighty added inline comments.Jun 19 2023, 10:43 AM
llvm/test/CodeGen/AArch64/fp16_intrinsic_lane.ll
490–500

Sure, I was trying to maintain some kind of consistency. I have also spotted duplicate tests, I don't know if it would be worth it to refactor some of the tests in the future.

samtebbs added inline comments.Jun 20 2023, 1:47 AM
llvm/test/CodeGen/AArch64/fp16_intrinsic_lane.ll
490–500

Yeah refactoring in a future patch with no functional changes sounds good to me.

There is a lot going on in this patch. It might be good to focus on one of the instructions first. And to split the test changes out into a separate patch, just showing the differences here.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
4504

Could these be in ThreeOperandFPData so that they reuse the patterns with the existing TriOpFrags?

6891

Would the same thing happen with other instructions too, if there were tests?

It might be good to focus on one of the instructions first.

Not sure what you mean by this. Should I update this revision to add patterns and tests for only one instruction, have it LGTM'd but not committed, then update it again for the next instruction?

If there was one patch that handled fmul, and another that handled fma then each would be simpler on its own and easier to work through the details.

overmighty retitled this revision from [AArch64] Add patterns for scalar FMUL, FMULX, FMADD, FMSUB to [AArch64] Add patterns for scalar FMUL, FMULX.
overmighty edited the summary of this revision. (Show Details)
overmighty added a reviewer: samtebbs.

Rebased new upstream commits, replaced test changes with minimal ones, fixed extra DUP/mov regression for scalar FMULX and added tests for it, removed new patterns for FMADD and FMSUB as well as the associated new tests.

overmighty added inline comments.Jun 21 2023, 10:37 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.td
6891

Right, I fixed it for FMULX too with this updated diff. See the changes in AArch64InstrFormats.td and the new test_fmulx_horizontal_f* tests.

overmighty added inline comments.Jun 22 2023, 2:55 AM
llvm/test/CodeGen/AArch64/arm64-neon-scalar-by-elem-mul.ll
259

Not sure if test_fmulx_horizontal_f* tests should be put in a separate file or not.

Thanks. That makes it clearer to read. Using the same patterns in SIMDFPIndexed is nice.

Can you make it so that t_vmulh_lane3_f16 is next to t_vmulh_lane_f16, and maybe rename t_vmulh_lane_f16 to t_vmulh_lane0_f16. So that they are still the same tests as before, but with the new additions inplace. The same for all the others.

llvm/lib/Target/AArch64/AArch64InstrFormats.td
8545

I think these should be HasNEON too.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
5251

I don't think this needs NEON according to the reference I was looking at. Just FP.
The patterns for FMUL are very similar to the patterns for FMULX16. Could they share the same multiclass for the patterns?

Rebased new upstream commits, added missing HasNEON predicate to new SIMDFPIndexed v1i32_indexed and v1i64_indexed patterns, replaced new scalar FMUL and FMULX patterns with new multiclass, re-refactored tests for consistency/grouping.

overmighty added inline comments.Jun 25 2023, 5:32 AM
llvm/lib/Target/AArch64/AArch64InstrFormats.td
8545

Right, I had it on the previous patterns for indexed FMUL, I somehow lost/forgot it while porting them to SIMDFPIndexed.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
4432

I wasn't sure where to put this.

5251

Can you confirm that it doesn't need Neon?

I couldn't find some sort of AArch64 equivalent to the Intel 64 manual which lists CPUID feature flags along instructions, but a few things suggest that even scalar FMULX needs Neon:

  • The Arm Cortex/Neoverse Software Optimization Guides that I read listed FMULX under the "ASIMD FP multiply" instruction group (and under SVE "Floating point multiply" too) but not under regular "FP multiply" where FMUL appears.
  • I'm not sure what it's worth but FMULX is defined here with an intrinsic that contains neon in its name as OpNode, the predicate is HasNEONorSME, and SIMDFPThreeScalar defaults to HasNEON.

However, this page lists FMULX as an instruction that has been added as part of floating-point changes in AArch64 and not Neon changes: https://developer.arm.com/documentation/den0024/a/AArch64-Floating-point-and-NEON/New-features-for-NEON-and-Floating-point-in-AArch64?lang=en.

overmighty added inline comments.Jun 25 2023, 6:25 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.td
4432

Also, I assume this multiclass needs a better name that mentions the number of operands for the instructions. Scalar FMUL invokes TwoOperandFPData but scalar FMULX invokes SIMDFPThreeScalar, I'm not sure if something like TwoOperandFPDataFromLane0Patterns would be a better name.

Matt added a subscriber: Matt.Jun 26 2023, 9:39 AM
dmgreen accepted this revision.Jun 28 2023, 5:02 AM

Thanks for the changes. With a few extra Nits and suggestions, this LGTM.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
4432

Perhaps move it next to the defm for FPScalarFromIndexedLane0Patterns?

Would it expect to be used for anything other than mul/mulx? If not, perhaps include Mul in the name instead of Scalar.

5251

The definitive source will be the pseudocode in the Arm Architecture Reference Manual. For mulx it has:

if elements == 1 then
  CheckFPEnabled64();
else
  CheckFPAdvSIMDEnabled64();

In this case, if the instruction definition already uses HasNEONorSME then its fine to keep the same thing for the new patterns. We can fix the two in another patch at some point. It likely doesn't matter a lot at the moment, I don't think there is a way to generate mulx intrinsics without Neon being enabled.

This could be moved down below the other instruction definitions.

This revision is now accepted and ready to land.Jun 28 2023, 5:02 AM

Rebased new upstream commits, renamed new multiclass from FPScalarFromIndexedLane0Patterns to FMULScalarFromIndexedLane0Patterns and moved it next to the first invocation for FMUL, moved second invocation for FMULX below the group of instruction definitions.

Thanks for the suggestions. If you don't have more and this still looks good to you, please commit this as "OverMighty <its.overmighty@gmail.com>". Thank you. :)

llvm/lib/Target/AArch64/AArch64InstrInfo.td
4432

I don't expect it to be used for any instructions other than FMUL and FMULX without some changes. I'm not sure about removing Scalar from the name though, especially if Indexed stays.

5251

By the way, this pseudocode from the Arm Architecture Reference Manual differs from the one found in the following web pages, where only CheckFPAdvSIMDEnabled64 is called:

Thanks. Yeah will do.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
5251

I see. That was a change made when SME was added to the architecture.

This revision was automatically updated to reflect the committed changes.