This is an archive of the discontinued LLVM Phabricator instance.

[X86][FMA] Optimize FNEG(FMUL) Patterns
ClosedPublic

Authored by RKSimon on Nov 22 2015, 11:41 AM.

Details

Summary

On FMA targets, we can avoid having to load a constant to negate a float/double multiply by instead using a FNMSUB (-(X*Y)-0)

Note: As Sanjay mentioned in his bug report, although this is consistently faster (by avoiding the constant load), this does increase register pressure by requiring us to create a zero register. I'm not sure how best to qualify this if people think its a problem. Only running with optsize doesn't really help us - we MAY reduce constantpool size (if no other FNEG are present) but we MAY also increase code size handling extra stack traffic. We do have precedent for this: we use blendps to zero out elements instead of using the slower insertps; I'm sure there are plenty of other examples.

Fix for PR24366

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 40882.Nov 22 2015, 11:41 AM
RKSimon retitled this revision from to [X86][FMA] Optimize FNEG(FMUL) Patterns.
RKSimon updated this object.
RKSimon added reviewers: spatel, delena, qcolombet.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
RKSimon updated this revision to Diff 41085.Nov 24 2015, 2:06 PM

Rebased - moved patch to PerformFNEGCombine

spatel added inline comments.
lib/Target/X86/X86ISelLowering.cpp
26157

Do we need unsafe math for this transform?

26159

Aren't the FMA checks enough? Is there some target that assumes that AVX512 includes FMA but does not set the FMA attribute?

26160

Nit: could hoist this and reuse below.

Elena please can you confirm if we need hasAVX512() as well as hasFMA() ?

lib/Target/X86/X86ISelLowering.cpp
26157

IMO no its not necessary, I was just being conservative - subtract from zero at type or internal precision should give the same result.

26159

This is what we're doing in all similar cases - I don't have a CPUID spec that discusses whether FMA is guaranteed for AVX512.

26160

Easily done - we seem to be very inconsistent as to whether we hoist SDLoc (and sometimes not use it) or only generate them if we actually perform the combine.

delena edited edge metadata.Nov 25 2015, 1:49 AM
delena added a subscriber: delena.

AVX512 always contains FMA, you don't need a special check.

  • Elena
RKSimon updated this revision to Diff 41140.Nov 25 2015, 6:27 AM
RKSimon edited edge metadata.

Updated with Sanjay & Elena's comments

RKSimon marked 3 inline comments as done.Nov 25 2015, 6:29 AM
spatel edited edge metadata.Nov 25 2015, 8:59 AM

AVX512 always contains FMA, you don't need a special check.

I see what's going on now. This patch needs the AVX512 check in order to work for 512-bit vectors. There should be at least one more test case included with this patch to check the 512-bit vector case. Alternatively, if we want to make 512-bit optimization a separate patch, the predicate should disallow those types.

So I see 2 bugs here. Test case:

define <16 x float> @test_v4f32_fneg_fmul_16(<16 x float> %x, <16 x float> %y) #0 {
  %m = fmul <16 x float> %x, %y
  %n = fsub <16 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>, %m
  ret <16 x float> %n
}

Bug #1: The AVX-512 attribute in the subtarget should imply hasFMA(). Currently, this isn't true, so if you don't explicitly specify FMA, you won't get any expected FMA transforms:

$ ./llc fma.ll -o - -mattr=avx512f
...
   vmulps	%zmm1, %zmm0, %zmm0
   vpxord	LCPI0_0(%rip), %zmm0, %zmm0
   retq

$ ./llc fma.ll -o - -mattr=avx512f,fma
...
  vpxord	%zmm2, %zmm2, %zmm2
  vfnmadd213ps	%zmm2, %zmm1, %zmm0
  retq

Bug #2: This patch isn't checking for legal types, so it will fire for a 512-bit vector on a target that doesn't support it, and we crash:

$ ./llc fma.ll -o - -mattr=fma
  Do not know how to custom type legalize this operation!
  UNREACHABLE executed at /Users/spatel/myllvm/llvm/lib/Target/X86/X86ISelLowering.cpp:19880!

This would also happen if the test case had an illegal type for any target; eg <17 x float>.

I think the best fix for #2 is to explicitly check for the supported types for a given subtarget rather than waiting for type legalization and then checking the scalar type. This should be a helper function, so we can have the existing code in PerformFMACombine() use it too.

[I added this to Phab over an hour ago, but it hasn't made it to the list
afaict, so repeating with an email.]

I see what's going on now. This patch needs the AVX512 check in order to
work for 512-bit vectors. There should be at least one more test case
included with this patch to check the 512-bit vector case. Alternatively,
if we want to make 512-bit optimization a separate patch, the predicate
should disallow those types.

So I see 2 bugs here. Test case:

define <16 x float> @test_v4f32_fneg_fmul_16(<16 x float> %x, <16 x
float> %y) #0 {

%m = fmul <16 x float> %x, %y
%n = fsub <16 x float> <float -0.0, float -0.0, float -0.0, float

-0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0,
float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float
-0.0, float -0.0>, %m

ret <16 x float> %n

}

Bug #1: The AVX-512 attribute in the subtarget should imply hasFMA().
Currently, this isn't true, so if you don't explicitly specify FMA, you
won't get any expected FMA transforms:

$ ./llc fma.ll -o - -mattr=avx512f... vmulps %zmm1, %zmm0, %zmm0
vpxord LCPI0_0(%rip), %zmm0, %zmm0 retq$ ./llc fma.ll -o -
-mattr=avx512f,fma... vpxord %zmm2, %zmm2, %zmm2 vfnmadd213ps %zmm2,
%zmm1, %zmm0 retq

Bug #2: This patch isn't checking for legal types, so it will fire for a
512-bit vector on a target that doesn't support it, and we crash:

$ ./llc fma.ll -o - -mattr=fma Do not know how to custom type
legalize this operation! UNREACHABLE executed at
/Users/spatel/myllvm/llvm/lib/Target/X86/X86ISelLowering.cpp:19880!

This would also happen if the test case had an illegal type for any target;
eg <17 x float>.

I think the best fix for #2 is to explicitly check for the supported types
for a given subtarget rather than waiting for type legalization and then
checking the scalar type. This should be a helper function, so we can have
the existing code in PerformFMACombine() use it too.

Thanks Sanjay - I'm going to update fma_patterns.ll and fma_patterns_wide.ll to properly test AVX512, and sync them so more of the tests are tested in the wide pattern as well. That should deal with both AVX512 support and the legal types issues.

RKSimon updated this revision to Diff 41335.Nov 28 2015, 10:51 AM
RKSimon edited edge metadata.

Updated patch to correctly support 512-bit float types on AVX512 and pre-AVX512 targets.

Note - it turns out we do need the hasAVX512() test to correctly lower on 128-256 bit vectors on AVX512 targets.

scanon added inline comments.Nov 28 2015, 2:52 PM
lib/Target/X86/X86ISelLowering.cpp
26157

0 - a*b gets the sign of zero wrong if a or b is exactly zero; the result should be -0, but +0 is returned instead in default rounding. So this should be protected by no signed zeros.

escha added a subscriber: escha.Nov 28 2015, 4:03 PM

Could this be resolved by using -0 as the constant instead of 0 in non-fast-math mode?

—escha

scanon edited edge metadata.Nov 28 2015, 5:57 PM

Could this be resolved by using -0 as the constant instead of 0 in non-fast-math mode?

You'd use VFNMSUB instead to avoid needing to conjure -0. However, that still does the wrong thing in non-default rounding modes. Consider -(0*-1) in round-down. This should produce +0. Instead, we get -0 - (-1*0) = -0.

So this could be licensed under no-signed-zeros, or also under assume-default-rounding (proposed in http://reviews.llvm.org/D14067).

RKSimon updated this revision to Diff 41349.Nov 29 2015, 4:33 AM
RKSimon edited edge metadata.

Updated tests to include versions with/without nsz flags.

Note: I had to tweak the AVX512 tests to target AVX512DQ to lower XOR instructions.

As updated, this seems fine to me.

Thanks, Steve.

Can we assume default rounding for now, change the operand to X86::FNMSUB, remove the nsz check, and leave a 'TODO' comment to revisit this after rounding mode support is added? I think it would be better to have this transform be more general if we can.

Simon - is there a reason to prefer waiting for type legalization before doing the transform? I know that's how it's done in PerformFMACombine(), but it seems that the more common pattern is to check for known legal types for a particular opcode. This allows the transform to fire earlier?

Can we assume default rounding for now, change the operand to X86::FNMSUB, remove the nsz check, and leave a 'TODO' comment to revisit this after rounding mode support is added? I think it would be better to have this transform be more general if we can.

I would *really* prefer not to. Not having rounding-mode support means that we should do the conservative thing. Once we have the ability to do the right thing, we can make the default more aggressive.

Simon - is there a reason to prefer waiting for type legalization before doing the transform? I know that's how it's done in PerformFMACombine(), but it seems that the more common pattern is to check for known legal types for a particular opcode. This allows the transform to fire earlier?

No reason other than matching the behaviour in PerformFMACombine() - I tend to prefer not creating X86ISD nodes until we really need to - it gives the DAGCombiner as much chance to constant fold etc. as possible.

Can we assume default rounding for now, change the operand to X86::FNMSUB, remove the nsz check, and leave a 'TODO' comment to revisit this after rounding mode support is added? I think it would be better to have this transform be more general if we can.

I would *really* prefer not to. Not having rounding-mode support means that we should do the conservative thing. Once we have the ability to do the right thing, we can make the default more aggressive.

Ok. We should have a comment here to explain the logic then.
Would it still be better to use FNMSUB now even with the nsz check, so we're minimizing codegen differences after we have rounding-mode support?

Would it still be better to use FNMSUB now even with the nsz check, so we're minimizing codegen differences after we have rounding-mode support?

Yes, also because it's more correct (even though the wrong answer is licensed by NSZ, if we can produce the right answer at no cost, we might as well do so).

RKSimon updated this revision to Diff 41439.Nov 30 2015, 2:45 PM

Updated to use FNMSUB instead of FNMADD.

Added FIXME for once we have rounding control flags

RKSimon updated this object.Nov 30 2015, 2:53 PM
spatel accepted this revision.Dec 1 2015, 8:20 AM
spatel edited edge metadata.

LGTM.

If you don't have a patch in progress for the missing AVX512 commute, add another FIXME comment for that?

This revision is now accepted and ready to land.Dec 1 2015, 8:20 AM
spatel added inline comments.Dec 1 2015, 10:25 AM
lib/Target/X86/X86ISelLowering.cpp
27

We can use "hasAnyFMA()" after:
http://reviews.llvm.org/rL254425

This revision was automatically updated to reflect the committed changes.

LGTM.

If you don't have a patch in progress for the missing AVX512 commute, add another FIXME comment for that?

Thanks Sanjay, I have both the missing AVX512 commutations and the FMA scalar domain issues on my todo list.