This is an archive of the discontinued LLVM Phabricator instance.

X86: FMA intrinsic + FNEG - sequence optimization
ClosedPublic

Authored by delena on Aug 9 2016, 5:30 AM.

Details

Summary

Optimized FMA intrinsic + FNEG , like

-(a*b+c)

and FNEG + FMA, like

a*b-c or (-a)*b+c

Legalization of ISD::FNEG is delayed to give a chance to combine it with X86ISD::FMADD/MSUB/NMADD/NMSUB.

7 out of 8 test cases are optimized. The last test case requires additional investigation and, probably, changes in the clang header.

Fixed https://llvm.org/bugs/show_bug.cgi?id=28892.

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 67319.Aug 9 2016, 5:30 AM
delena retitled this revision from to X86: FMA intrinsic + FNEG - sequence optimization.
delena updated this object.
delena added reviewers: igorb, RKSimon, andreadb.
delena set the repository for this revision to rL LLVM.
delena added a subscriber: llvm-commits.
RKSimon added inline comments.Aug 9 2016, 8:45 AM
../test/CodeGen/X86/fma-fneg-combine.ll
72 ↗(On Diff #67319)

Is this case fixed by D23108?

The patch https://reviews.llvm.org/D23108 will remove redundant "vmovaps" after the FMA.

RKSimon edited edge metadata.Aug 10 2016, 6:47 AM

Instead of adding the forceLegalization approach why can't we just expand combineFMA to support the negation of inputs for all X86ISD FMA/AVX512 opcodes? A helper function might be necessary (to detect FNEG or FXOR signbit-flipping) but that should be all.

Instead of adding the forceLegalization approach why can't we just expand combineFMA to support the negation of inputs for all X86ISD FMA/AVX512 opcodes? A helper function might be necessary (to detect FNEG or FXOR signbit-flipping) but that should be all.

When we receive X86ISD FMA/AVX512 opcodes, FNEG does not exist any more. It is lowered to FXOR and this is the problem. Analysis of FXOR requires visiting load and visiting constant pool and checking the value of the integer constant under the FP load.
That's why I want to delay FNEG legalization.

spatel edited edge metadata.Aug 13 2016, 12:39 PM

Instead of adding the forceLegalization approach why can't we just expand combineFMA to support the negation of inputs for all X86ISD FMA/AVX512 opcodes? A helper function might be necessary (to detect FNEG or FXOR signbit-flipping) but that should be all.

When we receive X86ISD FMA/AVX512 opcodes, FNEG does not exist any more. It is lowered to FXOR and this is the problem. Analysis of FXOR requires visiting load and visiting constant pool and checking the value of the integer constant under the FP load.
That's why I want to delay FNEG legalization.

The premature lowering to load from constant pool is the real problem isn't it? If we didn't do that, then you'd just match the expected X86ISD::FXOR with -1 pattern.

There's a FIXME in LowerFABSorFNEG() about making this better with optsize. I think we should fix this now rather than adding a work-around for something that's broken.

I didn't check if the failures due to this change are just due to load folding diffs or other problems, but the basic patch is just delete the bogus code in LowerFABSorFNEG():

Index: lib/Target/X86/X86ISelLowering.cpp

  • lib/Target/X86/X86ISelLowering.cpp (revision 278598)

+++ lib/Target/X86/X86ISelLowering.cpp (working copy)
@@ -14519,7 +14519,7 @@

// decide if we should generate a 16-byte constant mask when we only need 4 or
// 8 bytes for the scalar case.
  • MVT LogicVT;

+ EVT LogicVT;

MVT EltVT;
unsigned NumElts;

@@ -14543,18 +14543,10 @@

}
 
unsigned EltBits = EltVT.getSizeInBits();
  • LLVMContext *Context = DAG.getContext(); // For FABS, mask is 0x7f...; for FNEG, mask is 0x80... APInt MaskElt = IsFABS ? APInt::getSignedMaxValue(EltBits) : APInt::getSignBit(EltBits);
  • Constant *C = ConstantInt::get(*Context, MaskElt);
  • C = ConstantVector::getSplat(NumElts, C);
  • const TargetLowering &TLI = DAG.getTargetLoweringInfo();
  • SDValue CPIdx = DAG.getConstantPool(C, TLI.getPointerTy(DAG.getDataLayout()));
  • unsigned Alignment = cast<ConstantPoolSDNode>(CPIdx)->getAlignment();
  • SDValue Mask = DAG.getLoad(
  • LogicVT, dl, DAG.getEntryNode(), CPIdx,
  • MachinePointerInfo::getConstantPool(DAG.getMachineFunction()), Alignment);

+ SDValue Mask = DAG.getConstant(MaskElt, dl, LogicVT.changeVectorElementTypeToInteger());

We're already doing some constant pool extraction with getTargetShuffleMaskIndices/getTargetShuffleMaskConstant - what if we generalized that?

We're already doing some constant pool extraction with getTargetShuffleMaskIndices/getTargetShuffleMaskConstant - what if we generalized that?

rL279958 fixed the constant loading, so this patch should be updated (and should become simpler now).

Agree, I'll proceed.

delena updated this revision to Diff 69739.Aug 30 2016, 12:09 PM
delena edited edge metadata.

Changed code according to the comments bellow. I also added more tests, not all of the test cases are optimized.

spatel added inline comments.Aug 30 2016, 3:53 PM
../lib/Target/X86/X86ISelLowering.cpp
4869 ↗(On Diff #69739)

It's good to generalize this, but you should also change the variable names from Mask* to something generic since it's not about shuffles anymore. Please make this change ahead of the fneg changes to reduce the size of the patch.

30361–30373 ↗(On Diff #69739)

Are the *RND changes independent of the fneg bug fix? If yes, can you separate them from this patch?

30402 ↗(On Diff #69739)

returns -> Returns
Add period to end of sentence.

30411 ↗(On Diff #69739)

Sing -> Sign

delena marked 2 inline comments as done.Aug 31 2016, 12:36 AM
delena added inline comments.
../lib/Target/X86/X86ISelLowering.cpp
30361–30373 ↗(On Diff #69739)

The goal of this patch is to combine FNEG with FMA intrinsics.
*_RND nodes are generated from intrinsics only.
Here we are combining the following patterns:
FNEG ( avx512.mask.fm* a, b ,c) and
FNEG ( avx512.mask.fm* a, b ,c, Rounding)

delena updated this revision to Diff 69800.Aug 31 2016, 12:39 AM

Updated comments inside the code and variable names. NFC since the previous patch.

Should there be some tests for scalar double and packed double as well?

../lib/Target/X86/X86ISelLowering.cpp
4872 ↗(On Diff #69800)

MaskLoad -> Load ?

4881 ↗(On Diff #69800)

MaskCP -> ConstantPoolNode ?

delena updated this revision to Diff 69981.Sep 1 2016, 4:05 AM
delena marked 2 inline comments as done.

Added more tests for "double" scalar and vector intrinsics.

spatel accepted this revision.Sep 1 2016, 6:30 AM
spatel edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Sep 1 2016, 6:30 AM
This revision was automatically updated to reflect the committed changes.