Page MenuHomePhabricator

Implemented 132/213/231 forms selection for X86-FMA3-AVX512 opcodes.

Authored by v_klochkov on Aug 3 2016, 12:15 AM.




Please review the change-set that implements the commute transformation
(and thus better memory-folding and better register coalescing) for AVX512 FMA opcodes.

This is also a fix for

If this change-set seems too big to reviewers I can split it into 2 parts:

  • 2 new files with the new class Utils/X86InstrFMA3Info.[h,cpp]
  • fix users of those new classes + update of the LIT tests.

Even if that happens, this 'Differential Revision' may be still useful as it shows
both - the new classes and how they are used.

  1. New classes X86InstrFMA3Info and X86InstrFMA3Group.

Currently there are 1584 FMA3 opcodes!
Having a huge switch {case <FMA1>: case <FMA2>: case <FMA1584>: }
in the method X86InstrInfo::isFMA3() seems just wrong.

Also, there are several other places where all those opcodes would have to be listed,
for example, in X86InstrInfo::getFMA3OpcodeToCommuteOperands().
Those opcodes would also have to be in MemoryFoldTable3[] and MemoryFoldTable4[] arrays.

Finally, some other users of FMA3 opcodes are being implemented, they also would need to
list FMA3 opcodes as well, classify and group them (register vs memory forms, check k-masked/k-zero-masked, etc).

The new classes X86InstrFMA3Info and X86InstrFMA3Group:

  • List all existing FMA3 opcodes in one place,
  • Classify them by adding special attributes (IsIntrinsic, IsKMergeMasked, IsKZeroMasked, etc).
  • Collect relative FMA3 opcodes in groups of 6 opcodes: {FMA132r, FMA213r, FMA231r, FMA132m, FMA213m, FMA231m} or in groups of 3 register or 3 memory-form opcodes if group of 6 cannot be formed, for example only memory forms are available for FMAs loading/broadcasting one of operands from memory: {FMADD132PSZmb, FMADD213PSZmb, FMADD231PSZmb}.
  • Provide useful methods like isFMA3(), getFMA3Group(), isIntrinsic(), isKMasked(), isKMergeMasked(), isZeroMasked(), get{Reg,Mem}{132,213,231}Opcode(), etc. It also implements an iterator for walking through all register-form opcodes having memory-form equivalents.

The class X86InstrFMA3Info is a sort of singletons.
Only one object of the class is created per LLVM invocation.
It contains DenseMap<unsigned, X86InstrFMA3Group *> which maps Opcodes to their families,
i.e. groups of 6 opcodes, for example:
FMADD213SSr_Int -> {FMADD132SSr_Int, FMADD213SSr_Int, FMADD231SSr_Int,

                    FMADD132SSr_Int, FMADD213SSr_Int, FMADD231SSr_Int,

The map is initialized once with the first request to the X86InstrFMA3Info class,
then it is possible to get a const reference to groups of FMAs (represented by X86InstrFMA3Group)
using an opcode from such group:

static const X86InstrFMA3Group *getFMA3Group(unsigned Opcode);

I used the method llvm::call_once() to ensure that the method initGroupsOnce() is called only once.

  1. Other changes.

  • Added isCommutable flag to more opcodes. Even K-masked and K-zero-masked FMAs opcodes are commutable now.


  • Removed the long list of FMA3 opcodes from MemoryFoldTable3[], and added reg-to-mem mappings for 3 and 4 operand FMAs using the new iterators provided by X86InstrFMA3Info class.
  • Removed isFMA3() method having a huge switch-statement.
  • Added support for k-masked and k-zero-masked FMAs.
  • Changed the getFMA3OpcodeToCommuteOperands(). This method is again a member of X86InstrInfo and the 1st operand

is MachineInstr &MI (instead of unsigned Opcode). The reason is that
when we start analyzing the users of FMAs we need a MachineInstr and
need the method to be a member of X86InstrInfo.

avx512-fma.ll, avx512-fma-intrinsics.ll, avx512bwvl-intrinsics.ll:

  • Updated 3 LIT tests. The new code has less instructions because of better memory-folding and better register coalescing.

Thank you,
Vyacheslav Klochkov (v_klochkov)

Diff Detail


Event Timeline

v_klochkov updated this revision to Diff 66619.Aug 3 2016, 12:15 AM
v_klochkov retitled this revision from to Implemented 132/213/231 forms selection for X86-FMA3-AVX512 opcodes..
v_klochkov updated this object.
igorb added a subscriber: igorb.Aug 3 2016, 12:21 AM
lsaba added a subscriber: lsaba.Aug 3 2016, 5:41 AM
craig.topper added inline comments.Aug 3 2016, 9:22 PM
264 ↗(On Diff #66619)

These macros create a lot of small static tables and repeated initialization code for the map. Is there some way we can create fewer, larger static tables. Maybe a reg and mem table, a reg only table, and a mem only table with each row containing a group of opcodes and the attributes. Then loop through those larger tables to populate the map? Just a thought. Should reduce number of relocation entries and code size for this file.

Could probably be done as a follow up patch.

5961 ↗(On Diff #66619)

This part isn't really related to the bigger change of introducing the new class and using it. And there are no test cases for this. Can this be split out with tests? Maybe find whatever test tests it for the VEX version and just add an AVX512 command line to it.

5989 ↗(On Diff #66619)

Same as above comment.

v_klochkov updated this revision to Diff 67213.Aug 8 2016, 12:13 PM

Cancelled the changes in 2 places unrelated to the new classes and their uses (both are in isNonFoldablePartialRegisterLoad()).

Hi Craig,

Thank you for reviewing the patch.
I cancelled 2 code insertions accordingly to your recommendation, but did not do yet any changes in the FMA groups initialization.
Please see my comments below.

Thank you,

264 ↗(On Diff #66619)

I don't see any really good ways to improve the initialization code.
If that is doable, then I suggest to have it as a follow up patch, that would only change the private fields/methods of the new classes without changing the public interfaces.

The requirements to the new classes:

  • there should be some way to group 132/213/231 reg opcodes;
  • to group 132/213/231 mem opcodes;
  • to have mapping from reg opcodes to mem opcodes. Some future users also need mapping from mem to reg opcodes.
  • to have quick mapping from reg-opcode to a group of 3 reg opcodes; have the same for mem opcodes.
5961 ↗(On Diff #66619)

I agree and cancel this code insertion. Yes, it deserves a separate change-set.
I also think that the fix in this routine is still incomplete as it does not include support of some k-masked int opcodes.

craig.topper added inline comments.Aug 8 2016, 8:09 PM
16 ↗(On Diff #67213)

What are you getting from this header? The Utils library should not depend on anything from the X86CodeGen library. The Utils library exists today to allow X86InstPrinter to print shuffle decoding in llvm-mc which doesn't have the X86CodeGen library.

So unless this new class is needed by llvm-mc I suggest not including it in the Utils library as it would just be unnecessary bloat on that binary. Just put it in the main X86 directory directory instead.

v_klochkov updated this revision to Diff 67424.Aug 9 2016, 3:33 PM

Moved the 2 new files X86InstrFMA3Info.{cpp,h} from llvm/lib/Target/X86/Utils to llvm/lib/Target/X86

Craig, thank you for the reviewing the changes.
I moved the new files to llvm/lib/Target/X86.

16 ↗(On Diff #67213)

This include helped me to fix LLVM build errors like ''X86:: is undefined".

The new classes are not needed by llvm-mc, so I moved the new files to llvm/lib/Target/X86 folder.
Thank you.

craig.topper accepted this revision.Aug 10 2016, 9:50 PM
craig.topper edited edge metadata.


This revision is now accepted and ready to land.Aug 10 2016, 9:50 PM
This revision was automatically updated to reflect the committed changes.