`Hello,
Please review the change-set that implements the commute transformation
(and thus better memory-folding and better register coalescing) for AVX512 FMA opcodes.
This is also a fix for https://llvm.org/bugs/show_bug.cgi?id=17229
If this change-set seems too big to reviewers I can split it into 2 parts:
- 2 new files with the new class Utils/X86InstrFMA3Info.[h,cpp]
- fix users of those new classes + update of the LIT tests.
Even if that happens, this 'Differential Revision' may be still useful as it shows
both - the new classes and how they are used.
- New classes X86InstrFMA3Info and X86InstrFMA3Group.
Currently there are 1584 FMA3 opcodes!
Having a huge switch {case <FMA1>: case <FMA2>: case <FMA1584>: }
in the method X86InstrInfo::isFMA3() seems just wrong.
Also, there are several other places where all those opcodes would have to be listed,
for example, in X86InstrInfo::getFMA3OpcodeToCommuteOperands().
Those opcodes would also have to be in MemoryFoldTable3[] and MemoryFoldTable4[] arrays.
Finally, some other users of FMA3 opcodes are being implemented, they also would need to
list FMA3 opcodes as well, classify and group them (register vs memory forms, check k-masked/k-zero-masked, etc).
The new classes X86InstrFMA3Info and X86InstrFMA3Group:
- List all existing FMA3 opcodes in one place,
- Classify them by adding special attributes (IsIntrinsic, IsKMergeMasked, IsKZeroMasked, etc).
- Collect relative FMA3 opcodes in groups of 6 opcodes: {FMA132r, FMA213r, FMA231r, FMA132m, FMA213m, FMA231m} or in groups of 3 register or 3 memory-form opcodes if group of 6 cannot be formed, for example only memory forms are available for FMAs loading/broadcasting one of operands from memory: {FMADD132PSZmb, FMADD213PSZmb, FMADD231PSZmb}.
- Provide useful methods like isFMA3(), getFMA3Group(), isIntrinsic(), isKMasked(), isKMergeMasked(), isZeroMasked(), get{Reg,Mem}{132,213,231}Opcode(), etc. It also implements an iterator for walking through all register-form opcodes having memory-form equivalents.
The class X86InstrFMA3Info is a sort of singletons.
Only one object of the class is created per LLVM invocation.
It contains DenseMap<unsigned, X86InstrFMA3Group *> which maps Opcodes to their families,
i.e. groups of 6 opcodes, for example:
FMADD213SSr_Int -> {FMADD132SSr_Int, FMADD213SSr_Int, FMADD231SSr_Int,
FMADD132SSr_Int, FMADD213SSr_Int, FMADD231SSr_Int, IsIntrinsic};
The map is initialized once with the first request to the X86InstrFMA3Info class,
then it is possible to get a const reference to groups of FMAs (represented by X86InstrFMA3Group)
using an opcode from such group:
static const X86InstrFMA3Group *getFMA3Group(unsigned Opcode);
I used the method llvm::call_once() to ensure that the method initGroupsOnce() is called only once.
- Other changes.
X86InstrAVX512.td:
- Added isCommutable flag to more opcodes. Even K-masked and K-zero-masked FMAs opcodes are commutable now.
X86InstrInfo.cpp:
- Removed the long list of FMA3 opcodes from MemoryFoldTable3[], and added reg-to-mem mappings for 3 and 4 operand FMAs using the new iterators provided by X86InstrFMA3Info class.
- Removed isFMA3() method having a huge switch-statement.
- Added support for k-masked and k-zero-masked FMAs.
- Changed the getFMA3OpcodeToCommuteOperands(). This method is again a member of X86InstrInfo and the 1st operand
is MachineInstr &MI (instead of unsigned Opcode). The reason is that
when we start analyzing the users of FMAs we need a MachineInstr and
need the method to be a member of X86InstrInfo.
avx512-fma.ll, avx512-fma-intrinsics.ll, avx512bwvl-intrinsics.ll:
- Updated 3 LIT tests. The new code has less instructions because of better memory-folding and better register coalescing.
Thank you,
Vyacheslav Klochkov (v_klochkov)
`
What are you getting from this header? The Utils library should not depend on anything from the X86CodeGen library. The Utils library exists today to allow X86InstPrinter to print shuffle decoding in llvm-mc which doesn't have the X86CodeGen library.
So unless this new class is needed by llvm-mc I suggest not including it in the Utils library as it would just be unnecessary bloat on that binary. Just put it in the main X86 directory directory instead.