`Hello,
Please review the change-set that implements the commute transformation
(and thus better memory-folding and better register coalescing) for AVX512 FMA opcodes.
This is also a fix for https://llvm.org/bugs/show_bug.cgi?id=17229
If this change-set seems too big to reviewers I can split it into 2 parts:
- 2 new files with the new class Utils/X86InstrFMA3Info.[h,cpp]
- fix users of those new classes + update of the LIT tests.
Even if that happens, this 'Differential Revision' may be still useful as it shows
both - the new classes and how they are used.
- New classes X86InstrFMA3Info and X86InstrFMA3Group.
Currently there are 1584 FMA3 opcodes!
Having a huge switch {case <FMA1>: case <FMA2>: case <FMA1584>: }
in the method X86InstrInfo::isFMA3() seems just wrong.
Also, there are several other places where all those opcodes would have to be listed,
for example, in X86InstrInfo::getFMA3OpcodeToCommuteOperands().
Those opcodes would also have to be in MemoryFoldTable3[] and MemoryFoldTable4[] arrays.
Finally, some other users of FMA3 opcodes are being implemented, they also would need to
list FMA3 opcodes as well, classify and group them (register vs memory forms, check k-masked/k-zero-masked, etc).
The new classes X86InstrFMA3Info and X86InstrFMA3Group:
- List all existing FMA3 opcodes in one place,
- Classify them by adding special attributes (IsIntrinsic, IsKMergeMasked, IsKZeroMasked, etc).
- Collect relative FMA3 opcodes in groups of 6 opcodes: {FMA132r, FMA213r, FMA231r, FMA132m, FMA213m, FMA231m} or in groups of 3 register or 3 memory-form opcodes if group of 6 cannot be formed, for example only memory forms are available for FMAs loading/broadcasting one of operands from memory: {FMADD132PSZmb, FMADD213PSZmb, FMADD231PSZmb}.
- Provide useful methods like isFMA3(), getFMA3Group(), isIntrinsic(), isKMasked(), isKMergeMasked(), isZeroMasked(), get{Reg,Mem}{132,213,231}Opcode(), etc. It also implements an iterator for walking through all register-form opcodes having memory-form equivalents.
The class X86InstrFMA3Info is a sort of singletons.
Only one object of the class is created per LLVM invocation.
It contains DenseMap<unsigned, X86InstrFMA3Group *> which maps Opcodes to their families,
i.e. groups of 6 opcodes, for example:
FMADD213SSr_Int -> {FMADD132SSr_Int, FMADD213SSr_Int, FMADD231SSr_Int,
FMADD132SSr_Int, FMADD213SSr_Int, FMADD231SSr_Int, IsIntrinsic};
The map is initialized once with the first request to the X86InstrFMA3Info class,
then it is possible to get a const reference to groups of FMAs (represented by X86InstrFMA3Group)
using an opcode from such group:
static const X86InstrFMA3Group *getFMA3Group(unsigned Opcode);
I used the method llvm::call_once() to ensure that the method initGroupsOnce() is called only once.
- Other changes.
X86InstrAVX512.td:
- Added isCommutable flag to more opcodes. Even K-masked and K-zero-masked FMAs opcodes are commutable now.
X86InstrInfo.cpp:
- Removed the long list of FMA3 opcodes from MemoryFoldTable3[], and added reg-to-mem mappings for 3 and 4 operand FMAs using the new iterators provided by X86InstrFMA3Info class.
- Removed isFMA3() method having a huge switch-statement.
- Added support for k-masked and k-zero-masked FMAs.
- Changed the getFMA3OpcodeToCommuteOperands(). This method is again a member of X86InstrInfo and the 1st operand
is MachineInstr &MI (instead of unsigned Opcode). The reason is that
when we start analyzing the users of FMAs we need a MachineInstr and
need the method to be a member of X86InstrInfo.
avx512-fma.ll, avx512-fma-intrinsics.ll, avx512bwvl-intrinsics.ll:
- Updated 3 LIT tests. The new code has less instructions because of better memory-folding and better register coalescing.
Thank you,
Vyacheslav Klochkov (v_klochkov)
`
These macros create a lot of small static tables and repeated initialization code for the map. Is there some way we can create fewer, larger static tables. Maybe a reg and mem table, a reg only table, and a mem only table with each row containing a group of opcodes and the attributes. Then loop through those larger tables to populate the map? Just a thought. Should reduce number of relocation entries and code size for this file.
Could probably be done as a follow up patch.