VPTERNLOG is a ternary instruction with an immediate specifying the logical operation to perform. For each bit position in the 3 source vectors the bit from each source is concatenated together and the resulting 3-bit value is used to select a bit in the immediate. This bit value is written to the result vector.
We can commute this by swapping operands and modifying the immediate. To modify the immediate we need to swap two pairs of bits. The pairs correspond to the locations in the immediate where the commuted operands bits have opposite values and the uncommuted operand has the same value. Bits 0 and 7 will never be swapped since the relevant bits from all sources are the same value.
The first operand is the highest of the 3-bits(bit A) and the third operand is the lowest(bit C). So to swap the first and second operand, we need to swap the rows in following table where A and B have different values and C has the same value. So bit 2 and bit 4 swap, and bit 3 and bit 5 swap.
Bit A | Bit B | Bit C | Immediate |
---|---|---|---|
0 | 0 | 0 | Imm[0] |
0 | 0 | 1 | Imm[1] |
0 | 1 | 0 | Imm[2] |
0 | 1 | 1 | Imm[3] |
1 | 0 | 0 | Imm[4] |
1 | 0 | 1 | Imm[5] |
1 | 1 | 0 | Imm[6] |
1 | 1 | 1 | Imm[7] |
This patch reuses some of the code from FMA3 commuting since it is also a three source instruction. Most of findFMA3CommutedOpIndices is split out into findThreeSrcCommutedOpIndices to be reused to determine which operands can be commuted. I also changed it to use TSFlags bits for determining masked instructions since the FMA3Group attribute bits would not work for VPTERNLOG, but the TSFlag bits work for both.
For VPTERNLOG we call the new findThreeSrcCommutedOpIndices from findCommutedOpIndices with no additional processing.
The code from getFMA3OpcodeToCommuteOperands that determines which of the 3 possible cases is being requested, is split out into a helper function getThreeSrcCommuteCase that returns 0, 1, 2 for the case number if its a valid case, or -1 if its not.
X86InstrInfo::commuteInstructionImpl for VPTERNLOG calls getThreeSrcCommuteCase and if its a valid case, we use the case number to lookup which bits to swap in the immediate and make the modifications.
There appears to be an issue with the two address instruction pass where it stops searching additional commutable operands if it decides to swap the first two operands. It could look harder for a better commute. This deficiency can be seen in test cases where some vmovdqa64 instructions remaing that could have been removed if the first and third operand were commuted instead of the first and second. I hope to address this in a future commit.
It may be good to mention here what the returned values 0, 1, 2 mean. I.e. 0 means that it is possible to commute operands SrcOp1 and SrcOp2, etc.