``The main goal of the change-set is to improve Memory-operand folding
and Coalescing optimizations performed for X86 FMA instructions
(Described in (1) below).
Unfortunately, that could not be done without interface changes
done in methods findCommutedOpIndices() and commuteInstruction().
(Described in (3) below).
The minor changes in non-X86 target sources: PowerPC, ARM and AMDGPU
were required by the new commuteInstruction() method with additional 2 operands
started being called from llvm/lib/CodeGen/* classes common for all targets.
The size of the fix is pretty big because of having (1) and (3) in one
change-set. The alternative to this change-set could be splitting of
the change-set into 2 parts:
- interface changes (described in (3) below)
- improvement of X86 FMA form selection (described in (1) below).
(1) Implemented optimal form selection (213/312/231) for X86 FMA instructions
to improve Memory-Folding/Ciscization and Coalescing optimizations performed
for FMAs. The change-set allows commuting any of FMA operands: 1st and 2nd,
1st and 3rd, 2nd and 3rd.
Previously, only 1st and 2nd operands could be commuted.
Better Memory-folding and Coalescing optimizations help to reduce
registers pressure. Improvement from the changes can be shown on such
an example:
for (int i = 0; i < N; i += 1) {
val1 = _mm_and_pd(val1, val5);
val2 = _mm_and_pd(val2, val6);
val3 = _mm_and_pd(val3, val7);
val4 = _mm_and_pd(val4, val8);
val5 = _mm_xor_pd(val1, val5);
val6 = _mm_xor_pd(val2, val6);
val7 = _mm_xor_pd(val3, val7);
val8 = _mm_xor_pd(val4, val8);
v_accu1 = _mm_fmadd_pd(v_accu1, x1_arr[i], val1);
v_accu2 = _mm_fmadd_pd(v_accu2, x2_arr[i], val2);
v_accu3 = _mm_fmadd_pd(v_accu3, x3_arr[i], val3);
v_accu4 = _mm_fmadd_pd(v_accu4, x4_arr[i], val4);
v_accu5 = _mm_fmadd_pd(v_accu5, x5_arr[i], val5);
v_accu6 = _mm_fmadd_pd(v_accu6, x6_arr[i], val6);
v_accu7 = _mm_fmadd_pd(v_accu7, x7_arr[i], val7);
v_accu8 = _mm_fmadd_pd(v_accu8, x8_arr[i], val8);
}
ASM code BEFORE the changes:
.LBB1_2: # %for.body.6
# Parent Loop BB1_1 Depth=1
# => This Inner Loop Header: Depth=2
vmovapd %xmm0, -56(%rsp) # 16-byte Spill
vandpd %xmm7, %xmm3, %xmm7
vandpd %xmm5, %xmm12, %xmm5
vandpd %xmm6, %xmm9, %xmm6
vmovapd -40(%rsp), %xmm10 # 16-byte Reload
vandpd %xmm10, %xmm13, %xmm10
vmovapd %xmm10, -40(%rsp) # 16-byte Spill
vxorpd %xmm7, %xmm3, %xmm3
vxorpd %xmm5, %xmm12, %xmm12
vxorpd %xmm6, %xmm9, %xmm9
vxorpd %xmm10, %xmm13, %xmm13
vmovapd %xmm8, %xmm0
vmovapd x1_arr+8192(%rcx), %xmm8
vmovapd -24(%rsp), %xmm1 # 16-byte Reload
vfmadd213pd %xmm7, %xmm8, %xmm1
vmovapd %xmm1, -24(%rsp) # 16-byte Spill
vmovapd %xmm0, %xmm8
vmovapd x2_arr+8192(%rcx), %xmm1
vfmadd213pd %xmm5, %xmm1, %xmm4
vmovapd x3_arr+8192(%rcx), %xmm1
vfmadd213pd %xmm6, %xmm1, %xmm8
vmovapd x4_arr+8192(%rcx), %xmm1
vfmadd213pd %xmm10, %xmm1, %xmm11
vmovapd -56(%rsp), %xmm0 # 16-byte Reload
vmovapd x5_arr+8192(%rcx), %xmm1
vfmadd213pd %xmm3, %xmm1, %xmm15
vmovapd x6_arr+8192(%rcx), %xmm1
vfmadd213pd %xmm12, %xmm1, %xmm0
vmovapd x7_arr+8192(%rcx), %xmm1
vfmadd213pd %xmm9, %xmm1, %xmm2
vmovapd x8_arr+8192(%rcx), %xmm1
vfmadd213pd %xmm13, %xmm1, %xmm14
addq $16, %rcx
jne .LBB1_2
ASM code WITH the new changes (about 30% faster):
.LBB1_2: # %for.body.6
# Parent Loop BB1_1 Depth=1
# => This Inner Loop Header: Depth=2
vandpd %xmm7, %xmm3, %xmm7
vandpd %xmm5, %xmm2, %xmm5
vandpd %xmm6, %xmm0, %xmm6
vandpd %xmm1, %xmm4, %xmm1
vxorpd %xmm7, %xmm3, %xmm3
vxorpd %xmm5, %xmm2, %xmm2
vxorpd %xmm6, %xmm0, %xmm0
vfmadd132pd x1_arr+8192(%rcx), %xmm7, %xmm15
vfmadd132pd x2_arr+8192(%rcx), %xmm5, %xmm8
vfmadd132pd x3_arr+8192(%rcx), %xmm6, %xmm9
vfmadd132pd x4_arr+8192(%rcx), %xmm1, %xmm10
vfmadd132pd x5_arr+8192(%rcx), %xmm3, %xmm14
vfmadd132pd x6_arr+8192(%rcx), %xmm2, %xmm11
vfmadd132pd x7_arr+8192(%rcx), %xmm0, %xmm12
vxorpd %xmm1, %xmm4, %xmm4
vfmadd132pd x8_arr+8192(%rcx), %xmm4, %xmm13
addq $16, %rcx
jne .LBB1_2(2) Fixed a correctness problem caused by commuting 1st and 2nd operands of
scalar FMAs generated for intrinsics. The problem is AUTOMATICALLY/for-free
gets fixed by the proposed changes for (1).
For FMA intrinsic call:
__m128d foo(__m128d a, __m128d b, __m128d c) {
// must return XMM0={b[127:64], a[63:0]*b[63:0]+c[63:0]}
return _mm_fmadd_sd(b, a, c);
}
The Coalescer/TwoAddressInstructionPass swapped the 1st and 2nd operandsof SCALAR FMA and invalidated the higher bits of the result returned
from foo().
The change-set fixes that and prohibits swapping 1st and 2nd operands
of scalar FMAs.
Swapping 1st and 2nd operands of scalar FMAs is possible and legal,
but only after special analysis of FMA users. Such optimization/analysis
can be implemented separately.
(3) The changes performed for (1) and (2) could not be implemented without
interface change in 2 methods of TargetInstrInfo class and it's child classes:
bool TargetInstrInfo::findCommutedOpIndices(MachineInstr *MI,
unsigned &SrcOpIdx1,
unsigned &SrcOpIdx2) const;
The operands SrcOpIdx1 and SrcOpIdx2 used only for OUTPUT from
the method previously.
Now they are INPUT and OUTPUT arguments.
INPUT values specify the indices of operands that are wanted to be swapped.
The input value ~0U gives the findCommutedOpIndices freedom to pick
any commutable operand (i.e. defines the _old_ behaviour of the method).
MachineInstr *TargetInstrInfo::commuteInstruction(MachineInstr *MI,
bool NewMI,
unsigned Idx1,
unsigned Idx2) const;
Two arguments Idx1 and Idx2 were added to the method; they specify
the operands to be swapped/commuted.
The old commuteInstruction() method did not let you to ask to commute1st and 3rd operands or 2nd and 3rd operands.
The changes in TwoAddressInstructionPass.cpp show how the updated methods
can be used to fix missing optimization opportunities that could happen
previously.
Readability and risky assumptions.
Previously, something similar to this sequence was used in several places:
unsigned Idx1;
unsigned Idx2;
if (findCommutedOpIndices(MI, Idx1, Idx2) &&
(Idx1 == 1 || Idx2 == 1)) {
commuteInstrction(MI, false); //!!! how can we know that Idx1
// and Idx2 are commuted here?
<do something with Idx1 and Idx2 operands here>
}The updated functions allow to write more clear, safe and readable code:
unsigned Idx1 = 1 /*want to commute 1st operand*/;
unsigned Idx2 = ~0U /*Don't care, choose any other operand*/;
if (findCommutedOpIndices(MI, Idx1, Idx2)) {
commuteInstrction(MI, false, Idx1, Idx2);
<do something with Idx1 and Idx2 operands here>
}The old method commuteInstruction() not specifying the commuted operands
was not removed as removing it would require restructuring/improvements
in some other places (similar to what was done in TwoAddressInstructionPass),
which cannot be done in this change-set as the current version of
the change-set is already too big.
``