``The main goal of the change-set is to improve Memory-operand folding
and Coalescing optimizations performed for X86 FMA instructions
(Described in (1) below).
Unfortunately, that could not be done without interface changes
done in methods findCommutedOpIndices() and commuteInstruction().
(Described in (3) below).
The minor changes in non-X86 target sources: PowerPC, ARM and AMDGPU
were required by the new commuteInstruction() method with additional 2 operands
started being called from llvm/lib/CodeGen/* classes common for all targets.
The size of the fix is pretty big because of having (1) and (3) in one
change-set. The alternative to this change-set could be splitting of
the change-set into 2 parts:
- interface changes (described in (3) below)
- improvement of X86 FMA form selection (described in (1) below).
(1) Implemented optimal form selection (213/312/231) for X86 FMA instructions
to improve Memory-Folding/Ciscization and Coalescing optimizations performed for FMAs. The change-set allows commuting any of FMA operands: 1st and 2nd, 1st and 3rd, 2nd and 3rd. Previously, only 1st and 2nd operands could be commuted. Better Memory-folding and Coalescing optimizations help to reduce registers pressure. Improvement from the changes can be shown on such an example: for (int i = 0; i < N; i += 1) { val1 = _mm_and_pd(val1, val5); val2 = _mm_and_pd(val2, val6); val3 = _mm_and_pd(val3, val7); val4 = _mm_and_pd(val4, val8); val5 = _mm_xor_pd(val1, val5); val6 = _mm_xor_pd(val2, val6); val7 = _mm_xor_pd(val3, val7); val8 = _mm_xor_pd(val4, val8); v_accu1 = _mm_fmadd_pd(v_accu1, x1_arr[i], val1); v_accu2 = _mm_fmadd_pd(v_accu2, x2_arr[i], val2); v_accu3 = _mm_fmadd_pd(v_accu3, x3_arr[i], val3); v_accu4 = _mm_fmadd_pd(v_accu4, x4_arr[i], val4); v_accu5 = _mm_fmadd_pd(v_accu5, x5_arr[i], val5); v_accu6 = _mm_fmadd_pd(v_accu6, x6_arr[i], val6); v_accu7 = _mm_fmadd_pd(v_accu7, x7_arr[i], val7); v_accu8 = _mm_fmadd_pd(v_accu8, x8_arr[i], val8); } ASM code BEFORE the changes: .LBB1_2: # %for.body.6 # Parent Loop BB1_1 Depth=1 # => This Inner Loop Header: Depth=2 vmovapd %xmm0, -56(%rsp) # 16-byte Spill vandpd %xmm7, %xmm3, %xmm7 vandpd %xmm5, %xmm12, %xmm5 vandpd %xmm6, %xmm9, %xmm6 vmovapd -40(%rsp), %xmm10 # 16-byte Reload vandpd %xmm10, %xmm13, %xmm10 vmovapd %xmm10, -40(%rsp) # 16-byte Spill vxorpd %xmm7, %xmm3, %xmm3 vxorpd %xmm5, %xmm12, %xmm12 vxorpd %xmm6, %xmm9, %xmm9 vxorpd %xmm10, %xmm13, %xmm13 vmovapd %xmm8, %xmm0 vmovapd x1_arr+8192(%rcx), %xmm8 vmovapd -24(%rsp), %xmm1 # 16-byte Reload vfmadd213pd %xmm7, %xmm8, %xmm1 vmovapd %xmm1, -24(%rsp) # 16-byte Spill vmovapd %xmm0, %xmm8 vmovapd x2_arr+8192(%rcx), %xmm1 vfmadd213pd %xmm5, %xmm1, %xmm4 vmovapd x3_arr+8192(%rcx), %xmm1 vfmadd213pd %xmm6, %xmm1, %xmm8 vmovapd x4_arr+8192(%rcx), %xmm1 vfmadd213pd %xmm10, %xmm1, %xmm11 vmovapd -56(%rsp), %xmm0 # 16-byte Reload vmovapd x5_arr+8192(%rcx), %xmm1 vfmadd213pd %xmm3, %xmm1, %xmm15 vmovapd x6_arr+8192(%rcx), %xmm1 vfmadd213pd %xmm12, %xmm1, %xmm0 vmovapd x7_arr+8192(%rcx), %xmm1 vfmadd213pd %xmm9, %xmm1, %xmm2 vmovapd x8_arr+8192(%rcx), %xmm1 vfmadd213pd %xmm13, %xmm1, %xmm14 addq $16, %rcx jne .LBB1_2 ASM code WITH the new changes (about 30% faster): .LBB1_2: # %for.body.6 # Parent Loop BB1_1 Depth=1 # => This Inner Loop Header: Depth=2 vandpd %xmm7, %xmm3, %xmm7 vandpd %xmm5, %xmm2, %xmm5 vandpd %xmm6, %xmm0, %xmm6 vandpd %xmm1, %xmm4, %xmm1 vxorpd %xmm7, %xmm3, %xmm3 vxorpd %xmm5, %xmm2, %xmm2 vxorpd %xmm6, %xmm0, %xmm0 vfmadd132pd x1_arr+8192(%rcx), %xmm7, %xmm15 vfmadd132pd x2_arr+8192(%rcx), %xmm5, %xmm8 vfmadd132pd x3_arr+8192(%rcx), %xmm6, %xmm9 vfmadd132pd x4_arr+8192(%rcx), %xmm1, %xmm10 vfmadd132pd x5_arr+8192(%rcx), %xmm3, %xmm14 vfmadd132pd x6_arr+8192(%rcx), %xmm2, %xmm11 vfmadd132pd x7_arr+8192(%rcx), %xmm0, %xmm12 vxorpd %xmm1, %xmm4, %xmm4 vfmadd132pd x8_arr+8192(%rcx), %xmm4, %xmm13 addq $16, %rcx jne .LBB1_2
(2) Fixed a correctness problem caused by commuting 1st and 2nd operands of
scalar FMAs generated for intrinsics. The problem is AUTOMATICALLY/for-free gets fixed by the proposed changes for (1). For FMA intrinsic call: __m128d foo(__m128d a, __m128d b, __m128d c) { // must return XMM0={b[127:64], a[63:0]*b[63:0]+c[63:0]} return _mm_fmadd_sd(b, a, c); } The Coalescer/TwoAddressInstructionPass swapped the 1st and 2nd operands
of SCALAR FMA and invalidated the higher bits of the result returned
from foo().
The change-set fixes that and prohibits swapping 1st and 2nd operands
of scalar FMAs.
Swapping 1st and 2nd operands of scalar FMAs is possible and legal,
but only after special analysis of FMA users. Such optimization/analysis
can be implemented separately.
(3) The changes performed for (1) and (2) could not be implemented without
interface change in 2 methods of TargetInstrInfo class and it's child classes: bool TargetInstrInfo::findCommutedOpIndices(MachineInstr *MI, unsigned &SrcOpIdx1, unsigned &SrcOpIdx2) const; The operands SrcOpIdx1 and SrcOpIdx2 used only for OUTPUT from the method previously. Now they are INPUT and OUTPUT arguments. INPUT values specify the indices of operands that are wanted to be swapped. The input value ~0U gives the findCommutedOpIndices freedom to pick any commutable operand (i.e. defines the _old_ behaviour of the method). MachineInstr *TargetInstrInfo::commuteInstruction(MachineInstr *MI, bool NewMI, unsigned Idx1, unsigned Idx2) const; Two arguments Idx1 and Idx2 were added to the method; they specify the operands to be swapped/commuted. The old commuteInstruction() method did not let you to ask to commute
1st and 3rd operands or 2nd and 3rd operands.
The changes in TwoAddressInstructionPass.cpp show how the updated methods
can be used to fix missing optimization opportunities that could happen
previously.
Readability and risky assumptions.
Previously, something similar to this sequence was used in several places:
unsigned Idx1; unsigned Idx2; if (findCommutedOpIndices(MI, Idx1, Idx2) && (Idx1 == 1 || Idx2 == 1)) { commuteInstrction(MI, false); //!!! how can we know that Idx1 // and Idx2 are commuted here? <do something with Idx1 and Idx2 operands here> }
The updated functions allow to write more clear, safe and readable code:
unsigned Idx1 = 1 /*want to commute 1st operand*/; unsigned Idx2 = ~0U /*Don't care, choose any other operand*/; if (findCommutedOpIndices(MI, Idx1, Idx2)) { commuteInstrction(MI, false, Idx1, Idx2); <do something with Idx1 and Idx2 operands here> }
The old method commuteInstruction() not specifying the commuted operands
was not removed as removing it would require restructuring/improvements
in some other places (similar to what was done in TwoAddressInstructionPass),
which cannot be done in this change-set as the current version of
the change-set is already too big.
``
Please define a static constant for this magic value.
I do not like magic values wandering around without context.
I do not have a good naming right now though, maybe UndefinedIndex?