These instruction included a partial update of the output register in their SSE versions.
AVX, and AVX512 versions of the same instructions canceled the partial register update by passing an additional operand, that it's value will be copied to the rest of the output register.
No partial register update => The folding should be performed normally (not only when optimizing for size).
Details
- Reviewers
RKSimon zvi craig.topper igorb
Diff Detail
Event Timeline
lib/Target/X86/X86InstrSSE.td | ||
---|---|---|
3457 | UseAVX and HasAVX are not the same predicates. |
lib/Target/X86/X86InstrAVX512.td | ||
---|---|---|
7189 | this one should not be removed , only remove predicate OptForSize. | |
7209 | this one should not be removed, only remove predicate, please add test case. | |
7214 | this one should not be removed, only remove predicate, please add test case. |
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
1928 | TB_NO_REVERSE is used when the load size of the memory form is smaller than the register's size of the register form. In that case the unfolding is not legal since it will produce a load with the registers size. | |
1932 | Do you mean these changes should be in a separate patch? |
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
1928 | I maybe reading the avx512_fp14_s definitions incorrectly but it appears to be using the scalar f32x_info type not the packed equivalents. @craig.topper / @igorb please can you confirm? | |
1932 | OK - but please add them to the stack-folding-*.ll tests as well. |
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
1928 | That's right, but if you track the definitions more deeply you can see that the "rr" version is defined to have an xmm register operand while the "rm" version is defined to have a 32-bit memory operand. | |
1932 | Already added (in this patch). |
test/CodeGen/X86/stack-folding-fp-avx512.ll | ||
---|---|---|
758 ↗ | (On Diff #84823) | Please keep these in alphabetical order to make it easier to track missing instructions. |
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
1928 | Can leave a TODO here. We really should fix that so the we have a separate VRCP14SSrr_Int that uses the XMM type like we do for similar instructions. | |
lib/Target/X86/X86InstrSSE.td | ||
1789 | This doesn't need a Requires. There's no pattern. | |
1855 | Not sure this even needs a Requires. There's no pattern defined. | |
1866 | Will this fit on the previous line now? | |
test/CodeGen/X86/avx-arith.ll | ||
353 | This now has a register read dependency on xmm0 which I believe is what the OptForSize was originally protecting against. I know work has gone into ExeDepFix for UndefRegClearance. Do we believe that is sufficient to allow this folding now? |
test/CodeGen/X86/avx-arith.ll | ||
---|---|---|
353 | The OptForSize was incorrectly copied from the SSE version multiclass, where the memory form instructions performed a partial register updates (optimization guide states that partial updates come with a penalty and thus should be avoided). |
test/CodeGen/X86/avx-arith.ll | ||
---|---|---|
353 | Yes there was a read of xmm0 on the dart but it was from the movss instruction which wrote all bits. After this patch the xmm0 read is dependent on an unknown instruction. |
test/CodeGen/X86/avx-arith.ll | ||
---|---|---|
353 | Oops. Autocorrect turned sqrt into dart. |
test/CodeGen/X86/avx-arith.ll | ||
---|---|---|
353 | I see what you mean. |
I'm not sure I believe the OptForSize was naively copied from SSE. Someone went to the trouble of using AVX instructions with the correct number of operands in this comment block that your patch removes. It demonstrates exactly the issue I was pointing out.
// We don't want to fold scalar loads into these instructions unless // optimizing for size. This is because the folded instruction will have a // partial register update, while the unfolded sequence will not, e.g. // vmovss mem, %xmm0 // vrcpss %xmm0, %xmm0, %xmm0 // which has a clobber before the rcp, vs. // vrcpss mem, %xmm0, %xmm0 // TODO: In theory, we could fold the load, and avoid the stall caused by // the partial register store, either in ExeDepFix or with smarter RA.
After consulting an architect about the general problem, this is the answer I got:
For the following sequence:
vmovss (%rax), %xmm0 vsqrtss %xmm0, %xmm0, %xmm0
memory folding should be avoided (to avoid generating new read dependency).
But for the following sequence:
vmovss (%rax), %xmm1 vsqrtss %xmm1, %xmm0, %xmm1
the memory folded sequence is better performance wise (read dependency is already there).
This applies to all AVX/AVX512 scalar instruction which accepts an extra input operand and copies it's upper part to the upper part of the output operand.
Adding OptForSize in this case, disables the folding of these specific instructions only, in all cases.
While the ideal way of dealing with this is:
- Distinguishing between the 2 cases (listed above), and then deciding whether to fold or not.
- Apply this on all AVX/AVX512 scalar instructions with this behavior.
Other instructions with the same behavior (like vscalefss and vreducess) do not have any folding patterns and are not included in the folding tables, which means folding is not allowed at all (we can improve that).
So what I suggest is not committing this patch (even though the OptForSize is there from the wrong reason) in order to avoid performance degradation, and open a bug on this issue.
Do you agree with that?
this one should not be removed , only remove predicate OptForSize.
please add test case, i think llvm.sqrt.f32 can be used ( please insure AVX512 instruction selected)