Page MenuHomePhabricator

[X86][AVX] Remove "OptForSize" condition from some memory foldings.

Authored by aymanmus on Jan 15 2017, 6:20 AM.



These instruction included a partial update of the output register in their SSE versions.
AVX, and AVX512 versions of the same instructions canceled the partial register update by passing an additional operand, that it's value will be copied to the rest of the output register.
No partial register update => The folding should be performed normally (not only when optimizing for size).

Diff Detail

Event Timeline

aymanmus updated this revision to Diff 84490.Jan 15 2017, 6:20 AM
aymanmus retitled this revision from to [X86][AVX] Remove "OptForSize" condition from some memory foldings..
aymanmus updated this object.
aymanmus added reviewers: igorb, zvi, craig.topper.
aymanmus added a subscriber: llvm-commits.
zvi added a subscriber: myatsina.Jan 15 2017, 7:20 AM
zvi edited edge metadata.Jan 15 2017, 7:30 AM

LGTM, but please wait for an ok from other reviewers.


consider merging Predicates scope. You can do this as a NFC follow-up commit.


Is this extra line intentional?

aymanmus marked an inline comment as done.Jan 15 2017, 7:35 AM
aymanmus added inline comments.

UseAVX and HasAVX are not the same predicates.

igorb added inline comments.Jan 17 2017, 1:15 AM

this one should not be removed , only remove predicate OptForSize.
please add test case, i think llvm.sqrt.f32 can be used ( please insure AVX512 instruction selected)


this one should not be removed, only remove predicate, please add test case.


this one should not be removed, only remove predicate, please add test case.

RKSimon added inline comments.Jan 17 2017, 4:03 AM

Isn't it only the _Int variants that need TB_NO_REVERSE ?


All these changes to X86InstrInfo.cpp should probably be split off - add them to the stack-folding-*.ll tests

aymanmus marked 3 inline comments as done.Jan 18 2017, 5:22 AM
aymanmus added inline comments.

TB_NO_REVERSE is used when the load size of the memory form is smaller than the register's size of the register form. In that case the unfolding is not legal since it will produce a load with the registers size.


Do you mean these changes should be in a separate patch?
If I understand you correctly, I think the changes in X86InstrInfo.cpp is relevant to the patch since the main change here is to enable the folding of few instructions. Adding these entries does exactly that.

aymanmus updated this revision to Diff 84823.Jan 18 2017, 5:24 AM
  • Add tests
RKSimon added inline comments.Jan 18 2017, 10:21 AM

I maybe reading the avx512_fp14_s definitions incorrectly but it appears to be using the scalar f32x_info type not the packed equivalents. @craig.topper / @igorb please can you confirm?


OK - but please add them to the stack-folding-*.ll tests as well.

aymanmus added inline comments.Jan 19 2017, 12:36 AM

That's right, but if you track the definitions more deeply you can see that the "rr" version is defined to have an xmm register operand while the "rm" version is defined to have a 32-bit memory operand.
Unfolding from the memory form to the register form will generate a 128-bit load (because of the xmm register), which we want to avoid.


Already added (in this patch).

RKSimon added inline comments.Jan 19 2017, 6:54 AM

Please keep these in alphabetical order to make it easier to track missing instructions.

craig.topper added inline comments.Jan 20 2017, 8:54 PM

Can leave a TODO here. We really should fix that so the we have a separate VRCP14SSrr_Int that uses the XMM type like we do for similar instructions.


This doesn't need a Requires. There's no pattern.


Not sure this even needs a Requires. There's no pattern defined.


Will this fit on the previous line now?


This now has a register read dependency on xmm0 which I believe is what the OptForSize was originally protecting against. I know work has gone into ExeDepFix for UndefRegClearance. Do we believe that is sufficient to allow this folding now?

aymanmus marked 4 inline comments as done.Jan 30 2017, 6:30 AM
aymanmus added inline comments.

The OptForSize was incorrectly copied from the SSE version multiclass, where the memory form instructions performed a partial register updates (optimization guide states that partial updates come with a penalty and thus should be avoided).
I'm afraid I didn't understand how the memory folding adds read dependency on xmm0 in this case, the read dependency was already there.

aymanmus updated this revision to Diff 86278.Jan 30 2017, 6:31 AM
craig.topper added inline comments.Jan 30 2017, 7:16 AM

Yes there was a read of xmm0 on the dart but it was from the movss instruction which wrote all bits. After this patch the xmm0 read is dependent on an unknown instruction.

craig.topper added inline comments.Jan 30 2017, 7:29 AM

Oops. Autocorrect turned sqrt into dart.

aymanmus added inline comments.Jan 31 2017, 1:46 AM

I see what you mean.
I think this is a general problem that should not be discussed in this patch.
The reason the folding was disabled before has nothing related to this issue. And the problem is not specific for this instruction.
IMHO this patch should not consider this kind of problems as it should be dealt with as a separate manner.

craig.topper edited edge metadata.Jan 31 2017, 9:09 PM

I'm not sure I believe the OptForSize was naively copied from SSE. Someone went to the trouble of using AVX instructions with the correct number of operands in this comment block that your patch removes. It demonstrates exactly the issue I was pointing out.

// We don't want to fold scalar loads into these instructions unless
// optimizing for size. This is because the folded instruction will have a
// partial register update, while the unfolded sequence will not, e.g.
// vmovss mem, %xmm0
// vrcpss %xmm0, %xmm0, %xmm0
// which has a clobber before the rcp, vs.
// vrcpss mem, %xmm0, %xmm0
// TODO: In theory, we could fold the load, and avoid the stall caused by
// the partial register store, either in ExeDepFix or with smarter RA.

After consulting an architect about the general problem, this is the answer I got:

For the following sequence:

vmovss (%rax), %xmm0
vsqrtss %xmm0, %xmm0, %xmm0

memory folding should be avoided (to avoid generating new read dependency).
But for the following sequence:

vmovss (%rax), %xmm1
vsqrtss %xmm1, %xmm0, %xmm1

the memory folded sequence is better performance wise (read dependency is already there).

This applies to all AVX/AVX512 scalar instruction which accepts an extra input operand and copies it's upper part to the upper part of the output operand.
Adding OptForSize in this case, disables the folding of these specific instructions only, in all cases.
While the ideal way of dealing with this is:

  1. Distinguishing between the 2 cases (listed above), and then deciding whether to fold or not.
  2. Apply this on all AVX/AVX512 scalar instructions with this behavior.

Other instructions with the same behavior (like vscalefss and vreducess) do not have any folding patterns and are not included in the folding tables, which means folding is not allowed at all (we can improve that).
So what I suggest is not committing this patch (even though the OptForSize is there from the wrong reason) in order to avoid performance degradation, and open a bug on this issue.

Do you agree with that?

aymanmus abandoned this revision.Feb 12 2017, 5:47 AM