This is an archive of the discontinued LLVM Phabricator instance.

Improved X86-FMA3 mem-folding & coalescing
ClosedPublic

Authored by v_klochkov on Sep 29 2015, 1:50 PM.

Details

Summary

`
This change-set was initially included into a bigger change-set http://reviews.llvm.org/D11370
but X86 FMA3 specific changes were removed from D11370 to simplify that change-set.

The changes proposed here implement optimal form selection (213/312/231)
for X86 FMA3 instructions, and help to improve Memory-operand folding and Coalescing
optimizations performed for X86 FMA instructions.

Better Memory-folding and Coalescing optimizations help to reduce
registers pressure. Improvement from the changes can be shown on such
an example:

    for (int i = 0; i < N; i += 1) {
        val1 = _mm_and_pd(val1, val5);
        val2 = _mm_and_pd(val2, val6);
        val3 = _mm_and_pd(val3, val7);
        val4 = _mm_and_pd(val4, val8);
        val5 = _mm_xor_pd(val1, val5);
        val6 = _mm_xor_pd(val2, val6);
        val7 = _mm_xor_pd(val3, val7);
        val8 = _mm_xor_pd(val4, val8);

		v_accu1 = _mm_fmadd_pd(v_accu1, x1_arr[i], val1);
        v_accu2 = _mm_fmadd_pd(v_accu2, x2_arr[i], val2);
        v_accu3 = _mm_fmadd_pd(v_accu3, x3_arr[i], val3);
        v_accu4 = _mm_fmadd_pd(v_accu4, x4_arr[i], val4);
        v_accu5 = _mm_fmadd_pd(v_accu5, x5_arr[i], val5);
        v_accu6 = _mm_fmadd_pd(v_accu6, x6_arr[i], val6);
        v_accu7 = _mm_fmadd_pd(v_accu7, x7_arr[i], val7);
        v_accu8 = _mm_fmadd_pd(v_accu8, x8_arr[i], val8);
    }


    ASM code BEFORE the changes:
        .LBB1_2:                                # %for.body.6
                                        #   Parent Loop BB1_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovapd %xmm0, -56(%rsp)        # 16-byte Spill
        vandpd  %xmm7, %xmm3, %xmm7
        vandpd  %xmm5, %xmm12, %xmm5
        vandpd  %xmm6, %xmm9, %xmm6
        vmovapd -40(%rsp), %xmm10       # 16-byte Reload
        vandpd  %xmm10, %xmm13, %xmm10
        vmovapd %xmm10, -40(%rsp)       # 16-byte Spill
        vxorpd  %xmm7, %xmm3, %xmm3
        vxorpd  %xmm5, %xmm12, %xmm12
        vxorpd  %xmm6, %xmm9, %xmm9
        vxorpd  %xmm10, %xmm13, %xmm13
        vmovapd %xmm8, %xmm0
        vmovapd x1_arr+8192(%rcx), %xmm8
        vmovapd -24(%rsp), %xmm1        # 16-byte Reload
        vfmadd213pd     %xmm7, %xmm8, %xmm1
        vmovapd %xmm1, -24(%rsp)        # 16-byte Spill
        vmovapd %xmm0, %xmm8
        vmovapd x2_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm5, %xmm1, %xmm4
        vmovapd x3_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm6, %xmm1, %xmm8
        vmovapd x4_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm10, %xmm1, %xmm11
        vmovapd -56(%rsp), %xmm0        # 16-byte Reload
        vmovapd x5_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm3, %xmm1, %xmm15
        vmovapd x6_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm12, %xmm1, %xmm0
        vmovapd x7_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm9, %xmm1, %xmm2
        vmovapd x8_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm13, %xmm1, %xmm14
        addq    $16, %rcx
        jne     .LBB1_2

        ASM code WITH the new changes (about 30% faster):
        .LBB1_2:                                # %for.body.6
                                        #   Parent Loop BB1_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vandpd  %xmm7, %xmm3, %xmm7
        vandpd  %xmm5, %xmm2, %xmm5
        vandpd  %xmm6, %xmm0, %xmm6
        vandpd  %xmm1, %xmm4, %xmm1
        vxorpd  %xmm7, %xmm3, %xmm3
        vxorpd  %xmm5, %xmm2, %xmm2
        vxorpd  %xmm6, %xmm0, %xmm0
        vfmadd132pd     x1_arr+8192(%rcx), %xmm7, %xmm15
        vfmadd132pd     x2_arr+8192(%rcx), %xmm5, %xmm8
        vfmadd132pd     x3_arr+8192(%rcx), %xmm6, %xmm9
        vfmadd132pd     x4_arr+8192(%rcx), %xmm1, %xmm10
        vfmadd132pd     x5_arr+8192(%rcx), %xmm3, %xmm14
        vfmadd132pd     x6_arr+8192(%rcx), %xmm2, %xmm11
        vfmadd132pd     x7_arr+8192(%rcx), %xmm0, %xmm12
        vxorpd  %xmm1, %xmm4, %xmm4
        vfmadd132pd     x8_arr+8192(%rcx), %xmm4, %xmm13
        addq    $16, %rcx
        jne     .LBB1_2

This change-set also fixed an existing correctness problem caused
by commuting 1st and 2nd operands of scalar FMAs generated for intrinsics.

   
   For FMA intrinsic call:

       __m128d foo(__m128d a, __m128d b, __m128d c) {
	     // must return XMM0={b[127:64], a[63:0]*b[63:0]+c[63:0]}
		 // but currently returned value is XMM0={a[127:64], a[63:0]*b[63:0]+c[63:0]}
	     return _mm_fmadd_sd(b, a, c);
	   }

The Coalescer/TwoAddressInstructionPass swapped the 1st and 2nd operands
of SCALAR FMA and invalidated the higher bits of the result returned
from foo().
The change-set fixes that and prohibits swapping 1st and 2nd operands
of scalar FMAs.

Swapping 1st and 2nd operands of scalar FMAs may be possible and legal,
but only after special analysis of FMA users. Such optimization/analysis
can be implemented separately.
Another way is to separate FMA opcodes generated for FP operations
and FMA opcodes generated for FMA intrinsics as it is done now for ADD operations,
e.g. ADDSSrr vs ADDSSrr_Int. *_Int opcodes are handled more conservatively.
Being more conservative in commuting 1st and 2nd operands of scalar FMAs
right now seems better choice as stability/correctness has higher priority.

With regards to performance these changes are very good for vector/packed FMAs
(all source operands became commutable),
and neutral for scalar FMAs:
a) prohibit/disable commuting 1st and 2nd operands,
b) enable commuting 2nd and 3rd operands.
`

Diff Detail

Event Timeline

v_klochkov updated this revision to Diff 36032.Sep 29 2015, 1:50 PM
v_klochkov retitled this revision from to Improved X86-FMA3 mem-folding & coalescing.
v_klochkov updated this object.
v_klochkov added a reviewer: qcolombet.
v_klochkov added a subscriber: llvm-commits.
ab requested changes to this revision.Sep 30 2015, 2:36 PM
ab added a reviewer: ab.
ab added a subscriber: ab.

Another way is to separate FMA opcodes generated for FP operations
and FMA opcodes generated for FMA intrinsics as it is done now for ADD operations,
e.g. ADDSSrr vs ADDSSrr_Int. *_Int opcodes are handled more conservatively.
Being more conservative in commuting 1st and 2nd operands of scalar FMAs
right now seems better choice as stability/correctness has higher priority.

You're right, _Int would work (and is intended for exactly this situation), but I disagree that we can avoid fixing that here. I'm probably the one who hates _Int the most, but currently, the fma scalar intrinsic patterns seem just plain wrong, and working around that here isn't proper, IMHO. You should add the _Int instructions before landing this patch.

As for getting rid of _Int in the long term, we have https://llvm.org/bugs/show_bug.cgi?id=23449 !

This revision now requires changes to proceed.Sep 30 2015, 2:36 PM
qcolombet requested changes to this revision.Sep 30 2015, 2:37 PM
qcolombet edited edge metadata.

Hi Slava,

Thanks on working on this.

Two main things:

  1. Could you explain the structure you used to describe the dependencies between the FMA opcode?

I do not want to reverse engineer it to review the patch!

  1. The bug you fix here for the operand we shouldn’t commute for the intrinsic lowering is, IMO, separated from improving the lowering for the commutable operand.

I.e., please address that in a separate patch. As for direction, you could use the *_Int approach, or better, but more involved, model correctly the intrinsic lowering by adding a subregister for FP32 from the VR128 and use insert_subreg.

Cheers,
-Quentin

llvm/lib/Target/X86/X86InstrInfo.cpp
3084

return false;

3086

llvm_unreachable(“Opcode not handled by the switch")

3326

I would rewrite the check to make the CommuteAnyOperandIndex the first discrimination:
if (SrcOpIdx1 != CommuteAnyOperandIndex && (SrcOpIdx1 < 1 || SrcOpIdx1 > RegOpsNum))

return false;

Same for SrcOpIdx2.

3330

Add a comment saying that we look for two registers operands, those are the ones that can be commuted regardless of the FMA opcode. We will adjust the opcode later.

3338

the last *register* operand of the instruction.

3365

Add a comment along the line:
// Check if we can adjust the opcode so that the registers we change preserve the semantic.

3375

Please explain the structure you are using here.
In particular, what are those dependencies and how do you represent them.

3479

Get rid of this check.
This is a hack to workaround a bug. The bug should be fixed independently of the improvement for the lowering for commutable operands.

3482

Canonicalize SrcOpIdx1 and SrcOpIdx2 to avoid these duplicated checks.

llvm/lib/Target/X86/X86InstrInfo.h
270

\p SrcOpIdx1 and \p SrcOpIdx2

llvm/test/CodeGen/X86/fma_patterns.ll
177

We shouldn’t regress those.

v_klochkov updated this revision to Diff 36312.Oct 1 2015, 3:23 PM
v_klochkov edited edge metadata.

Ahmed, Quentin,
Thank you for the quick code-review.

I am ok with having the correctness fix for FMAs to be arranged as a separate change-set.
The correctness fix is removed from this change-set.

Also, I did some additional changes + renaming + documenting in
getFMA3OpcodeToCommuteOperands() to make the code look simpler.

I would like to land this fix and then to work on the correctness problem
that exists for scalar FMA intrinsics.

The simplest way is to add *_Int opcodes.

I am not sure I understood this idea

("adding a subregister for FP32 from the VR128 and use insert_subreg")

If there is a precedence (i.e. some similar scalar SIMD instruction) where that approach is used,
then I can try using that approach for FMAs.

Thank you,
Slava

I also did additional changes accordingly to reviewers' recommendations.

llvm/lib/Target/X86/X86InstrInfo.cpp
3084

Fixed.

3086

Fixed.

3326

I agree, your version looks a bit more clear. Fixed.

3330

Ok, I added a comment.

3338

It is interesting that I added the word "register" here when made changes for your previous comment and only then
noticed this comment asking me to do exactly the same change.
Fixed.

3365

Ok, added a comment.

3375

This is just an array after I removed IsScalar property.
I changed the comment to make it more clear.

3479

I removed this check from this change-set and updated the OpcodeAlts struct.
BTW, I would not call this check a hack. It was rather a pessimistic correctness check.

3482

Ok, Fixed. SrcOpIdx1 has the lowest index now to simplify the next checks.

llvm/lib/Target/X86/X86InstrInfo.h
270

Fixed. I did not know about \p. Thank you for letting me know.

llvm/test/CodeGen/X86/fma_patterns.ll
177

Ok, Fixed.

Hi Slava,

Almost there. I think we would benefit from a bit of refactoring.
Tell me if you disagree.

Thanks,
-Quentin

llvm/lib/Target/X86/X86InstrInfo.cpp
3381

Excellent!

3456

Maybe put three and adapt the comparison.

3459

Initialize FormIndex to LastFromIndex.

3460

Use for (GroupIndex = 0; GroupIndex < OpcodeGroupsNum && LastFormIndex != LastFormIndex; GroupIndex++)

3461

Use < LastFormIndex

3467

Remove this block.

3471

Use ==

3477

No need for {}

3484

I like the comments!
Keep them, but I believe we can refactor the code a bit.

What you basically have is given the indices, you map each FormIndex to a new FormIndex and you have three choices each time.
It looks to me like we could do save this table statically like
static const unsigned Mapping[][3] = {

// SrcOpIdx1 == 1 && SrcOpIdx2 == 2
// FMA132 A, C, b; ==> FMA231 C, A, b;
// FMA213 B, A, c; ==> FMA213 A, B, c;
// FMA231 C, A, b; ==> FMA132 A, C, b;
{ Form231Index,  Form213Index, Form132Index },
// (SrcOpIdx1 == 1 && SrcOpIdx2 == 3)
// etc.
// (SrcOpIdx1 == 2 && SrcOpIdx2 == 3)

};
unsigned Case;
if (SrcOpIdx1 == 1 && SrcOpIdx2 == 2)

Case = 0;

else if (SrcOpIdx1 == 1 && SrcOpIdx2 == 3)

Case = 1;

else if (SrcOpIdx1 == 2 && SrcOpIdx2 == 3)

Case = 2;

else

assert(“Invalid commutable indices for FMA”); // or return 0;

RetOpc == Mapping[Case][FormIndex];

v_klochkov updated this revision to Diff 36570.Oct 5 2015, 5:02 PM
v_klochkov edited edge metadata.
v_klochkov marked 9 inline comments as done.

Additional minor changes + code-restructuring in getFMA3OpcodeToCommuteOperands().

Hi Quentin,

Thank you for the very useful comments.
I updated the change-set. Please check the additional changes.

Thank you,
Slava

llvm/lib/Target/X86/X86InstrInfo.cpp
3456

Ok, but using 3 instead of 2 required renaming LastFormIndex to FormsNum.

3460

Ok. BTW, this change required a fix for GroupIndex after the loop (do GroupIndex-- once).

3484

This is a great idea! I did the corresponding changes.

qcolombet accepted this revision.Oct 6 2015, 9:58 AM
qcolombet edited edge metadata.

Hi Slava,

LGTM.

Regarding fixing the commute bug, forget about the sub register thing I was talking about. There is no precedence and on a second thought I am not sure it is the right to do. In other words, stick to the *_Int approach when you’ll get to it.

Thanks,
-Quentin

RKSimon added inline comments.
llvm/test/CodeGen/X86/fma-commute-x86.ll
5

Please can you regenerate with update_llc_test_checks.py to ensure that no extra mov or other instructions are being included.

Hi Quentin,

I do not have permissions for committing changes to LLVM trunc.
Andy (akaylor) who committed the changes for me previously reasonable refused to commit
this change-set as it is now because of Ahmed Bougacha’s comment:

“IMHO” Implement FMA*_Int instructions before landing this change-set.”

That is probably good move as the current fix potentially can produce regressions on tests where the operand 1 and 3 of scalar FMA intrinsics should not be commuted but can be commuted by the new changes.
Also, my senior colleges recommended to implement FMA*_Int instructions first too.

I am ready to submit a new code-review tracker for FMA*_Int instructions.
The corresponding changes are already implemented and tested locally.

Adding FMA*_Int instructions will cause conflicts in X86InstrFMA.td file.

Do you agree with the following plan?

  1. To add FMA*_Int instructions first;
  2. To update this change-set (D13269), i.e. to resolve conflicts in X86InstrFMA.td, and ask for additional review for changes in that file.
  1. In order to minimize the patch for (1), the commute optimization for FMA*_Int should be implemented as a separate change-set.

I am asking you as Ahmed stopped answering in this code-review tracker and also did not answer
to e-mails sent to him separately.

Thanks you,
Slava

ab added a comment.Oct 12 2015, 5:01 PM

Slava,

That plan sounds good to me; I think that's what both Quentin and myself imagined in our original replies.

Please go ahead and submit the patches for _Int FMA instructions.
As for this review, I'll second Simon's comment regarding the tests: please use the utils/update_llc_test_checks.py script. It helps generate CHECKS lines in a way that's both strict and maintainable.

Thanks for working on this, and sorry for the delay!
-Ahmed

Hi Ahmed,

Thank you for the answer.

I have submitted a new patch (FMA**_Int opcodes) for code-review.
http://reviews.llvm.org/D13710
Please see (D13710) for details.

So, this (D132269) patch is waiting for code-review/approval/checkin of the changes for (D13710) now.

Thank you,
Slava

v_klochkov updated this revision to Diff 39395.Nov 5 2015, 12:25 PM
v_klochkov edited edge metadata.

Resolved the conflicts in X86InstrFMA.td and updated the fma-commute-x86.ll test.

Hi,

I created the FMA*_Int opcodes in the patch for ( D13710 ) and it has been committed to LLVM trunc.
That patch conflicted with the changes I did here in X86InstrFMA.td.
Thus, I had to update my local workspace and upload the new RE-BASED patch this time.

Please review the updated changes in X86InstrFMA.td.
I would like to comment some additional changes I did to resolve the conflicts and to simplify the opcode definitions:

  • The parameters 'IsRVariantCommutable' and 'IsMVariantCommutable' were just removed because otherwise, they would be always set to 1.
  • Moved some comments from fma3{p,s}_forms multiclasses to fma3{p,s}_rm multiclasses. (the multiclasses fma3{p,s}_forms stopped mentioning the commute features, so having those comments there seemed not quite appropriate).

Also, reviewers asked me to use update_llc_test_checks.py for the new test fma-commute-x86.ll.
For some unknown reasons that tool did not work for me, it printed error for any input test.
So, I just added the better checks to the test manually.

This patch does not implement commute transformations for FMA*_Int opcodes.
That can/should be done in a separate patch.

Thank you,
Slava

qcolombet added inline comments.Nov 5 2015, 1:41 PM
llvm/lib/Target/X86/X86InstrFMA.td
37

Why are you changing the hasSideEffects semantic here?
I am not saying it is false, it just seems out of scope of what is necessary for this patch. What I am missing?

57

Although that is not incorrect, why do you get rid of that {?
You did the same later in that patch. Again, this is not wrong, but this kind of clean-ups adds noise to the review. I.e., if you want to do such clean-ups, which I am not sure I agree, that should be a separate patch.

162

Ditto.

194

Ditto.

Please accept my apologies for adding noise to the code-review process.
Ok, no more cleaup-ups at the late code-review phases in future.

I could cancel the clean-up changes that I did, but I am not sure that will help to the review process now, i.e. when I already uploaded them.
If I should remove the clean-ups, no problems, I'll do that. Please let me know.

Regarding the '{' and '}' used for 'let' statements and removed from Nov05 patch...
Quentin, from you comment I did not understand if you disagree with removal of the braces.
Please let me know if the braces should be restored, I'll update the patch.

Thank you,
Slava

llvm/lib/Target/X86/X86InstrFMA.td
37

If compare the patches from Nov05 and Oct05, then I just moved the setting of 'hasSideEffects = 0' from line 64 (Oct05) to here. That made the fma3p_forms multiclass more compact. It did not change semantics comparing to Oct05-patch.

Sorry if it added noise to the code-review process.

57

I am sorry for adding noise to the code-review process by doing clean ups in patches fixing already uploaded patches, i.e. like I did here. So, I shouldn't have removed 'IsMVariantCommutable' and 'IsRvariantCommutable'.

After I already removed those parameters, and set isCommutable = 1 at the line 18, I had to choose between:

a) to update the line 57 (Oct05 rev):
 "// Constraints = "$src1 = $dst" --> "// Constraints = "$src1 = $dst, isCommutable = 1"

or

b) to just remove the line 57 and eliminate the need in updating it in future when/if the 'let' statement above is changed again.

Both variants required minor changes and I chose (b). Do you recommend restoring the '{' and '} // ....' used for 'let' statements?

qcolombet accepted this revision.Nov 5 2015, 6:00 PM
qcolombet edited edge metadata.

Hi Slava,

Thanks for the clarifications.
And don’t worry about the “noise” part, I was just mentioning it for future reference :).

from you comment I did not understand if you disagree with removal of the braces.

Heh, that’s the thing, I had mixed feeling about that, but now, I have made my mind and I am fine with the patch with the additional clean-ups you did.

Thanks for your patience and all your work on that!

Do you have commit access now or do you want me to commit on your behalf?

Cheers,
-Quentin

llvm/lib/Target/X86/X86InstrFMA.td
37

Oh, I was comparing from the initial source to the current update, i.e., unless I am mistaken there is a semantic change:

defm r213 : fma3p_rm<opc213,
                     !strconcat(OpcodeStr, "213", PackTy),
                     MemFrag128, MemFrag256, OpTy128, OpTy256, Op>;

That being said, I believe the change is correct and I am fine with it being embedded in this patch.

57

That’s fine, we can remove those.

This revision was automatically updated to reflect the committed changes.