Page MenuHomePhabricator

Improved the interface of methods commuting operands, improved X86-FMA3 mem-folding&coalescing.
ClosedPublic

Authored by v_klochkov on Jul 20 2015, 2:21 PM.

Details

Summary

``The main goal of the change-set is to improve Memory-operand folding
and Coalescing optimizations performed for X86 FMA instructions
(Described in (1) below).

Unfortunately, that could not be done without interface changes
done in methods findCommutedOpIndices() and commuteInstruction().
(Described in (3) below).
The minor changes in non-X86 target sources: PowerPC, ARM and AMDGPU
were required by the new commuteInstruction() method with additional 2 operands
started being called from llvm/lib/CodeGen/* classes common for all targets.

The size of the fix is pretty big because of having (1) and (3) in one
change-set. The alternative to this change-set could be splitting of
the change-set into 2 parts:

  • interface changes (described in (3) below)
  • improvement of X86 FMA form selection (described in (1) below).

(1) Implemented optimal form selection (213/312/231) for X86 FMA instructions

   to improve Memory-Folding/Ciscization and Coalescing optimizations performed
   for FMAs. The change-set allows commuting any of FMA operands: 1st and 2nd, 
   1st and 3rd, 2nd and 3rd. 
   Previously, only 1st and 2nd operands could be commuted.
   
   Better Memory-folding and Coalescing optimizations help to reduce 
   registers pressure. Improvement from the changes can be shown on such 
   an example:

           for (int i = 0; i < N; i += 1) {
            val1 = _mm_and_pd(val1, val5);
            val2 = _mm_and_pd(val2, val6);
            val3 = _mm_and_pd(val3, val7);
            val4 = _mm_and_pd(val4, val8);
            val5 = _mm_xor_pd(val1, val5);
            val6 = _mm_xor_pd(val2, val6);
            val7 = _mm_xor_pd(val3, val7);
            val8 = _mm_xor_pd(val4, val8);

            v_accu1 = _mm_fmadd_pd(v_accu1, x1_arr[i], val1);
            v_accu2 = _mm_fmadd_pd(v_accu2, x2_arr[i], val2);
            v_accu3 = _mm_fmadd_pd(v_accu3, x3_arr[i], val3);
            v_accu4 = _mm_fmadd_pd(v_accu4, x4_arr[i], val4);
            v_accu5 = _mm_fmadd_pd(v_accu5, x5_arr[i], val5);
            v_accu6 = _mm_fmadd_pd(v_accu6, x6_arr[i], val6);
            v_accu7 = _mm_fmadd_pd(v_accu7, x7_arr[i], val7);
            v_accu8 = _mm_fmadd_pd(v_accu8, x8_arr[i], val8);
        }
		


    ASM code BEFORE the changes:
        .LBB1_2:                                # %for.body.6
                                        #   Parent Loop BB1_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovapd %xmm0, -56(%rsp)        # 16-byte Spill
        vandpd  %xmm7, %xmm3, %xmm7
        vandpd  %xmm5, %xmm12, %xmm5
        vandpd  %xmm6, %xmm9, %xmm6
        vmovapd -40(%rsp), %xmm10       # 16-byte Reload
        vandpd  %xmm10, %xmm13, %xmm10
        vmovapd %xmm10, -40(%rsp)       # 16-byte Spill
        vxorpd  %xmm7, %xmm3, %xmm3
        vxorpd  %xmm5, %xmm12, %xmm12
        vxorpd  %xmm6, %xmm9, %xmm9
        vxorpd  %xmm10, %xmm13, %xmm13
        vmovapd %xmm8, %xmm0
        vmovapd x1_arr+8192(%rcx), %xmm8
        vmovapd -24(%rsp), %xmm1        # 16-byte Reload
        vfmadd213pd     %xmm7, %xmm8, %xmm1
        vmovapd %xmm1, -24(%rsp)        # 16-byte Spill
        vmovapd %xmm0, %xmm8
        vmovapd x2_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm5, %xmm1, %xmm4
        vmovapd x3_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm6, %xmm1, %xmm8
        vmovapd x4_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm10, %xmm1, %xmm11
        vmovapd -56(%rsp), %xmm0        # 16-byte Reload
        vmovapd x5_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm3, %xmm1, %xmm15
        vmovapd x6_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm12, %xmm1, %xmm0
        vmovapd x7_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm9, %xmm1, %xmm2
        vmovapd x8_arr+8192(%rcx), %xmm1
        vfmadd213pd     %xmm13, %xmm1, %xmm14
        addq    $16, %rcx
        jne     .LBB1_2

        ASM code WITH the new changes (about 30% faster):
        .LBB1_2:                                # %for.body.6
                                        #   Parent Loop BB1_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vandpd  %xmm7, %xmm3, %xmm7
        vandpd  %xmm5, %xmm2, %xmm5
        vandpd  %xmm6, %xmm0, %xmm6
        vandpd  %xmm1, %xmm4, %xmm1
        vxorpd  %xmm7, %xmm3, %xmm3
        vxorpd  %xmm5, %xmm2, %xmm2
        vxorpd  %xmm6, %xmm0, %xmm0
        vfmadd132pd     x1_arr+8192(%rcx), %xmm7, %xmm15
        vfmadd132pd     x2_arr+8192(%rcx), %xmm5, %xmm8
        vfmadd132pd     x3_arr+8192(%rcx), %xmm6, %xmm9
        vfmadd132pd     x4_arr+8192(%rcx), %xmm1, %xmm10
        vfmadd132pd     x5_arr+8192(%rcx), %xmm3, %xmm14
        vfmadd132pd     x6_arr+8192(%rcx), %xmm2, %xmm11
        vfmadd132pd     x7_arr+8192(%rcx), %xmm0, %xmm12
        vxorpd  %xmm1, %xmm4, %xmm4
        vfmadd132pd     x8_arr+8192(%rcx), %xmm4, %xmm13
        addq    $16, %rcx
        jne     .LBB1_2

(2) Fixed a correctness problem caused by commuting 1st and 2nd operands of

   scalar FMAs generated for intrinsics. The problem is AUTOMATICALLY/for-free
   gets fixed by the proposed changes for (1).
   
   For FMA intrinsic call:

       __m128d foo(__m128d a, __m128d b, __m128d c) {
	     // must return XMM0={b[127:64], a[63:0]*b[63:0]+c[63:0]}
	     return _mm_fmadd_sd(b, a, c);
	   }

    The Coalescer/TwoAddressInstructionPass swapped the 1st and 2nd operands

of SCALAR FMA and invalidated the higher bits of the result returned
from foo().
The change-set fixes that and prohibits swapping 1st and 2nd operands
of scalar FMAs.

Swapping 1st and 2nd operands of scalar FMAs is possible and legal,
but only after special analysis of FMA users. Such optimization/analysis
can be implemented separately.

(3) The changes performed for (1) and (2) could not be implemented without

   interface change in 2 methods of TargetInstrInfo class and it's child classes:
       
	   bool TargetInstrInfo::findCommutedOpIndices(MachineInstr *MI,
                                                   unsigned &SrcOpIdx1,
                                                   unsigned &SrcOpIdx2) const;

       The operands SrcOpIdx1 and SrcOpIdx2 used only for OUTPUT from
	   the method previously.
	   Now they are INPUT and OUTPUT arguments.
	   INPUT values specify the indices of operands that are wanted to be swapped.
	   The input value ~0U gives the findCommutedOpIndices freedom to pick 
	   any commutable operand (i.e. defines the _old_ behaviour of the method).
	   
	   MachineInstr *TargetInstrInfo::commuteInstruction(MachineInstr *MI,
                                                         bool NewMI,
                                                         unsigned Idx1,
                                                         unsigned Idx2) const;

       Two arguments Idx1 and Idx2 were added to the method; they specify 
	   the operands to be swapped/commuted.
	    
    The old commuteInstruction() method did not let you to ask to commute

1st and 3rd operands or 2nd and 3rd operands.

The changes in TwoAddressInstructionPass.cpp show how the updated methods
can be used to fix missing optimization opportunities that could happen

        previously.

Readability and risky assumptions.
Previously, something similar to this sequence was used in several places:

	    unsigned Idx1;
		unsigned Idx2;
	    if (findCommutedOpIndices(MI, Idx1, Idx2) &&
		    (Idx1 == 1 || Idx2 == 1)) {
	      commuteInstrction(MI, false); //!!! how can we know that Idx1
		                                //    and Idx2 are commuted here?
	      <do something with Idx1 and Idx2 operands here>
		}

The updated functions allow to write more clear, safe and readable code:

        unsigned Idx1 = 1 /*want to commute 1st operand*/;
	    unsigned Idx2 = ~0U /*Don't care, choose any other operand*/;
	        if (findCommutedOpIndices(MI, Idx1, Idx2)) {
	          commuteInstrction(MI, false, Idx1, Idx2);
	          <do something with Idx1 and Idx2 operands here>
			}

The old method commuteInstruction() not specifying the commuted operands
was not removed as removing it would require restructuring/improvements
in some other places (similar to what was done in TwoAddressInstructionPass),
which cannot be done in this change-set as the current version of
the change-set is already too big.
``

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
v_klochkov updated this revision to Diff 30193.Jul 20 2015, 2:21 PM
v_klochkov updated this object.Jul 20 2015, 2:53 PM
v_klochkov edited edge metadata.
v_klochkov added a subscriber: llvm-commits.
v_klochkov updated this object.Jul 20 2015, 2:56 PM
delena added a subscriber: delena.Jul 21 2015, 12:22 AM

Hi Slava,

I added some comments, but I'm definitely not a reviewer. I suggest to send a mail to LLVM-dev and tell that you want to improve FMA code and ask people who can review the patch.
Do you have "OK" for the up-streaming?

llvm/lib/Target/X86/X86InstrInfo.cpp
2931 ↗(On Diff #30193)

I suggest to change from "Do not call" to "If you call"

2940 ↗(On Diff #30193)

Please remove one empty line.

3169 ↗(On Diff #30193)

Looks huge. I'm not sure but may be these enums in ABC order and we can compare against first and last?
Or auto-generate something?

3378 ↗(On Diff #30193)

This method may be static. Right?

3387 ↗(On Diff #30193)

I suggest to separate scalar from vector. You handle them separately, right?
I also think that you can put all FMA tables in a separate header file.

3546 ↗(On Diff #30193)

I think this interface is inconvenient. I suggest to separate input and output. You can put ~0U as default value of input.

Some comments on the proposed interfaces:

llvm/include/llvm/Target/TargetInstrInfo.h
97–109 ↗(On Diff #30193)

This is just a helper used to implement some of the other commuting methods. It should not be public and virtual, it doesn't even access any state and could be static or a helper function elsewhere.

274–277 ↗(On Diff #30193)

I think just adding Idx1,Idx2 to the existing method should be enough. The two operand commutable case will also work with that and it less confusing than having two overloaded variants around.

289–312 ↗(On Diff #30193)

This interface looks complicated, two functions for querying commutability with different "query" styles. I think this should be simplified to a function that simply returns all operands that are commutable. I would imagine something like:

bool findCommutedOpIndicess(MachineInstr *MI, SmallVectorImpl<unsigned> &CommutableOperandNums) const;

or

ArrayRef<unsigned> findCommutedOpIndices(MachineInstr *MI) const;

qcolombet edited edge metadata.Jul 29 2015, 10:31 AM

Hi Vyacheslav,

I second Matthias' comments.

Just a few more things, see the inline comments.

Thanks,
-Quentin

llvm/include/llvm/Target/TargetInstrInfo.h
100 ↗(On Diff #30193)

Please define a static constant for this magic value.
I do not like magic values wandering around without context.
I do not have a good naming right now though, maybe UndefinedIndex?

289–312 ↗(On Diff #30193)

I think for the general case instead of having a list of unsigned, we would need a list of pair of unsigned.

`Elena, Matthias, Quentin,

Thank you for the code-review and comments.
The interface of the functions findCommutedOpIndices() and areOpsCommutable()
caused the biggest questions.

Combining Elena's idea with some of Quentin's comments regarding
the magic consts can give this first alternative:

const unsigned CommuteAnyOperandIndex = ~0U;
bool findCommutedOpIndices(MachineInstr *MI, 
                           unsigned &OpIdx1,
                           unsigned &OpIdx2,
                           unsigned MustCommuteIdx1 = CommuteAnyOperandIndex,
                           unsigned MustCommuteIdx2 = CommuteAnyOperandIndex);

The second alternative is to have a method returning all pairs
of commutable operands:

bool findCommutedOpIndices(MachineInstr *MI, 
                           SmallVectorImpl<_some_type_containing_a_pair_of_unsigned_indices_>);

(I am not sure if I should define some new class/struct or reuse some existing type for <pair of indices>,
but that ok, let this question wait until we can decide what alternative seems better).

I really like Elena's idea; Please let me explain why.
The reason (a) below is the most important one.

a) In many cases we only need to know if some known operand is commutable with others.
   For example, RegisterCoalescer.cpp wants to know if 'UseOpIdx' can be swapped with something else;
   TwoAddressInstructionPass.cpp wants to know if 'BaseOpIdx' is commutable with something else;
   In some other cases we may even know the operands that need to be commuted and do not want
   to know about other operands commutativity.
   
   So, the first alternative provides a flexible instrument that helps to know just what we need to know,
   while the second alternative makes findCommutaleOpIndices() to collect information that often is
   not needed later.
   
   For example, I want to know if 2nd and 3rd operands of FMA are commutable.
   The 1st alternative just gives me the answer:
     if (findCommutedOpIndices(MI, Idx1, Idx2, 2, 3)) {}
   The 2nd alternative would probably return 3 pairs: <1,2>, <1,3>, <2,3>.
   I would not only need to find the desired <2,3> in the set of 3 pairs, but I would
   also ask findCommutedOpIndices() to do potentially expensive analysis regarding commutativity of
   the 1st operand (if FMA is scalar, then 1st operand is commutable only if users use only the lowest 
   element of XMM).
   
b) The 1st alternative helps to make the change-set a bit more compact than it is now,
   while the 2nd would require additional changes that would additionally complicate the places 
   where findCOmmutedOpIndices() is called now.

c) It would be very simple to remove the method areOpsCommutable() if it seems redundant.
   The calls of that method could be easily replaced with the calls of findCommutedOpIndices().
   For example:
     if (areOpsCommutable(MI, 1, 2)) {}
   ->
     if (findCommutedOpIndices(MI, Idx1, Idx2, 1, 2)) {}

Please let me know if you agree with the reasoning and if you are OK with Elena's idea.

Thanks,
Slava`

llvm/lib/Target/X86/X86InstrInfo.cpp
2931 ↗(On Diff #30193)

This "Do not call" comment was moved to here from the old version of include/llvm/Target/TargetInstrInfo.h
The TargetInstrInfo::commuteInstruction() has assert verifying that MI is commutable.
After taking that assert into account this comment seems quite precise.

2940 ↗(On Diff #30193)

Ok, removed it, the updated version of the change-set will have this fix.

3169 ↗(On Diff #30193)

Unfortunately it is huge, I agree.
Comparing against the first and last or having some assumptions about how and in which order the opcodes were defined
seems a very risky approach causing unexpected effects/errors in future. I am pretty sure that we should not go this way.

I considered the idea of having a special bit for FMAs (something similar to the fields defined in llvm/include/llvm/Target/Target.td:
isReturn,isBitcast,etc). Adding isFMA3 to there would be inappropriate as FMA3 is meaningful only for X86, while all other 32 1-bit fields defined there are quite generic and usable for all targets. Also, adding even 1 bit to there will increase the size of IR.
Unfortunately, I could not find anything similar but for X86 platform only.

3378 ↗(On Diff #30193)

Yes, this method could be static and be similar to existing methods like "static bool isFrameLoadOpcode(int Opcode)", etc.

The reason why I passed 'MachineInstruction' argument instead of 'Opcode' to this function and why this method is not static now,
is that I wanted to reserve the opportunity to handle SCALAR FMAs and their 1st operand more optimistically later (when additional analysis of scalar FMA users would be implemented); please see the FIXME comment at the line 3487.

3387 ↗(On Diff #30193)

There is one loop handling all vector and scalar FMAs below, I did not handle them separately.
The 'IsScalar' field was needed only to handle the 1st operand with extra carefulness as commuting 1st operand of scalar FMA requires some additional analysis.

Regarding the separating FMA tables into a separate header file...

Separating it to a header file makes sense only when it would be used by something else, i.e. not only by one method. Otherwise, it is more convenient to have this array definition closer to the function/method using that table.
Also, In my opinion the function local/static array OpcodeAlts is written using the same style that was used in several other places in this file (Please see the definition of MemoryFoldTable2Addr, MemoryFoldTable0, etc). Moving all similar static arrays of structures to a header file deserves a special/separate change-set.
3546 ↗(On Diff #30193)

Special thank you for this comment! Separating INPUT and OUTPUT arguments seems very reasonable. I like this idea.

In my opinion both approaches have right to live though.
Before adding 2 additional arguments to findCommutedOpIndices() and fixing other places I would wait for more comments from reviewers.

So, the first alternative provides a flexible instrument that helps to know just what we need to know, while the second alternative makes findCommutaleOpIndices() to collect information that often is not needed later.

Right, but the first method does not tell the consumer that there are other alternatives for a given operand. The second method is much more general.

That being said, I believe you are right that most users of this method do not care about the alternatives, at least for now, so this is fine to have them iterate on the other indexes to check if there are alternatives. Exactly like you do in the two address pass.

Now, regarding the API, I would be in favor for something simpler, i.e.:

  • Kill areOpsCommutable.
  • Keep findCommutableOps with its current signature.

— Just make the two unsigned input/output parameters, like you did.
— Do not add two extra unsigned input parameters.

llvm/lib/CodeGen/TargetInstrInfo.cpp
208 ↗(On Diff #30193)

Shouldn’t we just need two default arguments ~0U, instead of duplicating the prototype?

llvm/lib/Target/X86/X86InstrInfo.cpp
3546 ↗(On Diff #30193)

I do not see why it is better to separate the input and output parameters here.
As long as the parameter will have the value: CommuteAnyOperandIndex, we know how to make the distinction.

v_klochkov added a comment.EditedAug 6 2015, 6:30 PM

Thank you for the answer, Quentin.

I agree with your new comments and started preparing an updated change-set.
Regarding the default ~0U values for arguments of commuteInstruction() method, I still need your opinion - please see my answer to your question ('Inline Comment') for details.

Summarizing the planned additional changes:

  • remove areOpsCommutable()
  • fix ~0U magic const
  • change fixCommutedOpindices() to a helper method.
  • ? remove old commuteInstruction() method and duplicate code handling ~0U args 5 times..., but ONLY if you recommend doing that.

I would also add that if eventually we need a method that would return ALL pairs of commutable operands,
then it could be a NEW method that would have the word 'All' in its name:
... findAllCommutedOpIndices();
Such method (if needed) should be added in a separate change-set though.

Thank you,
Slava

llvm/include/llvm/Target/TargetInstrInfo.h
97–109 ↗(On Diff #30193)

Yes, I agree, good catch. This will be fixed.

100 ↗(On Diff #30193)

I agree, we need a const for this value. How about 'CommuteAnyOperandIndex'
or 'AnyCommutableOperandIndex'?

274–277 ↗(On Diff #30193)

Matthias, can you please explain your idea?
In particular, what do you mean by saying

'...should be enough. The two operand commutable case will also work with that'?

I really want to just remove the old method that does not specify the commuted operands (i.e. to do exactly what you recommended here and to just add two operands to existing method). Unfortunately, I cannot do that without rewriting several places in LLVM.

This change-set replaces some calls of commuteInstruction(MI) calls with calls of commuteInstruction(MI,false,Idx1,Idx2). That made the changes in those places more clear.

There are though 7 or 8 places where the old style method is called and rewriting those places would make this change-set significantly bigger.
For example, CodeGen/MachineCSE.cpp has 2 calls of old commuteInstruction() method.
That place is obviously has opportunities for improvement, but that should be done in a separate change-set as the current change-set is already very big.

So, the only reason for having two variants of commuteInstruction() method is the need to limit the size of the change-set, to limit the efforts needed for code-review, and to eliminate the risks (correctness, etc.) that are brought by too huge change-sets.

llvm/lib/CodeGen/TargetInstrInfo.cpp
208 ↗(On Diff #30193)

That could be done this way (i.e. use ~0U value as default values for the indices).
This way is good as there will be only one commuteInstruction() method.

The only big disadvantage on that way is that I'll need to duplicate the code of this method 5 times:

  1. in TargetInstrInfo.cpp (in commuteInstruction()
  2. in X86 specific implementation of commuteInstruction() method
  3. in PowerPC specific implementation
  4. in AMDGPU specific implementation
  5. in ARM specific implementation.

The duplicated code would be something like this:

if (Idx1 == ~0U || Idx2 == ~0U) {
  if (!findCommutedOpIndices(MI, Idx1, Idx2)) {
    asssert(...);
    return nullptr;
  }
}

After comparing the advantages and disadvantages... I would prefer to have this code in one place (i.e. as it is implemented now in the 1st change-set), but this question is not something very important for me
and I will do as you recommend to do.

v_klochkov updated this revision to Diff 31721.Aug 10 2015, 2:19 PM
v_klochkov edited edge metadata.

Quentin,

I updated the change-set accordingly to the comments and recommendations from reviewers.

In this change-set I did not remove the old commuteInstruction() method as I am waiting for comments from you regarding my comment that says that this method can be removed only with the cost of duplication of the method code 5 times.

Thank you,
Slava

Hi Slava,

In this change-set I did not remove the old commuteInstruction() method as I am waiting for comments from you regarding my comment that says that this method can be removed only with the cost of duplication of the method code 5 times.

I think it makes sense to only expose one API and have one overridable internal API. Then put the boring code shared in the public function.
E.g., something like:
class stuff {
protected:

virtual ty PublicAPIImpl(ty2);

public:

/* not virtual */ ty PublicAPI(ty2) {
  /* boring code */
 return PublicAPIImpl(ty2);
}

};

Thanks,
-Quentin

llvm/lib/Target/X86/X86InstrInfo.cpp
3499 ↗(On Diff #31721)

My understanding is that you are address this point here:

(2) Fixed a correctness problem caused by commuting 1st and 2nd operands of scalar FMAs generated for intrinsics.

Most of the time I think we do not care about the high level bits of the value (which is what you are fixing here). Therefore, I wonder if we are not being pessimistic on the commutation opportunities.
I agree we should seek correctness first, but I wonder how often that high level setting is actually expected… We had this bug forever and apparently nobody noticed it.

Anyway, what is your plan to get us the performance back?

v_klochkov updated this revision to Diff 32646.EditedAug 19 2015, 6:18 PM

This change-set (3rd revision) is done accordingly to Quentin's suggestion to have 'protected virtual commuteInstructionImpl()' method
and to have the other method 'commuteInstruction()' non-virtual. The last one can accept CommuteAnyOperandIndex arguments.

This solution gives us only one public commuteInstruction() method instead of 2 methods with different interfaces/prototypes as it was in 2nd revision.

Hi Quentin,

Thank you for the good idea (commuteInstructionImpl()). Hopefully, I understood it right.
I uploaded a new change-set. Would you please review the additional changes?

Also, I answered your question regarding the stability/correctness fix - please see my answer right after your inline comment/question. In my opinion, stability/correctness has priority over performance in this particular question.
There are at least 2 way how to fix the conservative code-gen.

Thank you,
Slava

llvm/lib/Target/X86/X86InstrInfo.cpp
3499 ↗(On Diff #31721)

That correctness problem exists for FMAs and does not exist for ADD/MUL operations.
Also, FMAs are relatively new instructions.

For example, if you compile the test:

#include <immintrin.h>
double func(double y, double x) {
  return y + x;
}
__m128d funcx(__m128d y, __m128d x) {
  return _mm_add_sd(x, y);
}

then you'll see that only 1 instruction is generated for func() and 2 instructions for funcx().
func() just ignores the upper bits of returned XMM and funcx() correctly handles the upper bits of returned XMM value.

The difference in IR is: ADDSDrr opcode is used in func(), ADDSDrr_Int opcode is used in funcx().

So, one of possible solutions could be to add *_Int opcodes for FMA operations like it was done for ADD and MUL operations, and be more conservative for *FMA*_Int opcodes only.

Another solution is mentioned in FIXME comment above, i.e. to implement functionality that can tell if only the lowest element of the result of scalar FMA is used.

In my opinion, these 2 solutions do not exclude each other; they both should be implemented.

Currently, we do not have *FMA*_Int opcodes, that is why it would be better to be more conservative and correct. This patch might make the code a little bit worse/conservative on some corner cases, but it also improves code-gen for many other cases, for example, for those cases where the 1st or 2nd operand can be swapped with 3rd operand when it helps to do memory-op-folding optimization.

Hi Slava,

I think we are getting close to the final approval :).
Thanks for working on this.

I haven’t looked at the X86 specific part yet, as I would like the patch to be split in two:

  • One patch for the change in API and NFC for all the backends.
  • One patch to extend the commute code to handle the FMA cases.

Also, a few general comments:

  • Some formatting look strange to me, please use clang-format on the patch.
  • Please do not insert blank comment lines around comments, i.e.,

<— Remove those
<Some comment>
// <— Remove those

  • Do not repeat the comment from the header in the cpp files.

Thanks,
-Quentin

llvm/include/llvm/Target/TargetInstrInfo.h
299 ↗(On Diff #32646)

We can’t override this one anymore, please update the comment.

302 ↗(On Diff #32646)

It would be useful to specify what is the behavior when both commute indices are CommuteAnyOperandIndex.

322 ↗(On Diff #32646)

Same thing: what happens if both indices are set to CommuteAnyOperandIndex?

llvm/lib/CodeGen/TargetInstrInfo.cpp
124 ↗(On Diff #32646)

We usually do not repeat the doxygen comment when they are set in the header.
By doing so, we risk to have them being out-of-sync. Therefore, please keep only the one in the header.
(Ditto for the other functions you’ve modified.)

llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
1167 ↗(On Diff #32646)

Please document the meaning of the other arguments as well, like Dist, BaseOpKilled, etc.

1182 ↗(On Diff #32646)

Invert the condition and use “continue”. (Per LLVM coding standard.)

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
179 ↗(On Diff #32646)

It is not handled at all, right?

Shouldn’t findCommutedOpIndices return only register operands?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
888 ↗(On Diff #32646)

I am guessing that you shouldn’t be the one doing this can of changes. Or at least, it should be a separate patch.

v_klochkov updated this revision to Diff 33366.EditedAug 27 2015, 3:45 PM

I removed the X86 FMA specific changes and left here only the interface changes
for commuteInstruction() and findCommutedOpIndices() methods accordingly to Quentin's request.

Also, removed blank comment lines and duplicated methods' description/comments.

`Hi Quentin,

I appreciate the time you spend to this code-review request.

In this change-set I removed X86 FMA specific changes - that will be a separate change-set
as you asked. This totally excluded the changes from the files:

llvm\test\CodeGen\X86\fma-commute-x86.ll
llvm\test\CodeGen\X86\fma_patterns.ll 
llvm\lib\Target\X86\X86InstrFMA.td

I removed the duplicated comments for methods (header vs cpp)
even though I personally like such duplication.
It is good to have function description in header file, but it is also
so convenient to have it in .cpp when you're looking at the method implementation;
you can see what arguments mean, etc. without need to take a look at *.h file.

AMDGPU changes:

I really did NOT want to do those non-obvious changes for AMDGPU.
Avoiding the changes would let me to avoid questions and to simplify the review/approve process.

Unfortunately, I had to do those changes. 
Also, those changes cannot be separated from the interface changes I did for
commuteInstruction() and findCommutedOpIndices().

The problem I met there can be explained this way:
1) There are some places like this:
     if (MI->IsCommutable() && TII->commuteInstruction(MI)) {}
    I.e. commuteInstruction() was called without preceding call of findCommutedOpIndices().
2) commuteInstruction() implementation for AMDGPU can commute Reg and Imm operands.

So, the solution is:
a) To allow AMDGPU implementation of findCommutedOpIndices() to return true
    when the second operand is Imm.
    Otherwise, all calls mentioned in problem (1) above would not commute instructions.
b) To fix SIFoldOperands.cpp/tryAddToFoldList() and add additional check there because
     Imm operand is not wanted there.
     I updated the FIXME comment there to make it more informative.

Thank you,
Slava
`

Hi Slava,

I still do not get the AMDGPU changes.

Unfortunately, I had to do those changes.
Also, those changes cannot be separated from the interface changes I did for
commuteInstruction() and findCommutedOpIndices().

We didn't change the basic semantic of those APIs, just making them more powerful, right?
So, why do we need to change the way we use them for this target.
Indeed, " if (MI->IsCommutable() && TII->commuteInstruction(MI)) {}" seems like a reasonable pattern to me.

What am I missing?

Thanks,
-Quentin

llvm/lib/CodeGen/TargetInstrInfo.cpp
212 ↗(On Diff #33366)

The formatting still looks suspicious to me.
I expect the "else" to be on the same line as the '}'.
Have you run clang-format?

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
186 ↗(On Diff #33366)

Since you said the target can commute imm and reg operand, why do we need this change?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
796 ↗(On Diff #33366)

I still don't get why we need to turn this if into an assert.

The code 'if (MI->IsCommutable() && TII->commuteInstruction(MI))' works differently before and after the changes in commuteInstruction() method .

BEFORE:

If MI has the attribute 'MCID::Commutable' set to true, then 
    1) go to AMDGPU specific implementation of commuteInstruction()
        and commute operands (even if those are Reg and Imm).

AFTER:

If MI has the attribute 'MCID::Commutable' set to true, then 
  1) go to TargetInstrInfo::commuteInstruction() and try to commute operands
      1a) call AMDGPU specific implementation of findCommutedOpIndices()
      1b) if could find commutable operands, then call AMDGPU specific implementation of commuteInstruction()

The step 1a) returned false and no operands commute happened, which caused LIT tests fails.
So, I just synchronized understanding of commutable operands in AMDGPU methods findCommutedOpIndices() and commuteInstruction().
The updated findCommutedOpIndices() returns true for Reg and Imm operands if such can be commuted by commuteInstruction().

The fix in SIFoldOperands.cpp was needed because the updated findCommutedOpIndices() may return true for commutable Reg and Imm operands.
In such cases the index of Imm operand is stored into 'FoldList' std::vector object.
Later (downstream of the optimization) the elements of returned FoldList are treated as if they are all REG operands, which causes assert violations for Imm operands stored to 'FoldList'.

I'll fix the mentioned formatting issues, and will try to run format-clang tools.

Thanks,
Slava

llvm/include/llvm/Target/TargetInstrInfo.h
287–288 ↗(On Diff #33366)

Fixed.

304–331 ↗(On Diff #33366)

Fixed.

llvm/lib/CodeGen/TargetInstrInfo.cpp
124 ↗(On Diff #33366)

Ok, removed the duplicated comments/descriptions.

llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
1167 ↗(On Diff #33366)

Fixed.

1182 ↗(On Diff #33366)

Fixed.

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
179 ↗(On Diff #33366)

findCommutedOpIndices() returns commutable operands. They can be Imm operands for AMDGPU target.
I also updated the FIXME comment to make it more informative.

v_klochkov updated this revision to Diff 33495.Aug 28 2015, 4:48 PM

Fixed coding style/standard violations such as too long lines, indentations, etc.
using the recommendations from 'clang-format-diff.py' tool.

Please see my answer regarding the new assert in AMDGPU version of commuteInstructionImpl() method.
Thanks,
Slava

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
796 ↗(On Diff #33366)

(Src1Idx == -1) is impossible here as this place is reachable only after findCommutedOpIndices() call which filters out such situations.
This assert for Src1Idx is semantically and stylistically equivalent to the assert at the line 788 (assert for Src0Idx).

qcolombet accepted this revision.Sep 4 2015, 4:21 PM
qcolombet edited edge metadata.

Hi Slava,

LGTM.

The code 'if (MI->IsCommutable() && TII->commuteInstruction(MI))' works differently before and after the changes in commuteInstruction() method .

Thanks for the explanation, now I see the difference in semantic. You’ll want to send an email to give a heads-up to out of tree targets for this change.

Also, before landing the patch, ping Tom (thomas.stellard@amd.com, code owner of the AMDGPU), to see if the changes in AMDGPU also look good to him.

Cheers,
-Quentin

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
178 ↗(On Diff #33495)

The comment is just fine without the FIXME in front of it.
Just remove it, i.e., I do not think there is anything to fix here in the end.

This revision is now accepted and ready to land.Sep 4 2015, 4:21 PM

Mr. Stellard,
Please review and approve the changes in 3 files owned by AMDGPU target:

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
llvm/lib/Target/AMDGPU/SIInstrInfo.h
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

All the changes (14 files, including AMDGPU) have been reviewed by Quentin Colombet.

Also, below I attached the e-mails I sent you to your e-mail at amd.com
Please see more details in it.

Thank you,
Vyacheslav Klochkov

From: Klochkov, Vyacheslav N  
Sent: Monday, September 14, 2015 4:46 PM
To: '<EDITED>@amd.com'
Cc: Klochkov, Vyacheslav N
Subject: RE: LLVM code-review: http://reviews.llvm.org/D11370
 
Mr. Stellard,
 
In this e-mail I am asking you for approval for AMDGMU specific changes
that are a small part of changes improving methods that commute operands of Machine Instructions.
 
Please see the details below.
Also, I will be more than happy to answer your questions if you have any.
 
Thank you,
Vyacheslav Klochkov
------------------------------------------------------------------------------------
From: Klochkov, Vyacheslav N 
Sent: Tuesday, September 8, 2015 4:05 PM
To: <EDITED>@amd.com
Cc: Klochkov, Vyacheslav N
Subject: LLVM code-review: http://reviews.llvm.org/D11370
 
Dear Mr. Stellard,
 
Would you please approve the AMDGPU specific changes in this code-review tracker:
http://reviews.llvm.org/D11370
 
This change-set was reviewed and approved by several people:
-        Quentin Colombet (official code-reviewer).
-        David Kreitzer (code-review before submitting changes to Open Source community);
-        Michael Kuperstein (code-review before submitting changes to community);
-        Matthias Braun and Elena Demikhovsky (not official reviewers, sent some comments).
Quentin recommended to ask for your approval for AMDGPU changes.
 
This change-set includes changes in 14 files, 3 of 14 are AMDGPU specific:
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
llvm/lib/Target/AMDGPU/SIInstrInfo.h
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
 
 
The change-set improves the interface of the findCommutedOpIndices() and commuteInstruction() methods.
It did not change the code-generation for AMDGPU, all LIT tests passed.
 
The old commuteInstruction() method did not specify the indices of the operands to be commuted.
The main idea of the change-set is that commuteInstruction() must be able to have the commuted operands be specified explicitly.
That is needed when a caller of commuteInstruction() method knows what operands must be commuted.
 
For example, if 1st and 2nd ops are commutable and 1st and 3rd operands are commutable too,
the old commuteInstruction() method did not allow to do the second commute transformation.
The new method fixes that problem.
 
It is still possible not to specify the operands to be commuted, but in such cases the operands to be commuted must be found by the method findCommutedOpIndices()
(Please see the commuteInstruction() method in llvm/lib/CodeGen/TargetInstrInfo.cpp).
 
This interface change in llvm/lib/CodeGen/TargetInstrInfo.cpp caused the need to do minor changes in target specific implementation of
findCommutedOpIndices() and commuteInstruction().
 
The changes in ARM, X86, PowerPC went very smooth. The changes in AMDGPU required me to do a little bit bigger changes.
 
For AMDGPU I tuned the findCommutedOpIndices() such a way that now it returns TRUE for commutable Reg and Imm operands
as Reg and Imm operands can be commuted in AMDGPU specific implementation of commuteInstruction() method.
So, findCommutedOpIndices() and commuteInstruction() are in sync now, i.e. the 1st returns true when the 2nd can do the commute.
 
The change in findCommutedOpIndices() allowed me to add an assert on one of operands in commuteInstructionImpl()  AMDGPU method.
Also, I needed to add a simple check to SIFoldOperands.cpp to avoid Imm operands in the transformation where only Reg operands are expected.
 
Thank you,
Vyacheslav Klochkov
arsenm added inline comments.Sep 17 2015, 11:44 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
874 ↗(On Diff #33495)

Typo: immeditate

885–887 ↗(On Diff #33495)

It's not entirely accurate to use isVOP2 / isVOP3 for checking the number of operands. VOP* instructions are always available in a VOP3 encoding, but will still have < 3 operands. Checking if AMDGPU::OpName::src2 is a valid operand is a more reliable check. SALU instructions with an immediate can also be commuted, although there is less reason to do so other than canonicalization. Although it looks like isVOP2/isVOP3 is what commuteInstruction already checks so I guess this is OK for now.

arsenm added inline comments.Sep 17 2015, 11:59 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
798–799 ↗(On Diff #33495)

I would prefer to deMorgan's law this and distribute the !

Thank you for the comments.
I fixed the "immeditate" misprint,
replaced the if statement: if (!(A && B) && !(C && D)) --> if ((!A || !B) && (!C || !D))
and removed the 'FIXME' word in SIFoldOperands.cpp (accordingly to recommendation from Quentin Colombet).

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
178 ↗(On Diff #33495)

Fixed. I removed the "FIXME" word.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
798–799 ↗(On Diff #33495)

Ok, Fixed.

874 ↗(On Diff #33495)

Fixed: immeditate -> immediate.

885–887 ↗(On Diff #33495)

Thank you for the explanations.
I am quite happy that you are Ok with the current version of the changes as the fixing of such subtle things should be done by AMDGPU experts. In this change-set I just synchronized the checks in findOpIndicesToCommute() and commuteInstructionImpl() (i.e. re-used the checks from commuteInstructionImpl() in findOpIndicesToCommute()).

arsenm accepted this revision.Sep 25 2015, 2:29 PM
arsenm added a reviewer: arsenm.

LGTM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
779–780 ↗(On Diff #33495)

I just removed this check a few days ago, so you probably will have a conflict when you apply this to trunk

800–801 ↗(On Diff #33495)

These asserts can be removed. I decided that there's no point to checking if src0/src1 are valid operands

This revision was automatically updated to reflect the committed changes.