This is an archive of the discontinued LLVM Phabricator instance.

[aarch64] add def-pats for dot product
ClosedPublic

Authored by sebpop on Sep 16 2019, 7:24 PM.

Details

Summary

This patch adds the patterns to select the dot product instructions.

Tested on aarch64-linux with make check-all.

Diff Detail

Event Timeline

sebpop created this revision.Sep 16 2019, 7:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2019, 7:24 PM
sebpop planned changes to this revision.Sep 16 2019, 10:28 PM
sebpop marked an inline comment as done.
sebpop added inline comments.
llvm/lib/Target/AArch64/AArch64InstrInfo.td
7080

I think there is an error in this case: as we duplicate the original value $Vo across the 4 lanes of the dot product, and then in the end we do the ADDV reduction across lanes, we end up with 4 times the original value.

I will prepare an updated patch to fix this.

I've got a cheeky request, and I appreciate that should go in a separate patch, but while you're at at it would you mind repeating this exercise for the ARM backend and AArch32?

sebpop updated this revision to Diff 220501.Sep 17 2019, 7:42 AM

The new patch does not use the first argument of the dot product instruction: we now set it to zero.
Patch tested on x86_64-apple-darwin with make check-all.

I've got a cheeky request, and I appreciate that should go in a separate patch, but while you're at at it would you mind repeating this exercise for the ARM backend and AArch32?

Let's get the arm64 patch in good shape and then we can retro-fit it for arm32 ;-)

Many thanks for addressing this!
This pattern matching looks very reasonable to me, just a round of nits:

  • was just curious about the AddedComplexity = 30
  • perhaps I am wrong, but was wondering if clang-format agrees with some of the indentation,
  • would some more test bring any benefits/additonal value? Like test with a loops, or tests with no aliasing attributes?
  • please ignore if you disagree, but thought that some names like ldop, K1, K2 could be slightly more informative/consistent.
  • was just curious about the AddedComplexity = 30

I chose 30 as it seemed to work: the rest of the file had complexities <= 20.
I updated that to 10 and it still works. I will update the patch with this.

  • perhaps I am wrong, but was wondering if clang-format agrees with some of the indentation,

I'll let you decide if clang-format is the best at tablegen ;-)
here is the output of clang format on the beginning of the patch...

// dot_v4i8                                                                                                                                                                       
class mul_v4i8<SDPatternOperator ldop> : PatFrag<(ops node                                                                                                                        
                                                  : $Rn, node                                                                                                                     
                                                  : $Rm, node                                                                                                                     
                                                  : $offset),                                                                                                                     
                                                 (mul(ldop(add node                                                                                                               
                                                           : $Rn, node                                                                                                            
                                                           : $offset)),                                                                                                           
                                                  (ldop(add node                                                                                                                  
                                                        : $Rm, node                                                                                                               
                                                        : $offset)))>;                                                                                                            
class mulz_v4i8<SDPatternOperator ldop> : PatFrag<(ops node                                                                                                                       
                                                   : $Rn, node                                                                                                                    
                                                   : $Rm),                                                                                                                        
                                                  (mul(ldop node                                                                                                                  
                                                       : $Rn),                                                                                                                    
                                                   (ldop node                                                                                                                     
                                                    : $Rm))>;
  • would some more test bring any benefits/additonal value? Like test with a loops, or tests with no aliasing attributes?

Sure, there may be some patterns that are not yet covered by the def-pats.
We can add more tests and catch more patterns.

  • please ignore if you disagree, but thought that some names like ldop, K1, K2 could be slightly more informative/consistent.

Please suggest better names, I'm fine changing these names.
K stands for constant, ldop is a load operator.

ok, sounds all good.

Good point: clang-format and tablegen :-)

To catch more dot product cases, we need to fix the passes above instruction selection.

I looked at the basic dot product loop:

int dot_product1(char *a, char *b, int sum) {
  for (int i = 0; i < 16 * K; i += 1)
    sum += a[i] * b[i];
  return sum;
}

for different values of K:

  • for K = 1, we do generate a dot instruction
  • for K = 2, K = 3
    • the loop is unrolled
    • SLP vectorizes the straight line code with vector factor 32
    • type legalization kicks in and destroys the pattern
    • we end up generating very poor code
  • K >= 4, no unroll, no SLP, no loop vectorization -> scalar byte loop code.

Looks like if we want to catch more dot product patterns, we'll need to fix the SLP and loop vectorizers.

I am also looking at some code that comes from TVM that is a higher level compiler generating code to LLVM IR.
I have seen that there is a missing pattern in interleaved load pass and a missing instruction in arm64: a ld8.
That is an interleaved load for an 8 by 8 byte matrix.
I think we can generate an i16 ld4 and then generate the low/high byte extracts in each lane.
This will simplify the dag on which we do instruction selection and enable generation of the dot product.

To catch more dot product cases, we need to fix the passes above instruction selection.

Yep, this actually what I was expecting. That is, I don't see a problem with this pattern matching here if it catches a few cases and helps you. But yes, to do the heavy lifting, this is probably a task for the loop vectorizer.

sebpop updated this revision to Diff 220669.Sep 18 2019, 7:56 AM
az added a comment.Sep 18 2019, 10:29 AM

There are few things missing in current work such as indexed dot product or what they call s/udot (vector, by element) in the ARM document (no need to do it now but a comment about that would help). There is also sve dot product. We need to port this code to SVE.

I agree that this work will miss many opportunities and the middle end will optimize the code in such a way that pattern matching does not work. I think that dot product need to have it own pass or subpass in the middle end. I see three places where it can be done: 1) early before the vectorizer in the same way we recognize min/max/etc., or 2) we can do it within the vectorizer as dot product is mainly a vectorization problem, or 3) we can do it post the vectorizer similar to the SIMD interleaved load/store. The third option, while not the best, has more chance of being accepted given that it less disruptive. Any tought on this as we may contribute to this effort in the future?

to do the heavy lifting, this is probably a task for the loop vectorizer.

You are right, it is not the task of instruction selection to vectorize the code:
if the code is vectorized, ISel should know how to select the right instructions.
If in the future the loop/SLP vectorizers produce patterns that are not covered
by the def-pats in this patch, we will add more patterns to generate dot-product.

In D67645#1674197, @az wrote:

There are few things missing in current work such as indexed dot product or what they call s/udot (vector, by element) in the ARM document (no need to do it now but a comment about that would help).

I added a comment in the patch.

There is also sve dot product. We need to port this code to SVE.

Same here, I added a FIXME comment.

I agree that this work will miss many opportunities and the middle end will optimize the code in such a way that pattern matching does not work. I think that dot product need to have it own pass or subpass in the middle end. I see three places where it can be done: 1) early before the vectorizer in the same way we recognize min/max/etc., or 2) we can do it within the vectorizer as dot product is mainly a vectorization problem, or 3) we can do it post the vectorizer similar to the SIMD interleaved load/store. The third option, while not the best, has more chance of being accepted given that it less disruptive. Any tought on this as we may contribute to this effort in the future?

I think I like your solution 2, and I think pre and post vectorizer passes would work as well.

In the SnapDragon compiler we used to generate ARM builtins/intrinsics directly from the vectorizer:
that way we didn't spend time in the back-end to select again instructions and
it provided a mechanism to ensure that the code generated by the vectorizer
looked the same in the back-end. That worked nicely.

Similarly, for the dot product we may want to generate a target independent builtin,
as LLVM does today for @llvm.experimental.vector.reduce.add.v16i32 in
llvm/lib/CodeGen/ExpandReductions.cpp.
If the target has a special instruction to handle the dot product,
it will implement a custom lowering.
So there will be no need to recognize the dot product again
if the vectorizer has already done that work.

sebpop updated this revision to Diff 220719.Sep 18 2019, 12:04 PM

yep, I think we need to generate those reduction intrinsics, which we can then lower/instruction select. I don't think there's anything controversial about that, intrinsics gets generated in a lot of different cases.

Are you planning on doing that now and turn your attention the vectorizer? That would make this work obsolete when it is ready.

I looked at both the SLP and loop vectorizer and I think this is more work than I can do right now.

Are you planning on doing that now and turn your attention the vectorizer?

not a this time.

That would make this work obsolete when it is ready.

Instruction selection and vectorizers are orthogonal.
I don't think that isel work would become obsolete with a vectorizer emitting dot builtins:
there will be need to recognize the dot instructions in code that has not been processed by the vectorizers.

SjoerdMeijer added a comment.EditedSep 19 2019, 8:04 AM

Instruction selection and vectorizers are orthogonal.

Well, okay, sure..... but depending on what tricks the vectoriser does, its output can be different, and the input to instruction selection be different, triggering different instruction selection patterns.

I don't think that isel work would become obsolete with a vectorizer emitting dot builtins:
there will be need to recognize the dot instructions in code that has not been processed by the vectorizers.

So, if the vectoriser for example emits dot product intrinsics, these patterns won't trigger and then it's say dead code if you see what I mean, but please correct me if I am wrong.

But at the same time, as I also said before, if this helps a few cases now, I don't see what's wrong with a nice little bit of pattern matching.

az added a comment.Sep 19 2019, 8:53 AM

What if the code is written with intrinsic but using mul, reduce (say similar to last test in this patch), then this patch will optimize that into dot product instruction. So, for legacy code that was written with old intrinsic, then this patch will remain useful even after dot product is implemented in the vectorizer.

Note that if somebody will be writing new code with intrinsic using mul, reduce instead of dot product, he is probably doing that for a reason and will want it to stay that way (he can then use the right flags to disable dot product).

SjoerdMeijer accepted this revision.Sep 19 2019, 9:04 AM

Yes, looks reasonable to me.
Perhaps wait a day with committing just in case someone has concerns.

This revision is now accepted and ready to land.Sep 19 2019, 9:04 AM
This revision was automatically updated to reflect the committed changes.