This is an archive of the discontinued LLVM Phabricator instance.

[PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation
Needs RevisionPublic

Authored by ashahid on Apr 15 2015, 8:01 AM.

Details

Summary

http://permalink.gmane.org/gmane.comp.compilers.llvm.devel/81724

This is with reference to the X86 PSAD instruction generation discussion happened in the above link.

This patch contains the Codgen changes. This change is introducing an llvm intrinsic called @llvm.sad() which takes two integer vector input and return an integer result. An ISD::SAD corresponding to this intrinsic is also added. Loop vectorizer or SLP vectorizer can use this intrinsic only after querying the target.

Diff Detail

Event Timeline

ashahid retitled this revision from to [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation.
ashahid updated this object.
ashahid edited the test plan for this revision. (Show Details)
ashahid set the repository for this revision to rL LLVM.
ashahid added a subscriber: Unknown Object (MLST).
ab edited edge metadata.Apr 15 2015, 11:56 AM

A few general notes:

  • A LangRef description of the intrinsic would be useful for reviewing (and of course necessary for committing).
  • Also, the TTI parts are independent of the intrinsic/building/legalization parts; perhaps submit those with the patch that introduces their usage? (the vectorizer I guess)
  • I'm a bit confused about the lowering: if you only implement the v16i8 legal case, I don't expect to see anything other than a .td pattern and the SelectionDAGBuilder code.
  • In general, I think this will eventually need much stronger legalization: once something is in IR, people *will* use it, and IMO, if there isn't a good reason, it shouldn't fail to select or crash the compiler.

Thanks for working on this, I'm excited about the vectorizer parts!
-Ahmed

include/llvm/IR/Intrinsics.td
589

The opposite question was asked on the thread about i32, but can we constraint the return type more than this? (open question, I don't think so)

590

No spaces after '[' and before ']'.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1291–1295

This basically bypasses the legalizer for SAD, for no good reason. It should just be handled by the default case, and the legalization below.

2741–2760

I'm confused: what does this achieve?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5096–5100

I would inline ExpandSad here, like bswap right above.

lib/Target/X86/X86ISelDAGToDAG.cpp
2087–2097

Why not a .td pattern? Should be a one-liner (+ the SDNode in TargetSelectionDAG.td).

2089

I think getMachineNode has an overload for two operands you can use.

2091

Two things:

  • SDM says SSE1 supports this on v8i8, which isn't really legal.
  • why not a single check? For instance, is hasAVX() useful here?
lib/Target/X86/X86ISelLowering.cpp
915

Why not Legal?

16928–16929

Why not a single SDValue?

16930–16936

The original ISD::SAD building returns an integer (matching the intrinsic). Why v2i64 + extract + trunc here, instead of just doing DAG.getNode(ISD::SAD, dl, Op.getValueType(), ...) ?

16932–16933

There's a getNode overload you can use for two operands: getNode(..., Op0, Op1), like you do right below.

lib/Target/X86/X86TargetTransformInfo.cpp
1143–1144

Why not "return false"? It doesn't make much sense to call it on any other type, but I think that's the point of the function.

test/CodeGen/X86/sad_intrinsic.ll
1

Why AVX? This is all SSE2, no?

5

Explicitly checking the movd would be useful, I think

Hi Ahmed,

Thanks for looking into it.

jmolloy requested changes to this revision.Apr 16 2015, 2:29 AM
jmolloy edited edge metadata.

Hi,

Thanks for continuing to work on this!

From a high level, I still stand by the comments I made on your LoopVectorizer review that this is a bit too Intel-specific. ARM and AArch64's SAD instructions do not behave as you model here. You model SAD as a per-element subtract, abs, then a cross-lane (horizontal) summation. The VABD instruction in ARM does not do the last step - that is assumed to be done in a separate instruction.

Now, we can model your intrinsic semantics in ARM using:

val2 = ABD val0, val1
val3 = ADDV val2

However that isn't very efficient - the sum-across-lanes is expensive. Ideally, in a loop, we'd do something like this:

loop:
  reduction1 = ABD array0[i], array1[i]
  br loop
val = ADDV reduction1

So we'd only do the horizontal step once, not on every iteration. That is why I think having the intrinsic represent just the per-element operations, not the horizontal part, would be the lowest common denominator between our two targets. It is easier to pattern match two nodes than to split one node apart and LICM part of it.

Similarly, the above construction should the loop vectorizer emit it is not very good for you. You would never be able to match your PSAD instruction. So my suggestion is this:

  • The intrinsic only models the per-element operations, not the horizontal part.
  • The X86 backend pattern matches the sad intrinsic plus a horizontal reduction to its PSAD node.
  • The Loop Vectorizer has code to emit *either* the per-element-only or the per-element-plus-horizontal SAD as part of its reduction logic.
  • The TargetTransformInfo hook is extended to return an enum SADType, in much the same way that popcnt type is done currently.

How does this sound?

Cheers,

James

This revision now requires changes to proceed.Apr 16 2015, 2:29 AM
ab added a comment.Apr 16 2015, 10:49 AM

FWIW I agree with James' concerns: I thought the intrinsic had already
been agreed upon, but that doesn't seem to be the case.

This got me thinking, and perhaps that's what you're saying James: if
the SAD intrinsic is too X86-specific, why do we even need an
intrinsic? We already express horizontal reductions using the
expanded form (if that's not ideal, then that's a separate topic);
the SUB is just a "sub <16 x i8>".

The tricky part is the ABS, which AFAIK we express using the expanded
form as well (InstCombine seems to agree).

If the only reason for the intrinsic is making it easier on the
vectorizer (I haven't really followed that discussion), it sounds like
there's something to be done for the more general problem (e.g. MAX,
or SAT, which I have not gotten back to yet).

Am I making sense?

-Ahmed

Hi James,

Thanks for your explanation.

With regards custom lowering, I think you have misinterpreted me. What I was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ ) )", non-X86 backends could custom lower it as something like "ABD + ADDV" (absolute >difference, sum-of-all-lanes). However, you'd end up with the sum-of-all-lanes unnecessarily being inside the loop! By the time the intrinsic is expanded, it may be difficult to determine that the sum can be moved outside the loop.

If the horizontal sum is an intrinsic, how the LICM will happen on it?

Also, you haven't answered my question about signedness that I've mentioned several times.

Currently the SAD intrinsic is modeled for unsigned data types. AFAIK, signedness matters
only when an arithmetic operation results in a carry or overflow. We have not seen a use case
for signed data types yet. Even in case of ARM, “SABD” (if I am correct) does not
set overflow flag.

Regards,
Shahid

From: James Molloy [mailto:james@jamesmolloy.co.uk]
Sent: Thursday, April 16, 2015 9:42 PM
To: Shahid, Asghar-ahmad; reviews+D9029+public+eee34c83a2c6f996@reviews.llvm.org; hfinkel@anl.gov; spatel@rotateright.com; aschwaighofer@apple.com; ahmed.bougacha@gmail.com
Cc: llvm-commits@cs.uiuc.edu
Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Hi Asghar-ahmad,

Thanks for responding. I'll try and explain in more detail what I mean. I agree that we can custom lower things and that we could implement your intrinsic on our and other architecture. That is not in question. What is in question is whether the definition of the intrinsic and behaviour as-is would allow *efficient* implementation on multiple target architectures.

To reiterate the examples from earlier, there are seemingly two different approaches for lowering a sum-of-absolute-differences loop. Assume 'a' and 'b' are the two input arrays, as some p\
ointer to vector type.

1:
int r = 0;
for (i = ...) {

r += sum( abs( *a++ - *b++ ) );

}
// r holds the sum-of-absolute-differences

2:
vector int r = {0};
for (i = ...) {

r += abs( *a++ - *b++ );

}
r holds partial sums.
int sad = sum(r);
sad holds the sum-of-absolute-differences

The most efficient form of lowering for X86 may possibly be (1), where a PSAD instruction can be used (although for non-i8 types perhaps not?). For ARM, AArch64 and according to an appnote I found [0] Altivec (I couldn't find anything about MIPS), (2) is going to be better.

So the goal as I see it is to define these intrinsics and IR idioms such that both forms can be generated depending on the target (and/or datatype - you don't have PSAD for floating point types, so if someone does a non-int SAD loop the most efficient form for you would be (2)).

With regards custom lowering, I think you have misinterpreted me. What I was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ ) )", non-X86 backends could custom lower it as something like "ABD + ADDV" (absolute difference, sum-of-all-lanes). However, you'd end up with the sum-of-all-lanes unnecessarily being inside the loop! By the time the intrinsic is expanded, it may be difficult to determine that the sum can be moved outside the loop.

Conversely, if we defined the intrinsic as "abs( *a++ - *b--)", we could still easily generate loop type (1) by adding a sum() around it. As it is easier to match a pattern than split a pattern apart and move it around (ISel is made for pattern matching!) this is the implementation I am suggesting.

Yes, you're right, this means the name "SAD" for the node may be a misnomer. What I've asked for is the splitting apart of an opaque intrinsic into a smaller opaque intrinsic and generic support IR, which is something we do try and do where possible elsewhere in the compiler. I hope I've explained why the node as you've described it may not be useful for any non-X86 target.

Also, you haven't answered my question about signedness that I've mentioned several times.

Cheers,

[0] http://www.freescale.com/webapp/sps/download/license.jsp?colCode=AVEC_SAD

Hi James,

From: James Molloy [mailto:james@jamesmolloy.co.uk]
Sent: Monday, April 20, 2015 2:03 PM
To: Shahid, Asghar-ahmad; reviews+D9029+public+eee34c83a2c6f996@reviews.llvm.org; hfinkel@anl.gov; spatel@rotateright.com; aschwaighofer@apple.com; ahmed.bougacha@gmail.com
Cc: llvm-commits@cs.uiuc.edu
Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Hi Shahid,

No matter how horizontal sums are modelled LICM will always have great difficulty recognizing that it can be hoisted. I think this is beyond LICM's abilities, which is why I suggested we allow the Loop Vectorizer to create several different idioms, one of which has the sum already hoisted. The Loop Vectorizer has all the knowledge to know that this is legal and profitable to do.
So you mean to create different idioms suited to different targets such as the (1) for X86, (2) for ARM from your example below.

I don't think you're right there with regards signedness. The conceptual operation of SAD is this:

  1. Extend inputs to the output size (i8 -> i32 in PSAD's case)
  2. Subtract the inputs
  3. abs() the result
  4. (optionally) sum the abs()s.

I think steps (2) and (4) are signedness-independent. I think steps (1) and (3) are not, and step (1) is where ARM's SABD differs from UABD.
I had referred the ARMv8 architecture manual for SABD & UABD, but I could not find any difference in their semantics,
both are using the unsigned input data. Is it the right doc to refer?

X86's PSAD also extends the inputs from i8 to i32, so I don't think PSAD is signedness independent either.
I don’t think this is correct, PSAD operate on 8 unsigned byte operands of source and destination.

Cheers,

James

Are you sure about that? I thought your psad had this signature:
i32 @psad(i8* %a, i8* %b)

Oops, it was a typo, pls read it as “both the operand source1 and source2”. Below is the link for reference.
https://www.yumpu.com/en/document/view/7897009/ia-32-intelr-architecture-software-developers-manual-volume-2/643

Regards,
Shahid

From: James Molloy [mailto:james@jamesmolloy.co.uk]
Sent: Monday, April 20, 2015 10:05 PM
To: Shahid, Asghar-ahmad; reviews+D9029+public+eee34c83a2c6f996@reviews.llvm.org; hfinkel@anl.gov; spatel@rotateright.com; aschwaighofer@apple.com; ahmed.bougacha@gmail.com
Cc: llvm-commits@cs.uiuc.edu
Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Hi,
So you mean to create different idioms suited to different targets such as the (1) for X86, (2) for ARM from your example below.
Yes; although (2) will apply to more than just ARM - there may be examples where such an idiom is profitable for X86 too, such as when you have i16 datatypes.

I had referred the ARMv8 architecture manual for SABD & UABD, but I could not find any difference in their semantics,
both are using the unsigned input data. Is it the right doc to refer?

That is the correct document. There's a knack to reading it: both SABD and UABD have a shared decode and semantic, but if you notice the "unsigned" pseudo-variable is set from the "U" bit of the instruction, which is 0 for UABD and 1 for SABD. So they do treat their input as differently signed.

I don’t think this is correct, PSAD operate on 8 unsigned byte operands of source and destination.

Are you sure about that? I thought your psad had this signature:

i32 @psad(i8* %a, i8* %b)

Are you saying instead it has:

i8 @psad(i8* %a, i8* %b)

Because that contradicts what you've said earlier in these threads and also intuition - what is the point in such an instruction as it can overflow during the summation step?

Cheers,

James

Hi James,

We agree with your logic of providing two intrinsic for ‘absolute difference’ and ‘horizontal add’.
I will soon send a description of that.
However for targets supporting specialized SAD instructions keeping the proposed llvm.sad() intrinsic
(of course need to extend to model for data types & signedness) still make sense due to

  1. Cost modeling would be easier and efficient for SAD idioms
  1. No dag combine would be required (absd() + hadd() -> sad())

Thoughts?

Regards,
Shahid

From: Shahid, Asghar-ahmad
Sent: Monday, April 20, 2015 10:24 PM
To: 'James Molloy'; reviews+D9029+public+eee34c83a2c6f996@reviews.llvm.org; hfinkel@anl.gov; spatel@rotateright.com; aschwaighofer@apple.com; ahmed.bougacha@gmail.com
Cc: llvm-commits@cs.uiuc.edu
Subject: RE: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Are you sure about that? I thought your psad had this signature:
i32 @psad(i8* %a, i8* %b)

Oops, it was a typo, pls read it as “both the operand source1 and source2”. Below is the link for reference.
https://www.yumpu.com/en/document/view/7897009/ia-32-intelr-architecture-software-developers-manual-volume-2/643

Regards,
Shahid

From: James Molloy [mailto:james@jamesmolloy.co.uk]
Sent: Monday, April 20, 2015 10:05 PM
To: Shahid, Asghar-ahmad; reviews+D9029+public+eee34c83a2c6f996@reviews.llvm.org<mailto:reviews+D9029+public+eee34c83a2c6f996@reviews.llvm.org>; hfinkel@anl.gov<mailto:hfinkel@anl.gov>; spatel@rotateright.com<mailto:spatel@rotateright.com>; aschwaighofer@apple.com<mailto:aschwaighofer@apple.com>; ahmed.bougacha@gmail.com<mailto:ahmed.bougacha@gmail.com>
Cc: llvm-commits@cs.uiuc.edu<mailto:llvm-commits@cs.uiuc.edu>
Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Hi,
So you mean to create different idioms suited to different targets such as the (1) for X86, (2) for ARM from your example below.
Yes; although (2) will apply to more than just ARM - there may be examples where such an idiom is profitable for X86 too, such as when you have i16 datatypes.

I had referred the ARMv8 architecture manual for SABD & UABD, but I could not find any difference in their semantics,
both are using the unsigned input data. Is it the right doc to refer?

That is the correct document. There's a knack to reading it: both SABD and UABD have a shared decode and semantic, but if you notice the "unsigned" pseudo-variable is set from the "U" bit of the instruction, which is 0 for UABD and 1 for SABD. So they do treat their input as differently signed.

I don’t think this is correct, PSAD operate on 8 unsigned byte operands of source and destination.

Are you sure about that? I thought your psad had this signature:

i32 @psad(i8* %a, i8* %b)

Are you saying instead it has:

i8 @psad(i8* %a, i8* %b)

Because that contradicts what you've said earlier in these threads and also intuition - what is the point in such an instruction as it can overflow during the summation step?

Cheers,

James

spatel resigned from this revision.Nov 25 2015, 9:39 AM
spatel removed a reviewer: spatel.

Looks like this is proceeding with a different implementation (for x86 at least):
http://reviews.llvm.org/D14897
http://reviews.llvm.org/D14840