This is an archive of the discontinued LLVM Phabricator instance.

[Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
AbandonedPublic

Authored by jmolloy on Jul 6 2015, 9:37 AM.

Details

Summary

This adds new intrinsics "hadd_*" for horizontal or reduction sum operation to facilitate efficient code generation for "sum of absolute differences" operation.
The patch also contains the introduction of corresponding SDNodes and basic legalization support.Sanity of the generated code is tested on X86.

This is 2nd of the three patches.The 1st patch can be referred here, http://reviews.llvm.org/D10867

Diff Detail

Event Timeline

ashahid updated this revision to Diff 29096.Jul 6 2015, 9:37 AM
ashahid retitled this revision from to [Codegen] Add intrinsics 'hadd*' and corresponding SDNodes for horizontal sum operation..
ashahid updated this object.
ashahid added reviewers: jmolloy, hfinkel, rengolin.
ashahid set the repository for this revision to rL LLVM.
ashahid added a subscriber: llvm-commits.
jmolloy requested changes to this revision.Jul 7 2015, 2:45 AM
jmolloy edited edge metadata.

Hi,

Generally looks good, but I have some initial comments.

Cheers,

James

docs/LangRef.rst
10918

You need to be very explicit about the behaviour of this intrinsic with floating point arguments. What order, if any, does it perform the adds in? If there is no guaranteed order, it can only be used in fast-math mode.

include/llvm/IR/Intrinsics.td
620

Just having one intrinsic here would be good; there's no need for a separate int and float version.

628

Blank line missing here.

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
721

Can't you just call ExpandHADD() here? or at least share the unroll and expand code?

This revision now requires changes to proceed.Jul 7 2015, 2:45 AM

Hi James,

Thanks for your comments.I will do the needful. Pls see my response below.

Regards,
Shahid

docs/LangRef.rst
10918

Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an order such as "add each element of vector, starting from element 0 to n-1, to an accumulated sum which is initialized to zero". Does it make sense?

include/llvm/IR/Intrinsics.td
620

In that case what should be the return type of intrinsic?

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
721

Yes, probably I can share.

Hi James,

Pls find the response below.

Shahid

From: James Molloy [mailto:james@jamesmolloy.co.uk]
Sent: Tuesday, July 07, 2015 7:09 PM
To: reviews+D10964+public+8430e6553a631d12@reviews.llvm.org; Shahid, Asghar-ahmad; hfinkel@anl.gov; renato.golin@linaro.org; james.molloy@arm.com
Cc: llvm-commits@cs.uiuc.edu
Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hadd*' and corresponding SDNodes for horizontal sum operation.

Hi,

Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an order such as "add each element of vector, starting from element 0 to n-1, to an accumulated sum which is initialized to zero". Does it make sense?

That would mean you wouldn't be able to lower it using a lg(n)-shuffles algorithm, as that does it in the wrong order. You'd have to use a linear algorithm which would perform quite poorly.
That’s right.

It would also stop horizontal add instructions being used on architectures that support them (I don't know of any that do for FP types - probably for this reason!). I'd probably go with the fast-math version personally.
In that case I would support fast-math version.

In that case what should be the return type of intrinsic?

llvm_any_ty ?
Initially I thought of this, however did not use because it will allow any other type also. Hence provided the float intrinsic separately.
If that is not a concern I would use llvm_any_ty.

hfinkel added inline comments.Jul 9 2015, 7:35 PM
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2925

Why is ISD::HADD not handled here?

2949

Why exactly does this do? Widening normally introduces undefs, but you can't add a bunch of undefs and get anything other than an undef out.

You might need the SDAG node to array an extra parameter indicating how many of the vector lanes are actually to be added to properly support widening.

ashahid added inline comments.Jul 10 2015, 4:56 AM
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2925

Probably my tests was not sufficient enough to ask me to handle ISD::HADD. I will see.

2949

Yes it does introduce undefs.

Sorry that I could not get your suggestion regarding the extra SDAG?

bruno added a subscriber: bruno.Jul 13 2015, 8:31 AM

Hi,

Thanks for working on this. Comments below.

docs/LangRef.rst
10945

I don't know if this discussion already happend, but I've been thinking about this and I'm wondering whether we should have a vector result instead of a scalar one; the result in the first element of the vector type and the other elements undef. Then an extractelement follows to get the scalar result.

IMO, this is more natural given the way architectures implement variants of HADD, they usually leave the results on vectors. One advantage of doing this is that we can also use this ISD::HADD while lowering other vector operations (the CTPOP case) and don't have to write a DAGCombine or any other extra logic to recognise the vector back from an extract. I might be biased on one side of the history here though, I'd appreciate hearing the other side :-)

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2876 ↗(On Diff #29096)

Note: If we use vector result instead of scalar, this won't be needed here and everything could be handled in LegalizeVectorOps.

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1093

For the integer part (ISD::HADD) I believe you could do "vector shifts + vector adds" instead of "extracts + scalar adds", probably better to do not leave the vector domain? In case the current target doesn't support "vector shifts + vector adds" for the element type, then your implementation should fallback to "extracts + scalar adds". To check that you can use in UnrollHADD:

if (TLI.getOperationAction(ISD::SHL, VT) == TargetLowering::Expand ||
    TLI.getOperationAction(ISD::ADD, VT) == TargetLowering::Expand)
 ....
ashahid added inline comments.Jul 15 2015, 1:49 AM
docs/LangRef.rst
10945

No, this discussion did not happen earlier.

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.

In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

ashahid edited edge metadata.

Updated the patch with following changes

  1. Overloaded the intrinsic properly.
  2. Updated the WidenVecOp to avoid summing up of undefs.
  3. Restricted the floating point version of intrinsic only for fast-math.
  4. Updated the doc accordingly
bruno added a comment.Jul 23 2015, 7:07 AM

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

Regarding the tests, please split and rename it to vector-hadd-128.ll and vector-hadd-256.ll, no need to split files based on the fact that they are testing the expansion. Once you have custom versions, just drive them by subtarget features (+ssse3, sse4*, etc). Take a look at vector-blend.ll and others for examples.

hfinkel edited edge metadata.Jul 25 2015, 3:29 PM

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

ashahid edited edge metadata.

Splitted the tests based on integer and float 128/256 bit data types.

bruno added a comment.Jul 28 2015, 8:46 AM

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

test/CodeGen/X86/vec-hadd-float-128.ll
3

Could you please update your tests to be more target neutral? I mean, use -mtriple=x86_64-unknown-unknown instead.

One question, what code does it emit if one removes -enable-unsafe-fp-math? If it currently makes no difference, you can remove it, otherwise you should be testing both versions.

Hi Bruno,

My response inlined.

Regards,
Shahid

test/CodeGen/X86/vec-hadd-float-128.ll
3

Ok, will make it more target neutral accordingly.

In fact, user is not supposed to use float version of this intrinsic at all, if it is done so, compiler will 'assert'.

Hi Hal,

Could you pls comment on the updates?

Regards,
Shahid

RKSimon added a subscriber: RKSimon.Aug 2 2015, 9:14 AM

Apologies for joining this discussion so late.

I'm worried that this intrinsic is over specific to the PSAD (sum_of) cases - I would have thought a pairwise style horizontal add would fit in much better with most target hardware and could still make locating PSAD style patterns pretty straightforward.

Another alternative would be to instead of a new instrinsic/SDNode, you could focus on providing common infrastructure to detect general horizontal reduction/reassociation patterns - PR23116 and PR21975 would benefit from these.

Failing that, would you consider renaming the opcode ISD::SUM or similar to avoid ambiguity with SSE + NEON HADD instructions?

Apologies for joining this discussion so late.

I'm worried that this intrinsic is over specific to the PSAD (sum_of) cases - I would have thought a pairwise style horizontal add would fit in much better with most target hardware and could still make locating PSAD style patterns pretty straightforward.

IMO, this intrinsic is generic in terms of the semantics of a horizontal sum and PSAD happens to use this semantic. Also for power of 2 operand cases, computation will be of O(ln) which is better than pairwise computation.

Another alternative would be to instead of a new instrinsic/SDNode, you could focus on providing common infrastructure to detect general horizontal reduction/reassociation patterns - PR23116 and PR21975 would benefit from these.

At this point in time, I would like to deffer this possibility.

Failing that, would you consider renaming the opcode ISD::SUM or similar to avoid ambiguity with SSE + NEON HADD instructions?

Sure.

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

I am ok with it but I would like to know Hal's opinion before proceeding.

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

I am ok with it but I would like to know Hal's opinion before proceeding.

I think this is okay; we should clearly document the motivation. This does not address any problems with scalar-valued PHIs, but should make the pattern matching easier to implement in the common case for backends with legal horizontal adds.

ashahid retitled this revision from [Codegen] Add intrinsics 'hadd*' and corresponding SDNodes for horizontal sum operation. to [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation..
ashahid removed rL LLVM as the repository for this revision.

Updated the patch with

  1. Renaming of intrinsic *hadd to *hsum and its related code/doc
  2. Renaming of ISD node *HADD to *HSUM and its related code
  3. Updated the SDAG builder to transform "llvm.*hsum" into two nodes,

*HSUM & EXTRACT_VECTOR_ELT.

  1. Test case updated accordingly.
bruno added inline comments.Aug 13 2015, 12:40 PM
include/llvm/CodeGen/ISDOpcodes.h
346

Space after the dot.

347

Same here.

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1067

Using the assert inside this "if" will be a problem when asserts aren't enabled.

test/CodeGen/X86/vec-hadd-float-128.ll
2

Thanks for updating the tests. Could you please use ./utils/update_llc_test_checks.py to rewrite the tests checking? This will make it easier later on to see the improvements from target customization or widening (as you mentioned in a TODO).

RKSimon added inline comments.Aug 14 2015, 5:12 AM
test/CodeGen/X86/vec-hadd-float-256.ll
5

The test file is vec-hadd-float-256 yet you are testing a 512 bit vector? Change to <4 x double>

Updated the patch regarding the comments given by Bruno and Simon.

RKSimon added inline comments.Aug 18 2015, 5:31 AM
test/CodeGen/X86/vec-hadd-int-256.ll
15

This codegen is the same as for the test1_hsum_int_i64 <2x i64> version in vec-hadd-int-128.ll - something is going wrong. You probably should compare against codegen from a AVX2 target.

ashahid added inline comments.Aug 18 2015, 11:55 PM
test/CodeGen/X86/vec-hadd-int-256.ll
15

With AVX2 the generated code differ as below.

Case V2i64

vpshufd $78, %xmm0, %xmm1       # xmm1 = xmm0[2,3,0,1]
vpaddq  %xmm1, %xmm0, %xmm0
vmovq   %xmm0, %rax
retq

Case V4i64

vextracti128    $1, %ymm0, %xmm1
vpaddq  %ymm1, %ymm0, %ymm0
vpermq  $237, %ymm0, %ymm1      # ymm1 = ymm0[1,3,2,3]
vpaddq  %ymm1, %ymm0, %ymm0
vmovq   %xmm0, %rax
vzeroupper
retq
RKSimon added inline comments.Aug 19 2015, 1:12 AM
test/CodeGen/X86/vec-hadd-int-256.ll
15

So yes, it appears to be something is wrong with the legalization. When you build for SSE you only get the hsum of the bottom <2 x i64>, when you build for AVX (which legalizes <4 x i64>) you get the hsum of the whole <2 x i64>.

Updated the patch to handle the legalization of vector type split properly.

Hi All,

Pls review as this is pending for quite some time.

Regards,
Shahid

Please upload future patches will full context, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions.

docs/LangRef.rst
10794

This constraint is not acceptable, you'll need to remove it. (plus, we're moving to a model where fast-math semantics are per-instruction flags). However, it is perfectly reasonable to document the fact that the ordering of the additions performed with the intrinsic is undefined.

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1058

Remove the UnsafeFPMath check.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2928

Remove the UnsafeFPMath check.

Please upload future patches will full context, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions.

Thanks for looking into it. I will update the patch accordingly and upload with full context.

ashahid updated this revision to Diff 33923.Sep 3 2015, 3:42 AM

Updated the docs/LangRef.rst and related code for Hal's comment.

ashahid updated this revision to Diff 34130.Sep 7 2015, 1:03 AM

Minor update in test case.

Hi Hal & others,

Please review, waiting for your comments / clearance.

Regards,
Shahid

I have no further issues, this is fine from my perspective.

Thanks James.

Looking forward for others responses.

Hi Hal,

Waiting for your response.

Regards,
Shahid

hfinkel added inline comments.Sep 18 2015, 5:12 PM
docs/LangRef.rst
10974

Nothing is being loaded here. You can just say that, "The argument is a vector of any integer or floating-point type."

10987

Signed or unsigned overflow?

10992

Missing space before "They"

10994

No need for a comma after that.

11000

integer or floating point number -> integer or floating-point type.

11009

I think this is unhelpful. Given that the order of additions is undefined, it might not be exactly equivalent to this code. I think describing this in words would be better in this case.

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1077

Given that the order of additions is undefined, we can add NoSignedWrap or NoUnsignedWrap in the integer case.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
711

Same comment here as above (we can add signed or unsigned nowrap here).

2963

Same here.

2968

And here.

Hi Hal,

Updated the LangRef.rst and code accordingly.Please review.

Regards,
Shahid

hfinkel added inline comments.Sep 23 2015, 11:57 AM
docs/LangRef.rst
10988

Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will be able to prove no-unsigned-overflow in most cases, and we don't get it from C's semantics, and thus, would not be able to generate this intrinsic.

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1071

If we can get rid of the undefined behavior for unsigned overflow; then remove this line.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
714

Same comment here too.

2959

Same here.

Hi Hal,

Thanks for pointing this.

Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will be able to prove no-unsigned-overflow in most cases, and we don't get it from C's semantics, >and thus, would not be able to generate this intrinsic.

I referred to the C's semantics of "unsigned overflow" and realized that it is not necessary, will update the patch accordingly.

Regards,
Shahid

Hi Hal,

Updated the patch accordingly.

Regards,
Shahid

I need to take a step back here; why are we doing this again?

I read again the RFC thread (http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html), and it ended with the following (from Renato):

BTW, now my plan is to just add the two intrinsics for 'absolute difference'
and 'horizontal add'.

That's ok, as long as they're impossible to represent in plain IR.

and I think that we were all in agreement on this point. But now I'm not sure you've demonstrated the prerequisite. The underlying operations here (and in D10867) seem like they are representable using IR (as demonstrated by the fact that you provide potentially-equivalent IR sequences in the documentation), except for the ordering freedom here.

And this, I fear, is where we run into trouble. The thing that is not representable in the IR is that the order of operations in the horizontal sum is undefined, and at the same time, signed overflow is undefined. This cannot be represented in the IR because nsw adds don't reassociate, and thus, there's no way to represent the sequence of nsw adds such that they can be reassociated while retaining their nsw property. But the problem is that, because this freedom cannot be represented in the IR, we also can't generate it from IR in a semantics-preserving way; and, thus, it would not be legal to generate it in the vectorizers.

Thus, this change does not seem right, and approving D10867 seems like a mistake as well. We could certainly fix the definition here to make it exactly representable in the IR, but then what's the point of doing so?

In the RFC, you mentioned cost modeling as a major motivation for adding intrinsics, but that seems like an unrelated problem (at least in part). During vectorization, we can use special interfaces to estimate the cost of complex patterns. In fact, we already have an interface for reductions: TTI.getReductionCost. There is a second relevant code model: That used by the unroller and inliner. Vectorization happens after inlining, so that interaction is not really relevant, but partial unrolling happens after vectorization, and so the cost model there might want to understand that a complex sequence of shuffles, extracts and adds has a disproportionately-low cost. The same is true of the inliner if the input IR uses vector types and initially contains these operations, but even in that case, you'd not want to canonicalize on the intrinsics too early in case other optimizations remove the need for most of the operations. Thus, in the end, you need pattern-matching code near the end of the pipeline anyway to account for input IR directly containing relevant operations on vector types.

In short, I don't understand why we're going in this direction. You can match these operations in the backend, and you can match them in the cost-model code. If we can't do the latter, then we should fix that. And we already have special interfaces for vectorization costs for complex operations such as this.

Hi Hal,

Response inlined.

Regards,
Shahid

davidxl added inline comments.
docs/LangRef.rst
10979

For the integer case, having scalar result type (with the same size as the vector element) make this intrinsic less useful -- due to overflow conditions. The vectorizer will have difficulty proving overflow does not happen and won't be able to generate it in many cases.

As Bruno commented, having vector result type may be the way to go. For instance, for the input type of v4i8, if the result type can be v2i16 -- the hsum is split into 2 horizontal adds each one producing a 16 bit result. If the result type is v1i32, the hsum adds four i8 integers and produces a 32bit result. Limiting this to power of 2 number of elements seems reasonable.

test/CodeGen/X86/vec-hadd-float-128.ll
10

Should it be shufps .... xmm1 = xmm1[1, ?, ?, ?]

13

this shufps and addps should not be expected

test/CodeGen/X86/vec-hadd-int-128.ll
8

The result does not look right -- should pshufb be generated instead?

24

should phsufw be generated?

Or more efficient with phaddw?

congh added a subscriber: congh.Oct 28 2015, 4:53 PM
congh added inline comments.Oct 28 2015, 5:18 PM
test/CodeGen/X86/vec-hadd-int-128.ll
8

I think shift operation is required here given we only have SSE2 support for x86_64.

24

In SSE2, pshuflw should be generated here. phaddw is introduced in SSSE3.

jmolloy commandeered this revision.Dec 12 2015, 5:27 AM
jmolloy abandoned this revision.
jmolloy edited reviewers, added: ashahid; removed: jmolloy.

This revision has been abandoned; Cong Yuo is now taking this forward in a different direction.