This is an archive of the discontinued LLVM Phabricator instance.

[Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
AbandonedPublic

Authored by jmolloy on Jul 6 2015, 9:37 AM.

Download Raw Diff

Details

Reviewers

rengolin
• ashahid
hfinkel

Summary

This adds new intrinsics "hadd_*" for horizontal or reduction sum operation to facilitate efficient code generation for "sum of absolute differences" operation.
The patch also contains the introduction of corresponding SDNodes and basic legalization support.Sanity of the generated code is tested on X86.

This is 2nd of the three patches.The 1st patch can be referred here, http://reviews.llvm.org/D10867

Diff Detail

Repository: rL LLVM

Event Timeline

• ashahid updated this revision to Diff 29096.Jul 6 2015, 9:37 AM

• ashahid retitled this revision from to [Codegen] Add intrinsics 'hadd*' and corresponding SDNodes for horizontal sum operation..

• ashahid updated this object.

• ashahid added reviewers: jmolloy, hfinkel, rengolin.

• ashahid set the repository for this revision to rL LLVM.

• ashahid added a subscriber: llvm-commits.

Hi,

Generally looks good, but I have some initial comments.

Cheers,

James

docs/LangRef.rst
10342	You need to be very explicit about the behaviour of this intrinsic with floating point arguments. What order, if any, does it perform the adds in? If there is no guaranteed order, it can only be used in fast-math mode.
include/llvm/IR/Intrinsics.td
610	Just having one intrinsic here would be good; there's no need for a separate int and float version.
618	Blank line missing here.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
719	Can't you just call ExpandHADD() here? or at least share the unroll and expand code?

This revision now requires changes to proceed.Jul 7 2015, 2:45 AM

Hi James,

Thanks for your comments.I will do the needful. Pls see my response below.

Regards,
Shahid

docs/LangRef.rst
10342	Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an order such as "add each element of vector, starting from element 0 to n-1, to an accumulated sum which is initialized to zero". Does it make sense?
include/llvm/IR/Intrinsics.td
610	In that case what should be the return type of intrinsic?
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
719	Yes, probably I can share.

• ashahid added a subscriber: • ashahid.Jul 7 2015, 8:35 AM

Hi James,

Pls find the response below.

Shahid

From: James Molloy [mailto:james@jamesmolloy.co.uk]
Sent: Tuesday, July 07, 2015 7:09 PM
To: reviews+D10964+public+8430e6553a631d12@reviews.llvm.org; Shahid, Asghar-ahmad; hfinkel@anl.gov; renato.golin@linaro.org; james.molloy@arm.com
Cc: llvm-commits@cs.uiuc.edu
Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hadd*' and corresponding SDNodes for horizontal sum operation.

Hi,

Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an order such as "add each element of vector, starting from element 0 to n-1, to an accumulated sum which is initialized to zero". Does it make sense?

That would mean you wouldn't be able to lower it using a lg(n)-shuffles algorithm, as that does it in the wrong order. You'd have to use a linear algorithm which would perform quite poorly.
That’s right.

It would also stop horizontal add instructions being used on architectures that support them (I don't know of any that do for FP types - probably for this reason!). I'd probably go with the fast-math version personally.
In that case I would support fast-math version.

In that case what should be the return type of intrinsic?

llvm_any_ty ?
Initially I thought of this, however did not use because it will allow any other type also. Hence provided the float intrinsic separately.
If that is not a concern I would use llvm_any_ty.

hfinkel added inline comments.Jul 9 2015, 7:35 PM

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2854	Why is ISD::HADD not handled here?
2878	Why exactly does this do? Widening normally introduces undefs, but you can't add a bunch of undefs and get anything other than an undef out. You might need the SDAG node to array an extra parameter indicating how many of the vector lanes are actually to be added to properly support widening.

• ashahid added inline comments.Jul 10 2015, 4:56 AM

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2854	Probably my tests was not sufficient enough to ask me to handle ISD::HADD. I will see.
2878	Yes it does introduce undefs. Sorry that I could not get your suggestion regarding the extra SDAG?

bruno added a subscriber: bruno.Jul 13 2015, 8:31 AM

Hi,

Thanks for working on this. Comments below.

docs/LangRef.rst
10369	I don't know if this discussion already happend, but I've been thinking about this and I'm wondering whether we should have a vector result instead of a scalar one; the result in the first element of the vector type and the other elements undef. Then an extractelement follows to get the scalar result. IMO, this is more natural given the way architectures implement variants of HADD, they usually leave the results on vectors. One advantage of doing this is that we can also use this ISD::HADD while lowering other vector operations (the CTPOP case) and don't have to write a DAGCombine or any other extra logic to recognise the vector back from an extract. I might be biased on one side of the history here though, I'd appreciate hearing the other side :-)
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2896	Note: If we use vector result instead of scalar, this won't be needed here and everything could be handled in LegalizeVectorOps.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1083	For the integer part (ISD::HADD) I believe you could do "vector shifts + vector adds" instead of "extracts + scalar adds", probably better to do not leave the vector domain? In case the current target doesn't support "vector shifts + vector adds" for the element type, then your implementation should fallback to "extracts + scalar adds". To check that you can use in UnrollHADD: if (TLI.getOperationAction(ISD::SHL, VT) == TargetLowering::Expand \|\| TLI.getOperationAction(ISD::ADD, VT) == TargetLowering::Expand) ....

• ashahid added inline comments.Jul 15 2015, 1:49 AM

docs/LangRef.rst
10369	No, this discussion did not happen earlier. IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also. In fact on X86, we need to do a DAGCombine of ABSDIFF and HADD to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

jmolloy mentioned this in D10273: Add intrinsics and SDNodes for signed/unsigned absolute difference (absdiff) and horizontal add (hadd)..Jul 17 2015, 2:24 AM

Updated the patch with following changes

Overloaded the intrinsic properly.
Updated the WidenVecOp to avoid summing up of undefs.
Restricted the floating point version of intrinsic only for fast-math.
Updated the doc accordingly

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

Regarding the tests, please split and rename it to vector-hadd-128.ll and vector-hadd-256.ll, no need to split files based on the fact that they are testing the expansion. Once you have custom versions, just drive them by subtarget features (+ssse3, sse4*, etc). Take a look at vector-blend.ll and others for examples.

In D10964#210687, @bruno wrote:

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

Splitted the tests based on integer and float 128/256 bit data types.

Ping !!

In D10964#212074, @hfinkel wrote:

In D10964#210687, @bruno wrote:

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

test/CodeGen/X86/vec-hadd-float-128.ll
2	Could you please update your tests to be more target neutral? I mean, use -mtriple=x86_64-unknown-unknown instead. One question, what code does it emit if one removes -enable-unsafe-fp-math? If it currently makes no difference, you can remove it, otherwise you should be testing both versions.

Hi Bruno,

My response inlined.

Regards,
Shahid

test/CodeGen/X86/vec-hadd-float-128.ll
2	Ok, will make it more target neutral accordingly. In fact, user is not supposed to use float version of this intrinsic at all, if it is done so, compiler will 'assert'.

Hi Hal,

Could you pls comment on the updates?

Regards,
Shahid

Apologies for joining this discussion so late.

I'm worried that this intrinsic is over specific to the PSAD (sum_of) cases - I would have thought a pairwise style horizontal add would fit in much better with most target hardware and could still make locating PSAD style patterns pretty straightforward.

Another alternative would be to instead of a new instrinsic/SDNode, you could focus on providing common infrastructure to detect general horizontal reduction/reassociation patterns - PR23116 and PR21975 would benefit from these.

Failing that, would you consider renaming the opcode ISD::SUM or similar to avoid ambiguity with SSE + NEON HADD instructions?

In D10964#216661, @RKSimon wrote:

Apologies for joining this discussion so late.

I'm worried that this intrinsic is over specific to the PSAD (sum_of) cases - I would have thought a pairwise style horizontal add would fit in much better with most target hardware and could still make locating PSAD style patterns pretty straightforward.

IMO, this intrinsic is generic in terms of the semantics of a horizontal sum and PSAD happens to use this semantic. Also for power of 2 operand cases, computation will be of O(ln) which is better than pairwise computation.

Another alternative would be to instead of a new instrinsic/SDNode, you could focus on providing common infrastructure to detect general horizontal reduction/reassociation patterns - PR23116 and PR21975 would benefit from these.

At this point in time, I would like to deffer this possibility.

Failing that, would you consider renaming the opcode ISD::SUM or similar to avoid ambiguity with SSE + NEON HADD instructions?

Sure.

In D10964#213526, @bruno wrote:

In D10964#212074, @hfinkel wrote:

In D10964#210687, @bruno wrote:

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

I am ok with it but I would like to know Hal's opinion before proceeding.

In D10964#217742, @ashahid wrote:

In D10964#213526, @bruno wrote:

In D10964#212074, @hfinkel wrote:

In D10964#210687, @bruno wrote:

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

I am ok with it but I would like to know Hal's opinion before proceeding.

I think this is okay; we should clearly document the motivation. This does not address any problems with scalar-valued PHIs, but should make the pattern matching easier to implement in the common case for backends with legal horizontal adds.

Updated the patch with

Renaming of intrinsic *hadd to *hsum and its related code/doc
Renaming of ISD node *HADD to *HSUM and its related code
Updated the SDAG builder to transform "llvm.*hsum" into two nodes,

*HSUM & EXTRACT_VECTOR_ELT.

Test case updated accordingly.

bruno added inline comments.Aug 13 2015, 12:40 PM

include/llvm/CodeGen/ISDOpcodes.h
346	Space after the dot.
347	Same here.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1057	Using the assert inside this "if" will be a problem when asserts aren't enabled.
test/CodeGen/X86/vec-hadd-float-128.ll
2	Thanks for updating the tests. Could you please use ./utils/update_llc_test_checks.py to rewrite the tests checking? This will make it easier later on to see the improvements from target customization or widening (as you mentioned in a TODO).

RKSimon added inline comments.Aug 14 2015, 5:12 AM

test/CodeGen/X86/vec-hadd-float-256.ll
5	The test file is vec-hadd-float-256 yet you are testing a 512 bit vector? Change to <4 x double>

Updated the patch regarding the comments given by Bruno and Simon.

RKSimon added inline comments.Aug 18 2015, 5:31 AM

test/CodeGen/X86/vec-hadd-int-256.ll
15	This codegen is the same as for the test1_hsum_int_i64 <2x i64> version in vec-hadd-int-128.ll - something is going wrong. You probably should compare against codegen from a AVX2 target.

• ashahid added inline comments.Aug 18 2015, 11:55 PM

test/CodeGen/X86/vec-hadd-int-256.ll

With AVX2 the generated code differ as below.

Case V2i64

vpshufd $78, %xmm0, %xmm1       # xmm1 = xmm0[2,3,0,1]
vpaddq  %xmm1, %xmm0, %xmm0
vmovq   %xmm0, %rax
retq

Case V4i64

vextracti128    $1, %ymm0, %xmm1
vpaddq  %ymm1, %ymm0, %ymm0
vpermq  $237, %ymm0, %ymm1      # ymm1 = ymm0[1,3,2,3]
vpaddq  %ymm1, %ymm0, %ymm0
vmovq   %xmm0, %rax
vzeroupper
retq

RKSimon added inline comments.Aug 19 2015, 1:12 AM

test/CodeGen/X86/vec-hadd-int-256.ll
15	So yes, it appears to be something is wrong with the legalization. When you build for SSE you only get the hsum of the bottom <2 x i64>, when you build for AVX (which legalizes <4 x i64>) you get the hsum of the whole <2 x i64>.

Updated the patch to handle the legalization of vector type split properly.

Ping !

Hi All,

Pls review as this is pending for quite some time.

Regards,
Shahid

Ping !!

Please upload future patches will full context, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions.

docs/LangRef.rst
10794	This constraint is not acceptable, you'll need to remove it. (plus, we're moving to a model where fast-math semantics are per-instruction flags). However, it is perfectly reasonable to document the fact that the ordering of the additions performed with the intrinsic is undefined.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1058	Remove the UnsafeFPMath check.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2928	Remove the UnsafeFPMath check.

In D10964#238370, @hfinkel wrote:

Please upload future patches will full context, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions.

Thanks for looking into it. I will update the patch accordingly and upload with full context.

Updated the docs/LangRef.rst and related code for Hal's comment.

Minor update in test case.

Hi Hal & others,

Please review, waiting for your comments / clearance.

Regards,
Shahid

I have no further issues, this is fine from my perspective.

Thanks James.

Looking forward for others responses.

Ping !!

Hi Hal,

Waiting for your response.

Regards,
Shahid

hfinkel added inline comments.Sep 18 2015, 5:12 PM

docs/LangRef.rst
10395	Nothing is being loaded here. You can just say that, "The argument is a vector of any integer or floating-point type."
10408	Signed or unsigned overflow?
10413	Missing space before "They"
10415	No need for a comma after that.
10421	integer or floating point number -> integer or floating-point type.
10430	I think this is unhelpful. Given that the order of additions is undefined, it might not be exactly equivalent to this code. I think describing this in words would be better in this case.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1067	Given that the order of additions is undefined, we can add NoSignedWrap or NoUnsignedWrap in the integer case.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
704	Same comment here as above (we can add signed or unsigned nowrap here).
2892	Same here.
2897	And here.

Hi Hal,

Updated the LangRef.rst and code accordingly.Please review.

Regards,
Shahid

hfinkel added inline comments.Sep 23 2015, 11:57 AM

docs/LangRef.rst
10409	Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will be able to prove no-unsigned-overflow in most cases, and we don't get it from C's semantics, and thus, would not be able to generate this intrinsic.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1061	If we can get rid of the undefined behavior for unsigned overflow; then remove this line.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
707	Same comment here too.
2888	Same here.

Hi Hal,

Thanks for pointing this.

Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will be able to prove no-unsigned-overflow in most cases, and we don't get it from C's semantics, >and thus, would not be able to generate this intrinsic.

I referred to the C's semantics of "unsigned overflow" and realized that it is not necessary, will update the patch accordingly.

Regards,
Shahid

Hi Hal,

Updated the patch accordingly.

Regards,
Shahid

I need to take a step back here; why are we doing this again?

I read again the RFC thread (http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html), and it ended with the following (from Renato):

BTW, now my plan is to just add the two intrinsics for 'absolute difference'
and 'horizontal add'.

That's ok, as long as they're impossible to represent in plain IR.

and I think that we were all in agreement on this point. But now I'm not sure you've demonstrated the prerequisite. The underlying operations here (and in D10867) seem like they are representable using IR (as demonstrated by the fact that you provide potentially-equivalent IR sequences in the documentation), except for the ordering freedom here.

And this, I fear, is where we run into trouble. The thing that is not representable in the IR is that the order of operations in the horizontal sum is undefined, and at the same time, signed overflow is undefined. This cannot be represented in the IR because nsw adds don't reassociate, and thus, there's no way to represent the sequence of nsw adds such that they can be reassociated while retaining their nsw property. But the problem is that, because this freedom cannot be represented in the IR, we also can't generate it from IR in a semantics-preserving way; and, thus, it would not be legal to generate it in the vectorizers.

Thus, this change does not seem right, and approving D10867 seems like a mistake as well. We could certainly fix the definition here to make it exactly representable in the IR, but then what's the point of doing so?

In the RFC, you mentioned cost modeling as a major motivation for adding intrinsics, but that seems like an unrelated problem (at least in part). During vectorization, we can use special interfaces to estimate the cost of complex patterns. In fact, we already have an interface for reductions: TTI.getReductionCost. There is a second relevant code model: That used by the unroller and inliner. Vectorization happens after inlining, so that interaction is not really relevant, but partial unrolling happens after vectorization, and so the cost model there might want to understand that a complex sequence of shuffles, extracts and adds has a disproportionately-low cost. The same is true of the inliner if the input IR uses vector types and initially contains these operations, but even in that case, you'd not want to canonicalize on the intrinsics too early in case other optimizations remove the need for most of the operations. Thus, in the end, you need pattern-matching code near the end of the pipeline anyway to account for input IR directly containing relevant operations on vector types.

In short, I don't understand why we're going in this direction. You can match these operations in the backend, and you can match them in the cost-model code. If we can't do the latter, then we should fix that. And we already have special interfaces for vectorization costs for complex operations such as this.

Hi Hal,

Response inlined.

Regards,
Shahid

davidxl added a subscriber: davidxl.Oct 27 2015, 8:56 PM

davidxl added inline comments.

docs/LangRef.rst
10400	For the integer case, having scalar result type (with the same size as the vector element) make this intrinsic less useful -- due to overflow conditions. The vectorizer will have difficulty proving overflow does not happen and won't be able to generate it in many cases. As Bruno commented, having vector result type may be the way to go. For instance, for the input type of v4i8, if the result type can be v2i16 -- the hsum is split into 2 horizontal adds each one producing a 16 bit result. If the result type is v1i32, the hsum adds four i8 integers and produces a 32bit result. Limiting this to power of 2 number of elements seems reasonable.
test/CodeGen/X86/vec-hadd-float-128.ll
11	Should it be shufps .... xmm1 = xmm1[1, ?, ?, ?]
14	this shufps and addps should not be expected
test/CodeGen/X86/vec-hadd-int-128.ll
9	The result does not look right -- should pshufb be generated instead?
25	should phsufw be generated? Or more efficient with phaddw?

congh added a subscriber: congh.Oct 28 2015, 4:53 PM

congh added inline comments.Oct 28 2015, 5:18 PM

test/CodeGen/X86/vec-hadd-int-128.ll
9	I think shift operation is required here given we only have SSE2 support for x86_64.
25	In SSE2, pshuflw should be generated here. phaddw is introduced in SSSE3.

This revision has been abandoned; Cong Yuo is now taking this forward in a different direction.

Revision Contents

Path

Size

docs/

LangRef.rst

49 lines

include/

llvm/

CodeGen/

ISDOpcodes.h

5 lines

IR/

Intrinsics.td

3 lines

Target/

TargetSelectionDAG.td

2 lines

lib/

CodeGen/

SelectionDAG/

LegalizeDAG.cpp

28 lines

LegalizeIntegerTypes.cpp

6 lines

LegalizeTypes.h

3 lines

LegalizeVectorOps.cpp

34 lines

LegalizeVectorTypes.cpp

53 lines

SelectionDAGBuilder.cpp

10 lines

SelectionDAGDumper.cpp

2 lines

TargetLoweringBase.cpp

2 lines

test/

CodeGen/

X86/

vec-hadd-float-128.ll

43 lines

vec-hadd-float-256.ll

26 lines

vec-hadd-int-128.ll

71 lines

vec-hadd-int-256.ll

17 lines

Diff 30691

docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 10,333 Lines • ▼ Show 20 Lines

	Syntax:			Syntax:
	"""""""			"""""""
	This is an overloaded intrinsic. The loaded data is a vector of any integer bit width.			This is an overloaded intrinsic. The loaded data is a vector of any integer bit width.

	.. code-block:: llvm			.. code-block:: llvm

	declare <4 x integer> @llvm.uabsdiff.v4i32(<4 x integer> %a, <4 x integer> %b)			declare <4 x integer> @llvm.uabsdiff.v4i32(<4 x integer> %a, <4 x integer> %b)

				jmolloyAuthorUnsubmitted Not Done Reply Inline Actions You need to be very explicit about the behaviour of this intrinsic with floating point arguments. What order, if any, does it perform the adds in? If there is no guaranteed order, it can only be used in fast-math mode. jmolloy: You need to be very explicit about the behaviour of this intrinsic with floating point…
				ashahidUnsubmitted Not Done Reply Inline Actions Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an order such as "add each element of vector, starting from element 0 to n-1, to an accumulated sum which is initialized to zero". Does it make sense? ashahid: Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an…

	Overview:			Overview:
	"""""""""			"""""""""

	The ``llvm.uabsdiff`` intrinsic returns a vector result of the absolute difference of the two operands,			The ``llvm.uabsdiff`` intrinsic returns a vector result of the absolute difference of the two operands,
	treating them both as unsigned integers.			treating them both as unsigned integers.

	The ``llvm.sabsdiff`` intrinsic returns a vector result of the absolute difference of the two operands,			The ``llvm.sabsdiff`` intrinsic returns a vector result of the absolute difference of the two operands,
	Show All 10 Lines

	Both intrinsics take two integer of the same bitwidth.			Both intrinsics take two integer of the same bitwidth.

	Semantics:			Semantics:
	""""""""""			""""""""""

	The expression::			The expression::

	call <4 x i32> @llvm.uabsdiff.v4i32(<4 x i32> %a, <4 x i32> %b)			call <4 x i32> @llvm.uabsdiff.v4i32(<4 x i32> %a, <4 x i32> %b)
				brunoUnsubmitted Not Done Reply Inline Actions I don't know if this discussion already happend, but I've been thinking about this and I'm wondering whether we should have a vector result instead of a scalar one; the result in the first element of the vector type and the other elements undef. Then an extractelement follows to get the scalar result. IMO, this is more natural given the way architectures implement variants of HADD, they usually leave the results on vectors. One advantage of doing this is that we can also use this ISD::HADD while lowering other vector operations (the CTPOP case) and don't have to write a DAGCombine or any other extra logic to recognise the vector back from an extract. I might be biased on one side of the history here though, I'd appreciate hearing the other side :-) bruno: I don't know if this discussion already happend, but I've been thinking about this and I'm…
				ashahidUnsubmitted Not Done Reply Inline Actions No, this discussion did not happen earlier. IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also. In fact on X86, we need to do a DAGCombine of ABSDIFF and HADD to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine. ashahid: No, this discussion did not happen earlier. IMO, the scalar version is more natural w.r.t the…

	is equivalent to::			is equivalent to::

	%sub = sub <4 x i32> %a, %b			%sub = sub <4 x i32> %a, %b
	%ispos = icmp ugt <4 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1>			%ispos = icmp ugt <4 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1>
	%neg = sub <4 x i32> zeroinitializer, %sub			%neg = sub <4 x i32> zeroinitializer, %sub
	%1 = select <4 x i1> %ispos, <4 x i32> %sub, <4 x i32> %neg			%1 = select <4 x i1> %ispos, <4 x i32> %sub, <4 x i32> %neg

	Similarly the expression::			Similarly the expression::

	call <4 x i32> @llvm.sabsdiff.v4i32(<4 x i32> %a, <4 x i32> %b)			call <4 x i32> @llvm.sabsdiff.v4i32(<4 x i32> %a, <4 x i32> %b)

	is equivalent to::			is equivalent to::

	%sub = sub nsw <4 x i32> %a, %b			%sub = sub nsw <4 x i32> %a, %b
	%ispos = icmp sgt <4 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1>			%ispos = icmp sgt <4 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1>
	%neg = sub nsw <4 x i32> zeroinitializer, %sub			%neg = sub nsw <4 x i32> zeroinitializer, %sub
	%1 = select <4 x i1> %ispos, <4 x i32> %sub, <4 x i32> %neg			%1 = select <4 x i1> %ispos, <4 x i32> %sub, <4 x i32> %neg


				'``llvm.hadd.*``'
				^^^^^^^^^^^^^^^^^

				Syntax:
				"""""""
				This is an overloaded intrinsic. The loaded data is a vector of any integer or floating point data type.
				hfinkelUnsubmitted Not Done Reply Inline Actions Nothing is being loaded here. You can just say that, "The argument is a vector of any integer or floating-point type." hfinkel: Nothing is being loaded here. You can just say that, "The argument is a vector of any integer…
				Floating point data type is only supported in fast-math mode.

				.. code-block:: llvm

				declare <integer> @llvm.hadd.i32.v4i32(<4 x integer> %a)
				davidxlUnsubmitted Not Done Reply Inline Actions For the integer case, having scalar result type (with the same size as the vector element) make this intrinsic less useful -- due to overflow conditions. The vectorizer will have difficulty proving overflow does not happen and won't be able to generate it in many cases. As Bruno commented, having vector result type may be the way to go. For instance, for the input type of v4i8, if the result type can be v2i16 -- the hsum is split into 2 horizontal adds each one producing a 16 bit result. If the result type is v1i32, the hsum adds four i8 integers and produces a 32bit result. Limiting this to power of 2 number of elements seems reasonable. davidxl: For the integer case, having scalar result type (with the same size as the vector element) make…
				declare <float> @llvm.hadd.f32.v4f32(<4 x float> %a)


				Overview:
				"""""""""

				The ``llvm.hadd`` intrinsic returns the result of the horizontal or reduction sum of the elements of the
				vector operand, treating it as integers or floats.
				hfinkelUnsubmitted Not Done Reply Inline Actions Signed or unsigned overflow? hfinkel: Signed or unsigned overflow?

				hfinkelUnsubmitted Not Done Reply Inline Actions Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will be able to prove no-unsigned-overflow in most cases, and we don't get it from C's semantics, and thus, would not be able to generate this intrinsic. hfinkel: Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will…
				.. note::

				These intrinsics are primarily used during the code generation stage of
				compilation.They are generated by the compiler passes such as the Loop and
				hfinkelUnsubmitted Not Done Reply Inline Actions Missing space before "They" hfinkel: Missing space before "They"
				SLP vectorizers.
				The expectation is that, the frontends should not need to generate these
				hfinkelUnsubmitted Not Done Reply Inline Actions No need for a comma after that. hfinkel: No need for a comma after that.
				intrinsics themselves.

				Arguments:
				""""""""""

				The argument is vector of integer or floating point number.
				hfinkelUnsubmitted Not Done Reply Inline Actions integer or floating point number -> integer or floating-point type. hfinkel: integer or floating point number -> integer or floating-point type.

				Semantics:
				""""""""""

				The expression::

				call i32 @llvm.hadd.i32.v4i32(<4 x i32> %a)

				is equivalent to::
				hfinkelUnsubmitted Not Done Reply Inline Actions I think this is unhelpful. Given that the order of additions is undefined, it might not be exactly equivalent to this code. I think describing this in words would be better in this case. hfinkel: I think this is unhelpful. Given that the order of additions is undefined, it might not be…

				%1 = shufflevector <4 x i32> %a, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
				%2 = add <4 x i32> %a, %1
				%3 = shufflevector <4 x i32> %2, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
				%4 = add <4 x i32> %2, %3
				%5 = extractelement <4 x i32> %4, i32 0


	Half Precision Floating Point Intrinsics			Half Precision Floating Point Intrinsics
	----------------------------------------			----------------------------------------

	For most target platforms, half precision floating point is a			For most target platforms, half precision floating point is a
	storage-only format. This means that it is a dense encoding (in memory)			storage-only format. This means that it is a dense encoding (in memory)
	but does not support computation in the format.			but does not support computation in the format.

	This means that code must first load the half-precision floating point			This means that code must first load the half-precision floating point
	▲ Show 20 Lines • Show All 339 Lines • ▼ Show 20 Lines
	The '``llvm.masked.gather``' intrinsic is designed for conditional reading of multiple scalar values from arbitrary memory locations in a single IR operation. It is useful for targets that support vector masked gathers and allows vectorizing basic blocks with data and control divergence. Other targets may support this intrinsic differently, for example by lowering it into a sequence of scalar load operations.			The '``llvm.masked.gather``' intrinsic is designed for conditional reading of multiple scalar values from arbitrary memory locations in a single IR operation. It is useful for targets that support vector masked gathers and allows vectorizing basic blocks with data and control divergence. Other targets may support this intrinsic differently, for example by lowering it into a sequence of scalar load operations.
	The semantics of this operation are equivalent to a sequence of conditional scalar loads with subsequent gathering all loaded values into a single vector. The mask restricts memory access to certain lanes and facilitates vectorization of predicated basic blocks.			The semantics of this operation are equivalent to a sequence of conditional scalar loads with subsequent gathering all loaded values into a single vector. The mask restricts memory access to certain lanes and facilitates vectorization of predicated basic blocks.


	::			::

	%res = call <4 x double> @llvm.masked.gather.v4f64 (<4 x double*> %ptrs, i32 8, <4 x i1>%mask, <4 x double> <true, true, true, true>)			%res = call <4 x double> @llvm.masked.gather.v4f64 (<4 x double*> %ptrs, i32 8, <4 x i1>%mask, <4 x double> <true, true, true, true>)

	;; The gather with all-true mask is equivalent to the following instruction sequence			;; The gather with all-true mask is equivalent to the following instruction sequence
				hfinkelUnsubmitted Not Done Reply Inline Actions This constraint is not acceptable, you'll need to remove it. (plus, we're moving to a model where fast-math semantics are per-instruction flags). However, it is perfectly reasonable to document the fact that the ordering of the additions performed with the intrinsic is undefined. hfinkel: This constraint is not acceptable, you'll need to remove it. (plus, we're moving to a model…
	%ptr0 = extractelement <4 x double*> %ptrs, i32 0			%ptr0 = extractelement <4 x double*> %ptrs, i32 0
	%ptr1 = extractelement <4 x double*> %ptrs, i32 1			%ptr1 = extractelement <4 x double*> %ptrs, i32 1
	%ptr2 = extractelement <4 x double*> %ptrs, i32 2			%ptr2 = extractelement <4 x double*> %ptrs, i32 2
	%ptr3 = extractelement <4 x double*> %ptrs, i32 3			%ptr3 = extractelement <4 x double*> %ptrs, i32 3

	%val0 = load double, double* %ptr0, align 8			%val0 = load double, double* %ptr0, align 8
	%val1 = load double, double* %ptr1, align 8			%val1 = load double, double* %ptr1, align 8
	%val2 = load double, double* %ptr2, align 8			%val2 = load double, double* %ptr2, align 8
	▲ Show 20 Lines • Show All 603 Lines • Show Last 20 Lines

include/llvm/CodeGen/ISDOpcodes.h

Show First 20 Lines • Show All 336 Lines • ▼ Show 20 Lines	enum NodeType {

/// Byte Swap and Counting operators.		/// Byte Swap and Counting operators.
BSWAP, CTTZ, CTLZ, CTPOP,		BSWAP, CTTZ, CTLZ, CTPOP,

/// [SU]ABSDIFF - Signed/Unsigned absolute difference of two input integer		/// [SU]ABSDIFF - Signed/Unsigned absolute difference of two input integer
/// vector. These nodes are generated from llvm.absdiff intrinsics.		/// vector. These nodes are generated from llvm.absdiff intrinsics.
SABSDIFF, UABSDIFF,		SABSDIFF, UABSDIFF,

		/// HADD/FHADD - Horizontal sum across the elements of sole
		/// integer or float input vector.
		brunoUnsubmitted Not Done Reply Inline Actions Space after the dot. bruno: Space after the dot.
		/// These nodes are generated from llvm.hadd* intrinsics.
		brunoUnsubmitted Not Done Reply Inline Actions Same here. bruno: Same here.
		HADD, FHADD,

/// Bit counting operators with an undefined result for zero inputs.		/// Bit counting operators with an undefined result for zero inputs.
CTTZ_ZERO_UNDEF, CTLZ_ZERO_UNDEF,		CTTZ_ZERO_UNDEF, CTLZ_ZERO_UNDEF,

/// Select(COND, TRUEVAL, FALSEVAL). If the type of the boolean COND is not		/// Select(COND, TRUEVAL, FALSEVAL). If the type of the boolean COND is not
/// i1 then the high bits must conform to getBooleanContents.		/// i1 then the high bits must conform to getBooleanContents.
SELECT,		SELECT,

/// Select with a vector condition (op #0) and two vector operands (ops #1		/// Select with a vector condition (op #0) and two vector operands (ops #1
▲ Show 20 Lines • Show All 558 Lines • Show Last 20 Lines

include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 601 Lines • ▼ Show 20 Lines	def int_convertuu : Intrinsic<[llvm_anyint_ty],
[llvm_anyint_ty, llvm_i32_ty, llvm_i32_ty]>;		[llvm_anyint_ty, llvm_i32_ty, llvm_i32_ty]>;

// Clear cache intrinsic, default to ignore (ie. emit nothing)		// Clear cache intrinsic, default to ignore (ie. emit nothing)
// maps to void __clear_cache() on supporting platforms		// maps to void __clear_cache() on supporting platforms
def int_clear_cache : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],		def int_clear_cache : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],
[], "llvm.clear_cache">;		[], "llvm.clear_cache">;

// Calculate the Absolute Differences of the two input vectors.		// Calculate the Absolute Differences of the two input vectors.
def int_sabsdiff : Intrinsic<[llvm_anyvector_ty],		def int_sabsdiff : Intrinsic<[llvm_anyvector_ty],
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Just having one intrinsic here would be good; there's no need for a separate int and float version. jmolloy: Just having one intrinsic here would be good; there's no need for a separate int and float…
		ashahidUnsubmitted Not Done Reply Inline Actions In that case what should be the return type of intrinsic? ashahid: In that case what should be the return type of intrinsic?
[ LLVMMatchType<0>, LLVMMatchType<0> ], [IntrNoMem]>;		[ LLVMMatchType<0>, LLVMMatchType<0> ], [IntrNoMem]>;
def int_uabsdiff : Intrinsic<[llvm_anyvector_ty],		def int_uabsdiff : Intrinsic<[llvm_anyvector_ty],
[ LLVMMatchType<0>, LLVMMatchType<0> ], [IntrNoMem]>;		[ LLVMMatchType<0>, LLVMMatchType<0> ], [IntrNoMem]>;

		// Calculate the horizontal/reduction sum across the elements of input vector.
		def int_hadd : Intrinsic<[llvm_any_ty], [llvm_anyvector_ty], [IntrNoMem]>;

//===-------------------------- Masked Intrinsics -------------------------===//		//===-------------------------- Masked Intrinsics -------------------------===//
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Blank line missing here. jmolloy: Blank line missing here.
//		//
def int_masked_store : Intrinsic<[], [llvm_anyvector_ty, LLVMPointerTo<0>,		def int_masked_store : Intrinsic<[], [llvm_anyvector_ty, LLVMPointerTo<0>,
llvm_i32_ty,		llvm_i32_ty,
LLVMVectorSameWidth<0, llvm_i1_ty>],		LLVMVectorSameWidth<0, llvm_i1_ty>],
[IntrReadWriteArgMem]>;		[IntrReadWriteArgMem]>;

def int_masked_load : Intrinsic<[llvm_anyvector_ty],		def int_masked_load : Intrinsic<[llvm_anyvector_ty],
[LLVMPointerTo<0>, llvm_i32_ty,		[LLVMPointerTo<0>, llvm_i32_ty,
Show All 35 Lines

include/llvm/Target/TargetSelectionDAG.td

Show First 20 Lines • Show All 382 Lines • ▼ Show 20 Lines	def sube : SDNode<"ISD::SUBE" , SDTIntBinOp,
[SDNPOutGlue, SDNPInGlue]>;		[SDNPOutGlue, SDNPInGlue]>;
def smin : SDNode<"ISD::SMIN" , SDTIntBinOp>;		def smin : SDNode<"ISD::SMIN" , SDTIntBinOp>;
def smax : SDNode<"ISD::SMAX" , SDTIntBinOp>;		def smax : SDNode<"ISD::SMAX" , SDTIntBinOp>;
def umin : SDNode<"ISD::UMIN" , SDTIntBinOp>;		def umin : SDNode<"ISD::UMIN" , SDTIntBinOp>;
def umax : SDNode<"ISD::UMAX" , SDTIntBinOp>;		def umax : SDNode<"ISD::UMAX" , SDTIntBinOp>;

def sabsdiff : SDNode<"ISD::SABSDIFF" , SDTIntBinOp>;		def sabsdiff : SDNode<"ISD::SABSDIFF" , SDTIntBinOp>;
def uabsdiff : SDNode<"ISD::UABSDIFF" , SDTIntBinOp>;		def uabsdiff : SDNode<"ISD::UABSDIFF" , SDTIntBinOp>;
		def hadd : SDNode<"ISD::HADD" , SDTIntUnaryOp>;
		def fhadd : SDNode<"ISD::FHADD" , SDTIntUnaryOp>;
def sext_inreg : SDNode<"ISD::SIGN_EXTEND_INREG", SDTExtInreg>;		def sext_inreg : SDNode<"ISD::SIGN_EXTEND_INREG", SDTExtInreg>;
def bswap : SDNode<"ISD::BSWAP" , SDTIntUnaryOp>;		def bswap : SDNode<"ISD::BSWAP" , SDTIntUnaryOp>;
def ctlz : SDNode<"ISD::CTLZ" , SDTIntUnaryOp>;		def ctlz : SDNode<"ISD::CTLZ" , SDTIntUnaryOp>;
def cttz : SDNode<"ISD::CTTZ" , SDTIntUnaryOp>;		def cttz : SDNode<"ISD::CTTZ" , SDTIntUnaryOp>;
def ctpop : SDNode<"ISD::CTPOP" , SDTIntUnaryOp>;		def ctpop : SDNode<"ISD::CTPOP" , SDTIntUnaryOp>;
def ctlz_zero_undef : SDNode<"ISD::CTLZ_ZERO_UNDEF", SDTIntUnaryOp>;		def ctlz_zero_undef : SDNode<"ISD::CTLZ_ZERO_UNDEF", SDTIntUnaryOp>;
def cttz_zero_undef : SDNode<"ISD::CTTZ_ZERO_UNDEF", SDTIntUnaryOp>;		def cttz_zero_undef : SDNode<"ISD::CTTZ_ZERO_UNDEF", SDTIntUnaryOp>;
def sext : SDNode<"ISD::SIGN_EXTEND", SDTIntExtendOp>;		def sext : SDNode<"ISD::SIGN_EXTEND", SDTIntExtendOp>;
▲ Show 20 Lines • Show All 714 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

Show First 20 Lines • Show All 2,885 Lines • ▼ Show 20 Lines	void SelectionDAGLegalize::ExpandNode(SDNode *Node) {
case ISD::CTTZ:		case ISD::CTTZ:
case ISD::CTTZ_ZERO_UNDEF:		case ISD::CTTZ_ZERO_UNDEF:
Tmp1 = ExpandBitCount(Node->getOpcode(), Node->getOperand(0), dl);		Tmp1 = ExpandBitCount(Node->getOpcode(), Node->getOperand(0), dl);
Results.push_back(Tmp1);		Results.push_back(Tmp1);
break;		break;
case ISD::BSWAP:		case ISD::BSWAP:
Results.push_back(ExpandBSWAP(Node->getOperand(0), dl));		Results.push_back(ExpandBSWAP(Node->getOperand(0), dl));
break;		break;
		case ISD::FHADD:
		case ISD::HADD: {
		if (Node->getOpcode() == ISD::FHADD)
		brunoUnsubmitted Not Done Reply Inline Actions Note: If we use vector result instead of scalar, this won't be needed here and everything could be handled in LegalizeVectorOps. bruno: Note: If we use vector result instead of scalar, this won't be needed here and everything could…
		assert(TM.Options.UnsafeFPMath &&
		"Floating point horizontal sum only supported for fast-math");
		SDValue OpVal = Node->getOperand(0);
		EVT VT = OpVal.getValueType();
		unsigned NumElems = VT.getVectorNumElements();
		EVT EltVT = VT.getVectorElementType();

		SmallVector<int, 16> ShuffleMask(NumElems, -1);
		for (unsigned i = NumElems; i != 1; i >>= 1) {
		// Move the upper half of the vector to the lower half.
		for (unsigned j = 0; j != i / 2; ++j)
		ShuffleMask[j] = (i / 2 + j);
		SDValue Shuffle = DAG.getVectorShuffle(VT, dl, OpVal, DAG.getUNDEF(VT),
		&ShuffleMask[0]);
		OpVal = DAG.getNode(Node->getOpcode() == ISD::HADD ? ISD::ADD : ISD::FADD,
		dl, VT, OpVal, Shuffle);
		}
		Tmp1 = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, OpVal,
		DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		if (Node->getOpcode() == ISD::HADD) {
		Tmp1 = DAG.getAnyExtOrTrunc(Tmp1, dl, Node->getValueType(0));
		}
		Results.push_back(Tmp1);
		break;
		}
case ISD::FRAMEADDR:		case ISD::FRAMEADDR:
case ISD::RETURNADDR:		case ISD::RETURNADDR:
case ISD::FRAME_TO_ARGS_OFFSET:		case ISD::FRAME_TO_ARGS_OFFSET:
Results.push_back(DAG.getConstant(0, dl, Node->getValueType(0)));		Results.push_back(DAG.getConstant(0, dl, Node->getValueType(0)));
break;		break;
case ISD::FLT_ROUNDS_:		case ISD::FLT_ROUNDS_:
Results.push_back(DAG.getConstant(1, dl, Node->getValueType(0)));		Results.push_back(DAG.getConstant(1, dl, Node->getValueType(0)));
break;		break;
▲ Show 20 Lines • Show All 1,495 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

Show First 20 Lines • Show All 884 Lines • ▼ Show 20 Lines	bool DAGTypeLegalizer::PromoteIntegerOperand(SDNode *N, unsigned OpNo) {
case ISD::ZERO_EXTEND: Res = PromoteIntOp_ZERO_EXTEND(N); break;		case ISD::ZERO_EXTEND: Res = PromoteIntOp_ZERO_EXTEND(N); break;
case ISD::EXTRACT_SUBVECTOR: Res = PromoteIntOp_EXTRACT_SUBVECTOR(N); break;		case ISD::EXTRACT_SUBVECTOR: Res = PromoteIntOp_EXTRACT_SUBVECTOR(N); break;

case ISD::SHL:		case ISD::SHL:
case ISD::SRA:		case ISD::SRA:
case ISD::SRL:		case ISD::SRL:
case ISD::ROTL:		case ISD::ROTL:
case ISD::ROTR: Res = PromoteIntOp_Shift(N); break;		case ISD::ROTR: Res = PromoteIntOp_Shift(N); break;
		case ISD::HADD: Res = PromoteIntOp_HADD(N); break;
}		}

// If the result is null, the sub-method took care of registering results etc.		// If the result is null, the sub-method took care of registering results etc.
if (!Res.getNode()) return false;		if (!Res.getNode()) return false;

// If the result is N, the sub-method updated N in place. Tell the legalizer		// If the result is N, the sub-method updated N in place. Tell the legalizer
// core about this.		// core about this.
if (Res.getNode() == N)		if (Res.getNode() == N)
▲ Show 20 Lines • Show All 317 Lines • ▼ Show 20 Lines
SDValue DAGTypeLegalizer::PromoteIntOp_ZERO_EXTEND(SDNode *N) {		SDValue DAGTypeLegalizer::PromoteIntOp_ZERO_EXTEND(SDNode *N) {
SDLoc dl(N);		SDLoc dl(N);
SDValue Op = GetPromotedInteger(N->getOperand(0));		SDValue Op = GetPromotedInteger(N->getOperand(0));
Op = DAG.getNode(ISD::ANY_EXTEND, dl, N->getValueType(0), Op);		Op = DAG.getNode(ISD::ANY_EXTEND, dl, N->getValueType(0), Op);
return DAG.getZeroExtendInReg(Op, dl,		return DAG.getZeroExtendInReg(Op, dl,
N->getOperand(0).getValueType().getScalarType());		N->getOperand(0).getValueType().getScalarType());
}		}

		SDValue DAGTypeLegalizer::PromoteIntOp_HADD(SDNode *N) {
		SDValue Op = GetPromotedInteger(N->getOperand(0));
		return DAG.getNode(N->getOpcode(), SDLoc(N),
		N->getValueType(0), Op);
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Integer Result Expansion		// Integer Result Expansion
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// ExpandIntegerResult - This method is called when the specified result of the		/// ExpandIntegerResult - This method is called when the specified result of the
/// specified node is found to need expansion. At this point, the node may also		/// specified node is found to need expansion. At this point, the node may also
/// have invalid operands or may have other results that need promotion, we just		/// have invalid operands or may have other results that need promotion, we just
▲ Show 20 Lines • Show All 1,970 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/LegalizeTypes.h

Show First 20 Lines • Show All 288 Lines • ▼ Show 20 Lines	private:
SDValue PromoteIntOp_SIGN_EXTEND(SDNode *N);		SDValue PromoteIntOp_SIGN_EXTEND(SDNode *N);
SDValue PromoteIntOp_SINT_TO_FP(SDNode *N);		SDValue PromoteIntOp_SINT_TO_FP(SDNode *N);
SDValue PromoteIntOp_STORE(StoreSDNode *N, unsigned OpNo);		SDValue PromoteIntOp_STORE(StoreSDNode *N, unsigned OpNo);
SDValue PromoteIntOp_TRUNCATE(SDNode *N);		SDValue PromoteIntOp_TRUNCATE(SDNode *N);
SDValue PromoteIntOp_UINT_TO_FP(SDNode *N);		SDValue PromoteIntOp_UINT_TO_FP(SDNode *N);
SDValue PromoteIntOp_ZERO_EXTEND(SDNode *N);		SDValue PromoteIntOp_ZERO_EXTEND(SDNode *N);
SDValue PromoteIntOp_MSTORE(MaskedStoreSDNode *N, unsigned OpNo);		SDValue PromoteIntOp_MSTORE(MaskedStoreSDNode *N, unsigned OpNo);
SDValue PromoteIntOp_MLOAD(MaskedLoadSDNode *N, unsigned OpNo);		SDValue PromoteIntOp_MLOAD(MaskedLoadSDNode *N, unsigned OpNo);
		SDValue PromoteIntOp_HADD(SDNode *N);

void PromoteSetCCOperands(SDValue &LHS,SDValue &RHS, ISD::CondCode Code);		void PromoteSetCCOperands(SDValue &LHS,SDValue &RHS, ISD::CondCode Code);

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Integer Expansion Support: LegalizeIntegerTypes.cpp		// Integer Expansion Support: LegalizeIntegerTypes.cpp
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//

/// GetExpandedInteger - Given a processed operand Op which was expanded into		/// GetExpandedInteger - Given a processed operand Op which was expanded into
▲ Show 20 Lines • Show All 329 Lines • ▼ Show 20 Lines	private:
void SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N, SDValue &Lo,		void SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N, SDValue &Lo,
SDValue &Hi);		SDValue &Hi);

// Vector Operand Splitting: <128 x ty> -> 2 x <64 x ty>.		// Vector Operand Splitting: <128 x ty> -> 2 x <64 x ty>.
bool SplitVectorOperand(SDNode *N, unsigned OpNo);		bool SplitVectorOperand(SDNode *N, unsigned OpNo);
SDValue SplitVecOp_VSELECT(SDNode *N, unsigned OpNo);		SDValue SplitVecOp_VSELECT(SDNode *N, unsigned OpNo);
SDValue SplitVecOp_UnaryOp(SDNode *N);		SDValue SplitVecOp_UnaryOp(SDNode *N);
SDValue SplitVecOp_TruncateHelper(SDNode *N);		SDValue SplitVecOp_TruncateHelper(SDNode *N);
		SDValue SplitVecOp_HADD(SDNode *N);

SDValue SplitVecOp_BITCAST(SDNode *N);		SDValue SplitVecOp_BITCAST(SDNode *N);
SDValue SplitVecOp_EXTRACT_SUBVECTOR(SDNode *N);		SDValue SplitVecOp_EXTRACT_SUBVECTOR(SDNode *N);
SDValue SplitVecOp_EXTRACT_VECTOR_ELT(SDNode *N);		SDValue SplitVecOp_EXTRACT_VECTOR_ELT(SDNode *N);
SDValue SplitVecOp_STORE(StoreSDNode *N, unsigned OpNo);		SDValue SplitVecOp_STORE(StoreSDNode *N, unsigned OpNo);
SDValue SplitVecOp_MSTORE(MaskedStoreSDNode *N, unsigned OpNo);		SDValue SplitVecOp_MSTORE(MaskedStoreSDNode *N, unsigned OpNo);
SDValue SplitVecOp_MSCATTER(MaskedScatterSDNode *N, unsigned OpNo);		SDValue SplitVecOp_MSCATTER(MaskedScatterSDNode *N, unsigned OpNo);
SDValue SplitVecOp_MGATHER(MaskedGatherSDNode *N, unsigned OpNo);		SDValue SplitVecOp_MGATHER(MaskedGatherSDNode *N, unsigned OpNo);
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	private:
SDValue WidenVecOp_EXTEND(SDNode *N);		SDValue WidenVecOp_EXTEND(SDNode *N);
SDValue WidenVecOp_EXTRACT_VECTOR_ELT(SDNode *N);		SDValue WidenVecOp_EXTRACT_VECTOR_ELT(SDNode *N);
SDValue WidenVecOp_EXTRACT_SUBVECTOR(SDNode *N);		SDValue WidenVecOp_EXTRACT_SUBVECTOR(SDNode *N);
SDValue WidenVecOp_STORE(SDNode* N);		SDValue WidenVecOp_STORE(SDNode* N);
SDValue WidenVecOp_MSTORE(SDNode* N, unsigned OpNo);		SDValue WidenVecOp_MSTORE(SDNode* N, unsigned OpNo);
SDValue WidenVecOp_SETCC(SDNode* N);		SDValue WidenVecOp_SETCC(SDNode* N);

SDValue WidenVecOp_Convert(SDNode *N);		SDValue WidenVecOp_Convert(SDNode *N);
		SDValue WidenVecOp_HADD(SDNode *N);

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Vector Widening Utilities Support: LegalizeVectorTypes.cpp		// Vector Widening Utilities Support: LegalizeVectorTypes.cpp
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//

/// Helper GenWidenVectorLoads - Helper function to generate a set of		/// Helper GenWidenVectorLoads - Helper function to generate a set of
/// loads to load a vector with a resulting wider type. It takes		/// loads to load a vector with a resulting wider type. It takes
/// LdChain: list of chains for the load to be generated.		/// LdChain: list of chains for the load to be generated.
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp

Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	class VectorLegalizer {
SDValue LegalizeOp(SDValue Op);		SDValue LegalizeOp(SDValue Op);

/// \brief Assuming the node is legal, "legalize" the results.		/// \brief Assuming the node is legal, "legalize" the results.
SDValue TranslateLegalizeResults(SDValue Op, SDValue Result);		SDValue TranslateLegalizeResults(SDValue Op, SDValue Result);

/// \brief Implements unrolling a VSETCC.		/// \brief Implements unrolling a VSETCC.
SDValue UnrollVSETCC(SDValue Op);		SDValue UnrollVSETCC(SDValue Op);

		/// \brief Implements unrolling a HADD/FHADD.
		SDValue UnrollHADD(SDValue Op);

/// \brief Implement expand-based legalization of vector operations.		/// \brief Implement expand-based legalization of vector operations.
///		///
/// This is just a high-level routine to dispatch to specific code paths for		/// This is just a high-level routine to dispatch to specific code paths for
/// operations to legalize them.		/// operations to legalize them.
SDValue Expand(SDValue Op);		SDValue Expand(SDValue Op);

/// \brief Implements expansion for FNEG; falls back to UnrollVectorOp if		/// \brief Implements expansion for FNEG; falls back to UnrollVectorOp if
/// FSUB isn't legal.		/// FSUB isn't legal.
▲ Show 20 Lines • Show All 638 Lines • ▼ Show 20 Lines	SDValue VectorLegalizer::Expand(SDValue Op) {
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
return ExpandUINT_TO_FLOAT(Op);		return ExpandUINT_TO_FLOAT(Op);
case ISD::FNEG:		case ISD::FNEG:
return ExpandFNEG(Op);		return ExpandFNEG(Op);
case ISD::SETCC:		case ISD::SETCC:
return UnrollVSETCC(Op);		return UnrollVSETCC(Op);
case ISD::UABSDIFF:		case ISD::UABSDIFF:
case ISD::SABSDIFF:		case ISD::SABSDIFF:
return ExpandABSDIFF(Op);		return ExpandABSDIFF(Op);
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Can't you just call ExpandHADD() here? or at least share the unroll and expand code? jmolloy: Can't you just call ExpandHADD() here? or at least share the unroll and expand code?
		ashahidUnsubmitted Not Done Reply Inline Actions Yes, probably I can share. ashahid: Yes, probably I can share.
		case ISD::HADD:
		case ISD::FHADD:
		return UnrollHADD(Op);
default:		default:
return DAG.UnrollVectorOp(Op.getNode());		return DAG.UnrollVectorOp(Op.getNode());
}		}
}		}

SDValue VectorLegalizer::ExpandABSDIFF(SDValue Op) {		SDValue VectorLegalizer::ExpandABSDIFF(SDValue Op) {
SDLoc dl(Op);		SDLoc dl(Op);
SDValue Tmp1, Tmp2, Tmp3, Tmp4;		SDValue Tmp1, Tmp2, Tmp3, Tmp4;
▲ Show 20 Lines • Show All 315 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i < NumElems; ++i) {
Ops[i] = DAG.getSelect(dl, EltVT, Ops[i],		Ops[i] = DAG.getSelect(dl, EltVT, Ops[i],
DAG.getConstant(APInt::getAllOnesValue		DAG.getConstant(APInt::getAllOnesValue
(EltVT.getSizeInBits()), dl, EltVT),		(EltVT.getSizeInBits()), dl, EltVT),
DAG.getConstant(0, dl, EltVT));		DAG.getConstant(0, dl, EltVT));
}		}
return DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Ops);		return DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Ops);
}		}

		SDValue VectorLegalizer::UnrollHADD(SDValue Op) {
		if (Op->getOpcode() == ISD::FHADD)
		assert(DAG.getTarget().Options.UnsafeFPMath &&
		"Floating point horizontal sum only supported for fast-math");
		brunoUnsubmitted Not Done Reply Inline Actions Using the assert inside this "if" will be a problem when asserts aren't enabled. bruno: Using the assert inside this "if" will be a problem when asserts aren't enabled.
		SDLoc dl(Op);
		hfinkelUnsubmitted Not Done Reply Inline Actions Remove the UnsafeFPMath check. hfinkel: Remove the UnsafeFPMath check.
		SDValue OpVal = Op.getOperand(0);
		EVT VT = OpVal.getValueType();
		unsigned NumElems = VT.getVectorNumElements();
		hfinkelUnsubmitted Not Done Reply Inline Actions If we can get rid of the undefined behavior for unsigned overflow; then remove this line. hfinkel: If we can get rid of the undefined behavior for unsigned overflow; then remove this line.
		EVT EltVT = VT.getVectorElementType();

		SmallVector<int, 16> ShuffleMask(NumElems, -1);
		for (unsigned i = NumElems; i != 1; i >>= 1) {
		// Move the upper half of the vector to the lower half.
		for (unsigned j = 0; j != i / 2; ++j)
		hfinkelUnsubmitted Not Done Reply Inline Actions Given that the order of additions is undefined, we can add NoSignedWrap or NoUnsignedWrap in the integer case. hfinkel: Given that the order of additions is undefined, we can add NoSignedWrap or NoUnsignedWrap in…
		ShuffleMask[j] = (i / 2 + j);
		SDValue Shuffle =
		DAG.getVectorShuffle(VT, dl, OpVal, DAG.getUNDEF(VT), &ShuffleMask[0]);
		OpVal = DAG.getNode(Op->getOpcode() == ISD::HADD ? ISD::ADD : ISD::FADD, dl,
		VT, OpVal, Shuffle);
		}
		SDValue Tmp1 = DAG.getNode(
		ISD::EXTRACT_VECTOR_ELT, dl, EltVT, OpVal,
		DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		if (Op->getOpcode() == ISD::HADD) {
		Tmp1 = DAG.getAnyExtOrTrunc(Tmp1, dl, Op.getValueType());
		}
		return Tmp1;
		}
}		}

		brunoUnsubmitted Not Done Reply Inline Actions For the integer part (ISD::HADD) I believe you could do "vector shifts + vector adds" instead of "extracts + scalar adds", probably better to do not leave the vector domain? In case the current target doesn't support "vector shifts + vector adds" for the element type, then your implementation should fallback to "extracts + scalar adds". To check that you can use in UnrollHADD: if (TLI.getOperationAction(ISD::SHL, VT) == TargetLowering::Expand \|\| TLI.getOperationAction(ISD::ADD, VT) == TargetLowering::Expand) .... bruno: For the integer part (ISD::HADD) I believe you could do "vector shifts + vector adds" instead…
bool SelectionDAG::LegalizeVectors() {		bool SelectionDAG::LegalizeVectors() {
return VectorLegalizer(*this).Run();		return VectorLegalizer(*this).Run();
}		}

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 637 Lines • ▼ Show 20 Lines	#endif
case ISD::FRINT:		case ISD::FRINT:
case ISD::FROUND:		case ISD::FROUND:
case ISD::FSIN:		case ISD::FSIN:
case ISD::FSQRT:		case ISD::FSQRT:
case ISD::FTRUNC:		case ISD::FTRUNC:
case ISD::SINT_TO_FP:		case ISD::SINT_TO_FP:
case ISD::TRUNCATE:		case ISD::TRUNCATE:
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
		case ISD::HADD:
		case ISD::FHADD:
SplitVecRes_UnaryOp(N, Lo, Hi);		SplitVecRes_UnaryOp(N, Lo, Hi);
break;		break;

case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
SplitVecRes_ExtendOp(N, Lo, Hi);		SplitVecRes_ExtendOp(N, Lo, Hi);
break;		break;
Show All 40 Lines

void DAGTypeLegalizer::SplitVecRes_BinOp(SDNode *N, SDValue &Lo,		void DAGTypeLegalizer::SplitVecRes_BinOp(SDNode *N, SDValue &Lo,
SDValue &Hi) {		SDValue &Hi) {
SDValue LHSLo, LHSHi;		SDValue LHSLo, LHSHi;
GetSplitVector(N->getOperand(0), LHSLo, LHSHi);		GetSplitVector(N->getOperand(0), LHSLo, LHSHi);
SDValue RHSLo, RHSHi;		SDValue RHSLo, RHSHi;
GetSplitVector(N->getOperand(1), RHSLo, RHSHi);		GetSplitVector(N->getOperand(1), RHSLo, RHSHi);
SDLoc dl(N);		SDLoc dl(N);

		hfinkelUnsubmitted Not Done Reply Inline Actions Same comment here as above (we can add signed or unsigned nowrap here). hfinkel: Same comment here as above (we can add signed or unsigned nowrap here).
Lo = DAG.getNode(N->getOpcode(), dl, LHSLo.getValueType(), LHSLo, RHSLo);		Lo = DAG.getNode(N->getOpcode(), dl, LHSLo.getValueType(), LHSLo, RHSLo);
Hi = DAG.getNode(N->getOpcode(), dl, LHSHi.getValueType(), LHSHi, RHSHi);		Hi = DAG.getNode(N->getOpcode(), dl, LHSHi.getValueType(), LHSHi, RHSHi);
}		}
		hfinkelUnsubmitted Not Done Reply Inline Actions Same comment here too. hfinkel: Same comment here too.

void DAGTypeLegalizer::SplitVecRes_TernaryOp(SDNode *N, SDValue &Lo,		void DAGTypeLegalizer::SplitVecRes_TernaryOp(SDNode *N, SDValue &Lo,
SDValue &Hi) {		SDValue &Hi) {
SDValue Op0Lo, Op0Hi;		SDValue Op0Lo, Op0Hi;
GetSplitVector(N->getOperand(0), Op0Lo, Op0Hi);		GetSplitVector(N->getOperand(0), Op0Lo, Op0Hi);
SDValue Op1Lo, Op1Hi;		SDValue Op1Lo, Op1Hi;
GetSplitVector(N->getOperand(1), Op1Lo, Op1Hi);		GetSplitVector(N->getOperand(1), Op1Lo, Op1Hi);
SDValue Op2Lo, Op2Hi;		SDValue Op2Lo, Op2Hi;
▲ Show 20 Lines • Show All 679 Lines • ▼ Show 20 Lines	#endif
case ISD::CTPOP:		case ISD::CTPOP:
case ISD::FP_EXTEND:		case ISD::FP_EXTEND:
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
case ISD::FTRUNC:		case ISD::FTRUNC:
Res = SplitVecOp_UnaryOp(N);		Res = SplitVecOp_UnaryOp(N);
break;		break;
		case ISD::HADD:
		case ISD::FHADD:
		Res = SplitVecOp_HADD(N);
		break;
}		}
}		}

// If the result is null, the sub-method took care of registering results etc.		// If the result is null, the sub-method took care of registering results etc.
if (!Res.getNode()) return false;		if (!Res.getNode()) return false;

// If the result is N, the sub-method updated N in place. Tell the legalizer		// If the result is N, the sub-method updated N in place. Tell the legalizer
// core about this.		// core about this.
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	SDValue DAGTypeLegalizer::SplitVecOp_BITCAST(SDNode *N) {

if (DAG.getDataLayout().isBigEndian())		if (DAG.getDataLayout().isBigEndian())
std::swap(Lo, Hi);		std::swap(Lo, Hi);

return DAG.getNode(ISD::BITCAST, SDLoc(N), N->getValueType(0),		return DAG.getNode(ISD::BITCAST, SDLoc(N), N->getValueType(0),
JoinIntegers(Lo, Hi));		JoinIntegers(Lo, Hi));
}		}

		SDValue DAGTypeLegalizer::SplitVecOp_HADD(SDNode *N) {
		SDValue Lo, Hi;
		EVT SubVT = N->getValueType(0);
		SDValue OpVal = N->getOperand(0);
		SDLoc dl(N);
		GetSplitVector(OpVal, Lo, Hi);
		Lo = DAG.getNode(N->getOpcode(), dl, SubVT, Lo);
		Hi = DAG.getNode(N->getOpcode(), dl, SubVT, Hi);
		return DAG.getNode(N->getOpcode() == ISD::HADD ? ISD::ADD : ISD::FADD, dl,
		SubVT, Lo, Hi);
		}

SDValue DAGTypeLegalizer::SplitVecOp_EXTRACT_SUBVECTOR(SDNode *N) {		SDValue DAGTypeLegalizer::SplitVecOp_EXTRACT_SUBVECTOR(SDNode *N) {
// We know that the extracted result type is legal.		// We know that the extracted result type is legal.
EVT SubVT = N->getValueType(0);		EVT SubVT = N->getValueType(0);
SDValue Idx = N->getOperand(1);		SDValue Idx = N->getOperand(1);
SDLoc dl(N);		SDLoc dl(N);
SDValue Lo, Hi;		SDValue Lo, Hi;
GetSplitVector(N->getOperand(0), Lo, Hi);		GetSplitVector(N->getOperand(0), Lo, Hi);

▲ Show 20 Lines • Show All 1,334 Lines • ▼ Show 20 Lines	#endif
case ISD::FP_EXTEND:		case ISD::FP_EXTEND:
case ISD::FP_TO_SINT:		case ISD::FP_TO_SINT:
case ISD::FP_TO_UINT:		case ISD::FP_TO_UINT:
case ISD::SINT_TO_FP:		case ISD::SINT_TO_FP:
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
case ISD::TRUNCATE:		case ISD::TRUNCATE:
Res = WidenVecOp_Convert(N);		Res = WidenVecOp_Convert(N);
break;		break;
		case ISD::HADD:
		hfinkelUnsubmitted Not Done Reply Inline Actions Why is ISD::HADD not handled here? hfinkel: Why is ISD::HADD not handled here?
		ashahidUnsubmitted Not Done Reply Inline Actions Probably my tests was not sufficient enough to ask me to handle ISD::HADD. I will see. ashahid: Probably my tests was not sufficient enough to ask me to handle ISD::HADD. I will see.
		case ISD::FHADD:
		Res = WidenVecOp_HADD(N);
		break;
}		}

// If Res is null, the sub-method took care of registering the result.		// If Res is null, the sub-method took care of registering the result.
if (!Res.getNode()) return false;		if (!Res.getNode()) return false;

// If the result is N, the sub-method updated N in place. Tell the legalizer		// If the result is N, the sub-method updated N in place. Tell the legalizer
// core about this.		// core about this.
if (Res.getNode() == N)		if (Res.getNode() == N)
return true;		return true;


assert(Res.getValueType() == N->getValueType(0) && N->getNumValues() == 1 &&		assert(Res.getValueType() == N->getValueType(0) && N->getNumValues() == 1 &&
"Invalid operand expansion");		"Invalid operand expansion");

ReplaceValueWith(SDValue(N, 0), Res);		ReplaceValueWith(SDValue(N, 0), Res);
return false;		return false;
}		}

		SDValue DAGTypeLegalizer::WidenVecOp_HADD(SDNode *N) {
		if (N->getOpcode() == ISD::FHADD)
		assert(DAG.getTarget().Options.UnsafeFPMath &&
		hfinkelUnsubmitted Not Done Reply Inline Actions Why exactly does this do? Widening normally introduces undefs, but you can't add a bunch of undefs and get anything other than an undef out. You might need the SDAG node to array an extra parameter indicating how many of the vector lanes are actually to be added to properly support widening. hfinkel: Why exactly does this do? Widening normally introduces undefs, but you can't add a bunch of…
		ashahidUnsubmitted Not Done Reply Inline Actions Yes it does introduce undefs. Sorry that I could not get your suggestion regarding the extra SDAG? ashahid: Yes it does introduce undefs. Sorry that I could not get your suggestion regarding the extra…
		"Floating point horizontal sum only supported for fast-math");
		// Since widen introduces undefs, scalarize the horizontal sum for actual
		// number of vector elements.
		// TODO: Improve scalarization using vector shift and add on supported
		// targets.
		SDValue Op = N->getOperand(0);
		EVT VT = Op.getValueType();
		unsigned NumElems = VT.getVectorNumElements();
		EVT EltVT = VT.getVectorElementType();
		SDLoc dl(Op);
		hfinkelUnsubmitted Not Done Reply Inline Actions Same here. hfinkel: Same here.

		SDValue LHSElem = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Op,
		DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		SDValue RHSElem = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Op,
		hfinkelUnsubmitted Not Done Reply Inline Actions Same here. hfinkel: Same here.
		DAG.getConstant(1, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		SDValue Ops;
		Ops = DAG.getNode(N->getOpcode() == ISD::HADD ? ISD::ADD : ISD::FADD, dl,
		EltVT, LHSElem, RHSElem);
		for (unsigned i = 2; i < NumElems; ++i) {
		hfinkelUnsubmitted Not Done Reply Inline Actions And here. hfinkel: And here.
		LHSElem = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Op,
		DAG.getConstant(i, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		Ops = DAG.getNode(N->getOpcode() == ISD::HADD ? ISD::ADD : ISD::FADD, dl,
		EltVT, LHSElem, Ops);
		}
		return Ops;
		}


SDValue DAGTypeLegalizer::WidenVecOp_EXTEND(SDNode *N) {		SDValue DAGTypeLegalizer::WidenVecOp_EXTEND(SDNode *N) {
SDLoc DL(N);		SDLoc DL(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

SDValue InOp = N->getOperand(0);		SDValue InOp = N->getOperand(0);
// If some legalization strategy other than widening is used on the operand,		// If some legalization strategy other than widening is used on the operand,
// we can't safely assume that just extending the low lanes is the correct		// we can't safely assume that just extending the low lanes is the correct
// transformation.		// transformation.
if (getTypeAction(InOp.getValueType()) != TargetLowering::TypeWidenVector)		if (getTypeAction(InOp.getValueType()) != TargetLowering::TypeWidenVector)
return WidenVecOp_Convert(N);		return WidenVecOp_Convert(N);
InOp = GetWidenedVector(InOp);		InOp = GetWidenedVector(InOp);
assert(VT.getVectorNumElements() <		assert(VT.getVectorNumElements() <
InOp.getValueType().getVectorNumElements() &&		InOp.getValueType().getVectorNumElements() &&
"Input wasn't widened!");		"Input wasn't widened!");

// We may need to further widen the operand until it has the same total		// We may need to further widen the operand until it has the same total
// vector size as the result.		// vector size as the result.
EVT InVT = InOp.getValueType();		EVT InVT = InOp.getValueType();
if (InVT.getSizeInBits() != VT.getSizeInBits()) {		if (InVT.getSizeInBits() != VT.getSizeInBits()) {
EVT InEltVT = InVT.getVectorElementType();		EVT InEltVT = InVT.getVectorElementType();
for (int i = MVT::FIRST_VECTOR_VALUETYPE, e = MVT::LAST_VECTOR_VALUETYPE; i < e; ++i) {		for (int i = MVT::FIRST_VECTOR_VALUETYPE, e = MVT::LAST_VECTOR_VALUETYPE; i < e; ++i) {
EVT FixedVT = (MVT::SimpleValueType)i;		EVT FixedVT = (MVT::SimpleValueType)i;
		hfinkelUnsubmitted Not Done Reply Inline Actions Remove the UnsafeFPMath check. hfinkel: Remove the UnsafeFPMath check.
EVT FixedEltVT = FixedVT.getVectorElementType();		EVT FixedEltVT = FixedVT.getVectorElementType();
if (TLI.isTypeLegal(FixedVT) &&		if (TLI.isTypeLegal(FixedVT) &&
FixedVT.getSizeInBits() == VT.getSizeInBits() &&		FixedVT.getSizeInBits() == VT.getSizeInBits() &&
FixedEltVT == InEltVT) {		FixedEltVT == InEltVT) {
assert(FixedVT.getVectorNumElements() >= VT.getVectorNumElements() &&		assert(FixedVT.getVectorNumElements() >= VT.getVectorNumElements() &&
"Not enough elements in the fixed type for the operand!");		"Not enough elements in the fixed type for the operand!");
assert(FixedVT.getVectorNumElements() != InVT.getVectorNumElements() &&		assert(FixedVT.getVectorNumElements() != InVT.getVectorNumElements() &&
"We can't have the same type as we started with!");		"We can't have the same type as we started with!");
▲ Show 20 Lines • Show All 683 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,657 Lines • ▼ Show 20 Lines	setValue(&I, DAG.getNode(ISD::UABSDIFF, sdl,
getValue(I.getArgOperand(1))));		getValue(I.getArgOperand(1))));
return nullptr;		return nullptr;
case Intrinsic::sabsdiff:		case Intrinsic::sabsdiff:
setValue(&I, DAG.getNode(ISD::SABSDIFF, sdl,		setValue(&I, DAG.getNode(ISD::SABSDIFF, sdl,
getValue(I.getArgOperand(0)).getValueType(),		getValue(I.getArgOperand(0)).getValueType(),
getValue(I.getArgOperand(0)),		getValue(I.getArgOperand(0)),
getValue(I.getArgOperand(1))));		getValue(I.getArgOperand(1))));
return nullptr;		return nullptr;
		case Intrinsic::hadd: {
		SDValue Arg = getValue(I.getArgOperand(0));
		EVT Ty = Arg.getValueType();
		EVT EltVT = Ty.getVectorElementType();
		setValue(&I, DAG.getNode(
		EltVT.isFloatingPoint() ? ISD::FHADD : ISD::HADD,
		sdl, TLI.getValueType(DAG.getDataLayout(), I.getType()),
		Arg));
		return nullptr;
		}
case Intrinsic::cttz: {		case Intrinsic::cttz: {
SDValue Arg = getValue(I.getArgOperand(0));		SDValue Arg = getValue(I.getArgOperand(0));
ConstantInt *CI = cast<ConstantInt>(I.getArgOperand(1));		ConstantInt *CI = cast<ConstantInt>(I.getArgOperand(1));
EVT Ty = Arg.getValueType();		EVT Ty = Arg.getValueType();
setValue(&I, DAG.getNode(CI->isZero() ? ISD::CTTZ : ISD::CTTZ_ZERO_UNDEF,		setValue(&I, DAG.getNode(CI->isZero() ? ISD::CTTZ : ISD::CTTZ_ZERO_UNDEF,
sdl, Ty, Arg));		sdl, Ty, Arg));
return nullptr;		return nullptr;
}		}
▲ Show 20 Lines • Show All 3,625 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp

Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	#endif
case ISD::FRINT: return "frint";		case ISD::FRINT: return "frint";
case ISD::FNEARBYINT: return "fnearbyint";		case ISD::FNEARBYINT: return "fnearbyint";
case ISD::FROUND: return "fround";		case ISD::FROUND: return "fround";
case ISD::FEXP: return "fexp";		case ISD::FEXP: return "fexp";
case ISD::FEXP2: return "fexp2";		case ISD::FEXP2: return "fexp2";
case ISD::FLOG: return "flog";		case ISD::FLOG: return "flog";
case ISD::FLOG2: return "flog2";		case ISD::FLOG2: return "flog2";
case ISD::FLOG10: return "flog10";		case ISD::FLOG10: return "flog10";
		case ISD::HADD: return "hadd";
		case ISD::FHADD: return "fhadd";

// Binary operators		// Binary operators
case ISD::ADD: return "add";		case ISD::ADD: return "add";
case ISD::SUB: return "sub";		case ISD::SUB: return "sub";
case ISD::MUL: return "mul";		case ISD::MUL: return "mul";
case ISD::MULHU: return "mulhu";		case ISD::MULHU: return "mulhu";
case ISD::MULHS: return "mulhs";		case ISD::MULHS: return "mulhs";
case ISD::SDIV: return "sdiv";		case ISD::SDIV: return "sdiv";
▲ Show 20 Lines • Show All 506 Lines • Show Last 20 Lines

lib/CodeGen/TargetLoweringBase.cpp

Show First 20 Lines • Show All 823 Lines • ▼ Show 20 Lines	for (MVT VT : MVT::all_valuetypes()) {
setOperationAction(ISD::SADDO, VT, Expand);		setOperationAction(ISD::SADDO, VT, Expand);
setOperationAction(ISD::SSUBO, VT, Expand);		setOperationAction(ISD::SSUBO, VT, Expand);
setOperationAction(ISD::UADDO, VT, Expand);		setOperationAction(ISD::UADDO, VT, Expand);
setOperationAction(ISD::USUBO, VT, Expand);		setOperationAction(ISD::USUBO, VT, Expand);
setOperationAction(ISD::SMULO, VT, Expand);		setOperationAction(ISD::SMULO, VT, Expand);
setOperationAction(ISD::UMULO, VT, Expand);		setOperationAction(ISD::UMULO, VT, Expand);
setOperationAction(ISD::UABSDIFF, VT, Expand);		setOperationAction(ISD::UABSDIFF, VT, Expand);
setOperationAction(ISD::SABSDIFF, VT, Expand);		setOperationAction(ISD::SABSDIFF, VT, Expand);
		setOperationAction(ISD::HADD, VT, Expand);
		setOperationAction(ISD::FHADD, VT, Expand);

// These library functions default to expand.		// These library functions default to expand.
setOperationAction(ISD::FROUND, VT, Expand);		setOperationAction(ISD::FROUND, VT, Expand);

// These operations default to expand for vector types.		// These operations default to expand for vector types.
if (VT.isVector()) {		if (VT.isVector()) {
setOperationAction(ISD::FCOPYSIGN, VT, Expand);		setOperationAction(ISD::FCOPYSIGN, VT, Expand);
setOperationAction(ISD::ANY_EXTEND_VECTOR_INREG, VT, Expand);		setOperationAction(ISD::ANY_EXTEND_VECTOR_INREG, VT, Expand);
▲ Show 20 Lines • Show All 832 Lines • Show Last 20 Lines

test/CodeGen/X86/vec-hadd-float-128.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -enable-unsafe-fp-math \| FileCheck --check-prefix=UNSAFE %s

				brunoUnsubmitted Not Done Reply Inline Actions Could you please update your tests to be more target neutral? I mean, use -mtriple=x86_64-unknown-unknown instead. One question, what code does it emit if one removes -enable-unsafe-fp-math? If it currently makes no difference, you can remove it, otherwise you should be testing both versions. bruno: Could you please update your tests to be more target neutral? I mean, use -mtriple=x86_64…
				ashahidUnsubmitted Not Done Reply Inline Actions Ok, will make it more target neutral accordingly. In fact, user is not supposed to use float version of this intrinsic at all, if it is done so, compiler will 'assert'. ashahid: Ok, will make it more target neutral accordingly. In fact, user is not supposed to use float…
				brunoUnsubmitted Not Done Reply Inline Actions Thanks for updating the tests. Could you please use ./utils/update_llc_test_checks.py to rewrite the tests checking? This will make it easier later on to see the improvements from target customization or widening (as you mentioned in a TODO). bruno: Thanks for updating the tests. Could you please use ./utils/update_llc_test_checks.py to…

				declare float @llvm.hadd.f32.v2f32(<2 x float>)

				define float @test1_hadd_float_f32(<2 x float> %a1) {
				; UNSAFE-LABEL: test1_hadd_float_f32
				; UNSAFE: movaps %xmm0, %xmm1
				; UNSAFE-NEXT: shufps $229, %xmm1, %xmm1 # xmm1 = xmm1[1,1,2,3]
				; UNSAFE-NEXT: addss %xmm0, %xmm1
				; UNSAFE-NEXT: movaps %xmm1, %xmm0
				davidxlUnsubmitted Not Done Reply Inline Actions Should it be shufps .... xmm1 = xmm1[1, ?, ?, ?] davidxl: Should it be shufps .... xmm1 = xmm1[1, ?, ?, ?]
				; UNSAFE-NEXT: retq
				%1 = call float @llvm.hadd.f32.v2f32(<2 x float> %a1)
				ret float %1
				davidxlUnsubmitted Not Done Reply Inline Actions this shufps and addps should not be expected davidxl: this shufps and addps should not be expected
				}

				declare float @llvm.hadd.f32.v4f32(<4 x float>)

				define float @test2_hadd_float_f32(<4 x float> %a1) {
				; UNSAFE-LABEL: test2_hadd_float_f32
				; UNSAFE: movapd %xmm0, %xmm1
				; UNSAFE-NEXT: shufpd $1, %xmm1, %xmm1 # xmm1 = xmm1[1,0]
				; UNSAFE-NEXT: addps %xmm0, %xmm1
				; UNSAFE-NEXT: movaps %xmm1, %xmm0
				; UNSAFE-NEXT: shufps $237, %xmm0, %xmm0 # xmm0 = xmm0[1,3,2,3]
				; UNSAFE-NEXT: addps %xmm1, %xmm0
				; UNSAFE-NEXT: retq
				%1 = call float @llvm.hadd.f32.v4f32(<4 x float> %a1)
				ret float %1
				}

				declare double @llvm.hadd.f64.v2f64(<2 x double>)

				define double @test1_hadd_float_f64(<2 x double> %a1) {
				; UNSAFE-LABEL: test1_hadd_float_f64
				; UNSAFE: movapd %xmm0, %xmm1
				; UNSAFE-NEXT: shufpd $1, %xmm1, %xmm1 # xmm1 = xmm1[1,0]
				; UNSAFE-NEXT: addpd %xmm0, %xmm1
				; UNSAFE-NEXT: movapd %xmm1, %xmm0
				; UNSAFE-NEXT: retq
				%1 = call double @llvm.hadd.f64.v2f64(<2 x double> %a1)
				ret double %1
				}

test/CodeGen/X86/vec-hadd-float-256.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -enable-unsafe-fp-math \| FileCheck --check-prefix=UNSAFE %s

				declare double @llvm.hadd.f64.v8f64(<8 x double>)

				define double @test2_hadd_float_f64(<8 x double> %a1) {
				RKSimonUnsubmitted Not Done Reply Inline Actions The test file is vec-hadd-float-256 yet you are testing a 512 bit vector? Change to <4 x double> RKSimon: The test file is vec-hadd-float-256 yet you are testing a 512 bit vector? Change to <4 x double>
				; UNSAFE-LABEL: test2_hadd_float_f64
				; UNSAFE: movapd {{.*}}
				; UNSAFE-NEXT: shufpd {{.*#+}} xmm4 = xmm4[1,0]
				; UNSAFE-NEXT: addpd {{.*}}
				; UNSAFE-NEXT: movapd {{.*}}
				; UNSAFE-NEXT: shufpd {{.*}} xmm3 = xmm3[1,0]
				; UNSAFE-NEXT: addpd {{.*}}
				; UNSAFE-NEXT: addsd {{.*}}
				; UNSAFE-NEXT: movapd {{.*}}
				; UNSAFE-NEXT: shufpd {{.*#+}} xmm2 = xmm2[1,0]
				; UNSAFE-NEXT: addpd {{.*}}
				; UNSAFE-NEXT: movapd {{.*}}
				; UNSAFE-NEXT: shufpd {{.*}} xmm1 = xmm1[1,0]
				; UNSAFE-NEXT: addpd {{.*}}
				; UNSAFE-NEXT: addsd {{.*}}
				; UNSAFE-NEXT: addsd {{.*}}
				; UNSAFE-NEXT: movapd {{.*}}
				; UNSAFE-NEXT: retq
				%1 = call double @llvm.hadd.f64.v8f64(<8 x double> %a1)
				ret double %1
				}

test/CodeGen/X86/vec-hadd-int-128.ll

This file was added.

				; RUN: llc -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s -check-prefix=CHECK

				declare i8 @llvm.hadd.i8.v4i8(<4 x i8>)

				define i8 @test_hadd_int_i8(<4 x i8> %a1) {
				; CHECK-LABEL: test_hadd_int_i8
				; CHECK: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddd {{.*}}
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
				davidxlUnsubmitted Not Done Reply Inline Actions The result does not look right -- should pshufb be generated instead? davidxl: The result does not look right -- should pshufb be generated instead?
				conghUnsubmitted Not Done Reply Inline Actions I think shift operation is required here given we only have SSE2 support for x86_64. congh: I think shift operation is required here given we only have SSE2 support for x86_64.
				; CHECK-NEXT: paddd {{.*}}
				; CHECK-NEXT: movd %xmm0, %eax
				; CHECK-NEXT: retq
				%1 = call i8 @llvm.hadd.i8.v4i8(<4 x i8> %a1)
				ret i8 %1
				}

				declare i16 @llvm.hadd.i16.v4i16(<4 x i16>)

				define i16 @test_hadd_int_i16(<4 x i16> %a1) {
				; CHECK-LABEL: test_hadd_int_i16
				; CHECK: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddd {{.*}}
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
				; CHECK-NEXT: paddd {{.*}}
				; CHECK-NEXT: movd %xmm0, %eax
				davidxlUnsubmitted Not Done Reply Inline Actions should phsufw be generated? Or more efficient with phaddw? davidxl: should phsufw be generated? Or more efficient with phaddw?
				conghUnsubmitted Not Done Reply Inline Actions In SSE2, pshuflw should be generated here. phaddw is introduced in SSSE3. congh: In SSE2, pshuflw should be generated here. phaddw is introduced in SSSE3.
				; CHECK-NEXT: retq
				%1 = call i16 @llvm.hadd.i16.v4i16(<4 x i16> %a1)
				ret i16 %1
				}

				declare i32 @llvm.hadd.i32.v3i32(<3 x i32>)

				define i32 @test1_hadd_int_i32(<3 x i32> %a1) {
				; CHECK-LABEL: test1_hadd_int_i32
				; CHECK: movd %xmm0, %eax
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
				; CHECK-NEXT: movd %xmm1, %ecx
				; CHECK-NEXT: addl %eax, %ecx
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
				; CHECK-NEXT: movd %xmm0, %eax
				; CHECK-NEXT: addl %ecx, %eax
				; CHECK-NEXT: retq
				%1 = call i32 @llvm.hadd.i32.v3i32(<3 x i32> %a1)
				ret i32 %1
				}

				declare i32 @llvm.hadd.i32.v4i32(<4 x i32>)

				define i32 @test2_hadd_int_i32(<4 x i32> %a1) {
				; CHECK-LABEL: test2_hadd_int_i32
				; CHECK: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddd {{.*}}
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
				; CHECK-NEXT: paddd {{.*}}
				; CHECK-NEXT: movd %xmm0, %eax
				; CHECK-NEXT: retq
				%1 = call i32 @llvm.hadd.i32.v4i32(<4 x i32> %a1)
				ret i32 %1
				}

				declare i64 @llvm.hadd.i64.v2i64(<2 x i64>)

				define i64 @test1_hadd_int_i64(<2 x i64> %a1) {
				; CHECK-LABEL: test1_hadd_int_i64
				; CHECK: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddq {{.*}}
				; CHECK-NEXT: movd %xmm1, %rax
				; CHECK-NEXT: retq
				%1 = call i64 @llvm.hadd.i64.v2i64(<2 x i64> %a1)
				ret i64 %1
				}

test/CodeGen/X86/vec-hadd-int-256.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s --check-prefix=CHECK

				declare i64 @llvm.hadd.i64.v4i64(<4 x i64>)

				define i64 @test2_hadd_int_i64(<4 x i64> %a1) {
				; CHECK-LABEL: test2_hadd_int_i64:
				; CHECK: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]
				; CHECK-NEXT: paddq {{.*}}
				; CHECK-NEXT: movd %xmm2, %rcx
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddq {{.*}}
				; CHECK-NEXT: movd %xmm1, %rax
				; CHECK-NEXT: addq %rcx, %rax
				; CHECK-NEXT: retq
				%1 = call i64 @llvm.hadd.i64.v4i64(<4 x i64> %a1)
				RKSimonUnsubmitted Not Done Reply Inline Actions This codegen is the same as for the test1_hsum_int_i64 <2x i64> version in vec-hadd-int-128.ll - something is going wrong. You probably should compare against codegen from a AVX2 target. RKSimon: This codegen is the same as for the test1_hsum_int_i64 <2x i64> version in vec-hadd-int-128.ll…
				ashahidUnsubmitted Not Done Reply Inline Actions With AVX2 the generated code differ as below. Case V2i64 vpshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1] vpaddq %xmm1, %xmm0, %xmm0 vmovq %xmm0, %rax retq Case V4i64 vextracti128 $1, %ymm0, %xmm1 vpaddq %ymm1, %ymm0, %ymm0 vpermq $237, %ymm0, %ymm1 # ymm1 = ymm0[1,3,2,3] vpaddq %ymm1, %ymm0, %ymm0 vmovq %xmm0, %rax vzeroupper retq ashahid: With AVX2 the generated code differ as below. Case V2i64 vpshufd $78, %xmm0, %xmm1…
				RKSimonUnsubmitted Not Done Reply Inline Actions So yes, it appears to be something is wrong with the legalization. When you build for SSE you only get the hsum of the bottom <2 x i64>, when you build for AVX (which legalizes <4 x i64>) you get the hsum of the whole <2 x i64>. RKSimon: So yes, it appears to be something is wrong with the legalization. When you build for SSE you…
				ret i64 %1
				}

This is an archive of the discontinued LLVM Phabricator instance.

[Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 30691

docs/LangRef.rst

include/llvm/CodeGen/ISDOpcodes.h

include/llvm/IR/Intrinsics.td

include/llvm/Target/TargetSelectionDAG.td

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

lib/CodeGen/SelectionDAG/LegalizeTypes.h

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp

lib/CodeGen/TargetLoweringBase.cpp

test/CodeGen/X86/vec-hadd-float-128.ll

test/CodeGen/X86/vec-hadd-float-256.ll

test/CodeGen/X86/vec-hadd-int-128.ll

test/CodeGen/X86/vec-hadd-int-256.ll

[Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
AbandonedPublic