This is an archive of the discontinued LLVM Phabricator instance.

[Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
AbandonedPublic

Authored by jmolloy on Jul 6 2015, 9:37 AM.

Download Raw Diff

Details

Reviewers

rengolin
• ashahid
hfinkel

Summary

This adds new intrinsics "hadd_*" for horizontal or reduction sum operation to facilitate efficient code generation for "sum of absolute differences" operation.
The patch also contains the introduction of corresponding SDNodes and basic legalization support.Sanity of the generated code is tested on X86.

This is 2nd of the three patches.The 1st patch can be referred here, http://reviews.llvm.org/D10867

Diff Detail

Event Timeline

• ashahid updated this revision to Diff 29096.Jul 6 2015, 9:37 AM

• ashahid retitled this revision from to [Codegen] Add intrinsics 'hadd*' and corresponding SDNodes for horizontal sum operation..

• ashahid updated this object.

• ashahid added reviewers: jmolloy, hfinkel, rengolin.

• ashahid set the repository for this revision to rL LLVM.

• ashahid added a subscriber: llvm-commits.

Hi,

Generally looks good, but I have some initial comments.

Cheers,

James

docs/LangRef.rst
10918	You need to be very explicit about the behaviour of this intrinsic with floating point arguments. What order, if any, does it perform the adds in? If there is no guaranteed order, it can only be used in fast-math mode.
include/llvm/IR/Intrinsics.td
620	Just having one intrinsic here would be good; there's no need for a separate int and float version.
628	Blank line missing here.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
721	Can't you just call ExpandHADD() here? or at least share the unroll and expand code?

This revision now requires changes to proceed.Jul 7 2015, 2:45 AM

Hi James,

Thanks for your comments.I will do the needful. Pls see my response below.

Regards,
Shahid

docs/LangRef.rst
10918	Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an order such as "add each element of vector, starting from element 0 to n-1, to an accumulated sum which is initialized to zero". Does it make sense?
include/llvm/IR/Intrinsics.td
620	In that case what should be the return type of intrinsic?
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
721	Yes, probably I can share.

• ashahid added a subscriber: • ashahid.Jul 7 2015, 8:35 AM

Hi James,

Pls find the response below.

Shahid

From: James Molloy [mailto:james@jamesmolloy.co.uk]
Sent: Tuesday, July 07, 2015 7:09 PM
To: reviews+D10964+public+8430e6553a631d12@reviews.llvm.org; Shahid, Asghar-ahmad; hfinkel@anl.gov; renato.golin@linaro.org; james.molloy@arm.com
Cc: llvm-commits@cs.uiuc.edu
Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hadd*' and corresponding SDNodes for horizontal sum operation.

Hi,

Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an order such as "add each element of vector, starting from element 0 to n-1, to an accumulated sum which is initialized to zero". Does it make sense?

That would mean you wouldn't be able to lower it using a lg(n)-shuffles algorithm, as that does it in the wrong order. You'd have to use a linear algorithm which would perform quite poorly.
That’s right.

It would also stop horizontal add instructions being used on architectures that support them (I don't know of any that do for FP types - probably for this reason!). I'd probably go with the fast-math version personally.
In that case I would support fast-math version.

In that case what should be the return type of intrinsic?

llvm_any_ty ?
Initially I thought of this, however did not use because it will allow any other type also. Hence provided the float intrinsic separately.
If that is not a concern I would use llvm_any_ty.

hfinkel added inline comments.Jul 9 2015, 7:35 PM

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2925	Why is ISD::HADD not handled here?
2949	Why exactly does this do? Widening normally introduces undefs, but you can't add a bunch of undefs and get anything other than an undef out. You might need the SDAG node to array an extra parameter indicating how many of the vector lanes are actually to be added to properly support widening.

• ashahid added inline comments.Jul 10 2015, 4:56 AM

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2925	Probably my tests was not sufficient enough to ask me to handle ISD::HADD. I will see.
2949	Yes it does introduce undefs. Sorry that I could not get your suggestion regarding the extra SDAG?

bruno added a subscriber: bruno.Jul 13 2015, 8:31 AM

Hi,

Thanks for working on this. Comments below.

docs/LangRef.rst
10945	I don't know if this discussion already happend, but I've been thinking about this and I'm wondering whether we should have a vector result instead of a scalar one; the result in the first element of the vector type and the other elements undef. Then an extractelement follows to get the scalar result. IMO, this is more natural given the way architectures implement variants of HADD, they usually leave the results on vectors. One advantage of doing this is that we can also use this ISD::HADD while lowering other vector operations (the CTPOP case) and don't have to write a DAGCombine or any other extra logic to recognise the vector back from an extract. I might be biased on one side of the history here though, I'd appreciate hearing the other side :-)
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2876 ↗	(On Diff #29096)	Note: If we use vector result instead of scalar, this won't be needed here and everything could be handled in LegalizeVectorOps.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1093	For the integer part (ISD::HADD) I believe you could do "vector shifts + vector adds" instead of "extracts + scalar adds", probably better to do not leave the vector domain? In case the current target doesn't support "vector shifts + vector adds" for the element type, then your implementation should fallback to "extracts + scalar adds". To check that you can use in UnrollHADD: if (TLI.getOperationAction(ISD::SHL, VT) == TargetLowering::Expand \|\| TLI.getOperationAction(ISD::ADD, VT) == TargetLowering::Expand) ....

• ashahid added inline comments.Jul 15 2015, 1:49 AM

docs/LangRef.rst
10945	No, this discussion did not happen earlier. IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also. In fact on X86, we need to do a DAGCombine of ABSDIFF and HADD to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

jmolloy mentioned this in D10273: Add intrinsics and SDNodes for signed/unsigned absolute difference (absdiff) and horizontal add (hadd)..Jul 17 2015, 2:24 AM

Updated the patch with following changes

Overloaded the intrinsic properly.
Updated the WidenVecOp to avoid summing up of undefs.
Restricted the floating point version of intrinsic only for fast-math.
Updated the doc accordingly

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

Regarding the tests, please split and rename it to vector-hadd-128.ll and vector-hadd-256.ll, no need to split files based on the fact that they are testing the expansion. Once you have custom versions, just drive them by subtarget features (+ssse3, sse4*, etc). Take a look at vector-blend.ll and others for examples.

In D10964#210687, @bruno wrote:

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

Splitted the tests based on integer and float 128/256 bit data types.

Ping !!

In D10964#212074, @hfinkel wrote:

In D10964#210687, @bruno wrote:

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

test/CodeGen/X86/vec-hadd-float-128.ll
3	Could you please update your tests to be more target neutral? I mean, use -mtriple=x86_64-unknown-unknown instead. One question, what code does it emit if one removes -enable-unsafe-fp-math? If it currently makes no difference, you can remove it, otherwise you should be testing both versions.

Hi Bruno,

My response inlined.

Regards,
Shahid

test/CodeGen/X86/vec-hadd-float-128.ll
3	Ok, will make it more target neutral accordingly. In fact, user is not supposed to use float version of this intrinsic at all, if it is done so, compiler will 'assert'.

Hi Hal,

Could you pls comment on the updates?

Regards,
Shahid

Apologies for joining this discussion so late.

I'm worried that this intrinsic is over specific to the PSAD (sum_of) cases - I would have thought a pairwise style horizontal add would fit in much better with most target hardware and could still make locating PSAD style patterns pretty straightforward.

Another alternative would be to instead of a new instrinsic/SDNode, you could focus on providing common infrastructure to detect general horizontal reduction/reassociation patterns - PR23116 and PR21975 would benefit from these.

Failing that, would you consider renaming the opcode ISD::SUM or similar to avoid ambiguity with SSE + NEON HADD instructions?

In D10964#216661, @RKSimon wrote:

Apologies for joining this discussion so late.

I'm worried that this intrinsic is over specific to the PSAD (sum_of) cases - I would have thought a pairwise style horizontal add would fit in much better with most target hardware and could still make locating PSAD style patterns pretty straightforward.

IMO, this intrinsic is generic in terms of the semantics of a horizontal sum and PSAD happens to use this semantic. Also for power of 2 operand cases, computation will be of O(ln) which is better than pairwise computation.

Another alternative would be to instead of a new instrinsic/SDNode, you could focus on providing common infrastructure to detect general horizontal reduction/reassociation patterns - PR23116 and PR21975 would benefit from these.

At this point in time, I would like to deffer this possibility.

Failing that, would you consider renaming the opcode ISD::SUM or similar to avoid ambiguity with SSE + NEON HADD instructions?

Sure.

In D10964#213526, @bruno wrote:

In D10964#212074, @hfinkel wrote:

In D10964#210687, @bruno wrote:

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

I am ok with it but I would like to know Hal's opinion before proceeding.

In D10964#217742, @ashahid wrote:

In D10964#213526, @bruno wrote:

In D10964#212074, @hfinkel wrote:

In D10964#210687, @bruno wrote:

IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.

I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.

My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.

In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.

I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.

Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.

I see, agreed that it semantically makes more sense to have this in the IR level.

Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.

However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.

What if only in the ISD level we have the node result in a vector? ISD nodes are supposed to represent lower level behaviour and then we can canolicalize it to HADD + extractelment, which I believe should be easier to deal with.

I am ok with it but I would like to know Hal's opinion before proceeding.

I think this is okay; we should clearly document the motivation. This does not address any problems with scalar-valued PHIs, but should make the pattern matching easier to implement in the common case for backends with legal horizontal adds.

Updated the patch with

Renaming of intrinsic *hadd to *hsum and its related code/doc
Renaming of ISD node *HADD to *HSUM and its related code
Updated the SDAG builder to transform "llvm.*hsum" into two nodes,

*HSUM & EXTRACT_VECTOR_ELT.

Test case updated accordingly.

bruno added inline comments.Aug 13 2015, 12:40 PM

include/llvm/CodeGen/ISDOpcodes.h
346	Space after the dot.
347	Same here.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1067	Using the assert inside this "if" will be a problem when asserts aren't enabled.
test/CodeGen/X86/vec-hadd-float-128.ll
2	Thanks for updating the tests. Could you please use ./utils/update_llc_test_checks.py to rewrite the tests checking? This will make it easier later on to see the improvements from target customization or widening (as you mentioned in a TODO).

RKSimon added inline comments.Aug 14 2015, 5:12 AM

test/CodeGen/X86/vec-hadd-float-256.ll
5	The test file is vec-hadd-float-256 yet you are testing a 512 bit vector? Change to <4 x double>

Updated the patch regarding the comments given by Bruno and Simon.

RKSimon added inline comments.Aug 18 2015, 5:31 AM

test/CodeGen/X86/vec-hadd-int-256.ll
15	This codegen is the same as for the test1_hsum_int_i64 <2x i64> version in vec-hadd-int-128.ll - something is going wrong. You probably should compare against codegen from a AVX2 target.

• ashahid added inline comments.Aug 18 2015, 11:55 PM

test/CodeGen/X86/vec-hadd-int-256.ll

With AVX2 the generated code differ as below.

Case V2i64

vpshufd $78, %xmm0, %xmm1       # xmm1 = xmm0[2,3,0,1]
vpaddq  %xmm1, %xmm0, %xmm0
vmovq   %xmm0, %rax
retq

Case V4i64

vextracti128    $1, %ymm0, %xmm1
vpaddq  %ymm1, %ymm0, %ymm0
vpermq  $237, %ymm0, %ymm1      # ymm1 = ymm0[1,3,2,3]
vpaddq  %ymm1, %ymm0, %ymm0
vmovq   %xmm0, %rax
vzeroupper
retq

RKSimon added inline comments.Aug 19 2015, 1:12 AM

test/CodeGen/X86/vec-hadd-int-256.ll
15	So yes, it appears to be something is wrong with the legalization. When you build for SSE you only get the hsum of the bottom <2 x i64>, when you build for AVX (which legalizes <4 x i64>) you get the hsum of the whole <2 x i64>.

Updated the patch to handle the legalization of vector type split properly.

Ping !

Hi All,

Pls review as this is pending for quite some time.

Regards,
Shahid

Ping !!

Please upload future patches will full context, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions.

docs/LangRef.rst
10794	This constraint is not acceptable, you'll need to remove it. (plus, we're moving to a model where fast-math semantics are per-instruction flags). However, it is perfectly reasonable to document the fact that the ordering of the additions performed with the intrinsic is undefined.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1058	Remove the UnsafeFPMath check.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2928	Remove the UnsafeFPMath check.

In D10964#238370, @hfinkel wrote:

Please upload future patches will full context, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions.

Thanks for looking into it. I will update the patch accordingly and upload with full context.

Updated the docs/LangRef.rst and related code for Hal's comment.

Minor update in test case.

Hi Hal & others,

Please review, waiting for your comments / clearance.

Regards,
Shahid

I have no further issues, this is fine from my perspective.

Thanks James.

Looking forward for others responses.

Ping !!

Hi Hal,

Waiting for your response.

Regards,
Shahid

hfinkel added inline comments.Sep 18 2015, 5:12 PM

docs/LangRef.rst
10974	Nothing is being loaded here. You can just say that, "The argument is a vector of any integer or floating-point type."
10987	Signed or unsigned overflow?
10992	Missing space before "They"
10994	No need for a comma after that.
11000	integer or floating point number -> integer or floating-point type.
11009	I think this is unhelpful. Given that the order of additions is undefined, it might not be exactly equivalent to this code. I think describing this in words would be better in this case.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1077	Given that the order of additions is undefined, we can add NoSignedWrap or NoUnsignedWrap in the integer case.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
711	Same comment here as above (we can add signed or unsigned nowrap here).
2963	Same here.
2968	And here.

Hi Hal,

Updated the LangRef.rst and code accordingly.Please review.

Regards,
Shahid

hfinkel added inline comments.Sep 23 2015, 11:57 AM

docs/LangRef.rst
10988	Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will be able to prove no-unsigned-overflow in most cases, and we don't get it from C's semantics, and thus, would not be able to generate this intrinsic.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1071	If we can get rid of the undefined behavior for unsigned overflow; then remove this line.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
714	Same comment here too.
2959	Same here.

Hi Hal,

Thanks for pointing this.

Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will be able to prove no-unsigned-overflow in most cases, and we don't get it from C's semantics, >and thus, would not be able to generate this intrinsic.

I referred to the C's semantics of "unsigned overflow" and realized that it is not necessary, will update the patch accordingly.

Regards,
Shahid

Hi Hal,

Updated the patch accordingly.

Regards,
Shahid

I need to take a step back here; why are we doing this again?

I read again the RFC thread (http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html), and it ended with the following (from Renato):

BTW, now my plan is to just add the two intrinsics for 'absolute difference'
and 'horizontal add'.

That's ok, as long as they're impossible to represent in plain IR.

and I think that we were all in agreement on this point. But now I'm not sure you've demonstrated the prerequisite. The underlying operations here (and in D10867) seem like they are representable using IR (as demonstrated by the fact that you provide potentially-equivalent IR sequences in the documentation), except for the ordering freedom here.

And this, I fear, is where we run into trouble. The thing that is not representable in the IR is that the order of operations in the horizontal sum is undefined, and at the same time, signed overflow is undefined. This cannot be represented in the IR because nsw adds don't reassociate, and thus, there's no way to represent the sequence of nsw adds such that they can be reassociated while retaining their nsw property. But the problem is that, because this freedom cannot be represented in the IR, we also can't generate it from IR in a semantics-preserving way; and, thus, it would not be legal to generate it in the vectorizers.

Thus, this change does not seem right, and approving D10867 seems like a mistake as well. We could certainly fix the definition here to make it exactly representable in the IR, but then what's the point of doing so?

In the RFC, you mentioned cost modeling as a major motivation for adding intrinsics, but that seems like an unrelated problem (at least in part). During vectorization, we can use special interfaces to estimate the cost of complex patterns. In fact, we already have an interface for reductions: TTI.getReductionCost. There is a second relevant code model: That used by the unroller and inliner. Vectorization happens after inlining, so that interaction is not really relevant, but partial unrolling happens after vectorization, and so the cost model there might want to understand that a complex sequence of shuffles, extracts and adds has a disproportionately-low cost. The same is true of the inliner if the input IR uses vector types and initially contains these operations, but even in that case, you'd not want to canonicalize on the intrinsics too early in case other optimizations remove the need for most of the operations. Thus, in the end, you need pattern-matching code near the end of the pipeline anyway to account for input IR directly containing relevant operations on vector types.

In short, I don't understand why we're going in this direction. You can match these operations in the backend, and you can match them in the cost-model code. If we can't do the latter, then we should fix that. And we already have special interfaces for vectorization costs for complex operations such as this.

Hi Hal,

Response inlined.

Regards,
Shahid

davidxl added a subscriber: davidxl.Oct 27 2015, 8:56 PM

davidxl added inline comments.

docs/LangRef.rst
10979	For the integer case, having scalar result type (with the same size as the vector element) make this intrinsic less useful -- due to overflow conditions. The vectorizer will have difficulty proving overflow does not happen and won't be able to generate it in many cases. As Bruno commented, having vector result type may be the way to go. For instance, for the input type of v4i8, if the result type can be v2i16 -- the hsum is split into 2 horizontal adds each one producing a 16 bit result. If the result type is v1i32, the hsum adds four i8 integers and produces a 32bit result. Limiting this to power of 2 number of elements seems reasonable.
test/CodeGen/X86/vec-hadd-float-128.ll
10	Should it be shufps .... xmm1 = xmm1[1, ?, ?, ?]
13	this shufps and addps should not be expected
test/CodeGen/X86/vec-hadd-int-128.ll
8	The result does not look right -- should pshufb be generated instead?
24	should phsufw be generated? Or more efficient with phaddw?

congh added a subscriber: congh.Oct 28 2015, 4:53 PM

congh added inline comments.Oct 28 2015, 5:18 PM

test/CodeGen/X86/vec-hadd-int-128.ll
8	I think shift operation is required here given we only have SSE2 support for x86_64.
24	In SSE2, pshuflw should be generated here. phaddw is introduced in SSSE3.

This revision has been abandoned; Cong Yuo is now taking this forward in a different direction.

Revision Contents

Path

Size

docs/

LangRef.rst

54 lines

include/

llvm/

CodeGen/

ISDOpcodes.h

6 lines

IR/

Intrinsics.td

3 lines

Target/

TargetSelectionDAG.td

2 lines

lib/

CodeGen/

SelectionDAG/

LegalizeIntegerTypes.cpp

15 lines

LegalizeTypes.h

4 lines

LegalizeVectorOps.cpp

26 lines

LegalizeVectorTypes.cpp

59 lines

SelectionDAGBuilder.cpp

18 lines

SelectionDAGDumper.cpp

2 lines

TargetLoweringBase.cpp

2 lines

test/

CodeGen/

X86/

vec-hadd-float-128.ll

48 lines

vec-hadd-float-256.ll

19 lines

vec-hadd-int-128.ll

75 lines

vec-hadd-int-256.ll

17 lines

Diff 35627

docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 10,785 Lines • ▼ Show 20 Lines
	Specialised Arithmetic Intrinsics			Specialised Arithmetic Intrinsics
	---------------------------------			---------------------------------

	'``llvm.canonicalize.*``' Intrinsic			'``llvm.canonicalize.*``' Intrinsic
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Syntax:			Syntax:
	"""""""			"""""""

				hfinkelUnsubmitted Not Done Reply Inline Actions This constraint is not acceptable, you'll need to remove it. (plus, we're moving to a model where fast-math semantics are per-instruction flags). However, it is perfectly reasonable to document the fact that the ordering of the additions performed with the intrinsic is undefined. hfinkel: This constraint is not acceptable, you'll need to remove it. (plus, we're moving to a model…
	::			::

	declare float @llvm.canonicalize.f32(float %a)			declare float @llvm.canonicalize.f32(float %a)
	declare double @llvm.canonicalize.f64(double %b)			declare double @llvm.canonicalize.f64(double %b)

	Overview:			Overview:
	"""""""""			"""""""""

	▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines

	Syntax:			Syntax:
	"""""""			"""""""
	This is an overloaded intrinsic. The loaded data is a vector of any integer bit width.			This is an overloaded intrinsic. The loaded data is a vector of any integer bit width.

	.. code-block:: llvm			.. code-block:: llvm

	declare <4 x integer> @llvm.uabsdiff.v4i32(<4 x integer> %a, <4 x integer> %b)			declare <4 x integer> @llvm.uabsdiff.v4i32(<4 x integer> %a, <4 x integer> %b)

				jmolloyAuthorUnsubmitted Not Done Reply Inline Actions You need to be very explicit about the behaviour of this intrinsic with floating point arguments. What order, if any, does it perform the adds in? If there is no guaranteed order, it can only be used in fast-math mode. jmolloy: You need to be very explicit about the behaviour of this intrinsic with floating point…
				ashahidUnsubmitted Not Done Reply Inline Actions Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an order such as "add each element of vector, starting from element 0 to n-1, to an accumulated sum which is initialized to zero". Does it make sense? ashahid: Ah, I did not think about it.Instead of restricting it to fast-math I would prefer to have an…

	Overview:			Overview:
	"""""""""			"""""""""

	The ``llvm.uabsdiff`` intrinsic returns a vector result of the absolute difference			The ``llvm.uabsdiff`` intrinsic returns a vector result of the absolute difference
	of the two operands, treating them both as unsigned integers. The intermediate			of the two operands, treating them both as unsigned integers. The intermediate
	calculations are computed using infinitely precise unsigned arithmetic. The final			calculations are computed using infinitely precise unsigned arithmetic. The final
	result will be truncated to the given type.			result will be truncated to the given type.
	Show All 10 Lines

	Arguments:			Arguments:
	""""""""""			""""""""""

	Both intrinsics take two integer of the same bitwidth.			Both intrinsics take two integer of the same bitwidth.

	Semantics:			Semantics:
	""""""""""			""""""""""

				brunoUnsubmitted Not Done Reply Inline Actions I don't know if this discussion already happend, but I've been thinking about this and I'm wondering whether we should have a vector result instead of a scalar one; the result in the first element of the vector type and the other elements undef. Then an extractelement follows to get the scalar result. IMO, this is more natural given the way architectures implement variants of HADD, they usually leave the results on vectors. One advantage of doing this is that we can also use this ISD::HADD while lowering other vector operations (the CTPOP case) and don't have to write a DAGCombine or any other extra logic to recognise the vector back from an extract. I might be biased on one side of the history here though, I'd appreciate hearing the other side :-) bruno: I don't know if this discussion already happend, but I've been thinking about this and I'm…
				ashahidUnsubmitted Not Done Reply Inline Actions No, this discussion did not happen earlier. IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also. In fact on X86, we need to do a DAGCombine of ABSDIFF and HADD to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine. ashahid: No, this discussion did not happen earlier. IMO, the scalar version is more natural w.r.t the…
	The expression::			The expression::

	call <4 x i32> @llvm.uabsdiff.v4i32(<4 x i32> %a, <4 x i32> %b)			call <4 x i32> @llvm.uabsdiff.v4i32(<4 x i32> %a, <4 x i32> %b)

	is equivalent to::			is equivalent to::

	%1 = zext <4 x i32> %a to <4 x i64>			%1 = zext <4 x i32> %a to <4 x i64>
	%2 = zext <4 x i32> %b to <4 x i64>			%2 = zext <4 x i32> %b to <4 x i64>
	%sub = sub <4 x i64> %1, %2			%sub = sub <4 x i64> %1, %2
	%trunc = trunc <4 x i64> to <4 x i32>			%trunc = trunc <4 x i64> to <4 x i32>

	and the expression::			and the expression::

	call <4 x i32> @llvm.sabsdiff.v4i32(<4 x i32> %a, <4 x i32> %b)			call <4 x i32> @llvm.sabsdiff.v4i32(<4 x i32> %a, <4 x i32> %b)

	is equivalent to::			is equivalent to::

	%sub = sub nsw <4 x i32> %a, %b			%sub = sub nsw <4 x i32> %a, %b
	%ispos = icmp sge <4 x i32> %sub, zeroinitializer			%ispos = icmp sge <4 x i32> %sub, zeroinitializer
	%neg = sub nsw <4 x i32> zeroinitializer, %sub			%neg = sub nsw <4 x i32> zeroinitializer, %sub
	%1 = select <4 x i1> %ispos, <4 x i32> %sub, <4 x i32> %neg			%1 = select <4 x i1> %ispos, <4 x i32> %sub, <4 x i32> %neg


				'``llvm.hsum.*``'
				^^^^^^^^^^^^^^^^^

				Syntax:
				"""""""
				This is an overloaded intrinsic. The argument is a vector of any integer or floating-point type.
				hfinkelUnsubmitted Not Done Reply Inline Actions Nothing is being loaded here. You can just say that, "The argument is a vector of any integer or floating-point type." hfinkel: Nothing is being loaded here. You can just say that, "The argument is a vector of any integer…
				Order of additions performed by the intrinsic is undefined.

				.. code-block:: llvm

				declare <integer> @llvm.hsum.i32.v4i32(<4 x integer> %a)
				davidxlUnsubmitted Not Done Reply Inline Actions For the integer case, having scalar result type (with the same size as the vector element) make this intrinsic less useful -- due to overflow conditions. The vectorizer will have difficulty proving overflow does not happen and won't be able to generate it in many cases. As Bruno commented, having vector result type may be the way to go. For instance, for the input type of v4i8, if the result type can be v2i16 -- the hsum is split into 2 horizontal adds each one producing a 16 bit result. If the result type is v1i32, the hsum adds four i8 integers and produces a 32bit result. Limiting this to power of 2 number of elements seems reasonable. davidxl: For the integer case, having scalar result type (with the same size as the vector element) make…
				declare <float> @llvm.hsum.f32.v4f32(<4 x float> %a)


				Overview:
				"""""""""

				The ``llvm.hsum`` intrinsic returns the result of the horizontal or reduction sum of the elements
				of the vector operand, treating it as integers or floats. For signed result overflow, the behavior
				hfinkelUnsubmitted Not Done Reply Inline Actions Signed or unsigned overflow? hfinkel: Signed or unsigned overflow?
				is undefined.
				hfinkelUnsubmitted Not Done Reply Inline Actions Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will be able to prove no-unsigned-overflow in most cases, and we don't get it from C's semantics, and thus, would not be able to generate this intrinsic. hfinkel: Unsigned overflow is also undefined? Is this really necessary? I doubt that the vectorizer will…

				.. note::

				These intrinsics are primarily used during the code generation stage of compilation. They are
				hfinkelUnsubmitted Not Done Reply Inline Actions Missing space before "They" hfinkel: Missing space before "They"
				generated by the compiler passes such as the Loop and SLP vectorizers. The expectation is that
				the frontends should not need to generate these intrinsics themselves.
				hfinkelUnsubmitted Not Done Reply Inline Actions No need for a comma after that. hfinkel: No need for a comma after that.

				Arguments:
				""""""""""

				The argument is vector of integer or floating-point type.

				hfinkelUnsubmitted Not Done Reply Inline Actions integer or floating point number -> integer or floating-point type. hfinkel: integer or floating point number -> integer or floating-point type.
				Semantics:
				""""""""""

				The ``llvm.hsum`` intrinsic function performs the splitting reduction sum on the elements of the
				vector argument using 'shufflevector' and 'add' instructions if the vector length is power of 2.

				For example the expression::

				call i32 @llvm.hsum.i32.v4i32(<4 x i32> %a)
				hfinkelUnsubmitted Not Done Reply Inline Actions I think this is unhelpful. Given that the order of additions is undefined, it might not be exactly equivalent to this code. I think describing this in words would be better in this case. hfinkel: I think this is unhelpful. Given that the order of additions is undefined, it might not be…

				might be expanded to::

				%1 = shufflevector <4 x i32> %a, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
				%2 = add <4 x i32> %a, %1
				%3 = shufflevector <4 x i32> %2, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
				%4 = add <4 x i32> %2, %3
				%5 = extractelement <4 x i32> %4, i32 0


				Otherwise the reduction sum is performed by scalarisation.


	Half Precision Floating Point Intrinsics			Half Precision Floating Point Intrinsics
	----------------------------------------			----------------------------------------

	For most target platforms, half precision floating point is a			For most target platforms, half precision floating point is a
	storage-only format. This means that it is a dense encoding (in memory)			storage-only format. This means that it is a dense encoding (in memory)
	but does not support computation in the format.			but does not support computation in the format.

	This means that code must first load the half-precision floating point			This means that code must first load the half-precision floating point
	▲ Show 20 Lines • Show All 989 Lines • Show Last 20 Lines

include/llvm/CodeGen/ISDOpcodes.h

Show First 20 Lines • Show All 336 Lines • ▼ Show 20 Lines	enum NodeType {

/// Byte Swap and Counting operators.		/// Byte Swap and Counting operators.
BSWAP, CTTZ, CTLZ, CTPOP,		BSWAP, CTTZ, CTLZ, CTPOP,

/// [SU]ABSDIFF - Signed/Unsigned absolute difference of two input integer		/// [SU]ABSDIFF - Signed/Unsigned absolute difference of two input integer
/// vector. These nodes are generated from llvm.absdiff intrinsics.		/// vector. These nodes are generated from llvm.absdiff intrinsics.
SABSDIFF, UABSDIFF,		SABSDIFF, UABSDIFF,

		/// HSUM/FHSUM(VEC) - Horizontal sum across the elements of sole integer
		/// or float input vector. Returns a vector, of the same type as VEC.
		brunoUnsubmitted Not Done Reply Inline Actions Space after the dot. bruno: Space after the dot.
		/// These nodes are generated from llvm.hsum* intrinsics. The scalar result
		brunoUnsubmitted Not Done Reply Inline Actions Same here. bruno: Same here.
		/// of the intrinsic is stored in first element of a vector of type VEC.
		HSUM, FHSUM,

/// Bit counting operators with an undefined result for zero inputs.		/// Bit counting operators with an undefined result for zero inputs.
CTTZ_ZERO_UNDEF, CTLZ_ZERO_UNDEF,		CTTZ_ZERO_UNDEF, CTLZ_ZERO_UNDEF,

/// Select(COND, TRUEVAL, FALSEVAL). If the type of the boolean COND is not		/// Select(COND, TRUEVAL, FALSEVAL). If the type of the boolean COND is not
/// i1 then the high bits must conform to getBooleanContents.		/// i1 then the high bits must conform to getBooleanContents.
SELECT,		SELECT,

/// Select with a vector condition (op #0) and two vector operands (ops #1		/// Select with a vector condition (op #0) and two vector operands (ops #1
▲ Show 20 Lines • Show All 577 Lines • Show Last 20 Lines

include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 611 Lines • ▼ Show 20 Lines	def int_convertuu : Intrinsic<[llvm_anyint_ty],
[llvm_anyint_ty, llvm_i32_ty, llvm_i32_ty]>;		[llvm_anyint_ty, llvm_i32_ty, llvm_i32_ty]>;

// Clear cache intrinsic, default to ignore (ie. emit nothing)		// Clear cache intrinsic, default to ignore (ie. emit nothing)
// maps to void __clear_cache() on supporting platforms		// maps to void __clear_cache() on supporting platforms
def int_clear_cache : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],		def int_clear_cache : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],
[], "llvm.clear_cache">;		[], "llvm.clear_cache">;

// Calculate the Absolute Differences of the two input vectors.		// Calculate the Absolute Differences of the two input vectors.
def int_sabsdiff : Intrinsic<[llvm_anyvector_ty],		def int_sabsdiff : Intrinsic<[llvm_anyvector_ty],
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Just having one intrinsic here would be good; there's no need for a separate int and float version. jmolloy: Just having one intrinsic here would be good; there's no need for a separate int and float…
		ashahidUnsubmitted Not Done Reply Inline Actions In that case what should be the return type of intrinsic? ashahid: In that case what should be the return type of intrinsic?
[ LLVMMatchType<0>, LLVMMatchType<0> ], [IntrNoMem]>;		[ LLVMMatchType<0>, LLVMMatchType<0> ], [IntrNoMem]>;
def int_uabsdiff : Intrinsic<[llvm_anyvector_ty],		def int_uabsdiff : Intrinsic<[llvm_anyvector_ty],
[ LLVMMatchType<0>, LLVMMatchType<0> ], [IntrNoMem]>;		[ LLVMMatchType<0>, LLVMMatchType<0> ], [IntrNoMem]>;

		// Calculate the horizontal/reduction sum across the elements of input vector.
		def int_hsum : Intrinsic<[llvm_any_ty], [llvm_anyvector_ty], [IntrNoMem]>;

//===-------------------------- Masked Intrinsics -------------------------===//		//===-------------------------- Masked Intrinsics -------------------------===//
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Blank line missing here. jmolloy: Blank line missing here.
//		//
def int_masked_store : Intrinsic<[], [llvm_anyvector_ty, LLVMPointerTo<0>,		def int_masked_store : Intrinsic<[], [llvm_anyvector_ty, LLVMPointerTo<0>,
llvm_i32_ty,		llvm_i32_ty,
LLVMVectorSameWidth<0, llvm_i1_ty>],		LLVMVectorSameWidth<0, llvm_i1_ty>],
[IntrReadWriteArgMem]>;		[IntrReadWriteArgMem]>;

def int_masked_load : Intrinsic<[llvm_anyvector_ty],		def int_masked_load : Intrinsic<[llvm_anyvector_ty],
[LLVMPointerTo<0>, llvm_i32_ty,		[LLVMPointerTo<0>, llvm_i32_ty,
Show All 35 Lines

include/llvm/Target/TargetSelectionDAG.td

Show First 20 Lines • Show All 382 Lines • ▼ Show 20 Lines	def sube : SDNode<"ISD::SUBE" , SDTIntBinOp,
[SDNPOutGlue, SDNPInGlue]>;		[SDNPOutGlue, SDNPInGlue]>;
def smin : SDNode<"ISD::SMIN" , SDTIntBinOp>;		def smin : SDNode<"ISD::SMIN" , SDTIntBinOp>;
def smax : SDNode<"ISD::SMAX" , SDTIntBinOp>;		def smax : SDNode<"ISD::SMAX" , SDTIntBinOp>;
def umin : SDNode<"ISD::UMIN" , SDTIntBinOp>;		def umin : SDNode<"ISD::UMIN" , SDTIntBinOp>;
def umax : SDNode<"ISD::UMAX" , SDTIntBinOp>;		def umax : SDNode<"ISD::UMAX" , SDTIntBinOp>;

def sabsdiff : SDNode<"ISD::SABSDIFF" , SDTIntBinOp>;		def sabsdiff : SDNode<"ISD::SABSDIFF" , SDTIntBinOp>;
def uabsdiff : SDNode<"ISD::UABSDIFF" , SDTIntBinOp>;		def uabsdiff : SDNode<"ISD::UABSDIFF" , SDTIntBinOp>;
		def hsum : SDNode<"ISD::HSUM" , SDTIntUnaryOp>;
		def fhsum : SDNode<"ISD::FHSUM" , SDTIntUnaryOp>;
def sext_inreg : SDNode<"ISD::SIGN_EXTEND_INREG", SDTExtInreg>;		def sext_inreg : SDNode<"ISD::SIGN_EXTEND_INREG", SDTExtInreg>;
def bswap : SDNode<"ISD::BSWAP" , SDTIntUnaryOp>;		def bswap : SDNode<"ISD::BSWAP" , SDTIntUnaryOp>;
def ctlz : SDNode<"ISD::CTLZ" , SDTIntUnaryOp>;		def ctlz : SDNode<"ISD::CTLZ" , SDTIntUnaryOp>;
def cttz : SDNode<"ISD::CTTZ" , SDTIntUnaryOp>;		def cttz : SDNode<"ISD::CTTZ" , SDTIntUnaryOp>;
def ctpop : SDNode<"ISD::CTPOP" , SDTIntUnaryOp>;		def ctpop : SDNode<"ISD::CTPOP" , SDTIntUnaryOp>;
def ctlz_zero_undef : SDNode<"ISD::CTLZ_ZERO_UNDEF", SDTIntUnaryOp>;		def ctlz_zero_undef : SDNode<"ISD::CTLZ_ZERO_UNDEF", SDTIntUnaryOp>;
def cttz_zero_undef : SDNode<"ISD::CTTZ_ZERO_UNDEF", SDTIntUnaryOp>;		def cttz_zero_undef : SDNode<"ISD::CTTZ_ZERO_UNDEF", SDTIntUnaryOp>;
def sext : SDNode<"ISD::SIGN_EXTEND", SDTIntExtendOp>;		def sext : SDNode<"ISD::SIGN_EXTEND", SDTIntExtendOp>;
▲ Show 20 Lines • Show All 737 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

Show First 20 Lines • Show All 144 Lines • ▼ Show 20 Lines	#endif
case ISD::ATOMIC_CMP_SWAP:		case ISD::ATOMIC_CMP_SWAP:
case ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS:		case ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS:
Res = PromoteIntRes_AtomicCmpSwap(cast<AtomicSDNode>(N), ResNo);		Res = PromoteIntRes_AtomicCmpSwap(cast<AtomicSDNode>(N), ResNo);
break;		break;
case ISD::UABSDIFF:		case ISD::UABSDIFF:
case ISD::SABSDIFF:		case ISD::SABSDIFF:
Res = PromoteIntRes_SimpleIntBinOp(N);		Res = PromoteIntRes_SimpleIntBinOp(N);
break;		break;
		case ISD::HSUM:
		Res = PromoteIntRes_HSUM(N);
		break;
}		}

// If the result is null then the sub-method took care of registering it.		// If the result is null then the sub-method took care of registering it.
if (Res.getNode())		if (Res.getNode())
SetPromotedInteger(SDValue(N, ResNo), Res);		SetPromotedInteger(SDValue(N, ResNo), Res);
}		}

		SDValue DAGTypeLegalizer::PromoteIntRes_HSUM(SDNode *N) {
		SDValue OpVal = GetPromotedInteger(N->getOperand(0));
		return DAG.getNode(N->getOpcode(), SDLoc(N),
		OpVal.getValueType(), OpVal);
		}

SDValue DAGTypeLegalizer::PromoteIntRes_MERGE_VALUES(SDNode *N,		SDValue DAGTypeLegalizer::PromoteIntRes_MERGE_VALUES(SDNode *N,
unsigned ResNo) {		unsigned ResNo) {
SDValue Op = DisintegrateMERGE_VALUES(N, ResNo);		SDValue Op = DisintegrateMERGE_VALUES(N, ResNo);
return GetPromotedInteger(Op);		return GetPromotedInteger(Op);
}		}

SDValue DAGTypeLegalizer::PromoteIntRes_AssertSext(SDNode *N) {		SDValue DAGTypeLegalizer::PromoteIntRes_AssertSext(SDNode *N) {
// Sign-extend the new bits, and continue the assertion.		// Sign-extend the new bits, and continue the assertion.
▲ Show 20 Lines • Show All 717 Lines • ▼ Show 20 Lines	bool DAGTypeLegalizer::PromoteIntegerOperand(SDNode *N, unsigned OpNo) {
case ISD::ZERO_EXTEND: Res = PromoteIntOp_ZERO_EXTEND(N); break;		case ISD::ZERO_EXTEND: Res = PromoteIntOp_ZERO_EXTEND(N); break;
case ISD::EXTRACT_SUBVECTOR: Res = PromoteIntOp_EXTRACT_SUBVECTOR(N); break;		case ISD::EXTRACT_SUBVECTOR: Res = PromoteIntOp_EXTRACT_SUBVECTOR(N); break;

case ISD::SHL:		case ISD::SHL:
case ISD::SRA:		case ISD::SRA:
case ISD::SRL:		case ISD::SRL:
case ISD::ROTL:		case ISD::ROTL:
case ISD::ROTR: Res = PromoteIntOp_Shift(N); break;		case ISD::ROTR: Res = PromoteIntOp_Shift(N); break;
		case ISD::HSUM: Res = PromoteIntOp_HSUM(N); break;
}		}

// If the result is null, the sub-method took care of registering results etc.		// If the result is null, the sub-method took care of registering results etc.
if (!Res.getNode()) return false;		if (!Res.getNode()) return false;

// If the result is N, the sub-method updated N in place. Tell the legalizer		// If the result is N, the sub-method updated N in place. Tell the legalizer
// core about this.		// core about this.
if (Res.getNode() == N)		if (Res.getNode() == N)
▲ Show 20 Lines • Show All 317 Lines • ▼ Show 20 Lines
SDValue DAGTypeLegalizer::PromoteIntOp_ZERO_EXTEND(SDNode *N) {		SDValue DAGTypeLegalizer::PromoteIntOp_ZERO_EXTEND(SDNode *N) {
SDLoc dl(N);		SDLoc dl(N);
SDValue Op = GetPromotedInteger(N->getOperand(0));		SDValue Op = GetPromotedInteger(N->getOperand(0));
Op = DAG.getNode(ISD::ANY_EXTEND, dl, N->getValueType(0), Op);		Op = DAG.getNode(ISD::ANY_EXTEND, dl, N->getValueType(0), Op);
return DAG.getZeroExtendInReg(Op, dl,		return DAG.getZeroExtendInReg(Op, dl,
N->getOperand(0).getValueType().getScalarType());		N->getOperand(0).getValueType().getScalarType());
}		}

		SDValue DAGTypeLegalizer::PromoteIntOp_HSUM(SDNode *N) {
		SDValue Op = GetPromotedInteger(N->getOperand(0));
		return DAG.getNode(N->getOpcode(), SDLoc(N),
		N->getValueType(0), Op);
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Integer Result Expansion		// Integer Result Expansion
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// ExpandIntegerResult - This method is called when the specified result of the		/// ExpandIntegerResult - This method is called when the specified result of the
/// specified node is found to need expansion. At this point, the node may also		/// specified node is found to need expansion. At this point, the node may also
/// have invalid operands or may have other results that need promotion, we just		/// have invalid operands or may have other results that need promotion, we just
▲ Show 20 Lines • Show All 1,981 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/LegalizeTypes.h

Show First 20 Lines • Show All 258 Lines • ▼ Show 20 Lines	private:
SDValue PromoteIntRes_SRA(SDNode *N);		SDValue PromoteIntRes_SRA(SDNode *N);
SDValue PromoteIntRes_SRL(SDNode *N);		SDValue PromoteIntRes_SRL(SDNode *N);
SDValue PromoteIntRes_TRUNCATE(SDNode *N);		SDValue PromoteIntRes_TRUNCATE(SDNode *N);
SDValue PromoteIntRes_UADDSUBO(SDNode *N, unsigned ResNo);		SDValue PromoteIntRes_UADDSUBO(SDNode *N, unsigned ResNo);
SDValue PromoteIntRes_UDIV(SDNode *N);		SDValue PromoteIntRes_UDIV(SDNode *N);
SDValue PromoteIntRes_UNDEF(SDNode *N);		SDValue PromoteIntRes_UNDEF(SDNode *N);
SDValue PromoteIntRes_VAARG(SDNode *N);		SDValue PromoteIntRes_VAARG(SDNode *N);
SDValue PromoteIntRes_XMULO(SDNode *N, unsigned ResNo);		SDValue PromoteIntRes_XMULO(SDNode *N, unsigned ResNo);
		SDValue PromoteIntRes_HSUM(SDNode *N);

// Integer Operand Promotion.		// Integer Operand Promotion.
bool PromoteIntegerOperand(SDNode *N, unsigned OperandNo);		bool PromoteIntegerOperand(SDNode *N, unsigned OperandNo);
SDValue PromoteIntOp_ANY_EXTEND(SDNode *N);		SDValue PromoteIntOp_ANY_EXTEND(SDNode *N);
SDValue PromoteIntOp_ATOMIC_STORE(AtomicSDNode *N);		SDValue PromoteIntOp_ATOMIC_STORE(AtomicSDNode *N);
SDValue PromoteIntOp_BITCAST(SDNode *N);		SDValue PromoteIntOp_BITCAST(SDNode *N);
SDValue PromoteIntOp_BUILD_PAIR(SDNode *N);		SDValue PromoteIntOp_BUILD_PAIR(SDNode *N);
SDValue PromoteIntOp_BR_CC(SDNode *N, unsigned OpNo);		SDValue PromoteIntOp_BR_CC(SDNode *N, unsigned OpNo);
Show All 14 Lines	private:
SDValue PromoteIntOp_SIGN_EXTEND(SDNode *N);		SDValue PromoteIntOp_SIGN_EXTEND(SDNode *N);
SDValue PromoteIntOp_SINT_TO_FP(SDNode *N);		SDValue PromoteIntOp_SINT_TO_FP(SDNode *N);
SDValue PromoteIntOp_STORE(StoreSDNode *N, unsigned OpNo);		SDValue PromoteIntOp_STORE(StoreSDNode *N, unsigned OpNo);
SDValue PromoteIntOp_TRUNCATE(SDNode *N);		SDValue PromoteIntOp_TRUNCATE(SDNode *N);
SDValue PromoteIntOp_UINT_TO_FP(SDNode *N);		SDValue PromoteIntOp_UINT_TO_FP(SDNode *N);
SDValue PromoteIntOp_ZERO_EXTEND(SDNode *N);		SDValue PromoteIntOp_ZERO_EXTEND(SDNode *N);
SDValue PromoteIntOp_MSTORE(MaskedStoreSDNode *N, unsigned OpNo);		SDValue PromoteIntOp_MSTORE(MaskedStoreSDNode *N, unsigned OpNo);
SDValue PromoteIntOp_MLOAD(MaskedLoadSDNode *N, unsigned OpNo);		SDValue PromoteIntOp_MLOAD(MaskedLoadSDNode *N, unsigned OpNo);
		SDValue PromoteIntOp_HSUM(SDNode *N);

void PromoteSetCCOperands(SDValue &LHS,SDValue &RHS, ISD::CondCode Code);		void PromoteSetCCOperands(SDValue &LHS,SDValue &RHS, ISD::CondCode Code);

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Integer Expansion Support: LegalizeIntegerTypes.cpp		// Integer Expansion Support: LegalizeIntegerTypes.cpp
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//

/// GetExpandedInteger - Given a processed operand Op which was expanded into		/// GetExpandedInteger - Given a processed operand Op which was expanded into
▲ Show 20 Lines • Show All 325 Lines • ▼ Show 20 Lines	private:
void SplitVecRes_MLOAD(MaskedLoadSDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_MLOAD(MaskedLoadSDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_MGATHER(MaskedGatherSDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_MGATHER(MaskedGatherSDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_SCALAR_TO_VECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_SCALAR_TO_VECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_SIGN_EXTEND_INREG(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_SIGN_EXTEND_INREG(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_SETCC(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_SETCC(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_UNDEF(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_UNDEF(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N, SDValue &Lo,		void SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N, SDValue &Lo,
SDValue &Hi);		SDValue &Hi);
		void SplitVecRes_HSUM(SDNode *N, SDValue &Lo, SDValue &Hi);

// Vector Operand Splitting: <128 x ty> -> 2 x <64 x ty>.		// Vector Operand Splitting: <128 x ty> -> 2 x <64 x ty>.
bool SplitVectorOperand(SDNode *N, unsigned OpNo);		bool SplitVectorOperand(SDNode *N, unsigned OpNo);
SDValue SplitVecOp_VSELECT(SDNode *N, unsigned OpNo);		SDValue SplitVecOp_VSELECT(SDNode *N, unsigned OpNo);
SDValue SplitVecOp_UnaryOp(SDNode *N);		SDValue SplitVecOp_UnaryOp(SDNode *N);
SDValue SplitVecOp_TruncateHelper(SDNode *N);		SDValue SplitVecOp_TruncateHelper(SDNode *N);

SDValue SplitVecOp_BITCAST(SDNode *N);		SDValue SplitVecOp_BITCAST(SDNode *N);
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	private:
SDValue WidenVecOp_EXTRACT_VECTOR_ELT(SDNode *N);		SDValue WidenVecOp_EXTRACT_VECTOR_ELT(SDNode *N);
SDValue WidenVecOp_EXTRACT_SUBVECTOR(SDNode *N);		SDValue WidenVecOp_EXTRACT_SUBVECTOR(SDNode *N);
SDValue WidenVecOp_STORE(SDNode* N);		SDValue WidenVecOp_STORE(SDNode* N);
SDValue WidenVecOp_MSTORE(SDNode* N, unsigned OpNo);		SDValue WidenVecOp_MSTORE(SDNode* N, unsigned OpNo);
SDValue WidenVecOp_SETCC(SDNode* N);		SDValue WidenVecOp_SETCC(SDNode* N);

SDValue WidenVecOp_Convert(SDNode *N);		SDValue WidenVecOp_Convert(SDNode *N);
SDValue WidenVecOp_FCOPYSIGN(SDNode *N);		SDValue WidenVecOp_FCOPYSIGN(SDNode *N);
		SDValue WidenVecOp_HSUM(SDNode *N);

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Vector Widening Utilities Support: LegalizeVectorTypes.cpp		// Vector Widening Utilities Support: LegalizeVectorTypes.cpp
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//

/// Helper GenWidenVectorLoads - Helper function to generate a set of		/// Helper GenWidenVectorLoads - Helper function to generate a set of
/// loads to load a vector with a resulting wider type. It takes		/// loads to load a vector with a resulting wider type. It takes
/// LdChain: list of chains for the load to be generated.		/// LdChain: list of chains for the load to be generated.
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	class VectorLegalizer {
/// \brief Implement vselect in terms of XOR, AND, OR when blend is not		/// \brief Implement vselect in terms of XOR, AND, OR when blend is not
/// supported by the target.		/// supported by the target.
SDValue ExpandVSELECT(SDValue Op);		SDValue ExpandVSELECT(SDValue Op);
SDValue ExpandSELECT(SDValue Op);		SDValue ExpandSELECT(SDValue Op);
SDValue ExpandLoad(SDValue Op);		SDValue ExpandLoad(SDValue Op);
SDValue ExpandStore(SDValue Op);		SDValue ExpandStore(SDValue Op);
SDValue ExpandFNEG(SDValue Op);		SDValue ExpandFNEG(SDValue Op);
SDValue ExpandABSDIFF(SDValue Op);		SDValue ExpandABSDIFF(SDValue Op);
		SDValue ExpandHSUM(SDValue Op);

/// \brief Implements vector promotion.		/// \brief Implements vector promotion.
///		///
/// This is essentially just bitcasting the operands to a different type and		/// This is essentially just bitcasting the operands to a different type and
/// bitcasting the result back to the original type.		/// bitcasting the result back to the original type.
SDValue Promote(SDValue Op);		SDValue Promote(SDValue Op);

/// \brief Implements [SU]INT_TO_FP vector promotion.		/// \brief Implements [SU]INT_TO_FP vector promotion.
▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	SDValue VectorLegalizer::LegalizeOp(SDValue Op) {
case ISD::SIGN_EXTEND_VECTOR_INREG:		case ISD::SIGN_EXTEND_VECTOR_INREG:
case ISD::ZERO_EXTEND_VECTOR_INREG:		case ISD::ZERO_EXTEND_VECTOR_INREG:
case ISD::SMIN:		case ISD::SMIN:
case ISD::SMAX:		case ISD::SMAX:
case ISD::UMIN:		case ISD::UMIN:
case ISD::UMAX:		case ISD::UMAX:
case ISD::UABSDIFF:		case ISD::UABSDIFF:
case ISD::SABSDIFF:		case ISD::SABSDIFF:
		case ISD::HSUM:
		case ISD::FHSUM:
QueryType = Node->getValueType(0);		QueryType = Node->getValueType(0);
break;		break;
case ISD::FP_ROUND_INREG:		case ISD::FP_ROUND_INREG:
QueryType = cast<VTSDNode>(Node->getOperand(1))->getVT();		QueryType = cast<VTSDNode>(Node->getOperand(1))->getVT();
break;		break;
case ISD::SINT_TO_FP:		case ISD::SINT_TO_FP:
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
QueryType = Node->getOperand(0).getValueType();		QueryType = Node->getOperand(0).getValueType();
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	SDValue VectorLegalizer::Expand(SDValue Op) {
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
return ExpandUINT_TO_FLOAT(Op);		return ExpandUINT_TO_FLOAT(Op);
case ISD::FNEG:		case ISD::FNEG:
return ExpandFNEG(Op);		return ExpandFNEG(Op);
case ISD::SETCC:		case ISD::SETCC:
return UnrollVSETCC(Op);		return UnrollVSETCC(Op);
case ISD::UABSDIFF:		case ISD::UABSDIFF:
case ISD::SABSDIFF:		case ISD::SABSDIFF:
return ExpandABSDIFF(Op);		return ExpandABSDIFF(Op);
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Can't you just call ExpandHADD() here? or at least share the unroll and expand code? jmolloy: Can't you just call ExpandHADD() here? or at least share the unroll and expand code?
		ashahidUnsubmitted Not Done Reply Inline Actions Yes, probably I can share. ashahid: Yes, probably I can share.
		case ISD::HSUM:
		case ISD::FHSUM:
		return ExpandHSUM(Op);
default:		default:
return DAG.UnrollVectorOp(Op.getNode());		return DAG.UnrollVectorOp(Op.getNode());
}		}
}		}

SDValue VectorLegalizer::ExpandABSDIFF(SDValue Op) {		SDValue VectorLegalizer::ExpandABSDIFF(SDValue Op) {
SDLoc dl(Op);		SDLoc dl(Op);
SDValue Op0 = Op.getOperand(0);		SDValue Op0 = Op.getOperand(0);
▲ Show 20 Lines • Show All 317 Lines • ▼ Show 20 Lines	SDValue RHSElem = DAG.getNode(
ISD::EXTRACT_VECTOR_ELT, dl, TmpEltVT, RHS,		ISD::EXTRACT_VECTOR_ELT, dl, TmpEltVT, RHS,
DAG.getConstant(i, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));		DAG.getConstant(i, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
Ops[i] = DAG.getNode(ISD::SETCC, dl,		Ops[i] = DAG.getNode(ISD::SETCC, dl,
TLI.getSetCCResultType(DAG.getDataLayout(),		TLI.getSetCCResultType(DAG.getDataLayout(),
*DAG.getContext(), TmpEltVT),		*DAG.getContext(), TmpEltVT),
LHSElem, RHSElem, CC);		LHSElem, RHSElem, CC);
Ops[i] = DAG.getSelect(dl, EltVT, Ops[i],		Ops[i] = DAG.getSelect(dl, EltVT, Ops[i],
DAG.getConstant(APInt::getAllOnesValue		DAG.getConstant(APInt::getAllOnesValue
(EltVT.getSizeInBits()), dl, EltVT),		(EltVT.getSizeInBits()), dl, EltVT),
		hfinkelUnsubmitted Not Done Reply Inline Actions Remove the UnsafeFPMath check. hfinkel: Remove the UnsafeFPMath check.
DAG.getConstant(0, dl, EltVT));		DAG.getConstant(0, dl, EltVT));
}		}
return DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Ops);		return DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Ops);
}		}

		SDValue VectorLegalizer::ExpandHSUM(SDValue Op) {
		SDLoc dl(Op);
		SDValue OpVal = Op.getOperand(0);
		EVT VT = OpVal.getValueType();
		brunoUnsubmitted Not Done Reply Inline Actions Using the assert inside this "if" will be a problem when asserts aren't enabled. bruno: Using the assert inside this "if" will be a problem when asserts aren't enabled.
		unsigned NumElems = VT.getVectorNumElements();
		SDNodeFlags Flags;
		Flags.setNoSignedWrap(Op->getOpcode() == ISD::HSUM);

		hfinkelUnsubmitted Not Done Reply Inline Actions If we can get rid of the undefined behavior for unsigned overflow; then remove this line. hfinkel: If we can get rid of the undefined behavior for unsigned overflow; then remove this line.
		SmallVector<int, 16> ShuffleMask(NumElems, -1);
		for (unsigned i = NumElems; i != 1; i >>= 1) {
		// Move the upper half of the vector to the lower half.
		for (unsigned j = 0; j != i / 2; ++j)
		ShuffleMask[j] = (i / 2 + j);
		SDValue Shuffle =
		hfinkelUnsubmitted Not Done Reply Inline Actions Given that the order of additions is undefined, we can add NoSignedWrap or NoUnsignedWrap in the integer case. hfinkel: Given that the order of additions is undefined, we can add NoSignedWrap or NoUnsignedWrap in…
		DAG.getVectorShuffle(VT, dl, OpVal, DAG.getUNDEF(VT), &ShuffleMask[0]);
		OpVal = DAG.getNode(Op->getOpcode() == ISD::HSUM ? ISD::ADD : ISD::FADD, dl,
		VT, OpVal, Shuffle, &Flags);
		}
		return OpVal;
		}
}		}

bool SelectionDAG::LegalizeVectors() {		bool SelectionDAG::LegalizeVectors() {
return VectorLegalizer(*this).Run();		return VectorLegalizer(*this).Run();
}		}
		brunoUnsubmitted Not Done Reply Inline Actions For the integer part (ISD::HADD) I believe you could do "vector shifts + vector adds" instead of "extracts + scalar adds", probably better to do not leave the vector domain? In case the current target doesn't support "vector shifts + vector adds" for the element type, then your implementation should fallback to "extracts + scalar adds". To check that you can use in UnrollHADD: if (TLI.getOperationAction(ISD::SHL, VT) == TargetLowering::Expand \|\| TLI.getOperationAction(ISD::ADD, VT) == TargetLowering::Expand) .... bruno: For the integer part (ISD::HADD) I believe you could do "vector shifts + vector adds" instead…

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 643 Lines • ▼ Show 20 Lines	#endif
case ISD::FSQRT:		case ISD::FSQRT:
case ISD::FTRUNC:		case ISD::FTRUNC:
case ISD::SINT_TO_FP:		case ISD::SINT_TO_FP:
case ISD::TRUNCATE:		case ISD::TRUNCATE:
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
SplitVecRes_UnaryOp(N, Lo, Hi);		SplitVecRes_UnaryOp(N, Lo, Hi);
break;		break;

		case ISD::HSUM:
		case ISD::FHSUM:
		SplitVecRes_HSUM(N, Lo, Hi);
		break;

case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
SplitVecRes_ExtendOp(N, Lo, Hi);		SplitVecRes_ExtendOp(N, Lo, Hi);
break;		break;

case ISD::ADD:		case ISD::ADD:
case ISD::SUB:		case ISD::SUB:
Show All 31 Lines	case ISD::FMA:
break;		break;
}		}

// If Lo/Hi is null, the sub-method took care of registering results etc.		// If Lo/Hi is null, the sub-method took care of registering results etc.
if (Lo.getNode())		if (Lo.getNode())
SetSplitVector(SDValue(N, ResNo), Lo, Hi);		SetSplitVector(SDValue(N, ResNo), Lo, Hi);
}		}

		void DAGTypeLegalizer::SplitVecRes_HSUM(SDNode *N, SDValue &Lo, SDValue &Hi) {
		SDValue OpVal = N->getOperand(0);
		SDLoc dl(N);
		GetSplitVector(OpVal, Lo, Hi);
		EVT NewVT = Lo.getValueType();
		Lo = DAG.getNode(N->getOpcode(), dl, NewVT, Lo);
		Hi = DAG.getNode(N->getOpcode(), dl, NewVT, Hi);

		hfinkelUnsubmitted Not Done Reply Inline Actions Same comment here as above (we can add signed or unsigned nowrap here). hfinkel: Same comment here as above (we can add signed or unsigned nowrap here).
		SDNodeFlags Flags;
		Flags.setNoSignedWrap(N->getOpcode() == ISD::HSUM);
		Lo = DAG.getNode(N->getOpcode() == ISD::HSUM ? ISD::ADD : ISD::FADD, dl,
		hfinkelUnsubmitted Not Done Reply Inline Actions Same comment here too. hfinkel: Same comment here too.
		NewVT, Lo, Hi, &Flags);
		Hi = DAG.getUNDEF(NewVT);
		}

void DAGTypeLegalizer::SplitVecRes_BinOp(SDNode *N, SDValue &Lo,		void DAGTypeLegalizer::SplitVecRes_BinOp(SDNode *N, SDValue &Lo,
SDValue &Hi) {		SDValue &Hi) {
SDValue LHSLo, LHSHi;		SDValue LHSLo, LHSHi;
GetSplitVector(N->getOperand(0), LHSLo, LHSHi);		GetSplitVector(N->getOperand(0), LHSLo, LHSHi);
SDValue RHSLo, RHSHi;		SDValue RHSLo, RHSHi;
GetSplitVector(N->getOperand(1), RHSLo, RHSHi);		GetSplitVector(N->getOperand(1), RHSLo, RHSHi);
SDLoc dl(N);		SDLoc dl(N);

▲ Show 20 Lines • Show All 1,340 Lines • ▼ Show 20 Lines	#endif
case ISD::FSIN:		case ISD::FSIN:
case ISD::FSQRT:		case ISD::FSQRT:
case ISD::FTRUNC:		case ISD::FTRUNC:
Res = WidenVecRes_Unary(N);		Res = WidenVecRes_Unary(N);
break;		break;
case ISD::FMA:		case ISD::FMA:
Res = WidenVecRes_Ternary(N);		Res = WidenVecRes_Ternary(N);
break;		break;
		case ISD::HSUM:
		case ISD::FHSUM:
		Res = WidenVecRes_Unary(N);
		break;
}		}

// If Res is null, the sub-method took care of registering the result.		// If Res is null, the sub-method took care of registering the result.
if (Res.getNode())		if (Res.getNode())
SetWidenedVector(SDValue(N, ResNo), Res);		SetWidenedVector(SDValue(N, ResNo), Res);
}		}

SDValue DAGTypeLegalizer::WidenVecRes_Ternary(SDNode *N) {		SDValue DAGTypeLegalizer::WidenVecRes_Ternary(SDNode *N) {
▲ Show 20 Lines • Show All 830 Lines • ▼ Show 20 Lines	#endif
case ISD::FP_EXTEND:		case ISD::FP_EXTEND:
case ISD::FP_TO_SINT:		case ISD::FP_TO_SINT:
case ISD::FP_TO_UINT:		case ISD::FP_TO_UINT:
case ISD::SINT_TO_FP:		case ISD::SINT_TO_FP:
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
case ISD::TRUNCATE:		case ISD::TRUNCATE:
Res = WidenVecOp_Convert(N);		Res = WidenVecOp_Convert(N);
break;		break;
		case ISD::HSUM:
		hfinkelUnsubmitted Not Done Reply Inline Actions Why is ISD::HADD not handled here? hfinkel: Why is ISD::HADD not handled here?
		ashahidUnsubmitted Not Done Reply Inline Actions Probably my tests was not sufficient enough to ask me to handle ISD::HADD. I will see. ashahid: Probably my tests was not sufficient enough to ask me to handle ISD::HADD. I will see.
		case ISD::FHSUM:
		Res = WidenVecOp_HSUM(N);
		break;
		hfinkelUnsubmitted Not Done Reply Inline Actions Remove the UnsafeFPMath check. hfinkel: Remove the UnsafeFPMath check.
}		}

// If Res is null, the sub-method took care of registering the result.		// If Res is null, the sub-method took care of registering the result.
if (!Res.getNode()) return false;		if (!Res.getNode()) return false;

// If the result is N, the sub-method updated N in place. Tell the legalizer		// If the result is N, the sub-method updated N in place. Tell the legalizer
// core about this.		// core about this.
if (Res.getNode() == N)		if (Res.getNode() == N)
return true;		return true;


assert(Res.getValueType() == N->getValueType(0) && N->getNumValues() == 1 &&		assert(Res.getValueType() == N->getValueType(0) && N->getNumValues() == 1 &&
"Invalid operand expansion");		"Invalid operand expansion");

ReplaceValueWith(SDValue(N, 0), Res);		ReplaceValueWith(SDValue(N, 0), Res);
return false;		return false;
}		}

		SDValue DAGTypeLegalizer::WidenVecOp_HSUM(SDNode *N) {
		// Since widen introduces undefs, scalarize the horizontal sum for actual
		// number of vector elements.
		hfinkelUnsubmitted Not Done Reply Inline Actions Why exactly does this do? Widening normally introduces undefs, but you can't add a bunch of undefs and get anything other than an undef out. You might need the SDAG node to array an extra parameter indicating how many of the vector lanes are actually to be added to properly support widening. hfinkel: Why exactly does this do? Widening normally introduces undefs, but you can't add a bunch of…
		ashahidUnsubmitted Not Done Reply Inline Actions Yes it does introduce undefs. Sorry that I could not get your suggestion regarding the extra SDAG? ashahid: Yes it does introduce undefs. Sorry that I could not get your suggestion regarding the extra…
		// TODO: Improve scalarization using vector shift and add on supported
		// targets.
		SDValue Op = N->getOperand(0);
		EVT VT = Op.getValueType();
		unsigned NumElems = VT.getVectorNumElements();
		EVT EltVT = VT.getVectorElementType();
		SDLoc dl(Op);
		SDNodeFlags Flags;
		Flags.setNoSignedWrap(N->getOpcode() == ISD::HSUM);

		hfinkelUnsubmitted Not Done Reply Inline Actions Same here. hfinkel: Same here.
		SDValue LHSElem = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Op,
		DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		SDValue RHSElem = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Op,
		DAG.getConstant(1, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		hfinkelUnsubmitted Not Done Reply Inline Actions Same here. hfinkel: Same here.

		SDValue Ops;
		Ops = DAG.getNode(N->getOpcode() == ISD::HSUM ? ISD::ADD : ISD::FADD, dl,
		EltVT, LHSElem, RHSElem, &Flags);
		for (unsigned i = 2; i < NumElems; ++i) {
		hfinkelUnsubmitted Not Done Reply Inline Actions And here. hfinkel: And here.
		LHSElem = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Op,
		DAG.getConstant(i, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		Ops = DAG.getNode(N->getOpcode() == ISD::HSUM ? ISD::ADD : ISD::FADD, dl,
		EltVT, LHSElem, Ops, &Flags);
		}
		return Ops;
		}


SDValue DAGTypeLegalizer::WidenVecOp_EXTEND(SDNode *N) {		SDValue DAGTypeLegalizer::WidenVecOp_EXTEND(SDNode *N) {
SDLoc DL(N);		SDLoc DL(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

SDValue InOp = N->getOperand(0);		SDValue InOp = N->getOperand(0);
// If some legalization strategy other than widening is used on the operand,		// If some legalization strategy other than widening is used on the operand,
// we can't safely assume that just extending the low lanes is the correct		// we can't safely assume that just extending the low lanes is the correct
// transformation.		// transformation.
▲ Show 20 Lines • Show All 712 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,806 Lines • ▼ Show 20 Lines	setValue(&I, DAG.getNode(ISD::UABSDIFF, sdl,
getValue(I.getArgOperand(1))));		getValue(I.getArgOperand(1))));
return nullptr;		return nullptr;
case Intrinsic::sabsdiff:		case Intrinsic::sabsdiff:
setValue(&I, DAG.getNode(ISD::SABSDIFF, sdl,		setValue(&I, DAG.getNode(ISD::SABSDIFF, sdl,
getValue(I.getArgOperand(0)).getValueType(),		getValue(I.getArgOperand(0)).getValueType(),
getValue(I.getArgOperand(0)),		getValue(I.getArgOperand(0)),
getValue(I.getArgOperand(1))));		getValue(I.getArgOperand(1))));
return nullptr;		return nullptr;
		case Intrinsic::hsum: {
		/// At IR level, this intrinsic is defined to return an scalar value.
		/// However at SDAG level, this is being transformed into two nodes
		/// HSUM + EXTRACT_VECTOR_ELT. The idea is to keep the result of HSUM
		/// into vector, because if required, addtional vector operations can be
		/// performed on this vector result.
		SDValue Arg = getValue(I.getArgOperand(0));
		EVT Ty = Arg.getValueType();
		EVT EltVT = Ty.getVectorElementType();
		SDValue Hsum = DAG.getNode(
		EltVT.isFloatingPoint() ? ISD::FHSUM : ISD::HSUM,
		sdl, Ty,
		Arg);
		Hsum = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, sdl, EltVT, Hsum,
		DAG.getConstant(0, sdl, TLI.getVectorIdxTy(DAG.getDataLayout())));
		setValue(&I, Hsum);
		return nullptr;
		}
case Intrinsic::cttz: {		case Intrinsic::cttz: {
SDValue Arg = getValue(I.getArgOperand(0));		SDValue Arg = getValue(I.getArgOperand(0));
ConstantInt *CI = cast<ConstantInt>(I.getArgOperand(1));		ConstantInt *CI = cast<ConstantInt>(I.getArgOperand(1));
EVT Ty = Arg.getValueType();		EVT Ty = Arg.getValueType();
setValue(&I, DAG.getNode(CI->isZero() ? ISD::CTTZ : ISD::CTTZ_ZERO_UNDEF,		setValue(&I, DAG.getNode(CI->isZero() ? ISD::CTTZ : ISD::CTTZ_ZERO_UNDEF,
sdl, Ty, Arg));		sdl, Ty, Arg));
return nullptr;		return nullptr;
}		}
▲ Show 20 Lines • Show All 3,664 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp

Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	#endif
case ISD::FRINT: return "frint";		case ISD::FRINT: return "frint";
case ISD::FNEARBYINT: return "fnearbyint";		case ISD::FNEARBYINT: return "fnearbyint";
case ISD::FROUND: return "fround";		case ISD::FROUND: return "fround";
case ISD::FEXP: return "fexp";		case ISD::FEXP: return "fexp";
case ISD::FEXP2: return "fexp2";		case ISD::FEXP2: return "fexp2";
case ISD::FLOG: return "flog";		case ISD::FLOG: return "flog";
case ISD::FLOG2: return "flog2";		case ISD::FLOG2: return "flog2";
case ISD::FLOG10: return "flog10";		case ISD::FLOG10: return "flog10";
		case ISD::HSUM: return "hsum";
		case ISD::FHSUM: return "fhsum";

// Binary operators		// Binary operators
case ISD::ADD: return "add";		case ISD::ADD: return "add";
case ISD::SUB: return "sub";		case ISD::SUB: return "sub";
case ISD::MUL: return "mul";		case ISD::MUL: return "mul";
case ISD::MULHU: return "mulhu";		case ISD::MULHU: return "mulhu";
case ISD::MULHS: return "mulhs";		case ISD::MULHS: return "mulhs";
case ISD::SDIV: return "sdiv";		case ISD::SDIV: return "sdiv";
▲ Show 20 Lines • Show All 530 Lines • Show Last 20 Lines

lib/CodeGen/TargetLoweringBase.cpp

Show First 20 Lines • Show All 823 Lines • ▼ Show 20 Lines	for (MVT VT : MVT::all_valuetypes()) {
setOperationAction(ISD::SADDO, VT, Expand);		setOperationAction(ISD::SADDO, VT, Expand);
setOperationAction(ISD::SSUBO, VT, Expand);		setOperationAction(ISD::SSUBO, VT, Expand);
setOperationAction(ISD::UADDO, VT, Expand);		setOperationAction(ISD::UADDO, VT, Expand);
setOperationAction(ISD::USUBO, VT, Expand);		setOperationAction(ISD::USUBO, VT, Expand);
setOperationAction(ISD::SMULO, VT, Expand);		setOperationAction(ISD::SMULO, VT, Expand);
setOperationAction(ISD::UMULO, VT, Expand);		setOperationAction(ISD::UMULO, VT, Expand);
setOperationAction(ISD::UABSDIFF, VT, Expand);		setOperationAction(ISD::UABSDIFF, VT, Expand);
setOperationAction(ISD::SABSDIFF, VT, Expand);		setOperationAction(ISD::SABSDIFF, VT, Expand);
		setOperationAction(ISD::HSUM, VT, Expand);
		setOperationAction(ISD::FHSUM, VT, Expand);

// These library functions default to expand.		// These library functions default to expand.
setOperationAction(ISD::FROUND, VT, Expand);		setOperationAction(ISD::FROUND, VT, Expand);

// These operations default to expand for vector types.		// These operations default to expand for vector types.
if (VT.isVector()) {		if (VT.isVector()) {
setOperationAction(ISD::FCOPYSIGN, VT, Expand);		setOperationAction(ISD::FCOPYSIGN, VT, Expand);
setOperationAction(ISD::ANY_EXTEND_VECTOR_INREG, VT, Expand);		setOperationAction(ISD::ANY_EXTEND_VECTOR_INREG, VT, Expand);
▲ Show 20 Lines • Show All 865 Lines • Show Last 20 Lines

test/CodeGen/X86/vec-hadd-float-128.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck --check-prefix=UNSAFE %s

				brunoUnsubmitted Not Done Reply Inline Actions Thanks for updating the tests. Could you please use ./utils/update_llc_test_checks.py to rewrite the tests checking? This will make it easier later on to see the improvements from target customization or widening (as you mentioned in a TODO). bruno: Thanks for updating the tests. Could you please use ./utils/update_llc_test_checks.py to…

				brunoUnsubmitted Not Done Reply Inline Actions Could you please update your tests to be more target neutral? I mean, use -mtriple=x86_64-unknown-unknown instead. One question, what code does it emit if one removes -enable-unsafe-fp-math? If it currently makes no difference, you can remove it, otherwise you should be testing both versions. bruno: Could you please update your tests to be more target neutral? I mean, use -mtriple=x86_64…
				ashahidUnsubmitted Not Done Reply Inline Actions Ok, will make it more target neutral accordingly. In fact, user is not supposed to use float version of this intrinsic at all, if it is done so, compiler will 'assert'. ashahid: Ok, will make it more target neutral accordingly. In fact, user is not supposed to use float…
				declare float @llvm.hsum.f32.v2f32(<2 x float>)

				define float @test1_hsum_float_f32(<2 x float> %a1) {
				; UNSAFE-LABEL: test1_hsum_float_f32:
				; UNSAFE: # BB#0:
				; UNSAFE-NEXT: movapd %xmm0, %xmm1
				; UNSAFE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]
				davidxlUnsubmitted Not Done Reply Inline Actions Should it be shufps .... xmm1 = xmm1[1, ?, ?, ?] davidxl: Should it be shufps .... xmm1 = xmm1[1, ?, ?, ?]
				; UNSAFE-NEXT: addps %xmm0, %xmm1
				; UNSAFE-NEXT: movaps %xmm1, %xmm0
				; UNSAFE-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,3,2,3]
				davidxlUnsubmitted Not Done Reply Inline Actions this shufps and addps should not be expected davidxl: this shufps and addps should not be expected
				; UNSAFE-NEXT: addps %xmm1, %xmm0
				; UNSAFE-NEXT: retq
				%1 = call float @llvm.hsum.f32.v2f32(<2 x float> %a1)
				ret float %1
				}

				declare float @llvm.hsum.f32.v4f32(<4 x float>)

				define float @test2_hsum_float_f32(<4 x float> %a1) {
				; UNSAFE-LABEL: test2_hsum_float_f32:
				; UNSAFE: # BB#0:
				; UNSAFE-NEXT: movapd %xmm0, %xmm1
				; UNSAFE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]
				; UNSAFE-NEXT: addps %xmm0, %xmm1
				; UNSAFE-NEXT: movaps %xmm1, %xmm0
				; UNSAFE-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,3,2,3]
				; UNSAFE-NEXT: addps %xmm1, %xmm0
				; UNSAFE-NEXT: retq
				%1 = call float @llvm.hsum.f32.v4f32(<4 x float> %a1)
				ret float %1
				}

				declare double @llvm.hsum.f64.v2f64(<2 x double>)

				define double @test1_hsum_float_f64(<2 x double> %a1) {
				; UNSAFE-LABEL: test1_hsum_float_f64:
				; UNSAFE: # BB#0:
				; UNSAFE-NEXT: movapd %xmm0, %xmm1
				; UNSAFE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]
				; UNSAFE-NEXT: addpd %xmm0, %xmm1
				; UNSAFE-NEXT: movapd %xmm1, %xmm0
				; UNSAFE-NEXT: retq
				%1 = call double @llvm.hsum.f64.v2f64(<2 x double> %a1)
				ret double %1
				}

test/CodeGen/X86/vec-hadd-float-256.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck --check-prefix=UNSAFE %s

				declare double @llvm.hsum.f64.v4f64(<4 x double>)

				define double @test_hsum_float_f64(<4 x double> %a1) {
				RKSimonUnsubmitted Not Done Reply Inline Actions The test file is vec-hadd-float-256 yet you are testing a 512 bit vector? Change to <4 x double> RKSimon: The test file is vec-hadd-float-256 yet you are testing a 512 bit vector? Change to <4 x double>
				; UNSAFE-LABEL: test_hsum_float_f64:
				; UNSAFE: # BB#0:
				; UNSAFE-NEXT: movapd %xmm1, %xmm2
				; UNSAFE-NEXT: shufpd {{.*#+}} xmm2 = xmm2[1,0]
				; UNSAFE-NEXT: addpd %xmm1, %xmm2
				; UNSAFE-NEXT: movapd %xmm0, %xmm1
				; UNSAFE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]
				; UNSAFE-DAG: addpd %xmm2, %xmm1
				; UNSAFE-DAG: addpd %xmm0, %xmm1
				; UNSAFE-NEXT: movapd %xmm1, %xmm0
				; UNSAFE-NEXT: retq
				%1 = call double @llvm.hsum.f64.v4f64(<4 x double> %a1)
				ret double %1
				}

test/CodeGen/X86/vec-hadd-int-128.ll

This file was added.

				; RUN: llc -mtriple=x86_64-unknown-unknown < %s \| FileCheck %s -check-prefix=CHECK

				declare i8 @llvm.hsum.i8.v4i8(<4 x i8>)

				define i8 @test_hsum_int_i8(<4 x i8> %a1) {
				; CHECK-LABEL: test_hsum_int_i8:
				; CHECK: # BB#0:
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				davidxlUnsubmitted Not Done Reply Inline Actions The result does not look right -- should pshufb be generated instead? davidxl: The result does not look right -- should pshufb be generated instead?
				conghUnsubmitted Not Done Reply Inline Actions I think shift operation is required here given we only have SSE2 support for x86_64. congh: I think shift operation is required here given we only have SSE2 support for x86_64.
				; CHECK-NEXT: paddd %xmm0, %xmm1
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
				; CHECK-NEXT: paddd %xmm1, %xmm0
				; CHECK-NEXT: movdqa %xmm0, -{{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb -{{[0-9]+}}(%rsp), %al
				; CHECK-NEXT: retq
				%1 = call i8 @llvm.hsum.i8.v4i8(<4 x i8> %a1)
				ret i8 %1
				}

				declare i16 @llvm.hsum.i16.v4i16(<4 x i16>)

				define i16 @test_hsum_int_i16(<4 x i16> %a1) {
				; CHECK-LABEL: test_hsum_int_i16:
				; CHECK: # BB#0:
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				davidxlUnsubmitted Not Done Reply Inline Actions should phsufw be generated? Or more efficient with phaddw? davidxl: should phsufw be generated? Or more efficient with phaddw?
				conghUnsubmitted Not Done Reply Inline Actions In SSE2, pshuflw should be generated here. phaddw is introduced in SSSE3. congh: In SSE2, pshuflw should be generated here. phaddw is introduced in SSSE3.
				; CHECK-NEXT: paddd %xmm0, %xmm1
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
				; CHECK-NEXT: paddd %xmm1, %xmm0
				; CHECK-NEXT: movd %xmm0, %eax
				; CHECK-NEXT: retq
				%1 = call i16 @llvm.hsum.i16.v4i16(<4 x i16> %a1)
				ret i16 %1
				}

				declare i32 @llvm.hsum.i32.v3i32(<3 x i32>)

				define i32 @test1_hsum_int_i32(<3 x i32> %a1) {
				; CHECK-LABEL: test1_hsum_int_i32:
				; CHECK: # BB#0:
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddd %xmm0, %xmm1
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
				; CHECK-NEXT: paddd %xmm1, %xmm0
				; CHECK-NEXT: movd %xmm0, %eax
				; CHECK-NEXT: retq
				%1 = call i32 @llvm.hsum.i32.v3i32(<3 x i32> %a1)
				ret i32 %1
				}

				declare i32 @llvm.hsum.i32.v4i32(<4 x i32>)

				define i32 @test2_hsum_int_i32(<4 x i32> %a1) {
				; CHECK-LABEL: test2_hsum_int_i32:
				; CHECK: # BB#0:
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddd %xmm0, %xmm1
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
				; CHECK-NEXT: paddd %xmm1, %xmm0
				; CHECK-NEXT: movd %xmm0, %eax
				; CHECK-NEXT: retq
				%1 = call i32 @llvm.hsum.i32.v4i32(<4 x i32> %a1)
				ret i32 %1
				}

				declare i64 @llvm.hsum.i64.v2i64(<2 x i64>)

				define i64 @test1_hsum_int_i64(<2 x i64> %a1) {
				; CHECK-LABEL: test1_hsum_int_i64:
				; CHECK: # BB#0:
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddq %xmm0, %xmm1
				; CHECK-NEXT: movd %xmm1, %rax
				; CHECK-NEXT: retq
				%1 = call i64 @llvm.hsum.i64.v2i64(<2 x i64> %a1)
				ret i64 %1
				}

test/CodeGen/X86/vec-hadd-int-256.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck %s --check-prefix=CHECK

				declare i64 @llvm.hsum.i64.v4i64(<4 x i64>)

				define i64 @test_hsum_int_i64(<4 x i64> %a1) {
				; CHECK-LABEL: test_hsum_int_i64:
				; CHECK: # BB#0:
				; CHECK-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]
				; CHECK-NEXT: paddq %xmm1, %xmm2
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: paddq %xmm0, %xmm1
				; CHECK-NEXT: paddq %xmm2, %xmm1
				; CHECK-NEXT: movd %xmm1, %rax
				; CHECK-NEXT: retq
				%1 = call i64 @llvm.hsum.i64.v4i64(<4 x i64> %a1)
				RKSimonUnsubmitted Not Done Reply Inline Actions This codegen is the same as for the test1_hsum_int_i64 <2x i64> version in vec-hadd-int-128.ll - something is going wrong. You probably should compare against codegen from a AVX2 target. RKSimon: This codegen is the same as for the test1_hsum_int_i64 <2x i64> version in vec-hadd-int-128.ll…
				ashahidUnsubmitted Not Done Reply Inline Actions With AVX2 the generated code differ as below. Case V2i64 vpshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1] vpaddq %xmm1, %xmm0, %xmm0 vmovq %xmm0, %rax retq Case V4i64 vextracti128 $1, %ymm0, %xmm1 vpaddq %ymm1, %ymm0, %ymm0 vpermq $237, %ymm0, %ymm1 # ymm1 = ymm0[1,3,2,3] vpaddq %ymm1, %ymm0, %ymm0 vmovq %xmm0, %rax vzeroupper retq ashahid: With AVX2 the generated code differ as below. Case V2i64 vpshufd $78, %xmm0, %xmm1…
				RKSimonUnsubmitted Not Done Reply Inline Actions So yes, it appears to be something is wrong with the legalization. When you build for SSE you only get the hsum of the bottom <2 x i64>, when you build for AVX (which legalizes <4 x i64>) you get the hsum of the whole <2 x i64>. RKSimon: So yes, it appears to be something is wrong with the legalization. When you build for SSE you…
				ret i64 %1
				}

This is an archive of the discontinued LLVM Phabricator instance.

[Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 35627

docs/LangRef.rst

include/llvm/CodeGen/ISDOpcodes.h

include/llvm/IR/Intrinsics.td

include/llvm/Target/TargetSelectionDAG.td

lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

lib/CodeGen/SelectionDAG/LegalizeTypes.h

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp

lib/CodeGen/TargetLoweringBase.cpp

test/CodeGen/X86/vec-hadd-float-128.ll

test/CodeGen/X86/vec-hadd-float-256.ll

test/CodeGen/X86/vec-hadd-int-128.ll

test/CodeGen/X86/vec-hadd-int-256.ll

[Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
AbandonedPublic