This is an archive of the discontinued LLVM Phabricator instance.

[X86] Changes to extract Horizontal addition operation for AVX-512.
Needs RevisionPublic

Authored by jbhateja on Aug 8 2017, 3:12 AM.

Details

Summary

vphadd is not a supported instruction for AVX-512, but if result of add (512 bits)
is partially consumed such that less than half of the bits are used by the user of an add
instruction, in that case we can perform horizontal addition and concatenate the result
with undef.

This will fix PR33758

Event Timeline

jbhateja created this revision.Aug 8 2017, 3:12 AM

Ping @ reviewers.

craig.topper added inline comments.Aug 10 2017, 5:23 PM
lib/Target/X86/X86ISelLowering.cpp
35597

Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get called on all sorts of types.

35610

Don't use a SmallVector for a hardcoded two elements. Just use a plain old array.

craig.topper edited edge metadata.Aug 10 2017, 5:29 PM

Are there any cases where this should happen where Op0 is different that Op1? All of these test changes are unary shuffles.

And really shouldn't we have some more generic way of shrinking adds like this? How many of the other adds in these tests could have been done in a smaller type to use a VEX encoding?

Thinking about this some more. Do we really want to use a horizontal add instruction for a register with itself? Horizontal add is suboptimally implemented in microcode. It's 3 uops while the pshufd and the add are only 2 uops. The 3 uops also mean its limited to the complex decoder on Intel hardware.

jbhateja updated this revision to Diff 112822.Aug 27 2017, 6:22 AM
jbhateja added a comment.EditedAug 27 2017, 6:27 AM

Thinking about this some more. Do we really want to use a horizontal add instruction for a register with itself? Horizontal add is suboptimally implemented in microcode. It's 3 uops while the pshufd and the add are only 2 uops. The 3 uops also mean its limited to the complex decoder on Intel hardware.

Yes, I agree. latency and number of micro codes are better without hadd, only lesser code size is advantage with hadd with same operands.
Do you suggest adding a check in isHorizontalBinOp to not recognize a valid pttern for horizontal add/sub if operands of are same.

jbhateja updated this revision to Diff 112823.Aug 27 2017, 6:30 AM
  • Removing a file got added in last patch.
jbhateja updated this revision to Diff 112829.Aug 27 2017, 8:22 AM
  • Stashed change leftout in last checkin + formatting changes.
  • Updating test reference

ping @reviewers

Ping reviewers

I think we should try to combine based on the add only being used by the extract_vector_elt. Turn the add into a 128-bit add being fed by extract_subvectors. Similarly if we see an add only being used by an extract_subvector we can shrink that add too and push the extracts up. This type of transform feels more generally useful because it will allow us to narrow many more adds in this code. This will enable EVEX->VEX to use a smaller encoding. We can apply this to many other opcodes as well.

If we do this early enough we should be able to shrink the add before the horizontal add detection.

jbhateja added a comment.EditedSep 6 2017, 12:43 AM

I think we should try to combine based on the add only being used by the extract_vector_elt. Turn the add into a 128-bit add being fed by extract_subvectors. Similarly if we see an add only being used by an extract_subvector we can shrink that add too and push the extracts up. This type of transform feels more generally useful because it will allow us to narrow many more adds in this code. This will enable EVEX->VEX to use a smaller encoding. We can apply this to many other opcodes as well.

If we do this early enough we should be able to shrink the add before the horizontal add detection.

Two cases for DAG node reduction:-

 a / Look at operands and try squeezing them up (EXTRACT_SUBVECTOR) for narrower operation  which is then concatinated with a pad to make the final result size same as original. Here we only look at the operands and not the uses of the operation. Which means it could break valid patterns being checked at the use nodes due to insertion of CONCAT_VECTORS now. 

b / Look at both the uses of the DAG node along with the operands of the node while narrowing down the operation. Idea here is to avoid insertion of extra concat operation for padding which shall keep the pattern matches at use node happy.

My initial patch was based on strategy (b), What I get from you above comment is to change it for any generic OPCODE instead of HADD. Correct?

Currently patch is based on strategy (a) with a wrapper over generic subroutine for scaling down operation only for HADD/HSUB perticular vector types which is also safe.

jbhateja updated this revision to Diff 115795.Sep 19 2017, 12:10 AM

Why this be solved by just combining extract_subvectors through things like binops and onto their inputs.

Like this:
combine (extract_subvector (add X, Y)) -> (narrower_add (extract_subvector X, extract_subvector Y))

lib/Target/X86/X86ISelLowering.cpp
30111

I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not supposed to be i32.

33648

Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the inputs? Then we shouldn't need any change here because we'll already have the narrow add?

33724

Why is this variable now upper case?

33792

Why was this moved out of the if.

test/CodeGen/X86/madd.ll
303

Why is this horizontal add not narrowed?

RKSimon edited edge metadata.Sep 19 2017, 2:43 AM

Would we be better off focussing on developing the @llvm.experimental.vector.reduce.add.* intrinsics? Or looking to get slpvectorizer to do extract_subvector style shuffles as part of the reduction?

jbhateja added inline comments.Sep 19 2017, 4:15 AM
lib/Target/X86/X86ISelLowering.cpp
30111

Yes this will be fixed.

33648

I think each Node specific combiner should look at pattern starting from itself.

i.e. combineExtractSubvector could be taught to remove extract_subvector from a vector shuffle (if has only one use) thus producing a smaller vector shuffle that will be generic change.

Looking for opportunity like you mentioned "extract of an add by extracting from input" appears as a specialization.

As of now narrowing down has be done genrically at following two places
1/ Extract_vector_elt : It looks at its input vector and try scaling down if possible.

2/ Narrow down binary operation (add/sub/fadd/fsub) [this is implicitly doing whay your comment says).

33724

For better visual clarity of variable name.

33792

This will be fixed.

I don't think we should shrinking operations based on undef inputs. I think we should be shrinking them based on what elements are consumed by their users. There's no reason the shuffles in these reductions have to have undef elements. For integers we may rewrite the shuffle mask to undef in InstCombine if the elements aren't used down stream. But we don't do that for FP.

As an example, why shouldn't we be able use a horizontal add for this

define <4 x double> @fadd_noundef(<8 x double> %x225, <8 x double> %x227) {

  %x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
  %x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
  %x229 = fadd <8 x double> %x226, %x228
  %x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x double> %x230
}
lib/Target/X86/X86ISelLowering.cpp
1694

Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making references to it in terms of SSE/AVX is misleading.

jbhateja updated this revision to Diff 116634.Sep 25 2017, 7:58 PM

I don't think we should shrinking operations based on undef inputs. I think we should be shrinking them based on what elements are consumed by their users. There's no reason the shuffles in these reductions have to have undef elements. For integers we may rewrite the shuffle mask to undef in InstCombine if the elements aren't used down stream. But we don't do that for FP.

As an example, why shouldn't we be able use a horizontal add for this

define <4 x double> @fadd_noundef(<8 x double> %x225, <8 x double> %x227) {

  %x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
  %x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
  %x229 = fadd <8 x double> %x226, %x228
  %x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x double> %x230
}

We do generate horizontal addition in this case.

lib/Target/X86/X86ISelLowering.cpp
1694

Reference to MMX here is signifying one of the denominations of a smaller vector register.

The test case I gave does not generate horizontal add with your patch with avx512f enabled. It does with avx2 but that's only because type legalization did the dirty work of splitting the result.

jbhateja updated this revision to Diff 116998.Sep 28 2017, 8:14 AM
  • Changes to cover more patterns for [f]hadd/[f]sub for AVX512 vector types.
  • Generic routines added which looks at uses / undef operands of an operation for scaling it down.
jbhateja added inline comments.Sep 28 2017, 8:22 AM
lib/Target/X86/X86ISelLowering.cpp
33638

Just realized that DAG argument is not used here, it shall be removed with other comments over patch.

RKSimon requested changes to this revision.Sep 29 2017, 3:49 AM

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

This revision now requires changes to proceed.Sep 29 2017, 3:49 AM
jbhateja added a comment.EditedSep 29 2017, 7:47 AM

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

Hi Simon,

Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts.
It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

define float @fhsub_16(<16 x float> %x225) {
;define <16 x float> @fhsub_16(<16 x float> %x225) {

%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x227 = fadd <16 x float> %x225, %x226
%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x229 = fsub <16 x float> %x227, %x228
%x230 = extractelement <16 x float> %x229, i32 0
ret float %x230

}

This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch D36650 is also suggesting for similar effort.

PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.

Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.

Thanks

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

Hi Simon,

Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts.
It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

In my opinion, @llvm.experimental.vector.reduce.fadd and friends should be treated as the canonical forms of those operations. InstCombine should form these intrinsics upon encountering these shuffle patterns.

The SLPVectorizer, LoopVectorizer, etc. should use the intrinsics when handling relevant reductions.

Is there an advantage to handling the shuffle patterns in the backend directly as opposed to forming the intrinsics earlier and then handling them in the backend?

define float @fhsub_16(<16 x float> %x225) {
;define <16 x float> @fhsub_16(<16 x float> %x225) {

%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x227 = fadd <16 x float> %x225, %x226
%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x229 = fsub <16 x float> %x227, %x228
%x230 = extractelement <16 x float> %x229, i32 0
ret float %x230

}

This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch D36650 is also suggesting for similar effort.

PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.

Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.

Thanks

RKSimon resigned from this revision.Oct 11 2018, 8:56 AM

@jbhateja Abandon this? @spatel's recent work seems to have covered everything already