This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Reduce Shuffle_Vector Nodes Count
Needs ReviewPublic

Authored by mmarjieh on Feb 10 2021, 3:25 AM.

Details

Summary

[DAGCombiner] Reduce Shuffle_Vector Nodes Count

This patch adds a DAG combine on vector shuffles that tries to reduce
the number of shuffle vector nodes.
For example, if we have the following pattern:

Instead of generating the following nodes:
t0: v8i32 = vector_shuffle<0,5,u,u,u,u,u,u> t3, undef
t1: v8i32 = vector_shuffle<0,1,10,11,u,u,u,u> t0, t4

Combine to this node:
t2: v8i32 = vector_shuffle<0,5,10,11,u,u,u,u> t3, t4

Diff Detail

Event Timeline

mmarjieh created this revision.Feb 10 2021, 3:25 AM
mmarjieh requested review of this revision.Feb 10 2021, 3:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2021, 3:25 AM
mmarjieh updated this revision to Diff 322641.Feb 10 2021, 3:28 AM
mmarjieh edited the summary of this revision. (Show Details)

Fix typo in commit message.

RKSimon added inline comments.Feb 10 2021, 6:44 AM
llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
115

At quick glance - this looks wrong, I'd expect this still to be the same vshufpd?

mmarjieh added inline comments.Feb 11 2021, 1:32 AM
llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
115

I am not familiar with X86's ISA.
Can you explain why?
Meanwhile, I will show you the difference in the DAG after my patch:

Before this patch:
SelectionDAG has 28 nodes:

t0: ch = EntryToken
t5: v4i64,ch = load<(load 32 from `<4 x i64>* null`, align 8)> t0, Constant:i32<0>, undef:i32
t6: v4i64,ch = load<(load 32 from `<4 x i64>* undef`, align 8)> t0, undef:i32, undef:i32
    t21: ch = TokenFactor t5:1, t6:1
            t36: v8i16 = BUILD_VECTOR Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, undef:i16, undef:i16, undef:i16, undef:i16
          t52: v2f64 = bitcast t36
        t62: v4f64 = concat_vectors t52, undef:v2f64
      t63: v4f64 = vector_shuffle<u,u,0,0> t62, undef:v4f64
            t37: v8i16 = X86ISD::VTRUNC t5
          t39: v8i16 = sign_extend_inreg t37, ValueType:ch:v8i8
        t44: v2f64 = bitcast t39
            t40: v8i16 = X86ISD::VTRUNC t6
          t41: v8i16 = sign_extend_inreg t40, ValueType:ch:v8i8
        t49: v2f64 = bitcast t41
      t59: v4f64 = concat_vectors t44, t49
    t68: v4f64 = vector_shuffle<4,6,2,3> t63, t59
    t3: i32,ch = load<(load 4 from %fixed-stack.0)> t0, FrameIndex:i32<-1>, undef:i32
  t57: ch = store<(store 32 into %ir.10, align 2)> t21, t68, t3, undef:i32
t29: ch = X86ISD::RET_FLAG t57, TargetConstant:i32<0>

After this patch:
SelectionDAG has 26 nodes:

t0: ch = EntryToken
t5: v4i64,ch = load<(load 32 from `<4 x i64>* null`, align 8)> t0, Constant:i32<0>, undef:i32
t6: v4i64,ch = load<(load 32 from `<4 x i64>* undef`, align 8)> t0, undef:i32, undef:i32
    t21: ch = TokenFactor t5:1, t6:1
            t37: v8i16 = X86ISD::VTRUNC t5
          t39: v8i16 = sign_extend_inreg t37, ValueType:ch:v8i8
        t44: v2f64 = bitcast t39
            t40: v8i16 = X86ISD::VTRUNC t6
          t41: v8i16 = sign_extend_inreg t40, ValueType:ch:v8i8
        t49: v2f64 = bitcast t41
      t59: v4f64 = concat_vectors t44, t49
          t36: v8i16 = BUILD_VECTOR Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, undef:i16, undef:i16, undef:i16, undef:i16
        t52: v2f64 = bitcast t36
      t62: v4f64 = concat_vectors t52, undef:v2f64
    t64: v4f64 = vector_shuffle<0,2,4,4> t59, t62
    t3: i32,ch = load<(load 4 from %fixed-stack.0)> t0, FrameIndex:i32<-1>, undef:i32
  t57: ch = store<(store 32 into %ir.10, align 2)> t21, t64, t3, undef:i32
t29: ch = X86ISD::RET_FLAG t57, TargetConstant:i32<0>
RKSimon added inline comments.Feb 11 2021, 3:22 AM
llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
115

My mistake - I missed that we were implicitly zeroing the upper elements (xmm -> ymm) - sorry about that

Hey guys, I would appreciate it if you can review.
I think this change is beneficial for all targets, since we are reducing the number of shuffle_vector dag nodes and hence reducing code size.
Since I am not familiar with all targets, can you go over the target assembly and verify that it is beneficial for you?
I counted the number of instructions in each lit test and saw an improvement in the number of instructions.

This doesn't seem like the right direction,
i'd expect that to be a new fold to reduce shuffle count,
because if we only teach some existing fold to do this,
we'll miss such shuffle patterns that appear via other means.

mmarjieh updated this revision to Diff 327711.Mar 3 2021, 1:38 AM

Instead of reducing the number of vector_shuffle DAG nodes
in reduceBuildVecToShuffle, do this in a separate combine.

lebedev.ri added a comment.EditedMar 3 2021, 1:42 AM

I wonder if we should be checking if the resulting shuffle is legal, i.e. see buildLegalVectorShuffle() and it's uses.

This doesn't seem like the right direction,
i'd expect that to be a new fold to reduce shuffle count,
because if we only teach some existing fold to do this,
we'll miss such shuffle patterns that appear via other means.

Agreed.
I added a separate combine for this.
I am still not sure what is the best DAGCombine Level to run this combine.
Currently, I run it after LegalizeDAG.
I would like to receive advice from you on when to run it.

mmarjieh retitled this revision from [DAGCombiner] Improve reduceBuildVecToShuffle Performance to [DAGCombiner] Reduce Shuffle_Vector Nodes Count.Mar 3 2021, 3:05 AM
mmarjieh edited the summary of this revision. (Show Details)
mmarjieh edited the summary of this revision. (Show Details)Mar 3 2021, 3:05 AM

This doesn't seem like the right direction,
i'd expect that to be a new fold to reduce shuffle count,
because if we only teach some existing fold to do this,
we'll miss such shuffle patterns that appear via other means.

Agreed.
I added a separate combine for this.
I am still not sure what is the best DAGCombine Level to run this combine.
Currently, I run it after LegalizeDAG.
I would like to receive advice from you on when to run it.

I'm not sure VECTOR_SHUFFLE makes is to the DAGCombine after LegalizeDAG on most targets. Definitely not on X86.

If all types are legal and LegalizeVectorOps doesn't change anything, there are only 2 DAG combine runs that will happen, the very first one before LegalizeTypes and the very last one after LegalizeDAG. The other 2 DAG combiners are conditional on the previous legalizer step making changes. Given that and the X86 behavior I'm not sure you really have an option to delay it.

I'm not sure how useful this really is - we already have something very similar later in the function that uses the MergeInnerShuffle lambda - maybe we're just missing an edge case there? But as @craig.topper said we need to be careful about performing this in later legalization stages.

RKSimon resigned from this revision.Mar 29 2021, 3:44 AM
RKSimon added a subscriber: Restricted Project.

@mmarjieh You're probably better off moving this into PPCISelLowering.cpp and making it a PPC only shuffle combine

@RKSimon @craig.topper After investigating and trying out this combine in different DAG combine phases, I noticed that this optimization is not always beneficial for all targets.
Since this combine is beneficial in our target, I already implemented it in the target specific ISelLowering.
If we want to continue with this patch and commit it to the community, I suggest that we introduce a target hook for it.

@RKSimon @craig.topper After investigating and trying out this combine in different DAG combine phases, I noticed that this optimization is not always beneficial for all targets.
Since this combine is beneficial in our target, I already implemented it in the target specific ISelLowering.
If we want to continue with this patch and commit it to the community, I suggest that we introduce a target hook for it.

I assume your target is out of tree? Keeping this as a generic combine with a target hook could be fine, assuming the PowerPC team accept the hook being enabled and we retain test coverage.