This is an archive of the discontinued LLVM Phabricator instance.

[X86] Lower vector interleave into unpck and perm
ClosedPublic

Authored by zhuhan0 on Sep 22 2022, 1:53 PM.

Details

Summary

This Godbolt link shows different codegen between clang and gcc for a transpose operation.

clang result:

vmovdqu xmm0, xmmword ptr [rcx + rax]
vmovdqu xmm1, xmmword ptr [rcx + rax + 16]
vmovdqu xmm2, xmmword ptr [r8 + rax]
vmovdqu xmm3, xmmword ptr [r8 + rax + 16]
vpunpckhbw      xmm4, xmm2, xmm0
vpunpcklbw      xmm0, xmm2, xmm0
vpunpcklbw      xmm2, xmm3, xmm1
vpunpckhbw      xmm1, xmm3, xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
vmovdqu xmmword ptr [rdi + 2*rax], xmm0
vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm4

gcc result:

vmovdqu ymm3, YMMWORD PTR [rdi+rax]
vpunpcklbw      ymm1, ymm3, YMMWORD PTR [rsi+rax]
vpunpckhbw      ymm0, ymm3, YMMWORD PTR [rsi+rax]
vperm2i128      ymm2, ymm1, ymm0, 32
vperm2i128      ymm1, ymm1, ymm0, 49
vmovdqu YMMWORD PTR [rcx+rax*2], ymm2
vmovdqu YMMWORD PTR [rcx+32+rax*2], ymm1

clang's code is roughly 15% slower than gcc's when evaluated on an internal compression benchmark.

The loop vectorizer generates the following shufflevector intrinsic:

%interleaved.vec = shufflevector <32 x i8> %a, <32 x i8> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>

which is lowered to SelectionDAG:

t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t6: v64i8 = concat_vectors t2, undef:v32i8
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t7: v64i8 = concat_vectors t4, undef:v32i8
t8: v64i8 = vector_shuffle<0,64,1,65,2,66,3,67,4,68,5,69,6,70,7,71,8,72,9,73,10,74,11,75,12,76,13,77,14,78,15,79,16,80,17,81,18,82,19,83,20,84,21,85,22,86,23,87,24,88,25,89,26,90,27,91,28,92,29,93,30,94,31,95> t6, t7

So far this vector_shuffle is good enough for us to pattern-match and transform, but as we go down the SelectionDAG pipeline, it got split into smaller shuffles. During dagcombine1, the shuffle is split by foldShuffleOfConcatUndefs.

  // shuffle (concat X, undef), (concat Y, undef), Mask -->
  // concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)
t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t19: v32i8 = vector_shuffle<0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47> t2, t4
t15: ch,glue = CopyToReg t0, Register:v32i8 $ymm0, t19
t20: v32i8 = vector_shuffle<16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63> t2, t4
t17: ch,glue = CopyToReg t15, Register:v32i8 $ymm1, t20, t15:1

With foldShuffleOfConcatUndefs commented out, the vector is still split later by the type legalizer, which comes after dagcombine1, because v64i8 is not a legal type in AVX2 (64 * 8 = 512 bits while ymm = 256 bits). There doesn't seem to be a good way to avoid this split. Lowering the vector_shuffle into unpck and perm during dagcombine1 is too early. Therefore, although somewhat inconvenient, we decided to go with pattern-matching a pair vector shuffles later in the SelectionDAG pipeline, as part of lowerV32I8Shuffle.

The code looks at the two operands of the first shuffle it encounters, iterates through the users of the operands, and tries to find two shuffles that are consecutive interleaves. Once the pattern is found, it lowers them into unpcks and perms. It returns the perm for the shuffle that's currently being lowered (have ISel modify the DAG), and replaces the other shuffle in place.

Diff Detail

Unit TestsFailed

Event Timeline

zhuhan0 created this revision.Sep 22 2022, 1:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 1:53 PM
zhuhan0 requested review of this revision.Sep 22 2022, 1:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 1:53 PM
zhuhan0 edited the summary of this revision. (Show Details)Sep 22 2022, 2:52 PM
zhuhan0 edited the summary of this revision. (Show Details)Sep 22 2022, 2:58 PM
RKSimon added inline comments.Sep 24 2022, 6:06 AM
llvm/test/CodeGen/X86/vector-interleave.ll
176

Please can you pre-commit this and rebase so we see the codegen change?

RKSimon added inline comments.Sep 24 2022, 8:27 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
17814

I'd probably replace these hard coded numbers with more general NumElts / NumEltsPerLane variables.

17853

Why are the ReplaceAllUsesWith calls necessary? We usually just rely on combines / demandedelts to replace these.

18550

I'm still not certain if we're better off trying to perform this in lowering or via shuffle combining or maybe combineConcatVectorOps?

RKSimon retitled this revision from [isel] Lower vector interleave into unpck and perm to [X86] Lower vector interleave into unpck and perm.Sep 24 2022, 8:27 AM
zhuhan0 updated this revision to Diff 462696.Sep 24 2022, 10:36 PM

Pre-commit test. Use NumElts instead of hard-coding.

zhuhan0 added inline comments.Sep 24 2022, 10:48 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
17853

Because in SelectionDAGLegalize::LegalizeOp only the currently visited node is replaced in the DAG, but here we're trying to transform both vector shuffle nodes, so I replaced the other (not currently visited) node on the fly. This does seem weird so I'm happy to change it if there's a better way. Is there an example of using "combines / demandedelts" to replace multiple nodes at the same time?

18550

We had the same question. The thing is, the original v64i8 vector_shuffle is split during generic dag combine, so it never gets to the X86 dag combiner.

After the v64i8 shuffle is split, the v32i8 shuffles do get to the X86 dag combiner, and they go through the combineShuffle path. That does come earlier than the legalizer. But combineShuffle doesn't call combineConcatVectorOps though and I'm not sure if any existing combine there is a good place to add this change. I can try adding a new combine there if you think that's a better place.

Hi @RKSimon, any thought on the new version? Thanks.

Friendly ping for a review of this.

MatzeB added a comment.Oct 6 2022, 3:58 PM

LGTM. I'm not super familiar with vector code lowering, so let's wait some more days to give others a chance to chime in if necessary, will accept in a few days then.

MatzeB added inline comments.Oct 6 2022, 4:00 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
17853

I think this is normal code when dealing with patterns that have multiple "root nodes".

RKSimon added inline comments.Oct 10 2022, 6:18 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
17798

Have you investigated using this for other types?

17831

auto *SVN1
auto *SVN2

zhuhan0 updated this revision to Diff 466709.Oct 10 2022, 11:38 PM

ShuffleVectorSDNode -> auto

zhuhan0 added inline comments.Oct 10 2022, 11:40 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
17798

Not yet. I'll submit another patch to generalize this. Let me first run some benchmarks to confirm that it's profitable.

zhuhan0 updated this revision to Diff 467293.Oct 12 2022, 4:21 PM

Generalize to other 256-bit vector types; Limit change to AVX2 only.

I decided to put the changes in the same diff so that it's easier to review.
This change showed wins across all other 256-bit vector types as well, when
measured on an internal compression benchmark (basically a loop as shown in
https://godbolt.org/z/s17Kv1s9T). We saw 40.6% and 33.4% improvement on
v16i16 and v8i32 respectively. For v4i64 though, we see merely 0.9%
improvement as the current perm + blend codegen seems already very good. So
I don't think this change is worth it for the 64-bit types.

RKSimon added inline comments.Oct 13 2022, 12:58 AM
llvm/test/CodeGen/X86/vector-interleave.ll
247

Wouldn't AVX1 benefit from the v8f32 case as well?

zhuhan0 added inline comments.Oct 13 2022, 12:05 PM
llvm/test/CodeGen/X86/vector-interleave.ll
247

Ah yes. I missed it.

zhuhan0 updated this revision to Diff 467561.Oct 13 2022, 12:06 PM

Remove AVX2 check on v8f32.

zhuhan0 added inline comments.Oct 13 2022, 12:10 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
18582–18606

v8i32 is also covered now because of the cast to v8f32 here.

RKSimon added inline comments.Oct 14 2022, 12:54 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
18171

Yeah, those AVX1 v8i32 regressions need addressing first - please can you add the hasAVX2 check back and a TODO saying this should be expanded to AVX1?

zhuhan0 updated this revision to Diff 467883.Oct 14 2022, 12:19 PM

Add back AVX2 check for v8f32. Add TODO.

RKSimon accepted this revision.Oct 15 2022, 6:44 AM

LGTM - cheers

This revision is now accepted and ready to land.Oct 15 2022, 6:44 AM

LGTM - cheers

Thanks for reviewing!

This revision was automatically updated to reflect the committed changes.