This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Vector] Add 16x16 strategy to vector.transpose lowering.
ClosedPublic

Authored by hanchung on Apr 18 2023, 11:18 PM.

Details

Summary

It adds a shuffle_16x16 strategy LowerVectorTranspose and renames shuffle to shuffle_1d. The idea is similar to 8x8 cases in x86Vector::avx2. The general algorithm is:

interleave 32-bit lanes using
    8x _mm512_unpacklo_epi32
    8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
    8x _mm512_unpacklo_epi64
    8x _mm512_unpackhi_epi64
permute 128-bit lanes using
   16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
   16x _mm512_shuffle_i32x4

After the first stage, they got transposed to

 0  16   1  17   4  20   5  21   8  24   9  25  12  28  13  29
 2  18   3  19   6  22   7  23  10  26  11  27  14  30  15  31
32  48  33  49 ...
34  50  35  51 ...
64  80  65  81 ...
...

After the second stage, they got transposed to

 0  16  32  48 ...
 1  17  33  49 ...
 2  18  34  49 ...
 3  19  35  51 ...
64  80  96 112 ...
65  81  97 114 ...
66  82  98 113 ...
67  83  99 115 ...
...

After the thrid stage, they got transposed to

  0  16  32  48   8  24  40  56  64  80  96  112 ...
  1  17  33  49 ...
  2  18  34  50 ...
  3  19  35  51 ...
  4  20  36  52 ...
  5  21  37  53 ...
  6  22  38  54 ...
  7  23  39  55 ...
128 144 160 176 ...
129 145 161 177 ...
...

After the last stage, they got transposed to

0  16  32  48  64  80  96 112 ... 240
1  17  33  49  66  81  97 113 ... 241
2  18  34  50  67  82  98 114 ... 242
...
15  31  47  63  79  96 111 127 ... 255

Diff Detail

Event Timeline

hanchung created this revision.Apr 18 2023, 11:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 11:18 PM
hanchung requested review of this revision.Apr 18 2023, 11:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 11:18 PM

I'm quite new to x86 transpose lowering area. After exploring papers and resources, I found that the 8x8 case is similar to the idea in https://stackoverflow.com/questions/29519222/how-to-transpose-a-16x16-matrix-using-simd-instructions. Thus, I implemented the 16x16 lowering mostly based on it. I passed one integration tests in IREE, which is a MLIR downstream project; I can test enable it in IREE by default after landing the revision. Some perf improvements are observed in my recent tensor.pack/linalg.transpose codegen as well.

goldstein.w.n added inline comments.
mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp
50

It might pay to make number of elements an argument (or template parameter) and then reusing this for mm256* variants.

82

static?

dcaballe requested changes to this revision.Apr 19 2023, 10:24 AM

Super excited to see this! Awesome!

A few requests:

  • This pattern doesn't generate ASM blocks so it's really not AVX512 specific. That's great because it can be retargeted to any ISA, including AVX2 or even ARM or RISC-V. Could you please move this pattern to LowerVectorTranspose.cpp?
  • Also in this regard, we have a shuffle based lowering (TransposeOp2DToShuffleLowering, IIRC) for transposes. I think it only generates a single giant shuffle that is then split by the LLVM backend. Have you tried to enable that option for a 16x16 transpose and see if the generated assembly is the same as the one from this patch when targeting AVX512?
  • We would also need an integration test for this lowering as shuffles are very sensitive to correctness issues.
mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616

Really cool that we don't have to use ASM for this pattern. That makes it retargetable to AVX2 and also to other targets!

This revision now requires changes to proceed.Apr 19 2023, 10:24 AM

This pattern doesn't generate ASM blocks so it's really not AVX512 specific. That's great because it can be retargeted to any ISA, including AVX2 or even ARM or RISC-V. Could you please move this pattern to LowerVectorTranspose.cpp?

I'm not sure if moving it to LowerVectorTranspose is a good idea or not. Here are two points that I can think of.

  1. The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.
  2. Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

Also in this regard, we have a shuffle based lowering (TransposeOp2DToShuffleLowering, IIRC) for transposes. I think it only generates a single giant shuffle that is then split by the LLVM backend. Have you tried to enable that option for a 16x16 transpose and see if the generated assembly is the same as the one from this patch when targeting AVX512?

I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

We would also need an integration test for this lowering as shuffles are very sensitive to correctness issues.

Totally agree! Can you send me a pointer for doing it in MLIR repo?

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

The shuffle patterns proposed here are generally about as good as they get
for any x86 ISA I know of.

It minimized cross-lane shuffles and does repeated in-lane shuffles. Those
two traits for the most part are exactly what the ISA is best suited for.

hanchung added a comment.EditedApr 19 2023, 2:46 PM
Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

Here is the ASM dump from IREE: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw

As expected, there are 8x (vunpcklps, vunpckhps) pairs, 8x (vunpcklpd, vunpckhpd) pairs, and 32x vshuff64x2 in the dump. The mask of vshuff64x2 are all 0x88 and 0xdd, which are as same as the implementation.

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.

Yep, different targets will lower these shuffles differently and even x86 will lower in its own way in some cases :) That's what it led to writing asm versions in some cases.
I think we placed everything here and use x86 specific names because it's where the experiment started but we should move generic patterns to a more generic place to make sure that people don't reinvent the wheel.

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

x86 unpack and pack are vector interleaving/deinterleaving instructions. Some targets have them. Unpack and pack sounds reasonable to me, though.

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

I assumed this was the case :). That's the first thing we have to figure out.

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

The giant shuffle op should also be split and mapped to target shuffle instructions but probably a different sequence.

Totally agree! Can you send me a pointer for doing it in MLIR repo?

https://github.com/llvm/llvm-project/tree/main/mlir/test/Integration/Dialect/Vector/CPU

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.

Yep, different targets will lower these shuffles differently and even x86 will lower in its own way in some cases :) That's what it led to writing asm versions in some cases.
I think we placed everything here and use x86 specific names because it's where the experiment started but we should move generic patterns to a more generic place to make sure that people don't reinvent the wheel.

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

x86 unpack and pack are vector interleaving/deinterleaving instructions. Some targets have them. Unpack and pack sounds reasonable to me, though.

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

I assumed this was the case :). That's the first thing we have to figure out.

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128

Think you would want to insert/perm2f128 before the unpck actually. Some X86 targets have a quirk where insert has better
throughput when micro-fused with loads. Will be easier to detect / optimize in codegen if the inputs the transpose (potentially
memory) have their first shuffle as the insertf128 pattern.

4 x SHUFPS
8 x BLENDPS

> I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

The giant shuffle op should also be split and mapped to target shuffle instructions but probably a different sequence.

> Totally agree! Can you send me a pointer for doing it in MLIR repo?

https://github.com/llvm/llvm-project/tree/main/mlir/test/Integration/Dialect/Vector/CPU

Here is the ASM dump from IREE: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw

As expected, there are 8x (vunpcklps, vunpckhps) pairs, 8x (vunpcklpd, vunpckhpd) pairs, and 32x vshuff64x2 in the dump. The mask of vshuff64x2 are all 0x88 and 0xdd, which are as same as the implementation.

We should be able to replace the vunpck*pd ones with a combination of shufps + blends that should be faster, in theory. That's what led to using asm in the past but perhaps we should give the shuffle + reordering approach a try before doing the same:

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

Can we get a test that ensures the assembly generated is what we expect?

This is an anti-pattern in LLVM historically, clang does not have any such tests for example.
IIRC the rational is that such tests are fragile and put the burden of maintenance on possibly unrelated part of the project (that is any pass in the middle end or backend that would be able to break this), and that it is over-constrained vs setting up benchmarks (we shouldn't care about the actual assembly, only about the perf).

hanchung updated this revision to Diff 515541.Apr 20 2023, 5:22 PM
  • Move the implementation to LowerVectorTranspose.cpp
  • Add a Shuffle16x16 strategy
  • Rename Shuffle strategy to Shuffle1D
  • Add an e2e integration test
hanchung retitled this revision from [mlir][X86Vector] Add specialized vector.transpose lowering for AVX512. to [mlir][Vector] Add 16x16 strategy to vector.transpose lowering..Apr 20 2023, 5:22 PM

@dcaballe please take another look, thank you!

mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp
50

SG, I added the method to LowerVectorTranspose.cpp. I'll prepare another PR that moves 4x8 case to LowerVectorTranspose.cpp and reuse the method for mm256*.

hanchung updated this revision to Diff 515545.Apr 20 2023, 5:26 PM

clean transposeToShuffle1D logic

hanchung updated this revision to Diff 515546.Apr 20 2023, 5:26 PM
This comment was removed by hanchung.
hanchung edited the summary of this revision. (Show Details)Apr 20 2023, 5:53 PM
dcaballe accepted this revision.Apr 21 2023, 11:33 AM

LGTM. Mostly doc and minor comments. Happy to take a look again before landing.

mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp
55 ↗(On Diff #515546)

option -> lowering option or lowering approach?
vector.shuffle -> vector shuffle based?

61 ↗(On Diff #515546)

Can you elaborate a bit more in the doc. I'm not sure I get what vals is.

67 ↗(On Diff #515546)

Add message to asserts

134 ↗(On Diff #515546)

128-bits -> 128-bit lanes

150 ↗(On Diff #515546)

mm512 is x86 specific. Perhaps we can call this create4x128BitSuffle or something like that. This should be able to handle any element type, right?

156 ↗(On Diff #515546)

switch?

182 ↗(On Diff #515546)

doc

191 ↗(On Diff #515546)

doc

Should this be transposeToShuffle16x16, following the naming rule above?

380 ↗(On Diff #515546)

Add doc for the 16x16 side?

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
615

transpose_16x16xf32 -> transpose_shuffle16x16xf32?

mlir/test/Integration/Dialect/Vector/CPU/test-transpose-16x16.mlir
4 ↗(On Diff #515546)

file name: -> shuffle16x16?

27 ↗(On Diff #515546)

Awesome! I think a correctness test should be enough and would align with Mehdi's comment. Thanks!

This revision is now accepted and ready to land.Apr 21 2023, 11:33 AM
hanchung updated this revision to Diff 515884.Apr 21 2023, 1:16 PM
hanchung marked 12 inline comments as done.

Improve docs and address comments

dcaballe accepted this revision.Apr 21 2023, 1:45 PM

LGTM! THanks for addressing the feedback

mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp
66 ↗(On Diff #515884)

some typos above

hanchung updated this revision to Diff 515938.Apr 21 2023, 3:36 PM

fix typo and rebase

This revision was landed with ongoing or failed builds.Apr 23 2023, 11:06 AM
This revision was automatically updated to reflect the committed changes.

Can we get a test that ensures the assembly generated is what we expect?

This is an anti-pattern in LLVM historically, clang does not have any such tests for example.
IIRC the rational is that such tests are fragile and put the burden of maintenance on possibly unrelated part of the project (that is any pass in the middle end or backend that would be able to break this), and that it is over-constrained vs setting up benchmarks (we shouldn't care about the actual assembly, only about the perf).

Ah yes, forgot about this, good point; this is also why I didn't add such tests in the past.

It adds a shuffle_16x16 strategy LowerVectorTranspose and renames shuffle to shuffle_1d. The idea is similar to 8x8 cases in x86Vector::avx2. The general algorithm is:

Coming back from vacation I am quite confused by the state in which this PR has landed: this is mentioned to be similar to x86Vector::avx2 but we now have 2 very different implementations that live in separate places for similar algorithms.
I would expect such an implementation to reuse and/or extend utils defined in X86Vector/Transforms/AVXTranspose.cpp such as MaskHelper::shuffle/blend/permute. Instead I seem to see a new implementation with magic assembly constants, one-off new helpers etc.

Please refactor this to be in line with the existing implementation (and/or evolve the existing one accordingly): it is not sustainable to have 2 completely separate implementations for simple variations of the same algorithm.

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616

The only thing that uses asm atm is the avx2 vblendps-based lowering that I wasn't able to get through LLVM in any other way than inline asm (as per the discussion I posted). If we have a good reliable way of generating the vblendps instructions without asm that would be great.
I do not see any blend instruction in the gist though: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw

Re "retargetable", I don't see it; isn't this implementation quite specific to AVX512 where we want to really be careful about crossing 128b boundaries? Isn't this also at risk of spilling severely on architectures with smaller vector sizes? I see this much more naturally living under x86vector, under an avx512 namespace.

652

CHECK-COUNT where possible plz

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128

I did try different variations but did not get LLVM to ever emit the blend-based version, hence the addition of an inline asm version.

Think you would want to insert/perm2f128 before the unpck actually. Some X86 targets have a quirk where insert has better
throughput when micro-fused with loads. Will be easier to detect / optimize in codegen if the inputs the transpose (potentially
memory) have their first shuffle as the insertf128 pattern.

Nice insight, I'd definitely be interested in seeing this tried out and measured.

dcaballe added inline comments.May 8 2023, 10:55 AM
mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616

Interleaving/deinterleaving ops are common shuffle ops in most architectures and shuffling across >128-bit lanes are also common limitations (see, for example, slide 13 here: https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf). This pattern won't be a perfect fit for, let's say SVE and RISC-V but I would expect certain level of applicability, esp. if we compare it with a scalar transfer ops or a single giant shuffle.

Totally agree, though, that we should refactor the common components instead of reimplementing them. That would be a great follow-up.