This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/Vector/Transforms/
-
mlir/
-
Dialect/
-
Vector/
-
Transforms/
-
VectorTransformsBase.td
-
lib/Dialect/Vector/Transforms/
-
Dialect/
-
Vector/
-
Transforms/
9/10
LowerVectorTranspose.cpp
-
test/
-
Dialect/
-
LLVM/
-
transform-e2e.mlir
-
Vector/
-
transform-vector.mlir
2/5
vector-transpose-lowering.mlir
-
Integration/Dialect/Vector/CPU/
-
Dialect/
-
Vector/
-
CPU/
-
test-shuffle16x16.mlir

Differential D148685

[mlir][Vector] Add 16x16 strategy to vector.transpose lowering.
ClosedPublic

Authored by hanchung on Apr 18 2023, 11:18 PM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
aartbik
springerm
dcaballe
ftynse
Benoit
mravishankar

Commits

rG8d163e504507: [mlir][Vector] Add 16x16 strategy to vector.transpose lowering.

Summary

It adds a shuffle_16x16 strategy LowerVectorTranspose and renames shuffle to shuffle_1d. The idea is similar to 8x8 cases in x86Vector::avx2. The general algorithm is:

interleave 32-bit lanes using
    8x _mm512_unpacklo_epi32
    8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
    8x _mm512_unpacklo_epi64
    8x _mm512_unpackhi_epi64
permute 128-bit lanes using
   16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
   16x _mm512_shuffle_i32x4

After the first stage, they got transposed to

 0  16   1  17   4  20   5  21   8  24   9  25  12  28  13  29
 2  18   3  19   6  22   7  23  10  26  11  27  14  30  15  31
32  48  33  49 ...
34  50  35  51 ...
64  80  65  81 ...
...

After the second stage, they got transposed to

 0  16  32  48 ...
 1  17  33  49 ...
 2  18  34  49 ...
 3  19  35  51 ...
64  80  96 112 ...
65  81  97 114 ...
66  82  98 113 ...
67  83  99 115 ...
...

After the thrid stage, they got transposed to

  0  16  32  48   8  24  40  56  64  80  96  112 ...
  1  17  33  49 ...
  2  18  34  50 ...
  3  19  35  51 ...
  4  20  36  52 ...
  5  21  37  53 ...
  6  22  38  54 ...
  7  23  39  55 ...
128 144 160 176 ...
129 145 161 177 ...
...

After the last stage, they got transposed to

0  16  32  48  64  80  96 112 ... 240
1  17  33  49  66  81  97 113 ... 241
2  18  34  50  67  82  98 114 ... 242
...
15  31  47  63  79  96 111 127 ... 255

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hanchung created this revision.Apr 18 2023, 11:18 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 11:18 PM

Herald added subscribers: bviyer, Moerafaat, zero9178 and 23 others. · View Herald Transcript

hanchung requested review of this revision.Apr 18 2023, 11:18 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 11:18 PM

Herald added a subscriber: stephenneuendorffer. · View Herald Transcript

I'm quite new to x86 transpose lowering area. After exploring papers and resources, I found that the 8x8 case is similar to the idea in https://stackoverflow.com/questions/29519222/how-to-transpose-a-16x16-matrix-using-simd-instructions. Thus, I implemented the 16x16 lowering mostly based on it. I passed one integration tests in IREE, which is a MLIR downstream project; I can test enable it in IREE by default after landing the revision. Some perf improvements are observed in my recent tensor.pack/linalg.transpose codegen as well.

hanchung added a reviewer: mravishankar.Apr 18 2023, 11:23 PM

Harbormaster completed remote builds in B226518: Diff 514828.Apr 18 2023, 11:35 PM

goldstein.w.n added a subscriber: goldstein.w.n.Apr 19 2023, 12:16 AM

goldstein.w.n added inline comments.

mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp
50 ↗	(On Diff #514828)	It might pay to make number of elements an argument (or template parameter) and then reusing this for `mm256*` variants.
82 ↗	(On Diff #514828)	static?

Super excited to see this! Awesome!

A few requests:

This pattern doesn't generate ASM blocks so it's really not AVX512 specific. That's great because it can be retargeted to any ISA, including AVX2 or even ARM or RISC-V. Could you please move this pattern to LowerVectorTranspose.cpp?
Also in this regard, we have a shuffle based lowering (TransposeOp2DToShuffleLowering, IIRC) for transposes. I think it only generates a single giant shuffle that is then split by the LLVM backend. Have you tried to enable that option for a 16x16 transpose and see if the generated assembly is the same as the one from this patch when targeting AVX512?
We would also need an integration test for this lowering as shuffles are very sensitive to correctness issues.

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616	Really cool that we don't have to use ASM for this pattern. That makes it retargetable to AVX2 and also to other targets!

This revision now requires changes to proceed.Apr 19 2023, 10:24 AM

This pattern doesn't generate ASM blocks so it's really not AVX512 specific. That's great because it can be retargeted to any ISA, including AVX2 or even ARM or RISC-V. Could you please move this pattern to LowerVectorTranspose.cpp?

I'm not sure if moving it to LowerVectorTranspose is a good idea or not. Here are two points that I can think of.

The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.
Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

Also in this regard, we have a shuffle based lowering (TransposeOp2DToShuffleLowering, IIRC) for transposes. I think it only generates a single giant shuffle that is then split by the LLVM backend. Have you tried to enable that option for a 16x16 transpose and see if the generated assembly is the same as the one from this patch when targeting AVX512?

I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

We would also need an integration test for this lowering as shuffles are very sensitive to correctness issues.

Totally agree! Can you send me a pointer for doing it in MLIR repo?

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

In D148685#4281659, @nicolasvasilache wrote:
Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd
This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

The shuffle patterns proposed here are generally about as good as they get
for any x86 ISA I know of.

It minimized cross-lane shuffles and does repeated in-lane shuffles. Those
two traits for the most part are exactly what the ISA is best suited for.

In D148685#4281659, @nicolasvasilache wrote:
Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd
This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

Here is the ASM dump from IREE: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw

As expected, there are 8x (vunpcklps, vunpckhps) pairs, 8x (vunpcklpd, vunpckhpd) pairs, and 32x vshuff64x2 in the dump. The mask of vshuff64x2 are all 0x88 and 0xdd, which are as same as the implementation.

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.

Yep, different targets will lower these shuffles differently and even x86 will lower in its own way in some cases :) That's what it led to writing asm versions in some cases.
I think we placed everything here and use x86 specific names because it's where the experiment started but we should move generic patterns to a more generic place to make sure that people don't reinvent the wheel.

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

x86 unpack and pack are vector interleaving/deinterleaving instructions. Some targets have them. Unpack and pack sounds reasonable to me, though.

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

I assumed this was the case :). That's the first thing we have to figure out.

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

The giant shuffle op should also be split and mapped to target shuffle instructions but probably a different sequence.

Totally agree! Can you send me a pointer for doing it in MLIR repo?

https://github.com/llvm/llvm-project/tree/main/mlir/test/Integration/Dialect/Vector/CPU

In D148685#4281722, @dcaballe wrote:
This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.

Yep, different targets will lower these shuffles differently and even x86 will lower in its own way in some cases :) That's what it led to writing asm versions in some cases.
I think we placed everything here and use x86 specific names because it's where the experiment started but we should move generic patterns to a more generic place to make sure that people don't reinvent the wheel.

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

x86 unpack and pack are vector interleaving/deinterleaving instructions. Some targets have them. Unpack and pack sounds reasonable to me, though.

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

I assumed this was the case :). That's the first thing we have to figure out.

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):
The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128

Think you would want to insert/perm2f128 before the unpck actually. Some X86 targets have a quirk where insert has better
throughput when micro-fused with loads. Will be easier to detect / optimize in codegen if the inputs the transpose (potentially
memory) have their first shuffle as the insertf128 pattern.

4 x SHUFPS
8 x BLENDPS

> I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

The giant shuffle op should also be split and mapped to target shuffle instructions but probably a different sequence.

> Totally agree! Can you send me a pointer for doing it in MLIR repo?

https://github.com/llvm/llvm-project/tree/main/mlir/test/Integration/Dialect/Vector/CPU

Here is the ASM dump from IREE: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw

As expected, there are 8x (vunpcklps, vunpckhps) pairs, 8x (vunpcklpd, vunpckhpd) pairs, and 32x vshuff64x2 in the dump. The mask of vshuff64x2 are all 0x88 and 0xdd, which are as same as the implementation.

We should be able to replace the vunpck*pd ones with a combination of shufps + blends that should be faster, in theory. That's what led to using asm in the past but perhaps we should give the shuffle + reordering approach a try before doing the same:

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

In D148685#4281659, @nicolasvasilache wrote:

Can we get a test that ensures the assembly generated is what we expect?

This is an anti-pattern in LLVM historically, clang does not have any such tests for example.
IIRC the rational is that such tests are fragile and put the burden of maintenance on possibly unrelated part of the project (that is any pass in the middle end or backend that would be able to break this), and that it is over-constrained vs setting up benchmarks (we shouldn't care about the actual assembly, only about the perf).

Move the implementation to LowerVectorTranspose.cpp
Add a Shuffle16x16 strategy
Rename Shuffle strategy to Shuffle1D
Add an e2e integration test

Herald added a subscriber: awarzynski. · View Herald TranscriptApr 20 2023, 5:22 PM

@dcaballe please take another look, thank you!

mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp
50 ↗	(On Diff #514828)	SG, I added the method to LowerVectorTranspose.cpp. I'll prepare another PR that moves 4x8 case to LowerVectorTranspose.cpp and reuse the method for `mm256*`.

clean transposeToShuffle1D logic

hanchung updated this revision to Diff 515546.Apr 20 2023, 5:26 PM

This comment was removed by hanchung.

Harbormaster completed remote builds in B227031: Diff 515546.Apr 20 2023, 5:42 PM

hanchung edited the summary of this revision. (Show Details)Apr 20 2023, 5:53 PM

Herald added a subscriber: pengfei. · View Herald TranscriptApr 20 2023, 5:53 PM

LGTM. Mostly doc and minor comments. Happy to take a look again before landing.

mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp
55	option -> lowering option or lowering approach? vector.shuffle -> vector shuffle based?
61	Can you elaborate a bit more in the doc. I'm not sure I get what `vals` is.
67	Add message to asserts
134	128-bits -> 128-bit lanes
150	mm512 is x86 specific. Perhaps we can call this `create4x128BitSuffle` or something like that. This should be able to handle any element type, right?
156	switch?
182	doc
191	doc Should this be transposeToShuffle16x16, following the naming rule above?
398–400	Add doc for the 16x16 side?
mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
615	transpose_16x16xf32 -> transpose_shuffle16x16xf32?
mlir/test/Integration/Dialect/Vector/CPU/test-transpose-16x16.mlir
4 ↗	(On Diff #515546)	file name: -> shuffle16x16?
27 ↗	(On Diff #515546)	Awesome! I think a correctness test should be enough and would align with Mehdi's comment. Thanks!

This revision is now accepted and ready to land.Apr 21 2023, 11:33 AM

Improve docs and address comments

LGTM! THanks for addressing the feedback

mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp
66	some typos above

Harbormaster completed remote builds in B227298: Diff 515884.Apr 21 2023, 1:57 PM

fix typo and rebase

This revision was landed with ongoing or failed builds.Apr 23 2023, 11:06 AM

Closed by commit rG8d163e504507: [mlir][Vector] Add 16x16 strategy to vector.transpose lowering. (authored by hanchung). · Explain Why

This revision was automatically updated to reflect the committed changes.

hanchung added a commit: rG8d163e504507: [mlir][Vector] Add 16x16 strategy to vector.transpose lowering..

In D148685#4282679, @mehdi_amini wrote:

In D148685#4281659, @nicolasvasilache wrote:

Can we get a test that ensures the assembly generated is what we expect?

This is an anti-pattern in LLVM historically, clang does not have any such tests for example.
IIRC the rational is that such tests are fragile and put the burden of maintenance on possibly unrelated part of the project (that is any pass in the middle end or backend that would be able to break this), and that it is over-constrained vs setting up benchmarks (we shouldn't care about the actual assembly, only about the perf).

Ah yes, forgot about this, good point; this is also why I didn't add such tests in the past.

It adds a shuffle_16x16 strategy LowerVectorTranspose and renames shuffle to shuffle_1d. The idea is similar to 8x8 cases in x86Vector::avx2. The general algorithm is:

Coming back from vacation I am quite confused by the state in which this PR has landed: this is mentioned to be similar to x86Vector::avx2 but we now have 2 very different implementations that live in separate places for similar algorithms.
I would expect such an implementation to reuse and/or extend utils defined in X86Vector/Transforms/AVXTranspose.cpp such as MaskHelper::shuffle/blend/permute. Instead I seem to see a new implementation with magic assembly constants, one-off new helpers etc.

Please refactor this to be in line with the existing implementation (and/or evolve the existing one accordingly): it is not sustainable to have 2 completely separate implementations for simple variations of the same algorithm.

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616	The only thing that uses asm atm is the avx2 vblendps-based lowering that I wasn't able to get through LLVM in any other way than inline asm (as per the discussion I posted). If we have a good reliable way of generating the vblendps instructions without asm that would be great. I do not see any blend instruction in the gist though: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw Re "retargetable", I don't see it; isn't this implementation quite specific to AVX512 where we want to really be careful about crossing 128b boundaries? Isn't this also at risk of spilling severely on architectures with smaller vector sizes? I see this much more naturally living under x86vector, under an avx512 namespace.
652	CHECK-COUNT where possible plz

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128

I did try different variations but did not get LLVM to ever emit the blend-based version, hence the addition of an inline asm version.

Think you would want to insert/perm2f128 before the unpck actually. Some X86 targets have a quirk where insert has better
throughput when micro-fused with loads. Will be easier to detect / optimize in codegen if the inputs the transpose (potentially
memory) have their first shuffle as the insertf128 pattern.

Nice insight, I'd definitely be interested in seeing this tried out and measured.

dcaballe added inline comments.May 8 2023, 10:55 AM

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616	Interleaving/deinterleaving ops are common shuffle ops in most architectures and shuffling across >128-bit lanes are also common limitations (see, for example, slide 13 here: https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf). This pattern won't be a perfect fit for, let's say SVE and RISC-V but I would expect certain level of applicability, esp. if we compare it with a scalar transfer ops or a single giant shuffle. Totally agree, though, that we should refactor the common components instead of reimplementing them. That would be a great follow-up.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Vector/

Transforms/

VectorTransformsBase.td

11 lines

lib/

Dialect/

Vector/

Transforms/

LowerVectorTranspose.cpp

291 lines

test/

Dialect/

LLVM/

transform-e2e.mlir

2 lines

Vector/

transform-vector.mlir

2 lines

vector-transpose-lowering.mlir

80 lines

Integration/

Dialect/

Vector/

CPU/

test-shuffle16x16.mlir

38 lines

Diff 516189

mlir/include/mlir/Dialect/Vector/Transforms/VectorTransformsBase.td

	Show All 12 Lines

	// Lower transpose into element-wise extract and inserts.			// Lower transpose into element-wise extract and inserts.
	def VectorTransposeLowering_Elementwise:			def VectorTransposeLowering_Elementwise:
	I32EnumAttrCase<"EltWise", 0, "eltwise">;			I32EnumAttrCase<"EltWise", 0, "eltwise">;
	// Lower 2-D transpose to `vector.flat_transpose`, maps 1-1 to LLVM matrix			// Lower 2-D transpose to `vector.flat_transpose`, maps 1-1 to LLVM matrix
	// intrinsics.			// intrinsics.
	def VectorTransposeLowering_FlatTranspose:			def VectorTransposeLowering_FlatTranspose:
	I32EnumAttrCase<"Flat", 1, "flat_transpose">;			I32EnumAttrCase<"Flat", 1, "flat_transpose">;
	// Lower 2-D transpose to `vector.shuffle`.			// Lower 2-D transpose to `vector.shuffle` on 1-D vector.
	def VectorTransposeLowering_Shuffle:			def VectorTransposeLowering_Shuffle1D:
	I32EnumAttrCase<"Shuffle", 2, "shuffle">;			I32EnumAttrCase<"Shuffle1D", 2, "shuffle_1d">;
				// Lower 2-D transpose to `vector.shuffle` on 16x16 vector.
				def VectorTransposeLowering_Shuffle16x16:
				I32EnumAttrCase<"Shuffle16x16", 3, "shuffle_16x16">;
	def VectorTransposeLoweringAttr : I32EnumAttr<			def VectorTransposeLoweringAttr : I32EnumAttr<
	"VectorTransposeLowering",			"VectorTransposeLowering",
	"control the lowering of `vector.transpose` operations.",			"control the lowering of `vector.transpose` operations.",
	[VectorTransposeLowering_Elementwise, VectorTransposeLowering_FlatTranspose,			[VectorTransposeLowering_Elementwise, VectorTransposeLowering_FlatTranspose,
	VectorTransposeLowering_Shuffle]> {			VectorTransposeLowering_Shuffle1D, VectorTransposeLowering_Shuffle16x16]> {
	let cppNamespace = "::mlir::vector";			let cppNamespace = "::mlir::vector";
	}			}

	// Lower multi_reduction into outer-reduction and inner-parallel ops.			// Lower multi_reduction into outer-reduction and inner-parallel ops.
	def VectorMultiReductionLowering_InnerParallel:			def VectorMultiReductionLowering_InnerParallel:
	I32EnumAttrCase<"InnerParallel", 0, "innerparallel">;			I32EnumAttrCase<"InnerParallel", 0, "innerparallel">;
	// Lower multi_reduction into outer-parallel and inner-reduction ops.			// Lower multi_reduction into outer-parallel and inner-reduction ops.
	def VectorMultiReductionLowering_InnerReduction:			def VectorMultiReductionLowering_InnerReduction:
	▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	for (size_t transpDim : llvm::reverse(transpose)) {
if (transpDim != numTransposedDims - 1)		if (transpDim != numTransposedDims - 1)
break;		break;
numTransposedDims--;		numTransposedDims--;
}		}

result.append(transpose.begin(), transpose.begin() + numTransposedDims);		result.append(transpose.begin(), transpose.begin() + numTransposedDims);
}		}

		/// Returns true if the lowering option is a vector shuffle based approach.
		dcaballeUnsubmitted Done Reply Inline Actions option -> lowering option or lowering approach? vector.shuffle -> vector shuffle based? dcaballe: option -> lowering option or lowering approach? vector.shuffle -> vector shuffle based?
		static bool isShuffleLike(VectorTransposeLowering lowering) {
		return lowering == VectorTransposeLowering::Shuffle1D \|\|
		lowering == VectorTransposeLowering::Shuffle16x16;
		}

		/// Returns a shuffle mask that builds on `vals`. `vals` is the offset base of
		dcaballeUnsubmitted Done Reply Inline Actions Can you elaborate a bit more in the doc. I'm not sure I get what `vals` is. dcaballe: Can you elaborate a bit more in the doc. I'm not sure I get what `vals` is.
		/// shuffle ops, i.e., the unpack pattern. The method iterates with `vals` to
		/// create the mask for `numBits` bits vector. The `numBits` have to be a
		/// multiple of 128. For example, if `vals` is {0, 1, 16, 17} and `numBits` is
		/// 512, there should be 16 elements in the final result. It constructs the
		/// below mask to get the unpack elements.
		dcaballeUnsubmitted Not Done Reply Inline Actions some typos above dcaballe: some typos above
		/// [0, 1, 16, 17,
		dcaballeUnsubmitted Done Reply Inline Actions Add message to asserts dcaballe: Add message to asserts
		/// 0+4, 1+4, 16+4, 17+4,
		/// 0+8, 1+8, 16+8, 17+8,
		/// 0+12, 1+12, 16+12, 17+12]
		static SmallVector<int64_t>
		getUnpackShufflePermFor128Lane(ArrayRef<int64_t> vals, int numBits) {
		assert(numBits % 128 == 0 && "expected numBits is a multiple of 128");
		int numElem = numBits / 32;
		SmallVector<int64_t> res;
		for (int i = 0; i < numElem; i += 4)
		for (int64_t v : vals)
		res.push_back(v + i);
		return res;
		}

		/// Lower to vector.shuffle on v1 and v2 with UnpackLoPd shuffle mask. For
		/// example, if it is targeting 512 bit vector, returns
		/// vector.shuffle on v1, v2, [0, 1, 16, 17,
		/// 0+4, 1+4, 16+4, 17+4,
		/// 0+8, 1+8, 16+8, 17+8,
		/// 0+12, 1+12, 16+12, 17+12].
		static Value createUnpackLoPd(ImplicitLocOpBuilder &b, Value v1, Value v2,
		int numBits) {
		int numElem = numBits / 32;
		return b.create<vector::ShuffleOp>(
		v1, v2,
		getUnpackShufflePermFor128Lane({0, 1, numElem, numElem + 1}, numBits));
		}

		/// Lower to vector.shuffle on v1 and v2 with UnpackHiPd shuffle mask. For
		/// example, if it is targeting 512 bit vector, returns
		/// vector.shuffle, v1, v2, [2, 3, 18, 19,
		/// 2+4, 3+4, 18+4, 19+4,
		/// 2+8, 3+8, 18+8, 19+8,
		/// 2+12, 3+12, 18+12, 19+12].
		static Value createUnpackHiPd(ImplicitLocOpBuilder &b, Value v1, Value v2,
		int numBits) {
		int numElem = numBits / 32;
		return b.create<vector::ShuffleOp>(
		v1, v2,
		getUnpackShufflePermFor128Lane({2, 3, numElem + 2, numElem + 3},
		numBits));
		}

		/// Lower to vector.shuffle on v1 and v2 with UnpackLoPs shuffle mask. For
		/// example, if it is targeting 512 bit vector, returns
		/// vector.shuffle, v1, v2, [0, 16, 1, 17,
		/// 0+4, 16+4, 1+4, 17+4,
		/// 0+8, 16+8, 1+8, 17+8,
		/// 0+12, 16+12, 1+12, 17+12].
		static Value createUnpackLoPs(ImplicitLocOpBuilder &b, Value v1, Value v2,
		int numBits) {
		int numElem = numBits / 32;
		auto shuffle = b.create<vector::ShuffleOp>(
		v1, v2,
		getUnpackShufflePermFor128Lane({0, numElem, 1, numElem + 1}, numBits));
		return shuffle;
		}

		/// Lower to vector.shuffle on v1 and v2 with UnpackHiPs shuffle mask. For
		/// example, if it is targeting 512 bit vector, returns
		/// vector.shuffle, v1, v2, [2, 18, 3, 19,
		/// 2+4, 18+4, 3+4, 19+4,
		/// 2+8, 18+8, 3+8, 19+8,
		/// 2+12, 18+12, 3+12, 19+12].
		static Value createUnpackHiPs(ImplicitLocOpBuilder &b, Value v1, Value v2,
		int numBits) {
		int numElem = numBits / 32;
		dcaballeUnsubmitted Done Reply Inline Actions 128-bits -> 128-bit lanes dcaballe: 128-bits -> 128-bit lanes
		return b.create<vector::ShuffleOp>(
		v1, v2,
		getUnpackShufflePermFor128Lane({2, numElem + 2, 3, numElem + 3},
		numBits));
		}

		/// Returns a vector.shuffle that shuffles 128-bit lanes (composed of 4 32-bit
		/// elements) selected by `mask` from `v1` and `v2`. I.e.,
		///
		/// DEFINE SELECT4(src, control) {
		/// CASE(control[1:0]) OF
		/// 0: tmp[127:0] := src[127:0]
		/// 1: tmp[127:0] := src[255:128]
		/// 2: tmp[127:0] := src[383:256]
		/// 3: tmp[127:0] := src[511:384]
		/// ESAC
		dcaballeUnsubmitted Done Reply Inline Actions mm512 is x86 specific. Perhaps we can call this `create4x128BitSuffle` or something like that. This should be able to handle any element type, right? dcaballe: mm512 is x86 specific. Perhaps we can call this `create4x128BitSuffle` or something like that.
		/// RETURN tmp[127:0]
		/// }
		/// dst[127:0] := SELECT4(v1[511:0], mask[1:0])
		/// dst[255:128] := SELECT4(v1[511:0], mask[3:2])
		/// dst[383:256] := SELECT4(v2[511:0], mask[5:4])
		/// dst[511:384] := SELECT4(v2[511:0], mask[7:6])
		dcaballeUnsubmitted Done Reply Inline Actions switch? dcaballe: switch?
		static Value create4x128BitSuffle(ImplicitLocOpBuilder &b, Value v1, Value v2,
		uint8_t mask) {
		assert(v1.getType().cast<VectorType>().getShape()[0] == 16 &&
		"expected a vector with length=16");
		SmallVector<int64_t> shuffleMask;
		auto appendToMask = [&](int64_t base, uint8_t control) {
		switch (control) {
		case 0:
		llvm::append_range(shuffleMask, ArrayRef<int64_t>{base + 0, base + 1,
		base + 2, base + 3});
		break;
		case 1:
		llvm::append_range(shuffleMask, ArrayRef<int64_t>{base + 4, base + 5,
		base + 6, base + 7});
		break;
		case 2:
		llvm::append_range(shuffleMask, ArrayRef<int64_t>{base + 8, base + 9,
		base + 10, base + 11});
		break;
		case 3:
		llvm::append_range(shuffleMask, ArrayRef<int64_t>{base + 12, base + 13,
		base + 14, base + 15});
		break;
		default:
		llvm_unreachable("control > 3 : overflow");
		}
		dcaballeUnsubmitted Done Reply Inline Actions doc dcaballe: doc
		};
		uint8_t b01 = mask & 0x3;
		uint8_t b23 = (mask >> 2) & 0x3;
		uint8_t b45 = (mask >> 4) & 0x3;
		uint8_t b67 = (mask >> 6) & 0x3;
		appendToMask(0, b01);
		appendToMask(0, b23);
		appendToMask(16, b45);
		appendToMask(16, b67);
		dcaballeUnsubmitted Done Reply Inline Actions doc Should this be transposeToShuffle16x16, following the naming rule above? dcaballe: doc Should this be transposeToShuffle16x16, following the naming rule above?
		return b.create<vector::ShuffleOp>(v1, v2, shuffleMask);
		}

		/// Lowers the value to a vector.shuffle op. The `source` is expected to be a
		/// 1-D vector and have `m`x`n` elements.
		static Value transposeToShuffle1D(OpBuilder &b, Value source, int m, int n) {
		SmallVector<int64_t> mask;
		mask.reserve(m * n);
		for (int64_t j = 0; j < n; ++j)
		for (int64_t i = 0; i < m; ++i)
		mask.push_back(i * n + j);
		return b.create<vector::ShuffleOp>(source.getLoc(), source, source, mask);
		}

		/// Lowers the value to a sequence of vector.shuffle ops. The `source` is
		/// expected to be a 16x16 vector.
		static Value transposeToShuffle16x16(OpBuilder &builder, Value source, int m,
		int n) {
		ImplicitLocOpBuilder b(source.getLoc(), builder);
		SmallVector<Value> vs;
		for (int64_t i = 0; i < m; ++i)
		vs.push_back(b.create<vector::ExtractOp>(source, i));

		// Interleave 32-bit lanes using
		// 8x _mm512_unpacklo_epi32
		// 8x _mm512_unpackhi_epi32
		Value t0 = createUnpackLoPs(b, vs[0x0], vs[0x1], 512);
		Value t1 = createUnpackHiPs(b, vs[0x0], vs[0x1], 512);
		Value t2 = createUnpackLoPs(b, vs[0x2], vs[0x3], 512);
		Value t3 = createUnpackHiPs(b, vs[0x2], vs[0x3], 512);
		Value t4 = createUnpackLoPs(b, vs[0x4], vs[0x5], 512);
		Value t5 = createUnpackHiPs(b, vs[0x4], vs[0x5], 512);
		Value t6 = createUnpackLoPs(b, vs[0x6], vs[0x7], 512);
		Value t7 = createUnpackHiPs(b, vs[0x6], vs[0x7], 512);
		Value t8 = createUnpackLoPs(b, vs[0x8], vs[0x9], 512);
		Value t9 = createUnpackHiPs(b, vs[0x8], vs[0x9], 512);
		Value ta = createUnpackLoPs(b, vs[0xa], vs[0xb], 512);
		Value tb = createUnpackHiPs(b, vs[0xa], vs[0xb], 512);
		Value tc = createUnpackLoPs(b, vs[0xc], vs[0xd], 512);
		Value td = createUnpackHiPs(b, vs[0xc], vs[0xd], 512);
		Value te = createUnpackLoPs(b, vs[0xe], vs[0xf], 512);
		Value tf = createUnpackHiPs(b, vs[0xe], vs[0xf], 512);

		// Interleave 64-bit lanes using
		// 8x _mm512_unpacklo_epi64
		// 8x _mm512_unpackhi_epi64
		Value r0 = createUnpackLoPd(b, t0, t2, 512);
		Value r1 = createUnpackHiPd(b, t0, t2, 512);
		Value r2 = createUnpackLoPd(b, t1, t3, 512);
		Value r3 = createUnpackHiPd(b, t1, t3, 512);
		Value r4 = createUnpackLoPd(b, t4, t6, 512);
		Value r5 = createUnpackHiPd(b, t4, t6, 512);
		Value r6 = createUnpackLoPd(b, t5, t7, 512);
		Value r7 = createUnpackHiPd(b, t5, t7, 512);
		Value r8 = createUnpackLoPd(b, t8, ta, 512);
		Value r9 = createUnpackHiPd(b, t8, ta, 512);
		Value ra = createUnpackLoPd(b, t9, tb, 512);
		Value rb = createUnpackHiPd(b, t9, tb, 512);
		Value rc = createUnpackLoPd(b, tc, te, 512);
		Value rd = createUnpackHiPd(b, tc, te, 512);
		Value re = createUnpackLoPd(b, td, tf, 512);
		Value rf = createUnpackHiPd(b, td, tf, 512);

		// Permute 128-bit lanes using
		// 16x _mm512_shuffle_i32x4
		t0 = create4x128BitSuffle(b, r0, r4, 0x88);
		t1 = create4x128BitSuffle(b, r1, r5, 0x88);
		t2 = create4x128BitSuffle(b, r2, r6, 0x88);
		t3 = create4x128BitSuffle(b, r3, r7, 0x88);
		t4 = create4x128BitSuffle(b, r0, r4, 0xdd);
		t5 = create4x128BitSuffle(b, r1, r5, 0xdd);
		t6 = create4x128BitSuffle(b, r2, r6, 0xdd);
		t7 = create4x128BitSuffle(b, r3, r7, 0xdd);
		t8 = create4x128BitSuffle(b, r8, rc, 0x88);
		t9 = create4x128BitSuffle(b, r9, rd, 0x88);
		ta = create4x128BitSuffle(b, ra, re, 0x88);
		tb = create4x128BitSuffle(b, rb, rf, 0x88);
		tc = create4x128BitSuffle(b, r8, rc, 0xdd);
		td = create4x128BitSuffle(b, r9, rd, 0xdd);
		te = create4x128BitSuffle(b, ra, re, 0xdd);
		tf = create4x128BitSuffle(b, rb, rf, 0xdd);

		// Permute 256-bit lanes using again
		// 16x _mm512_shuffle_i32x4
		vs[0x0] = create4x128BitSuffle(b, t0, t8, 0x88);
		vs[0x1] = create4x128BitSuffle(b, t1, t9, 0x88);
		vs[0x2] = create4x128BitSuffle(b, t2, ta, 0x88);
		vs[0x3] = create4x128BitSuffle(b, t3, tb, 0x88);
		vs[0x4] = create4x128BitSuffle(b, t4, tc, 0x88);
		vs[0x5] = create4x128BitSuffle(b, t5, td, 0x88);
		vs[0x6] = create4x128BitSuffle(b, t6, te, 0x88);
		vs[0x7] = create4x128BitSuffle(b, t7, tf, 0x88);
		vs[0x8] = create4x128BitSuffle(b, t0, t8, 0xdd);
		vs[0x9] = create4x128BitSuffle(b, t1, t9, 0xdd);
		vs[0xa] = create4x128BitSuffle(b, t2, ta, 0xdd);
		vs[0xb] = create4x128BitSuffle(b, t3, tb, 0xdd);
		vs[0xc] = create4x128BitSuffle(b, t4, tc, 0xdd);
		vs[0xd] = create4x128BitSuffle(b, t5, td, 0xdd);
		vs[0xe] = create4x128BitSuffle(b, t6, te, 0xdd);
		vs[0xf] = create4x128BitSuffle(b, t7, tf, 0xdd);

		auto reshInputType = VectorType::get(
		{m, n}, source.getType().cast<VectorType>().getElementType());
		Value res =
		b.create<arith::ConstantOp>(reshInputType, b.getZeroAttr(reshInputType));
		for (int64_t i = 0; i < m; ++i)
		res = b.create<vector::InsertOp>(vs[i], res, i);
		return res;
		}

namespace {		namespace {
/// Progressive lowering of TransposeOp.		/// Progressive lowering of TransposeOp.
/// One:		/// One:
/// %x = vector.transpose %y, [1, 0]		/// %x = vector.transpose %y, [1, 0]
/// is replaced by:		/// is replaced by:
/// %z = arith.constant dense<0.000000e+00>		/// %z = arith.constant dense<0.000000e+00>
/// %0 = vector.extract %y[0, 0]		/// %0 = vector.extract %y[0, 0]
/// %1 = vector.insert %0, %z [0, 0]		/// %1 = vector.insert %0, %z [0, 0]
Show All 16 Lines	LogicalResult matchAndRewrite(vector::TransposeOp op,
VectorType inputType = op.getSourceVectorType();		VectorType inputType = op.getSourceVectorType();
VectorType resType = op.getResultVectorType();		VectorType resType = op.getResultVectorType();

// Set up convenience transposition table.		// Set up convenience transposition table.
SmallVector<int64_t> transp;		SmallVector<int64_t> transp;
for (auto attr : op.getTransp())		for (auto attr : op.getTransp())
transp.push_back(attr.cast<IntegerAttr>().getInt());		transp.push_back(attr.cast<IntegerAttr>().getInt());

if (vectorTransformOptions.vectorTransposeLowering ==		if (isShuffleLike(vectorTransformOptions.vectorTransposeLowering) &&
vector::VectorTransposeLowering::Shuffle &&
resType.getRank() == 2 && transp[0] == 1 && transp[1] == 0)		resType.getRank() == 2 && transp[0] == 1 && transp[1] == 0)
return rewriter.notifyMatchFailure(		return rewriter.notifyMatchFailure(
op, "Options specifies lowering to shuffle");		op, "Options specifies lowering to shuffle");

// Handle a true 2-D matrix transpose differently when requested.		// Handle a true 2-D matrix transpose differently when requested.
if (vectorTransformOptions.vectorTransposeLowering ==		if (vectorTransformOptions.vectorTransposeLowering ==
vector::VectorTransposeLowering::Flat &&		vector::VectorTransposeLowering::Flat &&
resType.getRank() == 2 && transp[0] == 1 && transp[1] == 0) {		resType.getRank() == 2 && transp[0] == 1 && transp[1] == 0) {
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	LogicalResult matchAndRewrite(vector::TransposeOp op,
return success();		return success();
}		}

private:		private:
/// Options to control the vector patterns.		/// Options to control the vector patterns.
vector::VectorTransformsOptions vectorTransformOptions;		vector::VectorTransformsOptions vectorTransformOptions;
};		};

/// Rewrite a 2-D vector.transpose as a sequence of:		/// Rewrite a 2-D vector.transpose as a sequence of shuffle ops.
		/// If the strategy is Shuffle1D, it will be lowered to:
/// vector.shape_cast 2D -> 1D		/// vector.shape_cast 2D -> 1D
/// vector.shuffle		/// vector.shuffle
/// vector.shape_cast 1D -> 2D		/// vector.shape_cast 1D -> 2D
		/// If the strategy is Shuffle16x16, it will be lowered to a sequence of shuffle
		/// ops on 16xf32 vectors.
		dcaballeUnsubmitted Done Reply Inline Actions Add doc for the 16x16 side? dcaballe: Add doc for the 16x16 side?
class TransposeOp2DToShuffleLowering		class TransposeOp2DToShuffleLowering
: public OpRewritePattern<vector::TransposeOp> {		: public OpRewritePattern<vector::TransposeOp> {
public:		public:
using OpRewritePattern::OpRewritePattern;		using OpRewritePattern::OpRewritePattern;

TransposeOp2DToShuffleLowering(		TransposeOp2DToShuffleLowering(
vector::VectorTransformsOptions vectorTransformOptions,		vector::VectorTransformsOptions vectorTransformOptions,
MLIRContext *context, PatternBenefit benefit = 1)		MLIRContext *context, PatternBenefit benefit = 1)
Show All 9 Lines	if (srcType.getRank() != 2)
return rewriter.notifyMatchFailure(op, "Not a 2D transpose");		return rewriter.notifyMatchFailure(op, "Not a 2D transpose");

SmallVector<int64_t> transp;		SmallVector<int64_t> transp;
for (auto attr : op.getTransp())		for (auto attr : op.getTransp())
transp.push_back(attr.cast<IntegerAttr>().getInt());		transp.push_back(attr.cast<IntegerAttr>().getInt());
if (transp[0] != 1 && transp[1] != 0)		if (transp[0] != 1 && transp[1] != 0)
return rewriter.notifyMatchFailure(op, "Not a 2D transpose permutation");		return rewriter.notifyMatchFailure(op, "Not a 2D transpose permutation");

if (vectorTransformOptions.vectorTransposeLowering !=		Value res;
VectorTransposeLowering::Shuffle)
return rewriter.notifyMatchFailure(op, "Options do not ask for Shuffle");

int64_t m = srcType.getShape().front(), n = srcType.getShape().back();		int64_t m = srcType.getShape().front(), n = srcType.getShape().back();
		switch (vectorTransformOptions.vectorTransposeLowering) {
		case VectorTransposeLowering::Shuffle1D: {
Value casted = rewriter.create<vector::ShapeCastOp>(		Value casted = rewriter.create<vector::ShapeCastOp>(
loc, VectorType::get({m * n}, srcType.getElementType()),		loc, VectorType::get({m * n}, srcType.getElementType()),
op.getVector());		op.getVector());
SmallVector<int64_t> mask;		res = transposeToShuffle1D(rewriter, casted, m, n);
mask.reserve(m * n);		break;
for (int64_t j = 0; j < n; ++j)		}
for (int64_t i = 0; i < m; ++i)		case VectorTransposeLowering::Shuffle16x16:
mask.push_back(i * n + j);		if (m != 16 \|\| n != 16)
		return failure();
		res = transposeToShuffle16x16(rewriter, op.getVector(), m, n);
		break;
		case VectorTransposeLowering::EltWise:
		case VectorTransposeLowering::Flat:
		return failure();
		}

Value shuffled =
rewriter.create<vector::ShuffleOp>(loc, casted, casted, mask);
rewriter.replaceOpWithNewOp<vector::ShapeCastOp>(		rewriter.replaceOpWithNewOp<vector::ShapeCastOp>(
op, op.getResultVectorType(), shuffled);		op, op.getResultVectorType(), res);

return success();		return success();
}		}

private:		private:
/// Options to control the vector patterns.		/// Options to control the vector patterns.
vector::VectorTransformsOptions vectorTransformOptions;		vector::VectorTransformsOptions vectorTransformOptions;
};		};
} // namespace		} // namespace

void mlir::vector::populateVectorTransposeLoweringPatterns(		void mlir::vector::populateVectorTransposeLoweringPatterns(
RewritePatternSet &patterns, VectorTransformsOptions options,		RewritePatternSet &patterns, VectorTransformsOptions options,
PatternBenefit benefit) {		PatternBenefit benefit) {
patterns.add<TransposeOpLowering, TransposeOp2DToShuffleLowering>(		patterns.add<TransposeOpLowering, TransposeOp2DToShuffleLowering>(
options, patterns.getContext(), benefit);		options, patterns.getContext(), benefit);
}		}

mlir/test/Dialect/LLVM/transform-e2e.mlir

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	^bb1(%module_op: !pdl.operation):
%func_6 = transform.vector.lower_transfer %func_5		%func_6 = transform.vector.lower_transfer %func_5
max_transfer_rank = 1		max_transfer_rank = 1
: (!pdl.operation) -> !pdl.operation		: (!pdl.operation) -> !pdl.operation

%func_7 = transform.vector.lower_shape_cast %func_6		%func_7 = transform.vector.lower_shape_cast %func_6
: (!pdl.operation) -> !pdl.operation		: (!pdl.operation) -> !pdl.operation

%func_8 = transform.vector.lower_transpose %func_7		%func_8 = transform.vector.lower_transpose %func_7
lowering_strategy = "shuffle"		lowering_strategy = "shuffle_1d"
: (!pdl.operation) -> !pdl.operation		: (!pdl.operation) -> !pdl.operation
}		}

mlir/test/Dialect/Vector/transform-vector.mlir

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	^bb1(%module_op: !pdl.operation):
%func_6 = transform.vector.lower_transfer %func_5		%func_6 = transform.vector.lower_transfer %func_5
max_transfer_rank = 1		max_transfer_rank = 1
: (!pdl.operation) -> !pdl.operation		: (!pdl.operation) -> !pdl.operation

%func_7 = transform.vector.lower_shape_cast %func_6		%func_7 = transform.vector.lower_shape_cast %func_6
: (!pdl.operation) -> !pdl.operation		: (!pdl.operation) -> !pdl.operation

%func_8 = transform.vector.lower_transpose %func_7		%func_8 = transform.vector.lower_transpose %func_7
lowering_strategy = "shuffle"		lowering_strategy = "shuffle_1d"
: (!pdl.operation) -> !pdl.operation		: (!pdl.operation) -> !pdl.operation
}		}

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir

Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	func.func @transpose(%arg0: vector<2x4xf32>) -> vector<4x2xf32> {
%0 = vector.transpose %arg0, [1, 0] : vector<2x4xf32> to vector<4x2xf32>		%0 = vector.transpose %arg0, [1, 0] : vector<2x4xf32> to vector<4x2xf32>
return %0 : vector<4x2xf32>		return %0 : vector<4x2xf32>
}		}


transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%module_op: !pdl.operation):		^bb1(%module_op: !pdl.operation):
transform.vector.lower_transpose %module_op		transform.vector.lower_transpose %module_op
lowering_strategy = "shuffle"		lowering_strategy = "shuffle_1d"
: (!pdl.operation) -> !pdl.operation		: (!pdl.operation) -> !pdl.operation
}		}

// -----		// -----

// CHECK-LABEL: func @transpose(		// CHECK-LABEL: func @transpose(
func.func @transpose(%arg0: vector<2x4xf32>) -> vector<4x2xf32> {		func.func @transpose(%arg0: vector<2x4xf32>) -> vector<4x2xf32> {
// CHECK: vector.shape_cast {{.*}} : vector<2x4xf32> to vector<8xf32>		// CHECK: vector.shape_cast {{.*}} : vector<2x4xf32> to vector<8xf32>
▲ Show 20 Lines • Show All 492 Lines • ▼ Show 20 Lines
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%module_op: !pdl.operation):		^bb1(%module_op: !pdl.operation):
transform.vector.lower_transpose %module_op		transform.vector.lower_transpose %module_op
avx2_lowering_strategy = true		avx2_lowering_strategy = true
: (!pdl.operation) -> !pdl.operation		: (!pdl.operation) -> !pdl.operation
}		}

		// -----

		func.func @transpose_shuffle16x16xf32(%arg0: vector<16x16xf32>) -> vector<16x16xf32> {
		dcaballeUnsubmitted Done Reply Inline Actions transpose_16x16xf32 -> transpose_shuffle16x16xf32? dcaballe: transpose_16x16xf32 -> transpose_shuffle16x16xf32?
		// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
		dcaballeUnsubmitted Done Reply Inline Actions Really cool that we don't have to use ASM for this pattern. That makes it retargetable to AVX2 and also to other targets! dcaballe: Really cool that we don't have to use ASM for this pattern. That makes it retargetable to AVX2…
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions The only thing that uses asm atm is the avx2 vblendps-based lowering that I wasn't able to get through LLVM in any other way than inline asm (as per the discussion I posted). If we have a good reliable way of generating the vblendps instructions without asm that would be great. I do not see any blend instruction in the gist though: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw Re "retargetable", I don't see it; isn't this implementation quite specific to AVX512 where we want to really be careful about crossing 128b boundaries? Isn't this also at risk of spilling severely on architectures with smaller vector sizes? I see this much more naturally living under x86vector, under an avx512 namespace. nicolasvasilache: The only thing that uses asm atm is the avx2 vblendps-based lowering that I wasn't able to get…
		dcaballeUnsubmitted Not Done Reply Inline Actions Interleaving/deinterleaving ops are common shuffle ops in most architectures and shuffling across >128-bit lanes are also common limitations (see, for example, slide 13 here: https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf). This pattern won't be a perfect fit for, let's say SVE and RISC-V but I would expect certain level of applicability, esp. if we compare it with a scalar transfer ops or a single giant shuffle. Totally agree, though, that we should refactor the common components instead of reimplementing them. That would be a great follow-up. dcaballe: Interleaving/deinterleaving ops are common shuffle ops in most architectures and shuffling…
		// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions CHECK-COUNT where possible plz nicolasvasilache: CHECK-COUNT where possible plz
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
		%0 = vector.transpose %arg0, [1, 0] : vector<16x16xf32> to vector<16x16xf32>
		return %0 : vector<16x16xf32>
		}

		transform.sequence failures(propagate) {
		^bb1(%module_op: !pdl.operation):
		transform.vector.lower_transpose %module_op
		lowering_strategy = "shuffle_16x16"
		: (!pdl.operation) -> !pdl.operation
		}

mlir/test/Integration/Dialect/Vector/CPU/test-shuffle16x16.mlir

This file was added.

				// RUN: mlir-opt %s -convert-scf-to-cf \
				// RUN: -test-transform-dialect-interpreter \
				// RUN: -test-transform-dialect-erase-schedule \
				// RUN: -convert-vector-to-llvm -convert-func-to-llvm -reconcile-unrealized-casts \| \
				// RUN: mlir-cpu-runner -e entry -entry-point-result=void \
				// RUN: -shared-libs=%mlir_c_runner_utils \| \
				// RUN: FileCheck %s

				func.func @entry() {
				%in = arith.constant dense<[[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0], [16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0], [32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0], [48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0], [64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0], [80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0], [96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 110.0, 111.0], [112.0, 113.0, 114.0, 115.0, 116.0, 117.0, 118.0, 119.0, 120.0, 121.0, 122.0, 123.0, 124.0, 125.0, 126.0, 127.0], [128.0, 129.0, 130.0, 131.0, 132.0, 133.0, 134.0, 135.0, 136.0, 137.0, 138.0, 139.0, 140.0, 141.0, 142.0, 143.0], [144.0, 145.0, 146.0, 147.0, 148.0, 149.0, 150.0, 151.0, 152.0, 153.0, 154.0, 155.0, 156.0, 157.0, 158.0, 159.0], [160.0, 161.0, 162.0, 163.0, 164.0, 165.0, 166.0, 167.0, 168.0, 169.0, 170.0, 171.0, 172.0, 173.0, 174.0, 175.0], [176.0, 177.0, 178.0, 179.0, 180.0, 181.0, 182.0, 183.0, 184.0, 185.0, 186.0, 187.0, 188.0, 189.0, 190.0, 191.0], [192.0, 193.0, 194.0, 195.0, 196.0, 197.0, 198.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0, 205.0, 206.0, 207.0], [208.0, 209.0, 210.0, 211.0, 212.0, 213.0, 214.0, 215.0, 216.0, 217.0, 218.0, 219.0, 220.0, 221.0, 222.0, 223.0], [224.0, 225.0, 226.0, 227.0, 228.0, 229.0, 230.0, 231.0, 232.0, 233.0, 234.0, 235.0, 236.0, 237.0, 238.0, 239.0], [240.0, 241.0, 242.0, 243.0, 244.0, 245.0, 246.0, 247.0, 248.0, 249.0, 250.0, 251.0, 252.0, 253.0, 254.0, 255.0]]> : vector<16x16xf32>
				%0 = vector.transpose %in, [1, 0] : vector<16x16xf32> to vector<16x16xf32>
				vector.print %0 : vector<16x16xf32>
				// CHECK: ( ( 0, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240 ),
				// CHECK-SAME: ( 1, 17, 33, 49, 65, 81, 97, 113, 129, 145, 161, 177, 193, 209, 225, 241 ),
				// CHECK-SAME: ( 2, 18, 34, 50, 66, 82, 98, 114, 130, 146, 162, 178, 194, 210, 226, 242 ),
				// CHECK-SAME: ( 3, 19, 35, 51, 67, 83, 99, 115, 131, 147, 163, 179, 195, 211, 227, 243 ),
				// CHECK-SAME: ( 4, 20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228, 244 ),
				// CHECK-SAME: ( 5, 21, 37, 53, 69, 85, 101, 117, 133, 149, 165, 181, 197, 213, 229, 245 ),
				// CHECK-SAME: ( 6, 22, 38, 54, 70, 86, 102, 118, 134, 150, 166, 182, 198, 214, 230, 246 ),
				// CHECK-SAME: ( 7, 23, 39, 55, 71, 87, 103, 119, 135, 151, 167, 183, 199, 215, 231, 247 ),
				// CHECK-SAME: ( 8, 24, 40, 56, 72, 88, 104, 120, 136, 152, 168, 184, 200, 216, 232, 248 ),
				// CHECK-SAME: ( 9, 25, 41, 57, 73, 89, 105, 121, 137, 153, 169, 185, 201, 217, 233, 249 ),
				// CHECK-SAME: ( 10, 26, 42, 58, 74, 90, 106, 122, 138, 154, 170, 186, 202, 218, 234, 250 ),
				// CHECK-SAME: ( 11, 27, 43, 59, 75, 91, 107, 123, 139, 155, 171, 187, 203, 219, 235, 251 ),
				// CHECK-SAME: ( 12, 28, 44, 60, 76, 92, 108, 124, 140, 156, 172, 188, 204, 220, 236, 252 ),
				// CHECK-SAME: ( 13, 29, 45, 61, 77, 93, 109, 125, 141, 157, 173, 189, 205, 221, 237, 253 ),
				// CHECK-SAME: ( 14, 30, 46, 62, 78, 94, 110, 126, 142, 158, 174, 190, 206, 222, 238, 254 ),
				// CHECK-SAME: ( 15, 31, 47, 63, 79, 95, 111, 127, 143, 159, 175, 191, 207, 223, 239, 255 ) )
				return
				}

				transform.sequence failures(propagate) {
				^bb1(%module_op: !pdl.operation):
				transform.vector.lower_transpose %module_op
				lowering_strategy = "shuffle_16x16"
				: (!pdl.operation) -> !pdl.operation
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Vector] Add 16x16 strategy to vector.transpose lowering.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 516189

mlir/include/mlir/Dialect/Vector/Transforms/VectorTransformsBase.td

mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp

mlir/test/Dialect/LLVM/transform-e2e.mlir

mlir/test/Dialect/Vector/transform-vector.mlir

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir

mlir/test/Integration/Dialect/Vector/CPU/test-shuffle16x16.mlir

[mlir][Vector] Add 16x16 strategy to vector.transpose lowering.
ClosedPublic