This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
Vector/TransformOps/
-
TransformOps/
-
VectorTransformOps.h
-
VectorTransformOps.td
-
X86Vector/
-
Transforms.h
-
lib/Dialect/
-
Dialect/
-
Vector/TransformOps/
-
TransformOps/
-
VectorTransformOps.cpp
-
X86Vector/Transforms/
-
Transforms/
1/3
AVXTranspose.cpp
-
test/Dialect/Vector/
-
Dialect/
-
Vector/
2/5
vector-transpose-lowering.mlir

Differential D148685

[mlir][Vector] Add 16x16 strategy to vector.transpose lowering.
ClosedPublic

Authored by hanchung on Apr 18 2023, 11:18 PM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
aartbik
springerm
dcaballe
ftynse
Benoit
mravishankar

Commits

rG8d163e504507: [mlir][Vector] Add 16x16 strategy to vector.transpose lowering.

Summary

It adds a shuffle_16x16 strategy LowerVectorTranspose and renames shuffle to shuffle_1d. The idea is similar to 8x8 cases in x86Vector::avx2. The general algorithm is:

interleave 32-bit lanes using
    8x _mm512_unpacklo_epi32
    8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
    8x _mm512_unpacklo_epi64
    8x _mm512_unpackhi_epi64
permute 128-bit lanes using
   16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
   16x _mm512_shuffle_i32x4

After the first stage, they got transposed to

 0  16   1  17   4  20   5  21   8  24   9  25  12  28  13  29
 2  18   3  19   6  22   7  23  10  26  11  27  14  30  15  31
32  48  33  49 ...
34  50  35  51 ...
64  80  65  81 ...
...

After the second stage, they got transposed to

 0  16  32  48 ...
 1  17  33  49 ...
 2  18  34  49 ...
 3  19  35  51 ...
64  80  96 112 ...
65  81  97 114 ...
66  82  98 113 ...
67  83  99 115 ...
...

After the thrid stage, they got transposed to

  0  16  32  48   8  24  40  56  64  80  96  112 ...
  1  17  33  49 ...
  2  18  34  50 ...
  3  19  35  51 ...
  4  20  36  52 ...
  5  21  37  53 ...
  6  22  38  54 ...
  7  23  39  55 ...
128 144 160 176 ...
129 145 161 177 ...
...

After the last stage, they got transposed to

0  16  32  48  64  80  96 112 ... 240
1  17  33  49  66  81  97 113 ... 241
2  18  34  50  67  82  98 114 ... 242
...
15  31  47  63  79  96 111 127 ... 255

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hanchung created this revision.Apr 18 2023, 11:18 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 11:18 PM

Herald added subscribers: bviyer, Moerafaat, zero9178 and 23 others. · View Herald Transcript

hanchung requested review of this revision.Apr 18 2023, 11:18 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 11:18 PM

Herald added a subscriber: stephenneuendorffer. · View Herald Transcript

I'm quite new to x86 transpose lowering area. After exploring papers and resources, I found that the 8x8 case is similar to the idea in https://stackoverflow.com/questions/29519222/how-to-transpose-a-16x16-matrix-using-simd-instructions. Thus, I implemented the 16x16 lowering mostly based on it. I passed one integration tests in IREE, which is a MLIR downstream project; I can test enable it in IREE by default after landing the revision. Some perf improvements are observed in my recent tensor.pack/linalg.transpose codegen as well.

hanchung added a reviewer: mravishankar.Apr 18 2023, 11:23 PM

Harbormaster completed remote builds in B226518: Diff 514828.Apr 18 2023, 11:35 PM

goldstein.w.n added a subscriber: goldstein.w.n.Apr 19 2023, 12:16 AM

goldstein.w.n added inline comments.

mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp
50	It might pay to make number of elements an argument (or template parameter) and then reusing this for `mm256*` variants.
82	static?

Super excited to see this! Awesome!

A few requests:

This pattern doesn't generate ASM blocks so it's really not AVX512 specific. That's great because it can be retargeted to any ISA, including AVX2 or even ARM or RISC-V. Could you please move this pattern to LowerVectorTranspose.cpp?
Also in this regard, we have a shuffle based lowering (TransposeOp2DToShuffleLowering, IIRC) for transposes. I think it only generates a single giant shuffle that is then split by the LLVM backend. Have you tried to enable that option for a 16x16 transpose and see if the generated assembly is the same as the one from this patch when targeting AVX512?
We would also need an integration test for this lowering as shuffles are very sensitive to correctness issues.

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616	Really cool that we don't have to use ASM for this pattern. That makes it retargetable to AVX2 and also to other targets!

This revision now requires changes to proceed.Apr 19 2023, 10:24 AM

This pattern doesn't generate ASM blocks so it's really not AVX512 specific. That's great because it can be retargeted to any ISA, including AVX2 or even ARM or RISC-V. Could you please move this pattern to LowerVectorTranspose.cpp?

I'm not sure if moving it to LowerVectorTranspose is a good idea or not. Here are two points that I can think of.

The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.
Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

Also in this regard, we have a shuffle based lowering (TransposeOp2DToShuffleLowering, IIRC) for transposes. I think it only generates a single giant shuffle that is then split by the LLVM backend. Have you tried to enable that option for a 16x16 transpose and see if the generated assembly is the same as the one from this patch when targeting AVX512?

I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

We would also need an integration test for this lowering as shuffles are very sensitive to correctness issues.

Totally agree! Can you send me a pointer for doing it in MLIR repo?

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

In D148685#4281659, @nicolasvasilache wrote:
Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd
This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

The shuffle patterns proposed here are generally about as good as they get
for any x86 ISA I know of.

It minimized cross-lane shuffles and does repeated in-lane shuffles. Those
two traits for the most part are exactly what the ISA is best suited for.

In D148685#4281659, @nicolasvasilache wrote:
Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd
This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

Here is the ASM dump from IREE: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw

As expected, there are 8x (vunpcklps, vunpckhps) pairs, 8x (vunpcklpd, vunpckhpd) pairs, and 32x vshuff64x2 in the dump. The mask of vshuff64x2 are all 0x88 and 0xdd, which are as same as the implementation.

This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.

Yep, different targets will lower these shuffles differently and even x86 will lower in its own way in some cases :) That's what it led to writing asm versions in some cases.
I think we placed everything here and use x86 specific names because it's where the experiment started but we should move generic patterns to a more generic place to make sure that people don't reinvent the wheel.

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

x86 unpack and pack are vector interleaving/deinterleaving instructions. Some targets have them. Unpack and pack sounds reasonable to me, though.

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

I assumed this was the case :). That's the first thing we have to figure out.

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

The giant shuffle op should also be split and mapped to target shuffle instructions but probably a different sequence.

Totally agree! Can you send me a pointer for doing it in MLIR repo?

https://github.com/llvm/llvm-project/tree/main/mlir/test/Integration/Dialect/Vector/CPU

In D148685#4281722, @dcaballe wrote:
This is unfortunately not true in clang / llvm, see:
https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237

The 4x8 lowering is not AVX specific either. We maybe should move it to LowerVectorTranspose too? The reason of adding it to the file is that I'd like to keep this category lowering in the same place.

Yep, different targets will lower these shuffles differently and even x86 will lower in its own way in some cases :) That's what it led to writing asm versions in some cases.
I think we placed everything here and use x86 specific names because it's where the experiment started but we should move generic patterns to a more generic place to make sure that people don't reinvent the wheel.

Some utils are targeting Intel intrinsics, .e.g, mm512UnpackLoPd : https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_unpacklo_epi64&expand=6087. Is it applicable to other ISA? What would be the good naming for those utils? If they are AVX specific, maybe we should keep them in this file?

x86 unpack and pack are vector interleaving/deinterleaving instructions. Some targets have them. Unpack and pack sounds reasonable to me, though.

Do we have good confidence that the assembly generated uses the right
instruction mix beyond just getting better perf?
In my experience there can still be quite a lot left on the table.
Can we get a test that ensures the assembly generated is what we expect?

I assumed this was the case :). That's the first thing we have to figure out.

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):
The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128

Think you would want to insert/perm2f128 before the unpck actually. Some X86 targets have a quirk where insert has better
throughput when micro-fused with loads. Will be easier to detect / optimize in codegen if the inputs the transpose (potentially
memory) have their first shuffle as the insertf128 pattern.

4 x SHUFPS
8 x BLENDPS

> I tried a single giant shuffle, and they do not generate same code. The performance drops significantly comparing to this approach. I think the reason is that the shuffle ops can be mapped to target instructions in this approach.

The giant shuffle op should also be split and mapped to target shuffle instructions but probably a different sequence.

> Totally agree! Can you send me a pointer for doing it in MLIR repo?

https://github.com/llvm/llvm-project/tree/main/mlir/test/Integration/Dialect/Vector/CPU

Here is the ASM dump from IREE: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw

As expected, there are 8x (vunpcklps, vunpckhps) pairs, 8x (vunpcklpd, vunpckhpd) pairs, and 32x vshuff64x2 in the dump. The mask of vshuff64x2 are all 0x88 and 0xdd, which are as same as the implementation.

We should be able to replace the vunpck*pd ones with a combination of shufps + blends that should be faster, in theory. That's what led to using asm in the past but perhaps we should give the shuffle + reordering approach a try before doing the same:

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

In D148685#4281659, @nicolasvasilache wrote:

Can we get a test that ensures the assembly generated is what we expect?

This is an anti-pattern in LLVM historically, clang does not have any such tests for example.
IIRC the rational is that such tests are fragile and put the burden of maintenance on possibly unrelated part of the project (that is any pass in the middle end or backend that would be able to break this), and that it is over-constrained vs setting up benchmarks (we shouldn't care about the actual assembly, only about the perf).

Move the implementation to LowerVectorTranspose.cpp
Add a Shuffle16x16 strategy
Rename Shuffle strategy to Shuffle1D
Add an e2e integration test

Herald added a subscriber: awarzynski. · View Herald TranscriptApr 20 2023, 5:22 PM

@dcaballe please take another look, thank you!

mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp
50	SG, I added the method to LowerVectorTranspose.cpp. I'll prepare another PR that moves 4x8 case to LowerVectorTranspose.cpp and reuse the method for `mm256*`.

clean transposeToShuffle1D logic

hanchung updated this revision to Diff 515546.Apr 20 2023, 5:26 PM

This comment was removed by hanchung.

Harbormaster completed remote builds in B227031: Diff 515546.Apr 20 2023, 5:42 PM

hanchung edited the summary of this revision. (Show Details)Apr 20 2023, 5:53 PM

Herald added a subscriber: pengfei. · View Herald TranscriptApr 20 2023, 5:53 PM

LGTM. Mostly doc and minor comments. Happy to take a look again before landing.

mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp
55 ↗	(On Diff #515546)	option -> lowering option or lowering approach? vector.shuffle -> vector shuffle based?
61 ↗	(On Diff #515546)	Can you elaborate a bit more in the doc. I'm not sure I get what `vals` is.
67 ↗	(On Diff #515546)	Add message to asserts
134 ↗	(On Diff #515546)	128-bits -> 128-bit lanes
150 ↗	(On Diff #515546)	mm512 is x86 specific. Perhaps we can call this `create4x128BitSuffle` or something like that. This should be able to handle any element type, right?
156 ↗	(On Diff #515546)	switch?
182 ↗	(On Diff #515546)	doc
191 ↗	(On Diff #515546)	doc Should this be transposeToShuffle16x16, following the naming rule above?
380 ↗	(On Diff #515546)	Add doc for the 16x16 side?
mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
615	transpose_16x16xf32 -> transpose_shuffle16x16xf32?
mlir/test/Integration/Dialect/Vector/CPU/test-transpose-16x16.mlir
4 ↗	(On Diff #515546)	file name: -> shuffle16x16?
27 ↗	(On Diff #515546)	Awesome! I think a correctness test should be enough and would align with Mehdi's comment. Thanks!

This revision is now accepted and ready to land.Apr 21 2023, 11:33 AM

Improve docs and address comments

LGTM! THanks for addressing the feedback

mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp
66 ↗	(On Diff #515884)	some typos above

Harbormaster completed remote builds in B227298: Diff 515884.Apr 21 2023, 1:57 PM

fix typo and rebase

This revision was landed with ongoing or failed builds.Apr 23 2023, 11:06 AM

Closed by commit rG8d163e504507: [mlir][Vector] Add 16x16 strategy to vector.transpose lowering. (authored by hanchung). · Explain Why

This revision was automatically updated to reflect the committed changes.

hanchung added a commit: rG8d163e504507: [mlir][Vector] Add 16x16 strategy to vector.transpose lowering..

In D148685#4282679, @mehdi_amini wrote:

In D148685#4281659, @nicolasvasilache wrote:

Can we get a test that ensures the assembly generated is what we expect?

This is an anti-pattern in LLVM historically, clang does not have any such tests for example.
IIRC the rational is that such tests are fragile and put the burden of maintenance on possibly unrelated part of the project (that is any pass in the middle end or backend that would be able to break this), and that it is over-constrained vs setting up benchmarks (we shouldn't care about the actual assembly, only about the perf).

Ah yes, forgot about this, good point; this is also why I didn't add such tests in the past.

It adds a shuffle_16x16 strategy LowerVectorTranspose and renames shuffle to shuffle_1d. The idea is similar to 8x8 cases in x86Vector::avx2. The general algorithm is:

Coming back from vacation I am quite confused by the state in which this PR has landed: this is mentioned to be similar to x86Vector::avx2 but we now have 2 very different implementations that live in separate places for similar algorithms.
I would expect such an implementation to reuse and/or extend utils defined in X86Vector/Transforms/AVXTranspose.cpp such as MaskHelper::shuffle/blend/permute. Instead I seem to see a new implementation with magic assembly constants, one-off new helpers etc.

Please refactor this to be in line with the existing implementation (and/or evolve the existing one accordingly): it is not sustainable to have 2 completely separate implementations for simple variations of the same algorithm.

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616	The only thing that uses asm atm is the avx2 vblendps-based lowering that I wasn't able to get through LLVM in any other way than inline asm (as per the discussion I posted). If we have a good reliable way of generating the vblendps instructions without asm that would be great. I do not see any blend instruction in the gist though: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw Re "retargetable", I don't see it; isn't this implementation quite specific to AVX512 where we want to really be careful about crossing 128b boundaries? Isn't this also at risk of spilling severely on architectures with smaller vector sizes? I see this much more naturally living under x86vector, under an avx512 namespace.
652	CHECK-COUNT where possible plz

OTR, if blends are needed, we should also consider (not sure if Nicolas already tried it):

The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128

I did try different variations but did not get LLVM to ever emit the blend-based version, hence the addition of an inline asm version.

Think you would want to insert/perm2f128 before the unpck actually. Some X86 targets have a quirk where insert has better
throughput when micro-fused with loads. Will be easier to detect / optimize in codegen if the inputs the transpose (potentially
memory) have their first shuffle as the insertf128 pattern.

Nice insight, I'd definitely be interested in seeing this tried out and measured.

dcaballe added inline comments.May 8 2023, 10:55 AM

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir
616	Interleaving/deinterleaving ops are common shuffle ops in most architectures and shuffling across >128-bit lanes are also common limitations (see, for example, slide 13 here: https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf). This pattern won't be a perfect fit for, let's say SVE and RISC-V but I would expect certain level of applicability, esp. if we compare it with a scalar transfer ops or a single giant shuffle. Totally agree, though, that we should refactor the common components instead of reimplementing them. That would be a great follow-up.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Vector/

TransformOps/

VectorTransformOps.h

6 lines

VectorTransformOps.td

4 lines

X86Vector/

Transforms.h

38 lines

lib/

Dialect/

Vector/

TransformOps/

VectorTransformOps.cpp

17 lines

X86Vector/

Transforms/

AVXTranspose.cpp

155 lines

test/

Dialect/

Vector/

vector-transpose-lowering.mlir

78 lines

Diff 514828

mlir/include/mlir/Dialect/Vector/TransformOps/VectorTransformOps.h

Show First 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	struct LowerVectorsOptions : public VectorTransformsOptions {
/// @}		/// @}

bool transposeAVX2Lowering = false;		bool transposeAVX2Lowering = false;
LowerVectorsOptions &setTransposeAVX2Lowering(bool opt) {		LowerVectorsOptions &setTransposeAVX2Lowering(bool opt) {
transposeAVX2Lowering = opt;		transposeAVX2Lowering = opt;
return *this;		return *this;
}		}

		bool transposeAVX512Lowering = false;
		LowerVectorsOptions &setTransposeAVX512Lowering(bool opt) {
		transposeAVX512Lowering = opt;
		return *this;
		}

bool unrollVectorTransfers = true;		bool unrollVectorTransfers = true;
LowerVectorsOptions &setUnrollVectorTransfers(bool opt) {		LowerVectorsOptions &setUnrollVectorTransfers(bool opt) {
unrollVectorTransfers = opt;		unrollVectorTransfers = opt;
return *this;		return *this;
}		}
};		};
} // namespace vector		} // namespace vector
} // namespace mlir		} // namespace mlir

#endif // MLIR_DIALECT_VECTOR_TRANSFORMOPS_VECTORTRANSFORMOPS_H		#endif // MLIR_DIALECT_VECTOR_TRANSFORMOPS_VECTORTRANSFORMOPS_H

mlir/include/mlir/Dialect/Vector/TransformOps/VectorTransformOps.td

Show First 20 Lines • Show All 278 Lines • ▼ Show 20 Lines	let description = [{

This is usually a late step that is run after bufferization as part of the		This is usually a late step that is run after bufferization as part of the
process of lowering to e.g. LLVM or NVVM.		process of lowering to e.g. LLVM or NVVM.
}];		}];

let arguments = (ins TransformHandleTypeInterface:$target,		let arguments = (ins TransformHandleTypeInterface:$target,
DefaultValuedAttr<VectorTransposeLoweringAttr,		DefaultValuedAttr<VectorTransposeLoweringAttr,
"vector::VectorTransposeLowering::EltWise">:$lowering_strategy,		"vector::VectorTransposeLowering::EltWise">:$lowering_strategy,
DefaultValuedAttr<BoolAttr, "false">:$avx2_lowering_strategy		DefaultValuedAttr<BoolAttr, "false">:$avx2_lowering_strategy,
		DefaultValuedAttr<BoolAttr, "false">:$avx512_lowering_strategy
);		);
let results = (outs TransformHandleTypeInterface:$results);		let results = (outs TransformHandleTypeInterface:$results);

let assemblyFormat = [{		let assemblyFormat = [{
$target		$target
oilist (		oilist (
`lowering_strategy` `=` $lowering_strategy		`lowering_strategy` `=` $lowering_strategy
\| `avx2_lowering_strategy` `=` $avx2_lowering_strategy		\| `avx2_lowering_strategy` `=` $avx2_lowering_strategy
		\| `avx512_lowering_strategy` `=` $avx512_lowering_strategy
)		)
attr-dict		attr-dict
`:` functional-type($target, results)		`:` functional-type($target, results)
}];		}];
}		}

// TODO: evolve split_transfer_strategy to proper enums.		// TODO: evolve split_transfer_strategy to proper enums.
def SplitTransferFullPartialOp		def SplitTransferFullPartialOp
▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/X86Vector/Transforms.h

	Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines
	/// - clang/test/CodeGen/X86/avx2-builtins.c			/// - clang/test/CodeGen/X86/avx2-builtins.c
	/// - clang/test/CodeGen/X86/avx-shuffle-builtins.c			/// - clang/test/CodeGen/X86/avx-shuffle-builtins.c
	/// as well as the Intel Intrinsics Guide			/// as well as the Intel Intrinsics Guide
	/// (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html)			/// (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html)
	/// make it easier to just implement known good lowerings.			/// make it easier to just implement known good lowerings.
	/// All intrinsics correspond 1-1 to the Intel definition.			/// All intrinsics correspond 1-1 to the Intel definition.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

				namespace avx512 {
				namespace intrin {
				/// Lower to vector.shuffle v1, v2, [0, 1, 16, 17,
				/// 0+4, 1+4, 16+4, 17+4,
				/// 0+8, 1+8, 16+8, 17+8,
				/// 0+12, 1+12, 16+12, 17+12].
				Value mm512UnpackLoPd(ImplicitLocOpBuilder &b, Value v1, Value v2);
				/// Lower to vector.shuffle, v1, v2, [2, 3, 18, 19,
				/// 2+4, 3+4, 18+4, 19+4,
				/// 2+8, 3+8, 18+8, 19+8,
				/// 2+12, 3+12, 18+12, 19+12].
				Value mm512UnpackHiPd(ImplicitLocOpBuilder &b, Value v1, Value v2);
				/// Lower to vector.shuffle, v1, v2, [0, 16, 1, 17,
				/// 0+4, 16+4, 1+4, 17+4,
				/// 0+8, 16+8, 1+8, 17+8,
				/// 0+12, 16+12, 1+12, 17+12].
				Value mm512UnpackLoPs(ImplicitLocOpBuilder &b, Value v1, Value v2);
				/// Lower to vector.shuffle, v1, v2, [2, 18, 3, 19,
				/// 2+4, 18+4, 3+4, 19+4,
				/// 2+8, 18+8, 3+8, 19+8,
				/// 2+12, 18+12, 3+12, 19+12].
				Value mm512UnpackHiPs(ImplicitLocOpBuilder &b, Value v1, Value v2);
				} // namespace intrin

				/// 16x16xf32-specific AVX512 transpose lowering.
				void transpose16x16xf32(ImplicitLocOpBuilder &ib, MutableArrayRef<Value> vs);
				} // namespace avx512

	namespace avx2 {			namespace avx2 {

	namespace inline_asm {			namespace inline_asm {
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	/// Methods in the inline_asm namespace emit calls to LLVM::InlineAsmOp.			/// Methods in the inline_asm namespace emit calls to LLVM::InlineAsmOp.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	/// If bit i of `mask` is zero, take f32@i from v1 else take it from v2.			/// If bit i of `mask` is zero, take f32@i from v1 else take it from v2.
	Value mm256BlendPsAsm(ImplicitLocOpBuilder &b, Value v1, Value v2,			Value mm256BlendPsAsm(ImplicitLocOpBuilder &b, Value v1, Value v2,
	Show All 34 Lines
	/// Generic lowerings may either use intrin or inline_asm depending on needs.			/// Generic lowerings may either use intrin or inline_asm depending on needs.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	/// 4x8xf32-specific AVX2 transpose lowering.			/// 4x8xf32-specific AVX2 transpose lowering.
	void transpose4x8xf32(ImplicitLocOpBuilder &ib, MutableArrayRef<Value> vs);			void transpose4x8xf32(ImplicitLocOpBuilder &ib, MutableArrayRef<Value> vs);

	/// 8x8xf32-specific AVX2 transpose lowering.			/// 8x8xf32-specific AVX2 transpose lowering.
	void transpose8x8xf32(ImplicitLocOpBuilder &ib, MutableArrayRef<Value> vs);			void transpose8x8xf32(ImplicitLocOpBuilder &ib, MutableArrayRef<Value> vs);

				} // namespace avx2

	/// Structure to control the behavior of specialized AVX2 transpose lowering.			/// Structure to control the behavior of specialized AVX2 transpose lowering.
	struct TransposeLoweringOptions {			struct TransposeLoweringOptions {
	bool lower4x8xf32_ = false;			bool lower4x8xf32_ = false;
	TransposeLoweringOptions &lower4x8xf32(bool lower = true) {			TransposeLoweringOptions &lower4x8xf32(bool lower = true) {
	lower4x8xf32_ = lower;			lower4x8xf32_ = lower;
	return *this;			return *this;
	}			}
	bool lower8x8xf32_ = false;			bool lower8x8xf32_ = false;
	TransposeLoweringOptions &lower8x8xf32(bool lower = true) {			TransposeLoweringOptions &lower8x8xf32(bool lower = true) {
	lower8x8xf32_ = lower;			lower8x8xf32_ = lower;
	return *this;			return *this;
	}			}
				bool lower16x16xf32_ = false;
				TransposeLoweringOptions &lower16x16xf32(bool lower = true) {
				lower16x16xf32_ = lower;
				return *this;
				}
	};			};

	/// Options for controlling specialized AVX2 lowerings.			/// Options for controlling specialized AVX lowerings.
	struct LoweringOptions {			struct LoweringOptions {
	/// Configure specialized vector lowerings.			/// Configure specialized vector lowerings.
	TransposeLoweringOptions transposeOptions;			TransposeLoweringOptions transposeOptions;
	LoweringOptions &setTransposeOptions(TransposeLoweringOptions options) {			LoweringOptions &setTransposeOptions(TransposeLoweringOptions options) {
	transposeOptions = options;			transposeOptions = options;
	return *this;			return *this;
	}			}
	};			};

	/// Insert specialized transpose lowering patterns.			/// Insert specialized transpose lowering patterns.
	void populateSpecializedTransposeLoweringPatterns(			void populateSpecializedTransposeLoweringPatterns(
	RewritePatternSet &patterns, LoweringOptions options = LoweringOptions(),			RewritePatternSet &patterns, LoweringOptions options = LoweringOptions(),
	int benefit = 10);			int benefit = 10);

	} // namespace avx2
	} // namespace x86vector			} // namespace x86vector

	/// Collect a set of patterns to lower X86Vector ops to ops that map to LLVM			/// Collect a set of patterns to lower X86Vector ops to ops that map to LLVM
	/// intrinsics.			/// intrinsics.
	void populateX86VectorLegalizeForLLVMExportPatterns(			void populateX86VectorLegalizeForLLVMExportPatterns(
	LLVMTypeConverter &converter, RewritePatternSet &patterns);			LLVMTypeConverter &converter, RewritePatternSet &patterns);

	/// Configure the target to support lowering X86Vector ops to ops that map to			/// Configure the target to support lowering X86Vector ops to ops that map to
	/// LLVM intrinsics.			/// LLVM intrinsics.
	void configureX86VectorLegalizeForExportTarget(LLVMConversionTarget &target);			void configureX86VectorLegalizeForExportTarget(LLVMConversionTarget &target);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_X86VECTOR_TRANSFORMS_H			#endif // MLIR_DIALECT_X86VECTOR_TRANSFORMS_H

mlir/lib/Dialect/Vector/TransformOps/VectorTransformOps.cpp

	Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines
	// LowerTransposeOp			// LowerTransposeOp
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	void transform::LowerTransposeOp::populatePatterns(			void transform::LowerTransposeOp::populatePatterns(
	RewritePatternSet &patterns) {			RewritePatternSet &patterns) {
	vector::populateVectorTransposeLoweringPatterns(			vector::populateVectorTransposeLoweringPatterns(
	patterns, vector::VectorTransformsOptions().setVectorTransposeLowering(			patterns, vector::VectorTransformsOptions().setVectorTransposeLowering(
	getLoweringStrategy()));			getLoweringStrategy()));
				auto transposeOptions = x86vector::TransposeLoweringOptions();
	if (getAvx2LoweringStrategy()) {			if (getAvx2LoweringStrategy()) {
	auto avx2LoweringOptions =			transposeOptions = transposeOptions.lower4x8xf32().lower8x8xf32();
	x86vector::avx2::LoweringOptions().setTransposeOptions(
	x86vector::avx2::TransposeLoweringOptions()
	.lower4x8xf32(true)
	.lower8x8xf32(true));
	x86vector::avx2::populateSpecializedTransposeLoweringPatterns(
	patterns, avx2LoweringOptions, /benefit=/10);
	}			}
				if (getAvx512LoweringStrategy()) {
				transposeOptions = transposeOptions.lower16x16xf32();
				}

				auto avxLoweringOptions =
				x86vector::LoweringOptions().setTransposeOptions(transposeOptions);
				x86vector::populateSpecializedTransposeLoweringPatterns(
				patterns, avxLoweringOptions, /benefit=/10);
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// SplitTransferFullPartialOp			// SplitTransferFullPartialOp
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	void transform::SplitTransferFullPartialOp::populatePatterns(			void transform::SplitTransferFullPartialOp::populatePatterns(
	RewritePatternSet &patterns) {			RewritePatternSet &patterns) {
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp

Show All 21 Lines
#include "llvm/Support/FormatVariadic.h"		#include "llvm/Support/FormatVariadic.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::vector;		using namespace mlir::vector;
using namespace mlir::x86vector;		using namespace mlir::x86vector;
using namespace mlir::x86vector::avx2;		using namespace mlir::x86vector::avx2;
using namespace mlir::x86vector::avx2::inline_asm;		using namespace mlir::x86vector::avx2::inline_asm;
using namespace mlir::x86vector::avx2::intrin;		using namespace mlir::x86vector::avx2::intrin;
		using namespace mlir::x86vector::avx512::intrin;
		using namespace mlir::x86vector::avx512;

Value mlir::x86vector::avx2::inline_asm::mm256BlendPsAsm(		Value mlir::x86vector::avx2::inline_asm::mm256BlendPsAsm(
ImplicitLocOpBuilder &b, Value v1, Value v2, uint8_t mask) {		ImplicitLocOpBuilder &b, Value v1, Value v2, uint8_t mask) {
auto asmDialectAttr =		auto asmDialectAttr =
LLVM::AsmDialectAttr::get(b.getContext(), LLVM::AsmDialect::AD_Intel);		LLVM::AsmDialectAttr::get(b.getContext(), LLVM::AsmDialect::AD_Intel);
const auto *asmTp = "vblendps $0, $1, $2, {0}";		const auto *asmTp = "vblendps $0, $1, $2, {0}";
const auto *asmCstr =		const auto *asmCstr =
"=x,x,x"; // Careful: constraint parser is very brittle: no ws!		"=x,x,x"; // Careful: constraint parser is very brittle: no ws!
SmallVector<Value> asmVals{v1, v2};		SmallVector<Value> asmVals{v1, v2};
auto asmStr = llvm::formatv(asmTp, llvm::format_hex(mask, /width=/2)).str();		auto asmStr = llvm::formatv(asmTp, llvm::format_hex(mask, /width=/2)).str();
auto asmOp = b.create<LLVM::InlineAsmOp>(		auto asmOp = b.create<LLVM::InlineAsmOp>(
v1.getType(), /operands=/asmVals, /asm_string=/asmStr,		v1.getType(), /operands=/asmVals, /asm_string=/asmStr,
/constraints=/asmCstr, /has_side_effects=/false,		/constraints=/asmCstr, /has_side_effects=/false,
/is_align_stack=/false, /asm_dialect=/asmDialectAttr,		/is_align_stack=/false, /asm_dialect=/asmDialectAttr,
/operand_attrs=/ArrayAttr());		/operand_attrs=/ArrayAttr());
return asmOp.getResult(0);		return asmOp.getResult(0);
}		}

		static SmallVector<int64_t> getMm512UnpackShufflePerm(ArrayRef<int64_t> vals) {
		goldstein.w.nUnsubmitted Not Done Reply Inline Actions It might pay to make number of elements an argument (or template parameter) and then reusing this for `mm256` variants. goldstein.w.n:* It might pay to make number of elements an argument (or template parameter) and then reusing…
		hanchungAuthorUnsubmitted Done Reply Inline Actions SG, I added the method to LowerVectorTranspose.cpp. I'll prepare another PR that moves 4x8 case to LowerVectorTranspose.cpp and reuse the method for `mm256`. hanchung:* SG, I added the method to LowerVectorTranspose.cpp. I'll prepare another PR that moves 4x8 case…
		SmallVector<int64_t> res;
		for (int i = 0; i < 16; i += 4)
		for (int64_t v : vals)
		res.push_back(v + i);
		return res;
		}

		Value mlir::x86vector::avx512::intrin::mm512UnpackLoPd(ImplicitLocOpBuilder &b,
		Value v1, Value v2) {
		return b.create<vector::ShuffleOp>(v1, v2,
		getMm512UnpackShufflePerm({0, 1, 16, 17}));
		}

		Value mlir::x86vector::avx512::intrin::mm512UnpackHiPd(ImplicitLocOpBuilder &b,
		Value v1, Value v2) {
		return b.create<vector::ShuffleOp>(v1, v2,
		getMm512UnpackShufflePerm({2, 3, 18, 19}));
		}

		Value mlir::x86vector::avx512::intrin::mm512UnpackLoPs(ImplicitLocOpBuilder &b,
		Value v1, Value v2) {
		return b.create<vector::ShuffleOp>(v1, v2,
		getMm512UnpackShufflePerm({0, 16, 1, 17}));
		}

		Value mlir::x86vector::avx512::intrin::mm512UnpackHiPs(ImplicitLocOpBuilder &b,
		Value v1, Value v2) {
		return b.create<vector::ShuffleOp>(v1, v2,
		getMm512UnpackShufflePerm({2, 18, 3, 19}));
		}

		Value mm512ShuffleI32x4(ImplicitLocOpBuilder &b, Value v1, Value v2,
		goldstein.w.nUnsubmitted Not Done Reply Inline Actions static? goldstein.w.n: static?
		uint8_t mask) {
		assert(v1.getType().cast<VectorType>().getShape()[0] == 16 &&
		"expected a vector with length=16");
		SmallVector<int64_t> shuffleMask;
		auto appendToMask = [&](int64_t base, uint8_t control) {
		if (control == 0)
		llvm::append_range(shuffleMask, ArrayRef<int64_t>{base + 0, base + 1,
		base + 2, base + 3});
		else if (control == 1)
		llvm::append_range(shuffleMask, ArrayRef<int64_t>{base + 4, base + 5,
		base + 6, base + 7});
		else if (control == 2)
		llvm::append_range(shuffleMask, ArrayRef<int64_t>{base + 8, base + 9,
		base + 10, base + 11});
		else if (control == 3)
		llvm::append_range(shuffleMask, ArrayRef<int64_t>{base + 12, base + 13,
		base + 14, base + 15});
		else
		llvm_unreachable("control > 3 : overflow");
		};
		uint8_t b01, b23, b45, b67;
		MaskHelper::extractShuffle(mask, b01, b23, b45, b67);
		appendToMask(0, b01);
		appendToMask(0, b23);
		appendToMask(16, b45);
		appendToMask(16, b67);
		return b.create<vector::ShuffleOp>(v1, v2, shuffleMask);
		}

		void mlir::x86vector::avx512::transpose16x16xf32(ImplicitLocOpBuilder &b,
		MutableArrayRef<Value> vs) {
		// Interleave 32-bit lanes using
		// 8x _mm512_unpacklo_epi32
		// 8x _mm512_unpackhi_epi32
		Value t0 = mm512UnpackLoPs(b, vs[0], vs[1]);
		Value t1 = mm512UnpackHiPs(b, vs[0], vs[1]);
		Value t2 = mm512UnpackLoPs(b, vs[2], vs[3]);
		Value t3 = mm512UnpackHiPs(b, vs[2], vs[3]);
		Value t4 = mm512UnpackLoPs(b, vs[4], vs[5]);
		Value t5 = mm512UnpackHiPs(b, vs[4], vs[5]);
		Value t6 = mm512UnpackLoPs(b, vs[6], vs[7]);
		Value t7 = mm512UnpackHiPs(b, vs[6], vs[7]);
		Value t8 = mm512UnpackLoPs(b, vs[8], vs[9]);
		Value t9 = mm512UnpackHiPs(b, vs[8], vs[9]);
		Value ta = mm512UnpackLoPs(b, vs[10], vs[11]);
		Value tb = mm512UnpackHiPs(b, vs[10], vs[11]);
		Value tc = mm512UnpackLoPs(b, vs[12], vs[13]);
		Value td = mm512UnpackHiPs(b, vs[12], vs[13]);
		Value te = mm512UnpackLoPs(b, vs[14], vs[15]);
		Value tf = mm512UnpackHiPs(b, vs[14], vs[15]);

		// Interleave 64-bit lanes using
		// 8x _mm512_unpacklo_epi64
		// 8x _mm512_unpackhi_epi64
		Value r0 = mm512UnpackLoPd(b, t0, t2);
		Value r1 = mm512UnpackHiPd(b, t0, t2);
		Value r2 = mm512UnpackLoPd(b, t1, t3);
		Value r3 = mm512UnpackHiPd(b, t1, t3);
		Value r4 = mm512UnpackLoPd(b, t4, t6);
		Value r5 = mm512UnpackHiPd(b, t4, t6);
		Value r6 = mm512UnpackLoPd(b, t5, t7);
		Value r7 = mm512UnpackHiPd(b, t5, t7);
		Value r8 = mm512UnpackLoPd(b, t8, ta);
		Value r9 = mm512UnpackHiPd(b, t8, ta);
		Value ra = mm512UnpackLoPd(b, t9, tb);
		Value rb = mm512UnpackHiPd(b, t9, tb);
		Value rc = mm512UnpackLoPd(b, tc, te);
		Value rd = mm512UnpackHiPd(b, tc, te);
		Value re = mm512UnpackLoPd(b, td, tf);
		Value rf = mm512UnpackHiPd(b, td, tf);

		// Permute 128-bit lanes using
		// 16x _mm512_shuffle_i32x4
		t0 = mm512ShuffleI32x4(b, r0, r4, 0x88);
		t1 = mm512ShuffleI32x4(b, r1, r5, 0x88);
		t2 = mm512ShuffleI32x4(b, r2, r6, 0x88);
		t3 = mm512ShuffleI32x4(b, r3, r7, 0x88);
		t4 = mm512ShuffleI32x4(b, r0, r4, 0xdd);
		t5 = mm512ShuffleI32x4(b, r1, r5, 0xdd);
		t6 = mm512ShuffleI32x4(b, r2, r6, 0xdd);
		t7 = mm512ShuffleI32x4(b, r3, r7, 0xdd);
		t8 = mm512ShuffleI32x4(b, r8, rc, 0x88);
		t9 = mm512ShuffleI32x4(b, r9, rd, 0x88);
		ta = mm512ShuffleI32x4(b, ra, re, 0x88);
		tb = mm512ShuffleI32x4(b, rb, rf, 0x88);
		tc = mm512ShuffleI32x4(b, r8, rc, 0xdd);
		td = mm512ShuffleI32x4(b, r9, rd, 0xdd);
		te = mm512ShuffleI32x4(b, ra, re, 0xdd);
		tf = mm512ShuffleI32x4(b, rb, rf, 0xdd);

		// Permute 256-bit lanes using again
		// 16x _mm512_shuffle_i32x4
		vs[0x0] = mm512ShuffleI32x4(b, t0, t8, 0x88);
		vs[0x1] = mm512ShuffleI32x4(b, t1, t9, 0x88);
		vs[0x2] = mm512ShuffleI32x4(b, t2, ta, 0x88);
		vs[0x3] = mm512ShuffleI32x4(b, t3, tb, 0x88);
		vs[0x4] = mm512ShuffleI32x4(b, t4, tc, 0x88);
		vs[0x5] = mm512ShuffleI32x4(b, t5, td, 0x88);
		vs[0x6] = mm512ShuffleI32x4(b, t6, te, 0x88);
		vs[0x7] = mm512ShuffleI32x4(b, t7, tf, 0x88);
		vs[0x8] = mm512ShuffleI32x4(b, t0, t8, 0xdd);
		vs[0x9] = mm512ShuffleI32x4(b, t1, t9, 0xdd);
		vs[0xa] = mm512ShuffleI32x4(b, t2, ta, 0xdd);
		vs[0xb] = mm512ShuffleI32x4(b, t3, tb, 0xdd);
		vs[0xc] = mm512ShuffleI32x4(b, t4, tc, 0xdd);
		vs[0xd] = mm512ShuffleI32x4(b, t5, td, 0xdd);
		vs[0xe] = mm512ShuffleI32x4(b, t6, te, 0xdd);
		vs[0xf] = mm512ShuffleI32x4(b, t7, tf, 0xdd);
		}

Value mlir::x86vector::avx2::intrin::mm256UnpackLoPs(ImplicitLocOpBuilder &b,		Value mlir::x86vector::avx2::intrin::mm256UnpackLoPs(ImplicitLocOpBuilder &b,
Value v1, Value v2) {		Value v1, Value v2) {
return b.create<vector::ShuffleOp>(		return b.create<vector::ShuffleOp>(
v1, v2, ArrayRef<int64_t>{0, 8, 1, 9, 4, 12, 5, 13});		v1, v2, ArrayRef<int64_t>{0, 8, 1, 9, 4, 12, 5, 13});
}		}

Value mlir::x86vector::avx2::intrin::mm256UnpackHiPs(ImplicitLocOpBuilder &b,		Value mlir::x86vector::avx2::intrin::mm256UnpackHiPs(ImplicitLocOpBuilder &b,
Value v1, Value v2) {		Value v1, Value v2) {
▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	auto applyRewrite = [&]() {
reshInput = ib.create<vector::ShapeCastOp>(reshInputType, reshInput);		reshInput = ib.create<vector::ShapeCastOp>(reshInputType, reshInput);

// Extract 1-D vectors from the higher-order dimension of the input		// Extract 1-D vectors from the higher-order dimension of the input
// vector.		// vector.
for (int64_t i = 0; i < m; ++i)		for (int64_t i = 0; i < m; ++i)
vs.push_back(ib.create<vector::ExtractOp>(reshInput, i));		vs.push_back(ib.create<vector::ExtractOp>(reshInput, i));

// Transpose set of 1-D vectors.		// Transpose set of 1-D vectors.
if (m == 4)		if (m == 4 && n == 8)
transpose4x8xf32(ib, vs);		transpose4x8xf32(ib, vs);
if (m == 8)		if (m == 8 && n == 8)
transpose8x8xf32(ib, vs);		transpose8x8xf32(ib, vs);
		if (m == 16 && n == 16)
		transpose16x16xf32(ib, vs);

// Insert transposed 1-D vectors into the higher-order dimension of the		// Insert transposed 1-D vectors into the higher-order dimension of the
// output vector.		// output vector.
Value res = ib.create<arith::ConstantOp>(reshInputType,		Value res = ib.create<arith::ConstantOp>(reshInputType,
ib.getZeroAttr(reshInputType));		ib.getZeroAttr(reshInputType));
for (int64_t i = 0; i < m; ++i)		for (int64_t i = 0; i < m; ++i)
res = ib.create<vector::InsertOp>(vs[i], res, i);		res = ib.create<vector::InsertOp>(vs[i], res, i);

// The output vector still has the shape of the input vector (e.g., 4x8).		// The output vector still has the shape of the input vector (e.g., 4x8).
// We have to transpose their dimensions and retrieve its original rank		// We have to transpose their dimensions and retrieve its original rank
// (e.g., 1x8x1x4x1).		// (e.g., 1x8x1x4x1).
res = ib.create<vector::ShapeCastOp>(flattenedType, res);		res = ib.create<vector::ShapeCastOp>(flattenedType, res);
res = ib.create<vector::ShapeCastOp>(op.getResultVectorType(), res);		res = ib.create<vector::ShapeCastOp>(op.getResultVectorType(), res);
rewriter.replaceOp(op, res);		rewriter.replaceOp(op, res);
return success();		return success();
};		};

if (loweringOptions.transposeOptions.lower4x8xf32_ && m == 4 && n == 8)		if (loweringOptions.transposeOptions.lower4x8xf32_ && m == 4 && n == 8)
return applyRewrite();		return applyRewrite();
if (loweringOptions.transposeOptions.lower8x8xf32_ && m == 8 && n == 8)		if (loweringOptions.transposeOptions.lower8x8xf32_ && m == 8 && n == 8)
return applyRewrite();		return applyRewrite();
		if (loweringOptions.transposeOptions.lower16x16xf32_ && m == 16 && n == 16)
		return applyRewrite();
return failure();		return failure();
}		}

private:		private:
LoweringOptions loweringOptions;		LoweringOptions loweringOptions;
};		};

void mlir::x86vector::avx2::populateSpecializedTransposeLoweringPatterns(		void mlir::x86vector::populateSpecializedTransposeLoweringPatterns(
RewritePatternSet &patterns, LoweringOptions options, int benefit) {		RewritePatternSet &patterns, LoweringOptions options, int benefit) {
patterns.add<TransposeOpLowering>(options, patterns.getContext(), benefit);		patterns.add<TransposeOpLowering>(options, patterns.getContext(), benefit);
}		}

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir

	Show First 20 Lines • Show All 603 Lines • ▼ Show 20 Lines
	}			}

	transform.sequence failures(propagate) {			transform.sequence failures(propagate) {
	^bb1(%module_op: !pdl.operation):			^bb1(%module_op: !pdl.operation):
	transform.vector.lower_transpose %module_op			transform.vector.lower_transpose %module_op
	avx2_lowering_strategy = true			avx2_lowering_strategy = true
	: (!pdl.operation) -> !pdl.operation			: (!pdl.operation) -> !pdl.operation
	}			}

				// -----

				func.func @transpose210_1x16x16xf32(%arg0: vector<1x16x16xf32>) -> vector<16x16x1xf32> {
				dcaballeUnsubmitted Done Reply Inline Actions transpose_16x16xf32 -> transpose_shuffle16x16xf32? dcaballe: transpose_16x16xf32 -> transpose_shuffle16x16xf32?
				// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
				dcaballeUnsubmitted Done Reply Inline Actions Really cool that we don't have to use ASM for this pattern. That makes it retargetable to AVX2 and also to other targets! dcaballe: Really cool that we don't have to use ASM for this pattern. That makes it retargetable to AVX2…
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions The only thing that uses asm atm is the avx2 vblendps-based lowering that I wasn't able to get through LLVM in any other way than inline asm (as per the discussion I posted). If we have a good reliable way of generating the vblendps instructions without asm that would be great. I do not see any blend instruction in the gist though: https://gist.githubusercontent.com/hanhanW/c5fefa20151c27da113181e6748697a3/raw Re "retargetable", I don't see it; isn't this implementation quite specific to AVX512 where we want to really be careful about crossing 128b boundaries? Isn't this also at risk of spilling severely on architectures with smaller vector sizes? I see this much more naturally living under x86vector, under an avx512 namespace. nicolasvasilache: The only thing that uses asm atm is the avx2 vblendps-based lowering that I wasn't able to get…
				dcaballeUnsubmitted Not Done Reply Inline Actions Interleaving/deinterleaving ops are common shuffle ops in most architectures and shuffling across >128-bit lanes are also common limitations (see, for example, slide 13 here: https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf). This pattern won't be a perfect fit for, let's say SVE and RISC-V but I would expect certain level of applicability, esp. if we compare it with a scalar transfer ops or a single giant shuffle. Totally agree, though, that we should refactor the common components instead of reimplementing them. That would be a great follow-up. dcaballe: Interleaving/deinterleaving ops are common shuffle ops in most architectures and shuffling…
				// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 16, 1, 17, 4, 20, 5, 21, 8, 24, 9, 25, 12, 28, 13, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 18, 3, 19, 6, 22, 7, 23, 10, 26, 11, 27, 14, 30, 15, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 16, 17, 4, 5, 20, 21, 8, 9, 24, 25, 12, 13, 28, 29] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [2, 3, 18, 19, 6, 7, 22, 23, 10, 11, 26, 27, 14, 15, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions CHECK-COUNT where possible plz nicolasvasilache: CHECK-COUNT where possible plz
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				// CHECK: vector.shuffle {{.*}} [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31] : vector<16xf32>, vector<16xf32>
				%0 = vector.transpose %arg0, [2, 1, 0] : vector<1x16x16xf32> to vector<16x16x1xf32>
				return %0 : vector<16x16x1xf32>
				}

				transform.sequence failures(propagate) {
				^bb1(%module_op: !pdl.operation):
				transform.vector.lower_transpose %module_op
				avx512_lowering_strategy = true
				: (!pdl.operation) -> !pdl.operation
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Vector] Add 16x16 strategy to vector.transpose lowering.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 514828

mlir/include/mlir/Dialect/Vector/TransformOps/VectorTransformOps.h

mlir/include/mlir/Dialect/Vector/TransformOps/VectorTransformOps.td

mlir/include/mlir/Dialect/X86Vector/Transforms.h

mlir/lib/Dialect/Vector/TransformOps/VectorTransformOps.cpp

mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp

mlir/test/Dialect/Vector/vector-transpose-lowering.mlir

[mlir][Vector] Add 16x16 strategy to vector.transpose lowering.
ClosedPublic